Implement a pair of new system calls to provide extended and further extensible
stat functions.
The second of the associated patches is the main patch that provides these new
system calls:
ssize_t ret = xstat(int dfd,
const char *filename,
unsigned atflag,
struct xstat_parameters *params,
struct xstat *buffer,
size_t bufsize);
ssize_t ret = fxstat(int fd,
struct xstat_parameters *params,
struct xstat *buffer,
size_t bufsize);
which are more fully documented in that patch's description.
These new stat functions provide a number of useful features, in summary:
(1) More information: creation time, inode generation number, data version
number, flags/attributes. A subset of these is available through each of:
CIFS, NFS, AFS, Ext4, BTRFS and others.
(2) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server.
(3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
cached attributes are up to date.
(4) Allow the filesystem to indicate what it can/cannot provide: A filesystem
can now say it doesn't support a standard stat feature if that isn't
available.
(5) Make the fields a consistent size on all arches, and make them large.
(6) Can be extended by using more request flags and appending further data
after the end of the standard return data.
Note that no lstat() equivalent is required as that can be implemented through
xstat() with atflag == 0.
==================
ADDITIONAL PATCHES
==================
The first patch makes const a bunch of system call userspace string/buffer
arguments. I can then make sys_xstat()'s filename pointer const too (though
the entire first patch is not required for that).
The third patch makes the AFS filesystem use i_generation for the vnode ID
uniquifier rather than i_version, ...Store the AFS vnode uniquifier in the i_generation field, not the i_version
field of the inode struct. i_version can then be given the AFS data version
number.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/afs/dir.c | 8 ++++----
fs/afs/fsclient.c | 3 ++-
fs/afs/inode.c | 10 +++++-----
3 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b42d5cc..afb9ff8 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -542,11 +542,11 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
dentry->d_op = &afs_fs_dentry_operations;
d_add(dentry, inode);
- _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
+ _leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%u }",
fid.vnode,
fid.unique,
dentry->d_inode->i_ino,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);
return NULL;
}
@@ -626,10 +626,10 @@ static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
* been deleted and replaced, and the original vnode ID has
* been reused */
if (fid.unique != vnode->fid.unique) {
- _debug("%s: file deleted (uq %u -> %u I:%llu)",
+ _debug("%s: file deleted (uq %u -> %u I:%u)",
dentry->d_name.name, fid.unique,
vnode->fid.unique,
- (unsigned long long)dentry->d_inode->i_version);
+ dentry->d_inode->i_generation);
spin_lock(&vnode->lock);
set_bit(AFS_VNODE_DELETED, &vnode->flags);
spin_unlock(&vnode->lock);
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 4bd0218..346e328 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -89,7 +89,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
i_size_write(&vnode->vfs_inode, size);
vnode->vfs_inode.i_uid = status->owner;
vnode->vfs_inode.i_gid = status->group;
- vnode->vfs_inode.i_version = vnode->fid.unique;
+ vnode->vfs_inode.i_generation = ...Return extended attributes from the CIFS filesystem. This includes the
following:
(1) Return the file creation time as btime. We assume that the creation time
won't change over the life of the inode.
(2) FS_AUTOMOUNT_FL on referral/submount directories.
(3) Deasserting XSTAT_REQUEST_INO in st_result_mask if we made up the inode
number and didn't get it from the server.
(4) Map various Windows file attributes to FS_xxx_FL flags in st_inode_flags,
fetching them from the server if we don't have them yet or don't have a
current copy.
Furthermore, what cifs_getattr() does can be controlled as follows:
(1) If AT_FORCE_ATTR_SYNC is indicated, or if the inode flags or creation time
are requested but not yet collected, then the attributes will be reread
unconditionally.
(2) If the basic stats are requested or if the inode flags are requested and
have been collected previously, then the attributes will be reread if out
of date.
(3) Otherwise the cached attributes will be used - even if expired - without
reference to the server.
Note that cifs_revalidate_dentry() will issue an extra operation to get the
FILE_ALL_INFO in addition to the FILE_UNIX_BASIC_INFO if it needs to collect
creation time and attributes on behalf of cifs_getattr().
[NOTE: THIS PATCH IS UNTESTED!]
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/cifs/cifsfs.h | 2 +
fs/cifs/cifsglob.h | 5 +++
fs/cifs/dir.c | 2 +
fs/cifs/inode.c | 76 ++++++++++++++++++++++++++++++++++++++++++++--------
4 files changed, 71 insertions(+), 14 deletions(-)
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index a7eb65c..50bf70b 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -62,7 +62,7 @@ extern int cifs_rmdir(struct inode *, struct dentry *);
extern int cifs_rename(struct inode *, struct dentry *, struct inode *,
struct dentry *);
extern int cifs_revalidate_file(struct file *filp);
-extern int ...Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/afs/dir.c | 1 +
fs/afs/internal.h | 1 +
fs/afs/mntpt.c | 46 +++++++++++++++-------------------------------
3 files changed, 17 insertions(+), 31 deletions(-)
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index afb9ff8..d2dd137 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -65,6 +65,7 @@ static const struct dentry_operations afs_fs_dentry_operations = {
.d_revalidate = afs_d_revalidate,
.d_delete = afs_d_delete,
.d_release = afs_d_release,
+ .d_automount = afs_d_automount,
};
#define AFS_DIR_HASHTBL_SIZE 128
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 5f679b7..2c700dc 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -583,6 +583,7 @@ extern int afs_abort_to_error(u32);
extern const struct inode_operations afs_mntpt_inode_operations;
extern const struct file_operations afs_mntpt_file_operations;
+extern struct vfsmount *afs_d_automount(struct path *);
extern int afs_mntpt_check_symlink(struct afs_vnode *, struct key *);
extern void afs_mntpt_kill_timer(void);
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index a9e2303..ea9cfee 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -24,7 +24,6 @@ static struct dentry *afs_mntpt_lookup(struct inode *dir,
struct dentry *dentry,
struct nameidata *nd);
static int afs_mntpt_open(struct inode *inode, struct file *file);
-static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd);
static void afs_mntpt_expiry_timed_out(struct work_struct *work);
const struct file_operations afs_mntpt_file_operations = {
@@ -33,7 +32,6 @@ const struct file_operations afs_mntpt_file_operations = {
const struct inode_operations afs_mntpt_inode_operations = {
.lookup = afs_mntpt_lookup,
- .follow_link = afs_mntpt_follow_link,
.readlink = page_readlink,
.getattr = ...Add a dentry op (d_automount) to handle automounting directories rather than abusing the follow_link() inode operation. The operation is keyed off a new inode flag (S_AUTOMOUNT). This makes it easier to add an AT_ flag to suppress terminal segment automount during pathwalk. It should also remove the need for the kludge code in the pathwalk algorithm to handle directories with follow_link() semantics. I've only changed __follow_mount() to handle automount points, but it might be necessary to change follow_mount() too. The latter is only used from follow_dotdot(), but any automounts on ".." should be pinned whilst we're using a child of it. Note that autofs4's use of follow_mount() will need examining if this patch is committed. Signed-off-by: David Howells <dhowells@redhat.com> --- Documentation/filesystems/Locking | 2 + Documentation/filesystems/vfs.txt | 13 ++++++ fs/namei.c | 85 +++++++++++++++++++++++++++++-------- fs/stat.c | 2 + include/linux/dcache.h | 5 ++ include/linux/fs.h | 2 + 6 files changed, 91 insertions(+), 18 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 96d4293..ccbfa98 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -16,6 +16,7 @@ prototypes: void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); + struct vfsmount *(*d_automount)(struct path *path); locking rules: none have BKL @@ -27,6 +28,7 @@ d_delete: yes no yes no d_release: no no no yes d_iput: no no no yes d_dname: no no no no +d_automount: no no no yes --------------------------- inode_operations --------------------------- prototypes: diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 94677e7..31a9e8f 100644 --- ...
Moving this out of ->follow_link is a good idea, but please submit this as a separate patch series, as it has very little to do with stat(). --
Except that I want to use it to create a new AT flag for xstat() (and also fstatat()), but fair enough. David --
Provide a mechanism in the kernel by which extra results beyond those allocated
space in the xstat struct can be returned to userspace.
[I'm not sure this is the best way to do this; it's a bit unwieldy. However,
I'd rather not overburden struct kstat with fields for every extra result we
might want to return as it's allocated on the stack in various places.
Possibly the pass_result of struct xstat_extra_result could be placed in
struct kstat to be used if pass_result is non-NULL, and struct kstat could be
passed to container_of().]
This is modelled on the filldir approach used to read directory entries. This
allows kernel routines (such as NFSD) to access this information too.
A new inode operation (getattr_extra) is provided that interested filesystems
need to implement. If this is not provided, then it is assumed that no extra
results will be returned.
The getattr_extra() routine is passed a token to represent the request:
struct xstat_extra_result {
u64 request_mask;
struct kstat *stat;
xstat_extra_result_t pass_result;
};
The three fields in this struct are: the request_mask (with bits not
representing extra results edited out); the pointer to the kstat structure as
passed to getattr() (stat->query_flags may be useful); and a pointer to a
function to which each individual result should be passed.
The requests can be handled in order with something like the following:
u64 request_mask = token->request_mask;
do {
int request = __ffs64(request_mask);
request_mask &= ~(1ULL << request);
switch (request) {
case ilog2(XSTAT_REQUEST_FOO): {
struct xstat_foo foo;
ret = myfs_get_foo(inode, token, &foo);
if (!ret)
token->pass_result(token, request,
&foo, sizeof(foo));
break;
}
default:
ret = 0;
break;
}
} while (ret == 0 && request_mask);
The caller should probably embed token in something so that they can retrieve
it in the pass_result() function with container_of().
Signed-off-by: ...As mentioned before this is total overkill. The request/respond flags together with the buffer size already provide enough ways to extent the structure in a backwards compatible way if needed. --
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/namei.c | 17 +++--------------
1 files changed, 3 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index fcec3c6..86068a2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,17 +845,6 @@ fail:
}
/*
- * This is a temporary kludge to deal with "automount" symlinks; proper
- * solution is to trigger them on follow_mount(), so that do_lookup()
- * would DTRT. To be killed before 2.6.34-final.
- */
-static inline int follow_on_final(struct inode *inode, unsigned lookup_flags)
-{
- return inode && unlikely(inode->i_op->follow_link) &&
- ((lookup_flags & LOOKUP_FOLLOW) || S_ISDIR(inode->i_mode));
-}
-
-/*
* Name resolution.
* This is the basic name resolution function, turning a pathname into
* the final dentry. We expect 'base' to be positive and a directory.
@@ -975,7 +964,8 @@ last_component:
if (err)
break;
inode = next.dentry->d_inode;
- if (follow_on_final(inode, lookup_flags)) {
+ if (inode && unlikely(inode->i_op->follow_link) &&
+ (lookup_flags & LOOKUP_FOLLOW)) {
err = do_follow_link(&next, nd);
if (err)
goto return_err;
@@ -1888,8 +1878,7 @@ reval:
struct inode *inode = path.dentry->d_inode;
void *cookie;
error = -ELOOP;
- /* S_ISDIR part is a temporary automount kludge */
- if (!(nd.flags & LOOKUP_FOLLOW) && !S_ISDIR(inode->i_mode))
+ if (!(nd.flags & LOOKUP_FOLLOW))
goto exit_dput;
if (count++ == 32)
goto exit_dput;
--
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of directories
with follow_link semantics. This can be used by fstatat()/xstat() users to
permit the gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/namei.c | 15 ++++++++++-----
fs/stat.c | 4 +++-
include/linux/fcntl.h | 1 +
include/linux/namei.h | 2 ++
4 files changed, 16 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 86068a2..056427e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -654,7 +654,8 @@ static int follow_automount(struct path *path, int res)
/* no need for dcache_lock, as serialization is taken care in
* namespace.c
*/
-static int __follow_mount(struct path *path, unsigned nofollow)
+static int __follow_mount(struct path *path, unsigned nofollow,
+ struct nameidata *nd)
{
struct vfsmount *mounted;
int ret, res = 0;
@@ -674,8 +675,12 @@ static int __follow_mount(struct path *path, unsigned nofollow)
}
if (!d_automount_point(path->dentry))
break;
- if (nofollow)
- return -ELOOP;
+ if (!(nd->flags & LOOKUP_CONTINUE)) {
+ if (nofollow)
+ return -ELOOP;
+ if (nd->flags & LOOKUP_NO_AUTOMOUNT)
+ break;
+ }
ret = follow_automount(path, res);
if (ret < 0)
return ret;
@@ -769,7 +774,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
done:
path->mnt = mnt;
path->dentry = dentry;
- ret = __follow_mount(path, 0);
+ ret = __follow_mount(path, 0, nd);
if (unlikely(ret < 0))
path_put(path);
return ret;
@@ -1762,7 +1767,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (open_flag & O_EXCL)
goto exit_dput;
- error = __follow_mount(path, open_flag & O_NOFOLLOW);
+ error = __follow_mount(path, open_flag & O_NOFOLLOW, nd);
if (error < 0)
goto exit_dput;
diff --git a/fs/stat.c ...Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
[NOTE: THIS IS UNTESTED!]
[Question: Why does cifs_dfs_do_refmount() when the caller has already done
that and could pass the result through?]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
---
fs/cifs/cifs_dfs_ref.c | 145 +++++++++++++++++++++++-------------------------
fs/cifs/cifsfs.h | 6 ++
fs/cifs/dir.c | 2 +
fs/cifs/inode.c | 8 ++-
4 files changed, 83 insertions(+), 78 deletions(-)
diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c
index 4516867..500b952 100644
--- a/fs/cifs/cifs_dfs_ref.c
+++ b/fs/cifs/cifs_dfs_ref.c
@@ -230,8 +230,8 @@ compose_mount_options_err:
}
-static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
- struct dentry *dentry, const struct dfs_info3_param *ref)
+static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
+ const struct dfs_info3_param *ref)
{
struct cifs_sb_info *cifs_sb;
struct vfsmount *mnt;
@@ -239,12 +239,12 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
char *devname = NULL;
char *fullpath;
- cifs_sb = CIFS_SB(dentry->d_inode->i_sb);
+ cifs_sb = CIFS_SB(mntpt->d_inode->i_sb);
/*
* this function gives us a path with a double backslash prefix. We
* require a single backslash for DFS.
*/
- fullpath = build_path_from_dentry(dentry);
+ fullpath = build_path_from_dentry(mntpt);
if (!fullpath)
return ERR_PTR(-ENOMEM);
@@ -262,35 +262,6 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
}
-static int add_mount_helper(struct vfsmount *newmnt, struct nameidata *nd,
- struct list_head *mntlist)
-{
- /* stolen from afs code */
- int err;
-
- mntget(newmnt);
- err = do_add_mount(newmnt, &nd->path, nd->path.mnt->mnt_flags | MNT_SHRINKABLE, mntlist);
- switch (err) ...Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/nfs/dir.c | 2 +
fs/nfs/inode.c | 1 +
fs/nfs/internal.h | 1 +
fs/nfs/namespace.c | 87 ++++++++++++++++++++++++----------------------------
4 files changed, 44 insertions(+), 47 deletions(-)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 782b431..d7e5810 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -927,6 +927,7 @@ const struct dentry_operations nfs_dentry_operations = {
.d_revalidate = nfs_lookup_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
+ .d_automount = nfs_d_automount,
};
static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
@@ -1002,6 +1003,7 @@ const struct dentry_operations nfs4_dentry_operations = {
.d_revalidate = nfs_open_revalidate,
.d_delete = nfs_dentry_delete,
.d_iput = nfs_dentry_iput,
+ .d_automount = nfs_d_automount,
};
/*
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 8c6de96..f9737bd 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -296,6 +296,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
inode->i_op = &nfs_mountpoint_inode_operations;
inode->i_fop = NULL;
set_bit(NFS_INO_MOUNTPOINT, &nfsi->flags);
+ inode->i_flags |= S_AUTOMOUNT;
}
} else if (S_ISLNK(inode->i_mode))
inode->i_op = &nfs_symlink_inode_operations;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index d8bd619..48de6f8 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -238,6 +238,7 @@ extern char *nfs_path(const char *base,
const struct dentry *droot,
const struct dentry *dentry,
char *buffer, ssize_t buflen);
+extern struct vfsmount *nfs_d_automount(struct path *path);
/* getroot.c */
extern struct dentry *nfs_get_root(struct super_block *, struct nfs_fh *);
diff --git ...Make automounter filesystems return FS_AUTOMOUNT_FL in st_inode_flags to
xstat().
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/autofs/init.c | 1 +
fs/autofs4/init.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/autofs/init.c b/fs/autofs/init.c
index cea5219..2c06d4b 100644
--- a/fs/autofs/init.c
+++ b/fs/autofs/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
static struct file_system_type autofs_fs_type = {
.owner = THIS_MODULE,
.name = "autofs",
+ .inode_flags = FS_AUTOMOUNT_FL,
.get_sb = autofs_get_sb,
.kill_sb = autofs_kill_sb,
};
diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 9722e4b..43df431 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
static struct file_system_type autofs_fs_type = {
.owner = THIS_MODULE,
.name = "autofs",
+ .inode_flags = FS_AUTOMOUNT_FL,
.get_sb = autofs_get_sb,
.kill_sb = autofs4_kill_sb,
};
--
Make network filesystems return FS_REMOTE_FL in st_inode_flags to xstat().
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/afs/super.c | 1 +
fs/ceph/super.c | 1 +
fs/cifs/cifsfs.c | 1 +
fs/coda/inode.c | 1 +
fs/ncpfs/inode.c | 1 +
fs/nfs/super.c | 7 +++++++
fs/smbfs/inode.c | 1 +
7 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index e932e5a..daaa3d4 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -40,6 +40,7 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
struct file_system_type afs_fs_type = {
.owner = THIS_MODULE,
.name = "afs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = afs_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = 0,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index fa87f51..f486ac8 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1019,6 +1019,7 @@ static void ceph_kill_sb(struct super_block *s)
static struct file_system_type ceph_fs_type = {
.owner = THIS_MODULE,
.name = "ceph",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = ceph_get_sb,
.kill_sb = ceph_kill_sb,
.fs_flags = FS_RENAME_DOES_D_MOVE,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index ef9a773..eb2c517 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -586,6 +586,7 @@ static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
struct file_system_type cifs_fs_type = {
.owner = THIS_MODULE,
.name = "cifs",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = cifs_get_sb,
.kill_sb = kill_anon_super,
/* .fs_flags */
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index d97f993..cb05427 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -308,6 +308,7 @@ static int coda_get_sb(struct file_system_type *fs_type,
struct file_system_type coda_fs_type = {
.owner = THIS_MODULE,
.name = "coda",
+ .inode_flags = FS_REMOTE_FL,
.get_sb = coda_get_sb,
.kill_sb = kill_anon_super,
.fs_flags = ...Make special system filesystems return FS_SPECIAL_FL in st_inode_flags to xstat(). Signed-off-by: David Howells <dhowells@redhat.com> --- arch/ia64/kernel/perfmon.c | 7 ++++--- arch/powerpc/platforms/cell/spufs/inode.c | 1 + arch/s390/hypfs/inode.c | 1 + drivers/infiniband/hw/ipath/ipath_fs.c | 1 + drivers/infiniband/hw/qib/qib_fs.c | 1 + drivers/isdn/capi/capifs.c | 1 + drivers/misc/ibmasm/ibmasmfs.c | 1 + drivers/mtd/mtdchar.c | 1 + drivers/oprofile/oprofilefs.c | 1 + drivers/usb/core/inode.c | 1 + drivers/usb/gadget/f_fs.c | 1 + drivers/usb/gadget/inode.c | 1 + drivers/xen/xenfs/super.c | 1 + fs/anon_inodes.c | 1 + fs/binfmt_misc.c | 1 + fs/configfs/mount.c | 1 + fs/debugfs/inode.c | 1 + fs/fuse/control.c | 1 + fs/hostfs/hostfs_kern.c | 1 + fs/nfsd/nfsctl.c | 1 + fs/ocfs2/dlmfs/dlmfs.c | 1 + fs/openpromfs/inode.c | 1 + fs/pipe.c | 1 + fs/proc/root.c | 1 + fs/sysfs/mount.c | 1 + ipc/mqueue.c | 1 + kernel/cgroup.c | 1 + kernel/cpuset.c | 1 + net/socket.c | 1 + net/sunrpc/rpc_pipe.c | 1 + security/inode.c | 1 + security/selinux/selinuxfs.c | 1 + security/smack/smackfs.c | 1 + 33 files changed, 36 insertions(+), 3 deletions(-) diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index ...
Actually, that last is not true; FS_REMOTE_FL is per-file, not per-fs. You can have a filesystem that has fabricated files and remote files. For instance, with kAFS at some point you will be go into /afs, do a lookup for a directory that doesn't exist, but whose name represents a cell+volume, the filesystem will fabricate a local directory and then attempt to mount a remote directory on to it. David --
Return extended attributes from the NFS filesystem. This includes the
following:
(1) The change attribute as st_data_version if NFSv4.
(2) FS_AUTOMOUNT_FL on referral/submount directories.
Furthermore, what nfs_getattr() does can be controlled as follows:
(1) If AT_FORCE_ATTR_SYNC is indicated, or mtime, ctime or data_version (NFSv4
only) are requested then the outstanding writes will be written to the
server first.
(2) The inode's attributes may be synchronised with the server:
(a) If AT_FORCE_ATTR_SYNC is indicated or if atime is requested (and atime
updating is not suppressed by a mount flag) then the attributes will
be reread unconditionally.
(b) If the data version or any of basic stats are requested then the
attributes will be reread if the cached attributes have expired.
(c) Otherwise the cached attributes will be used - even if expired -
without reference to the server.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/nfs/inode.c | 46 ++++++++++++++++++++++++++++++++++------------
1 files changed, 34 insertions(+), 12 deletions(-)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 099b351..8c6de96 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -495,11 +495,21 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr)
int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
struct inode *inode = dentry->d_inode;
+ unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
int err;
- /* Flush out writes to the server in order to update c/mtime. */
- if (S_ISREG(inode->i_mode)) {
+ if (NFS_SERVER(inode)->nfs_client->rpc_ops->version < 4)
+ stat->request_mask &= ~XSTAT_REQUEST_DATA_VERSION;
+
+ /* Flush out writes to the server in order to update c/mtime
+ * or data version if the user wants them */
+ if ((force || stat->request_mask & ...Return extended attributes from the Ext4 filesystem. This includes the
following:
(1) The inode creation time (i_crtime) as i_btime.
(2) The inode i_generation as i_gen if not the root directory.
(3) The inode i_version as st_data_version if a file with I_VERSION set or a
directory.
(4) FS_xxx_FL flags as for FS_IOC_GETFLAGS.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 32 +++++++++++++++++++++++++++++---
fs/ext4/namei.c | 2 ++
fs/ext4/symlink.c | 2 ++
5 files changed, 36 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..96823f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1571,6 +1571,8 @@ extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct dentry *, struct iattr *);
extern int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
+extern int ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+ struct kstat *stat);
extern void ext4_delete_inode(struct inode *);
extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5313ae4..18c29ab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -150,7 +150,7 @@ const struct file_operations ext4_file_operations = {
const struct inode_operations ext4_file_inode_operations = {
.truncate = ext4_truncate,
.setattr = ext4_setattr,
- .getattr = ext4_getattr,
+ .getattr = ext4_file_getattr,
#ifdef CONFIG_EXT4_FS_XATTR
.setxattr = generic_setxattr,
.getxattr = generic_getxattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..822a4ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5550,12 +5550,38 @@ err_out:
int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat)
{
- struct inode ...Return extended attributes from the AFS filesystem. This includes the
following:
(1) The vnode uniquifier as st_gen.
(2) The data version number as st_data_version.
(3) FS_AUTOMOUNT_FL on mountpoint directories.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/afs/inode.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index ee3190a..02f115f 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -300,16 +300,19 @@ error_unlock:
/*
* read the attributes of an inode
*/
-int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,
- struct kstat *stat)
+int afs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
{
- struct inode *inode;
-
- inode = dentry->d_inode;
+ struct inode *inode = dentry->d_inode;
_enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);
generic_fillattr(inode, stat);
+
+ stat->result_mask |= XSTAT_REQUEST_GEN | XSTAT_REQUEST_DATA_VERSION;
+ stat->gen = inode->i_generation;
+ stat->data_version = inode->i_version;
+ if (test_bit(AFS_VNODE_MOUNTPOINT, &AFS_FS_I(inode)->flags))
+ stat->inode_flags |= FS_AUTOMOUNT_FL;
return 0;
}
--
Add a pair of system calls to make extended file stats available, including
file creation time, inode version and data version where available through the
underlying filesystem.
[This depends on the previously posted pair of patches to (a) constify a number
of syscall string and buffer arguments and (b) rearrange AFS's use of
i_version and i_generation].
This has a number of uses:
(1) Creation time: The SMB protocol carries the creation time, which could be
exported by Samba, which will in turn help CIFS make use of FS-Cache as
that can be used for coherency data.
This is also specified in NFSv4 as a recommended attribute and could be
exported by NFSD [Steve French].
(2) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper].
(3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
cached attributes are up to date [Trond Myklebust].
(4) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd
Schubert].
(5) Data version number: Could be used by userspace NFS servers [Aneesh Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get it
from the kstat struct if it used vfs_xgetattr() instead.
(6) BSD stat compatibility: Including more fields from the BSD stat such as
creation time (st_btime) and inode generation number (st_gen) [Jeremy
Allison, Bernd Schubert].
(7) Extra coherency data may be useful in making backups [Andreas Dilger].
(8) Allow the filesystem to indicate what it can/cannot provide: A filesystem
can now say it doesn't support a standard stat feature if that isn't
available.
(9) Make the fields a consistent size on all arches, and make them large.
(10) Can be extended by using more ...I don't think I'd call this general preference. Three of the four are fixed length and could easily be done inside the structure if you leave a bit of space instead of a variable-length field at the end. For the volume id, I could not find any file system that requires more than 32 bytes here, which is also reasonable to put into the structure. Make it 36 if you want to cover ascii encoded UUIDs. That's at most 60 bytes for the extensions you're considering already, plus the 152 you have already is still less than a cache line on some machines. Padding it to 256 bytes would make it nice and round, I'd also still argue that 32 bits would be better since you can put them into the argument list instead of having to use a pointer to xstat_parameters. You only use 15 bits so far, so the remaining 17 bits should go a long way. It's not as important to me as the The resulting syscall I'd hope for would be int xstat(dfd, const char *filename, unsigned flags, unsigned mask, struct xstat *buf); Everything else in your patch looks very good and has my full support. Arnd --
? Maybe I wasn't clear: I meant having an extended stat() syscall rather than You should also include a length. Volume IDs may be binary rather than Which we currently allocate on the kernel stack, plus up to a couple of kstat structs if something like eCryptFS is used. Admittedly, the base xstat struct could be kmalloc()'d instead, but why use up all that space if you don't need it? David --
unsigned? Existing filesystems support on-disk timestamps representing times prior to the epoch. --
Ok, I misparsed your statement there. I don't think anyone was
objecting the use of xstat for this.
The controversial part is only how the extension happens. I would
already feel better about it if you just dropped the
'unsigned long long st_extra_results[0];' at the end and
added a comment saying that the structure may grow in the future, though
Yes, maybe. There are several possible encodings for this. I was actually
thinking of fixed-length string rather than zero-terminated, but that
is possible as well. If this gets added, we need to audit every possible
use to make sure each of them is covered. My point was mostly that if we
If you're worried about stack utilization, xstat could also be embedded into
kstat, like
struct kstat {
u64 request_mask;
struct xstat x;
};
Then you only need one of them on the stack for sys_xstat, or have both
struct kstat and struct stat/stat64 for the other syscalls.
Arnd
--
You could also define the tv_gran_units to be power-of-ten nanoseconds,
making it a decimal floating point number like
enum {
XSTAT_NANOSECONDS_GRANULARITY = 0,
XSTAT_MICROSECONDS_GRANULARITY = 3,
XSTAT_MILLISECONDS_GRANULARITY = 6,
XSTAT_SECONDS_GRANULARITY = 9,
};
That would make it easier to define an xstat_time_before() function, though
it means that you could no longer do XSTAT_MINUTES_GRANULARITY and
I wouldn't even go that far if we needed sub-ns (I don't think we do), because
that breaks old compilers that cannot do bit fields.
Arnd
--
So you're thinking of indicating time (in)equality based on overlapping time
granules?
Your suggestion would suffice, I think. With a 2:2 split between exponent
(tv_gran_units) and mantissa (tv_granularity), you can do:
UNIT SECONDS/UNIT EXPONENT MANTISSA
nanoseconds 0.000000001 -9 1
microseconds 0.000001 -6 1
millseconds 0.001 -3 1
seconds 1 0 1
minutes 60 1 6
hours 3600 2 36
days 86400 2 864
weeks 604800 2 6048
Any units beyond that are variable length and not worth considering, IMO.
And if you don't want negative numbers in your exponent, you can make the base
unit nS instead of S.
Is it worth allowing a filesystem to indicate that it has granularity smaller
than nS, even if the resolution can't be handled here? We could even have:
struct xstat_time {
signed long long tv_sec; /* seconds */
unsigned int tv_nsec; /* nanoseconds */
unsigned char tv_psec4; /* picoseconds/4 */
signed char tv_gran_exp; /* exponent */
unsigned short tv_gran_mant; /* mantissa */
};
Though it's probably still an unnecessary extravagance to have the pS field.
It's probably best left as padding for now; we can always change our minds
later...
David
--
No, just tv_granularity. Most users won't need to care that this
Yes, for example rsync could use this to determine wether a local (e.g. FAT)
and a remote (e.g. NFS) file are identical or not. Right now, you can pass
the granularity in seconds as a command line argument, but it would be nice
There are also two extra bits in tv_nsec ;-). No, I don't think we
need picoseconds any time soon.
One byte padding might not be the worst thing to have in here, like
struct xstat_time {
signed long long tv_sec; /* seconds */
unsigned int tv_nsec; /* nanoseconds */
unsigned short tv_gran_mant; /* mantissa */
signed char tv_gran_exp; /* exponent */
unsigned char unused;
};
Arnd
--
At least for the in-tree filesystems, I do not see any that keep timestamps with a granularity larger than 2s. For that, a simple 32-bit tv_granularity in nanoseconds (not limited to 1e9) would suffice, and there is no need for the complexity of dealing with a separate exponent. If there is a need to handle larger granularity, its msb could potentially be used to indicate that the number is in seconds instead of nanoseconds. This is convenient because the timestamp is already broken down into sec and nsec fields. So this bit would then indicate that the granularity applies to the tv_sec field, and that tv_nsec is not in use. But even this is overkill if no one uses a granularity larger than 2s. - Mark --
Yes, good point. That would indeed be a significant simplification. Arnd --
Adding Uli to the Cc list to make sure this system call is useful for glibc / can be exported by it. Otherwise it's rather pointless Why making them large for the sake of it? We'll need massive changes all through libc and applications to ever make use of this. So please Just pass this as a single flag by value. And just make it an unsigned No point in adding special types here that aren't genericly useful. Also this is the first and only system call using split major/minor What's the point of the REQUEST in the name? Also no double underscores inside the identifier. Instead adding a _MASK postfix Please don't overload the FL_ namespace even more. It's already a complete mess given that it overloads the extN on-disk namespace. If you already have a buflen parameter there is absolute no need for the extra results field. Just define new fields at the end and include them if the bufsize is big enough and it's in the mask of requested Why add a special case like that? Especially if we make the request Please don't introduce tons of special cases. Instead use a simple rule like: - a filesystem must return all attributes requests, or return an error if it can't. - a filesystem may return additional attributes, the caller can detect this by looking at st_mask. plus possibly a list of attributes the filesystem must be able to provide if requests. I don't see a reason to make that mask different from the attributes required by Posix. --
Given xstat.otime=0, how would you determine whether the file is really tagged with a date of 1970, or whether it's just the fs which didnot store this kind of information. --
I was thinking more of stuff that's already in the Linux stat struct, some of which is fabricated because the underlying fs doesn't support it. Take RomFS for example: it fabricates all of st_mtime, st_atime, st_ctime, st_nlinks, st_blocks, st_uid and st_gid because none of them are stored in the medium Similarly, UbiFS fabricates st_blocks and complains in a comment that it makes no sense for that type of filesystem. There are other examples. David --
There are extra dates and version numbers potentially available. This may be
So that you can decide not to use it. Some of our filesystems fabricate things
Otherwise we end up with #ifdefs and duplicated fields of different sizes
within stat structs, and fields of "long" types which vary in size, depending
on the environment.
I just want to make sure that:
- st_ino is stored as 64-bit
- st_size and st_blocks are stored 64-bit
- st.{a,b,c,m}time.tv_sec are stored 64-bit
We could probably stand to make st_blksize 32-bit. I'd quite like to leave
I can perhaps agree on the device numbers, though some filesystems we have can
store numbers that can't be represented by dev_t. I think, however, everything
we have can be handled by a 32:32 split. The numbers could then be encoded as
desired in userspace.
The problem with using extant time structs is they use "long" or "unsigned
long". And I specifically want to get away from that, since it might be
Perhaps, but it contrasts nicely with request_mask, and makes it easier to
Firstly: Lightweight stat: I want to say that the filesystem may return data
that is out of date if it isn't asked for specifically, but the filesystem has
a copy available. But I'm not sure that this should apply to non-standard
fields.
Secondly: It doesn't matter what POSIX wants; not all filesystems we support
have everything available. Where something that's standard is not available,
we have the opportunity to indicate this, whilst still providing a fabricated
result, so that the user can take note of this fact if they choose to, whilst
totally ignoring the indication if they prefer, and just using the fabrication.
Davod
--
Ugh. So I think this is pretty disgusting. For a few reasons: - that whole xstat buffer handling is just a mess. I think you already fixed the "xstat_parameters" crud and just made it a simple unsigned long and a direct argument, but the "buffer+buflen" thing is still disgusting. Why not just leave a few empty fields at the end, and make the rule be: "We don't just add random crap, so don't expect it to grow widely in the future". - you use "long long" all over the place. Don't do that. If you want a fixed size, say so, and use "u64/s64". That's the _real_ fixed size, and "long long" just _happens_ to be the same size on all current architectures. Put another way: "long" just _happened_ to be 32 bits way back when on pretty much all targets. That's where all the 64-bit compatibility mess came from. Don't make the same mistake. Besides, if the point is to make things be the same, _document_ that point by using a type that is explicitly sized. - why create that new kind of xstat() that realistically absolutely nobody will use outside of some very special cases, and that has no real advantages for 99.9% of all people? You could make it a "atomic stat+open" by replacing the useless "size" return value with a "fd" return value, add a flag saying "we're also interested in opening it" (in the same result set flags), and instead of that stupid "buflen" input, give the "mode" input that open needs. Tadaa! You now have something that more people might be interested in, if only because it avoids a system call and might be a performance win. Who knows. Ask the Wine people what strange Quite frankly, my gut feel is that once you do "xstat(dfd, filename, ...)" then it's damn stupid to do a separate "fxstat()", when you might as well say that "xtstat(dfd, NULL, ...)" is the same as "fxstat(fd, ...)" Now, the difference between adding one or two system calls may not be huge, but just from a cleanliness angle, I really don't see the point of having another ...
I was thinking more of an unsigned int argument, since it can't have more than
Because it gets allocated on the kernel stack. It's already 160 bytes, and
expanding it will eat more kernel stack space. Now, I can offset that by: (a)
embedding it in struct kstat so that we allocate less stack space in xstat()
overall, and (b) allocating kstat/xstat structs with kmalloc() rather than on
I was following struct stat/stat64 in arch/x86/include/asm/stat.h which do the
same. Also, if this is going to be seen by userspace, isn't it better to use
The new information is useful for some cases. Samba for example. At least
two of the fields I'm adding are also made available through BSD's stat()
call, and will automatically be used for some things by autoconf magic if they
become available.
I'm still trying to get a handle on what people think will be truly useful. I
can see things *could* be useful, particularly to GUI file managers and ls,
but not everyone is of the same opinion.
Perhaps you or others can offer answers to the following questions as these
might help:
(1) Should I offer information that's effectively free to come by, but could
be got through:
(a) An extra statfs() call - such as whether a file is remote, whether
it's some kernel special file? Or what the volume label is for this
file?
(b) An extra getxattr() call - such as a file's security label.
(c) An extra ioctl() call - such as FS_IOC_GETFLAGS.
(2) Should I offer information that's appropriate to non-UNIX filesystems
such as FAT, NTFS or CIFS. Some of this may map onto other fields, such
as FS_IOC_GETFLAGS.
(3) Should I offer information about which results that I've returned are
actually useful, as opposed to being fabricated on the spot? Such as
UID/GID in FAT or blocks in UBIFS. This may be of use to df or a GUI.
For instance, a GUI, seeing that UID/GID aren't useful, could ask the
filesystem to provide information ...Using implementation issues like that as a reason for some odd interface that we'll have to live with for the next decades sounds bad. It's basically a broken form of versioning, since if you end up using buffer sizes, everybody will just use "sizeof()" except for some random crazy developer that decides to re-use a buffer they use for something else, and then use the size of that instead. End result: the kernel gets passed in some random constant that depends on just which version of glibc they were compiled against _or_ on just how crazy they were. And it all just encourages people to do odd things. For example, the glibc developers, who love adding their own random fields for crazy "forwards compatibility", will start extending the xstat structure on their own and then just pass in the larger size and emulate a few new fields à la that whole vfstat thing. And then if/when we want to extend on it, we're screwed. So making it fixed is not only simpler, it avoids all the "I'm passing in random integers" crud. You don't need to allocate the whole thing inside the kernel anyway. Quite the reverse. You probably want to continue using the kernel "kstat" interface with some extensions. That's the point of kstat, after all - allowing the filesystem interfaces to share _one_ interface rather than having new interfaces at the VFS level for every damn new stat implementation we have to do for user space. In short, your stack space usage is all totally bogus. You should copy the kstat to the user xstat one field at a time, and NOT allocate an xstat on the kernel stack at all. There is no advantage to using "memcpy_to_user()" (after having filled in the kernel struct one field at a time) over just filling in the user struct directly. Just do "access_ok() + several __put_user() calls", in other words. I think you wanted to use "memcpy_to_user()" just because you had that broken "bufsize" argument to begin with. If you get rid of the bufsize, you also get rid of the potential ...
That's not what I meant at all. I meant there may be things out there that will just use st_btime and st_gen as soon as they appear without anything having to be done to them because these fields already exist in the BSD stat struct. Samba is such an example as this. It will use st_btime immediately if it Not having ls cause a mass automount just because you did an ls of a directory Perhaps. As previously mentioned, BSD (and other unices) already make some of these fields available (notably st_btime and st_gen). We could also make a I suspect they would, though maybe they can say otherwise. What about SMB directory enumeration? I believe that is effectively getdents-with-stat. Having to do open+stat for each file for that would be painful. David --
Yeah, but do you need xstat information at all for something like
that? Most people try very hard to make do with the information
returned by readdir itself (d_type and inode number), because if you
end up looking up each name you've already pretty much lost in a
performance model.
(And I do agree that a "readdirplus()" is probably something that a
lot of server people would find useful, but obviously that's another
cross-filesystem nightmare. Only a few filesystems can cheaply give
you anything but d_type/d_ino, and not all do even that),
Linus
--
This lightweight stat() interface is exactly needed for things like "color ls", Having a readdirplus() syscall would be even better, but again only with the ability to request specific attributes. Otherwise the filesystem may be doing a lot of extra work to collect all of the file attributes, and then userspace will probably be throwing most of them away. Cheers, Andreas --
It is? It's called crtime in Ext4. st_btime, however, would be compatible with BSD's stat, and Samba would just use it by way of autoconf magic if it appeared. David --
Samba has the following check: # recent FreeBSD, NetBSD have creation timestamps called birthtime: AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec]) AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec])) and the supporting code around that. "birth" might also be where the "b" comes from :-) Volker --
Of course you can find remnants of btime in Linux's BSD-style task accounting, but Linux always looked more like SysV than BSD, speaking for otime. And if you are using autoconf, the cost of using otime over Well, in all reference to the Matrix movie, files aren't born. Except for Directory Default ACLs and possibly security labels, they usually don't inherit either :) And on a CS level, it's more like copy than inherit, because if the parent changes, the file does not (with the potential exception of security relabeling, bla). --
On Thu, Jul 22, 2010 at 5:17 AM, Volker Lendecke
Oh wow. And all of this just convinces me that we should _not_ do any
of this, since clearly it's all totally useless and people can't even
agree on a name.
Let's wait five years and see if there is actually any consensus on it
being needed and used at all, rather than rush into something just
because "we can".
Linus
--
The nice thing about this is also that if this is supposed to be fully usable for Windows clients, the birthtime needs to be changeable. That's what NTFS semantics gives you, thus Windows clients tend to require it. Just as a hint, nothing that Linux should necessarily have to be bothered with, this is Samba's duty :-) Volker --
On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke
Ok. So it's not really a creation date, exactly the same way ctime
isn't at all a creation date.
And maybe that actually hints at a better solution: maybe a better
model is to create a new per-thread flag that says "do ctime updates
the way windows does them".
So instead of adding another "btime" - which isn't actually what even
windows does - just admit that the _real_ issue is that Unix and
Windows semantics are different for the pre-existing "ctime".
The fact is, windows has "access time", "modification time" and
"creation time" _exactly_ like UNIX. It's just that the ctime has
slightly different semantics in windows vs unix. So quite frankly,
it's totally insane to introduce a "birthtime", when that isn't even
what windows wants, just because people cannot face the actual real
difference.
Tell me why we shouldn't just do this right?
Linus
--
On Thu, Jul 22, 2010 at 11:47 AM, Linus Torvalds I haven't been keeping up with this thread, but I believe NTFS has a number of timestamps, not just 3. This blog post references 8 in the left hand column. The 4 standard (most common) ones are: File last access File last modified File created MFT last modified My understanding is that "MFT last modified" has semantics very similar to Linux ctime. But there is not a generic equivalent to NTFS created. Thus if trying to have the Linux kernel match NTFS semantics for the benefit of Samba is the goal, it seems a new field should be preferred instead of having linux ctime try to do different jobs. Greg --
I forgot the blog post url: http://blogs.sans.org/computer-forensics/2010/04/12/windows-7-mft-entry-timestamp-prop... --
No, ctime isn't the same as Windows "create time". Windows "create time" semantics are that the timestamp is set to current time on file creation, but afterwards anyone with sufficient access can then modify it (!). Which is different from the "birthtime" spec on *BSD, as they can't be modified. Currently on *BSD we look for our special EA containing any modified create times on a file, and return that as "create time" if found, if not we return the st_birthtime from the stat struct. That works well enough for systems where you don't want to allow birthtime to be changed. Having said that I'm not sure how they cope with doing restores to a filesystem where you would need to set st_birthtime :-). Jeremy. --
Umm. What kind of reading problems do you guys have? I know effin well that ctime isn't the same as Windows create time. THAT WAS MY POINT. But the fact is, th Unix ctime semantics are insane and largely useless. There's a damn good reason almost nobody uses ctime under unix. So what I'm suggesting is that we have a flag - either per-process or per-mount - that just says "use windows semantics for ctime". And yes, I'm very aware that the "c" in ctime doesn't stand for "create". But anybody who points that out is - once more - totally missing the point. My point is that we have three timestamps, and windows wants three timestamps (somebody claims that NTFS has four timestamps, but the Windows file time access functions certainly only shows three times, so any potential extra on-disk times have no relevance because they are invisible to pretty much everybody). We can have unix semantics for mtime/atime/ctime, or we can have windows semantics for those three values. So let's say that we introduce a mount flag that says "ctime=winctime", which basically just sets a flag that instead of changing ctime on chmod/chown/etc, it just changes mtime instead (or, as mentioned, we could make it a process flag instead). Let's face it, Unix semantics are not sacred. Especially not something like ctime, which is pretty damn useless. If you're a samba server, why not just say "let's do ctime the way windows does creation times", and let it be at that? I personally think that Unix ctime is insane. There is no real reason why "write()" should change mtime, but "chmod" changes ctime. It was just a random decision way back when, and it's clearly not what samba wants, and it's equally clearly not what even most _unix_ people want (just google for "ctime" and "creation time", and watch the confusion - exactly because unix semantics are simply _random_ and odd semantics in this area) I would not be at all surprised if it turns out that people might want to really turn ctime into ...
I beg to differ. ctime is not completely useless. It reflects changes on the inode for when you don't you change the content. It's like an mtime for the metadata. It comes useful when you go around in your filesystem trying to figure out who of your co-admins screwed up the permissions on /etc/passwd... and if the mtime is the same as that of the last backup, I can at least have a reasonable assurance that it was /only/ the metadata that was tampered with. (SHA1 check, yeah yeah, costly on large --
Errr... Only if you eliminate utimes() from your syscall table. Otherwise it is trivial to reset the mtime after changing the file contents. Cheers Trond --
Well yes; I had implicitly implied that evil people with malicious intent are absent. --
Uh. Yes. Except that why is file metadata really different from file data? Most people really don't care. And a lot of people have asked for creation dates - and I seriously doubt that Windows people complain a lot about the fact that there you have mtime for metadata changes too. The point being that Unix ctime semantics certainly have well-defined semantics, but they are in no way "better" than having a real creation time, and are often worse. Just imagine what you could do as an MIS person if you actually had a creation time you could somewhat trust? You talk about seeing somebody change the permissions of /etc/passwd, but realistically, absent preexisting semantics, who would really ask for that? The only reason you mention that as an example of what you can do with ctime is that that is indeed pretty much the _only_ thing you can do with ctime, and it really isn't that useful. In contrast, with a creation date, you see the difference between people overwriting files by writing to them, or overwriting files by creating a new one and moving it over the old one. At a guess, that would be quite as useful to a sysadmin as ctime is now (my gut feel is that it would be more so, but whatever). IOW, there really isn't anything magically good about UNIX ctime semantics, and in fact they are totally broken in the presence of extended attributes (that's file data, but it only changes ctime? WTF is up with that? Yes, I know why it happens, and it makes sense within the insane unix ctime rules, but no way does it make sense in a bigger picture unless you are in total denial and try to claim that xattrs are just metadata despite having contents). And yes, I am also sure that there are applications that do depend on ctime semantics. Trond mentioned NFS serving, and that's unfortunate. I bet there are others. That's inevitable when you have 40 years of history. So I'm not claiming that re-using ctime is painfree, but for somebody that cares about samba a lot, I bet it's a _lot_ ...
Samba mostly ignores ctime, for just the reasons you mention. But re-using ctime as create time will lead to more horrible confusion (IMHO). Easier to add a btime field to stat (or whatever you want to call it), especially as some of the filesystems already support it, the code for it exists inside Samba and is working on other UNIX-style OS'es, and for filesystems that don't support it, just return Yep. We even have to do that on systems with an immutable btime to get Windows semantics. Jeremy. --
Yeah, having create time would be important. That said, having a non user-settable modify timestamp is crucial for quickly determining whether a file has changed. --
How would "cp --archive" and a host of backup/restore tools work without user-settable modify timestamps? Or are you proposing another timestamp? I do computer forensics, I like timestamps, but enough is enough. Greg --
mtime and atime are already user settable and archive programs use this on the destination, but ctime would be different after copy/restore. When updating the archive, just comparing mtime to determine if the source changed is problematic as it can be set to any value after the change, but src.ctime would be greater than dest.ctime in this case. With posix semantics (http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_07) this is not perfect either as there can be false-positives when the file stat changed but the file has not, e.g. when st_nlink changed. --
On Thu, Jul 22, 2010 at 1:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: But Windows doesn't work that way for I'm fairly sure. Window's mtime is only affected by file content updates. (I don't know about xattr updates). If you look at the first and fourth rows of the table at: http://blogs.sans.org/computer-forensics/2010/04/12/windows-7-mft-entry-timestamp-prop... You see that there are a number of activities that update the "$STD Info MFT Entry Modified Field" that don't update the "$STD Info Modification Time" Again, "$STD Info MFT Entry Modified Field" has semantics close to linux ctime. And "$STD Info Modification Time" similar to mtime. I don't know if there are APIs to present MFT Entry Modified to user space or if Samba uses that info. I just know it's part of the on-disk NTFS filesystem data. Greg --
On Thu, 22 Jul 2010 10:24:17 -0700 Much as I despise xattrs, this would definitely be my preference. ctime and mtime have real cache-coherence semantics which require them being updated by the kernel (whether the cache is on an NFS client, in a backup archive, or in a .o translation of a .c file). create-time, on the other hand, would never be updated by the kernel, and might sometimes be updated by an application. So it is a very different sort of attribute, much like a hypothetical 'last archived' time. The only role the kernel might have would be setting the 'creation time' when the file was created, but it seems even that isn't always what is wanted, because people don't so much what the time of create of the container-on-disk, but the time of creation of the data-content. I would want to see a pretty convincing use-case that cannot be solved with xattrs before 'creation time' was added to a generic kernel interface. So just use xattrs and don't involve the kernel in any detailed knowledge of this value. Maybe xstat should take a list of xattrs to be retrieved as well?? or maybe not. But I hope the xstat debate doesn't get bogged down about whether 'create time' is sensible or not. Quite apart from the ability to return more attributes, I think it has real value is being able to return fewer attributes, and being allowed to ask for 'best guess' values. Being able to do an 'fstat' and being certain that you won't be blocked by a non-responsive NFS server would be a GOOD THING (TM). NeilBrown --
So does creation time, at least for CIFS caching. Creation time has potential for spotting when the object at a pathname has changed for something else, given the lack of inode number and inode generation from windows servers. That should be a timestamp in the content itself, not a filesystem metadata Then there's no point even considering this. You could emulate the entirety of stat() with getxattr(). I've previously posted a patch to implement the retrieval of creation time, inode gen and data version as xattrs and been told Why not? BSD has it in its stat struct. Windows has it in its Win32 equivalents. Samba for one will look for it there, and use it if it is. Using an xattr means an extra pathwalk and extra locking per access for any program that wants it. It's a reasonable bet such a program will also be stat'ing the file it wants the creation time for. If we are going to extend stat anyway, then why not make out a short list of extra things we could usefully return and consider adding them? Something like creation time is reasonably easy to come by for little extra overhead. The idea of xstat() having a variable-length buffer and variable arguments has been well derided. It ain't going to happen, much though I'd like it to. I'd quite like to offer the opportunity to return the security label, for example. David --
On Wed, 28 Jul 2010 18:28:02 +0100 This justifies for me why a CIFS client would want to extract the creation-time from the CIFS protocol, but not why you want to expose it via a generic interface. The kernel/filesystem doesn't need to maintain creation-time to meet this need, only the CIFS server needs to maintain it - the kernel/filesystem just needs to provide somewhere to store it - xattrs. Given that we have an extensible attribute framework, it seems wrong to be adding new attributes to *stat. If a given filesystem wants to store certain attributes more efficiently, then it is welcome to intercept xattr calls and store (say) "cifs.birthtime" directly at a known offset in the inode. The flip-side of extracting these various attributes is setting them. One presumably doesn't want to set st_data_version and possibly not st_gen, but there seems to be a need to set st_btime and FS_SYSTEM_FL and FS_TEMPORARY_FL might want to be set. Your xstat doesn't give any way to do that, xattrs already does - you just need to define names for the attributes. So I'm against adding new attributes that simply involve the fs storing some information for the application to use. I'm still pondering those extra flags: FS_SPECIAL_FL FS_AUTOMOUNT_FL FS_AUTOMOUNT_ANY_FL FS_REMOTE_FL FS_ENCRYPTED_FL FS_OFFLINE_FL They sound like they might be useful, they are not file-metadata (like btime) but rather implementation details (like st_blocks). So it is probably sensible to include them as you have done. However I would really like to see clear and complete documentation for them. When exactly should a filesystem set these flag, and what exactly can an application assume if they are (or are not) set. If a filesystem is mounted on an network-block-device, or a loop-back of a file on NFS, is FS_REMOTE_FL set? Is ROT13 enough for FS_ENCRYPTED_FL to be set? If the NFS server is "not responding, still trying", should FS_OFFLINE_FL get set on all files? And I cannot even ...
For what it's worth, the NFSv4 server would also export creation time if we had it. --b. --
On Thu, 29 Jul 2010 09:04:01 +1000 The problem with the above approach is that you're assuming that the data in question is always accessed via the CIFS server. If someone comes along and messes with the data outside of CIFS, then samba won't have knowledge of that and the birthtime will be wrong. There's some history behind this as well -- samba tracks windows ACLs via xattr and it can be very problematic keeping those up to date when the data is accessed outside of samba. I think presenting this data via xattr makes the most sense. It's simple and as Neil points out, it also provides us with a clealy settable interface. If we ever get an xstat-like syscall, we can always present the same data via that as well. I also think it's quite reasonable to consider tracking birthtime in a generic inode field. In the absence of that, filesystems could track this themselves in their filesystem-specific inode structs. Furthermore, I'll go ahead and propose the following (simple) semantics: 1) birthtime is initialized to the current time when a new inode is created 2) it's settable via the xattr to an arbitrary value Either way, the xattr for this ought to be named the same on all filesystems. Samba shouldn't need to know or care what the underlying filesystem is, as long as it presents the correct xattr. That should make samba happy, and be reasonably simple to implement. -- Jeff Layton <jlayton@redhat.com> --
It would also be easier for NFSD if the creation time was in struct kstat.
It's included as an optional element in NFSv4. The same goes for the data
version number. I'm not sure about the inode generation, I suspect that's used
as part of the FH construction.
However, someone was talking about a userspace NFS daemon, and there they may
want all three bits. Even Samba may want multiple bits. Calling getxattr
multiple times per file starts to add up, even for internal values.
Consider further: NFS, for example, could be made to retrieve the creation time
from the server. This can be merged with the attribute fetch done by the
getattr() call, or it could be done separately by getxattr. Unless it's stored
in RAM, that's one NFS RPC op versus two. Okay, that's a bit of an artificial
It's not attribute storage I'm thinking about, but making attribute retrieval
I acknowledge that if we went down the getxattr() route, then that
automatically makes setxattr() the obvious candidate for setting things.
But think about it another way: what if you want to set several attributes?
You have to make a bunch of setxattr() calls. But what if it were possible to
do all of chmod, chgrp, chown, truncate, utimes, set_btime, etc. all in one go,
atomically? We more or less have this internally in the kernel, and it might
stand to be exposed to userspace.
I've split these away from ioc flags as ioc flags is very ext2/3/4 centric, and
those filesystems happily create their own ioc flags sets without updating the
Yeah. I have plans to write documentation for it, but I'd like to have a
clearer idea of what the interface might be before doing that.
But to give you an idea of the flags:
(*) FS_SPECIAL_FL - Kernel API file from a quasi-filesystem such as /proc or
/sys - the sort of thing you might not want to expose through NFSD.
(*) FS_AUTOMOUNT_FL - A named automount/referral point. You attempt to
transit this directory and the backing fs will mount something over ...On Thu, 29 Jul 2010 17:15:15 +0100
Thanks for these. It particularly helps when you identify how the flag might
be used - guiding GUI icon choice is certainly valid and tells me that if I
don't set the flag 'correctly' (maybe because it is too difficult) then it
isn't the end of the world.
I get the AUTOMOUNT distinction too - FS_AUTHMOUNT_ANY_FL would be good for a
GUI as it could allow you to type in a filename for it to try to follow.
I'm not sure exactly how FS_ENCRYPTED_FL would be used - if the gui might be
prompted to ask for a key there would either need to be a completely general
interface for presenting keys, or the flag should be specific to CIFS and
should mean that a key must be given to CIFS to unlock the file.
Similarly, what can you do with an OFFLINE file? Do CIFS and AFS offline
files behave the same way? If not there should be two different flags. If
so then that behaviour should be specified with the flag ... unless this flag
is just for GUI cosmetics too.
Anyway, I've been thinking more about this and have refined my position
somewhat. I'll present it here for what it is worth - feel free to ignore
bits you don't like.
Your proposed 'xstat' seems to combine a number of different goals - doing
that is always a bit dangerous as you have defend it on multiple fronts...
I see the separate goals are:
A/ allowing attributes to be accessed independently - an explicit list of
required attributes is given and the FS doesn't need to collect the other
attributes.
B/ allowing synthetic attributes to be identified - if the FS doesn't
natively support some attribute but must synthesise it, you can now
discover that fact
C/ add an ad-hoc collection of new attributes that filesystems can return if
they happen to support them
D/ do all the above with a single system call for efficiency.
I think pushing all these together is asking for trouble - arguments about one
aspect will interfere with completion of the others.
Given ...Linus Torvalds wrote: I personally think that Unix ctime is insane. There is no real reason why "write()" should change mtime, but "chmod" changes ctime. It was just a random decision way back when... I believe it was done that way so "dump" could backup just the inode and not the data if only the inode had changed. Full history here: http://blog.plover.com/Unix/ctime.html --
Yes, the dump reasoning makes sense, and that history also shows that
originally chmod just changed mtime (since that's the _sane_ thing to
do). So if it wasn't for dump - that nobody uses any more and that was
considered a hack even back when and never supported things like
xattrs etc - unix probably wouldn't have a ctime at all (or would have
implemented a "creation time" because people would have asked for it).
So I'm sure there are reasons for ctime. That just doesn't mean that
it's really "good", the same way there were reasons to name "creat()"
without the "e".
Linus
--
Ask NetApp about that :-). They have built a rather large business on just that fact :-). Jeremy. --
Get sued out of existence by software patent trolls who have lost the ability to write code, apparently :-). --
The time is counted in years, not hours :-). --
I said "limited", not "non-existent". The fact remains that most of us would be hard pressed to name an application that requires you to share the same dataset to both Windows/CIFS and posix NFS clients. Everything from ACL models through caseless vs case-aware filesystems and Windows vs posix locking semantics tends to discourage mixing the two environments. Trond --
Your Mac has a perfectly functional CIFS client, as do your Linux boxes. They both interoperate just fine with Samba, and would presumably continue to do so if someone were to decide to reuse the ctime field on your Samba box as storage for a create time. Trond --
It didn't, at one point. Some version of Mac OS X would cause a client kernel crash when unmounting the CIFS share. I think it's been fixed, but we had to have some OS X clients switch to NFS because of it. -Phil --
CIFS doesn't support symlinks (they just appear as the referenced file), so I've had applications that scan the filesystem recurse indefinitely due to symlinked directories on a CIFS share appearing as hard-linked directories on the client. This doesn't happen when the filesystem is accessed via NFS. Cheers, Andreas --
This shouldn't go on indefinitely - PATH_MAX is reached at some point. --
Sigh... So please explain how it would be useful to export that particular filesystem through _both_ CIFS and NFS? My point was that in most circumstances you want to export either through CIFS or through NFS, but very rarely both. I also made the point that converting ctime into a creation time would break NFS, but it would be a limited breakage, mainly affecting the client's ability to detect ACL changes, and possibly causing the inode to get temporarily updated with stale attribute information on occasion due to out-of-order RPC replies. Trond --
Seems like a reasonable case for, say, a public "ftp server". For example, I keep ftp5.gwdg.de:/ftp/pub mounted, that's a little more convenient than always having to start an ftp cilent. Conversely, since NFS is, well, non-existent on Windows, one would use CIFS there (had it ftp5 opened) to get the same convenience. --
>>>>> "Jeremy" == Jeremy Allison <jra@samba.org> writes: Jeremy> Ask NetApp about that :-). They have built a rather large Jeremy> business on just that fact :-). And it does work, as long as you also go with either unix or windows semantics for the security and permissions bits. If you try to use the mixed-mode, you're in for a world of hurt. Oh yeah, Netapp still uses dump/restore for it's backups. :] Though whether it's still dependent on the optimization of ctime being used to know whether to just dump the inode only or not, I can't say. John --
Hi Linus,
> My point is that we have three timestamps, and
> windows wants three timestamps (somebody claims that NTFS has four
> timestamps, but the Windows file time access functions certainly only
> shows three times, so any potential extra on-disk times have no
> relevance because they are invisible to pretty much everybody).
Not quite. The underlying structure available to Windows programmers
is this one:
typedef struct _FILE_BASIC_INFORMATION {
LARGE_INTEGER CreationTime;
LARGE_INTEGER LastAccessTime;
LARGE_INTEGER LastWriteTime;
LARGE_INTEGER ChangeTime;
ULONG FileAttributes;
} FILE_BASIC_INFORMATION, *PFILE_BASIC_INFORMATION;
See http://msdn.microsoft.com/en-us/library/ff545762%28v=VS.85%29.aspx
These are the definitions:
CreationTime
Specifies the time that the file was created.
LastAccessTime
Specifies the time that the file was last accessed.
LastWriteTime
Specifies the time that the file was last written to.
ChangeTime
Specifies the last time the file was changed.
You are right that the more commonly used APIs (such as
GetFileInformationByHandle()) omit the ChangeTime field in the return
value. The ChangeTime is also not visible via the normal Windows GUI
or command line tools.
But there are APIs that are used by quite a few programs that do get
all 4 timestamps. For example, GetFileInformationByHandleEx() returns
all 4 fields. I include an example program that uses that API to show
all the timestamps below.
and yes, we think that real applications (such as Excel), look at
these values separately.
The other big difference from POSIX timestamps is that the
CreationTime is settable on Windows, and some of the windows UI
behaviour relies on this.
Cheers, Tridge
PS: Sorry for coming into this discussion so late
/*
show all 4 file times
tridge@samba.org, July 2010
*/
#define _WIN32_WINNT 0x0600
#include <stdio.h>
#include <stdlib.h>
#include "windows.h"
#include ...Well, not POSIX, because POSIX doesn't have CreationTime at all. BSD's birthtime doesn't allow it to be set, and the question here is largely philosophical. Does it literally mean "file creation time" in terms of when the OS created the file, or does it mean "file" in the sense of application contents. For example, if an application edits the file and saves it out using "write file to foo.new; sync; rename foo to foo.bak; rename foo.new to foo", should the creation time for the newly written file "foo" be the time when the editor saved out the file (i.e., when "foo.new" was created), or copied from the original file "foo"'s creation time. This is something (whether or not the application is allowed to set the creation time) that I think makes sense to be either a filesystem level mount option, or superblock tunable, or even a per-process personality flag. However, I think Linus's idea of using a per-process flag to control whether or not "ctime" has the original POSIX semantics or some new "creation time" semantics would lead to a huge amount of confusion. Given that a number of new filesystems, including both ext4 and btrfs, have creation time, it makes sense for us to have a fourth timestamp. Whether or not our creation time is settable or not is a separate question, and I don't think we need to follow BSD's lead on this. If GNOME and/or KDE applications start using it, I could see this becoming that gets wide adoption fairly quickly. - Ted --
Hi Ted, > Does it literally mean "file creation time" in terms of when the OS > created the file, or does it mean "file" in the sense of > application contents. For example, if an application edits the > file and saves it out using "write file to foo.new; sync; rename > foo to foo.bak; rename foo.new to foo", should the creation time > for the newly written file "foo" be the time when the editor saved > out the file (i.e., when "foo.new" was created), or copied from the > original file "foo"'s creation time. In Windows this is can be controlled by applications, but it also is done at the filesystem level in NTFS using a technique that Microsoft call "File System Tunneling". If you create a file with the same name within a short time (default 15s and settable in the registry) of when the file previously existed then it will get the same CreationTime as the previous file. For details see http://support.microsoft.com/kb/172190 Some applications also do this regardless of the registry setting for MaximumTunnelEntryAgeInSeconds. They use the ability to set the CreationTime to get the same behaviour. Cheers, Tridge --
actually, it can (partly :). But the way it can be done is an insane hack: <quote "http://ace.delos.com/kirk/"> To provide a sensible birth time for applications that are unaware of the birth time attribute, we changed the semantics of the "utimes" system call so that if the birth time was newer than the value of the modification time that it was setting, it sets the birth time to the same time as the modification time. An application that is aware of the birth time attribute can set both the birth time and the modification time by doing two calls to "utimes". First it calls "utimes" with a modification time equal to the saved birth time, then it calls "utimes" a second time with a modification time equal to the (presumably newer) saved modification time. </quote> Thus it can also be only be set more in the past. Cheers Björn -- SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen phone: +49-551-370000-0, fax: +49-551-370000-9 AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen --
When abusing an existing time stamp use atime not ctime please. ctime has it's uses. atime was just a mistake and is nearly useless. And with noatime we already have creation time semantics for atime. utz --
noatime was a late afterthought, and because it can interfere with some programs, relatime came along too. --
I know mutt uses atime to detect new messages. But there are better and There are people who prefer noatime over relatime. Using an existing time stamp for creation time is a bad idea IMHO. But when doing this use the least important one. Which is atime. For example ctime is used by backup programs. Anyway when we want to support creation time it should be an additional time stamp. utz --
On Fri, 30 Jul 2010 23:22:58 +0200 Ugh. Honestly all of this talk of abusing different time fields seems like craziness to me. It's going to be very hard to do that without breaking *something*. There's also very little reason to do this when xattrs are a much cleaner approach. Neil Brown has put forth a very reasoned justification for putting the birthtime in an xattr. After reading it, I think that makes more sense than anything. It's also something that can be done without any extra infrastructure. If at some point in the future we get an xstat-like syscall then we can always add birthtime to that as well. Ditto for the other fields under discussion (i_generation and the like). -- Jeff Layton <jlayton@redhat.com> --
Just my 2 cents (as a Samba server implementor). I *hate* the idea of adding a "virtual" EA for birthtime. If you're going to add it, just add it to the stat struct like *BSD does. Don't abuse the other time fields, it's a new one. Jeff, please don't advocate for an EA for the Samba server to use. Don't add it as an EA. It's *not* an EA, it's a timestamp. Jeremy. --
On Thu, 5 Aug 2010 16:52:18 -0700 I'm curious. Why do you particularly care what interface the kernel uses to provide you with access to this attribute? And given that it is an attribute that is not part of 'POSIX' or "UNIX", it would seem to be an extension - an extended attribute. As the Linux kernel does virtually nothing with this attribute except provide access, it seems to be a very different class of thing to other timestamps. Surely it is simply some storage associated with a file which is capable of storing a timestamp, which can be set or retrieved by an application, and which happens to be initialised to the current time when a file is created. Yes, to you it is a timestamp. But to Linux it is a few bytes of user-settable metadata. Sounds like an EA to me. Or do you really want something like BSD's 'btime' which as I understand it cannot be set. Would that be really useful to you? Is there something important that I am missing? NeilBrown --
Obviously the cifs and SMB2 protocols which Samba server support can ask the server to set the create time of a file (this is handled through xattrs today along with the "dos attribute" flags such as archive/hidden/system), but certainly it is much more common (and It is another syscall that Samba server would have to make - and xattr performance is extremely slow on some file systems (although presumably this one would be more likely to be stored in inode and perhaps not as bad on ext4, cifs and a few others such as ntfs). -- Thanks, Steve --
On Thu, 5 Aug 2010 22:55:06 -0500 Right. One has to consider that samba has to satisfy READDIRPLUS-like calls, and on a large directory all of those extra syscalls are likely to impact performance. In my view, the ideal thing would be to add this field as an EA and continue work on implementing xstat(). Adding it as an EA gives userland a way to set this value, without needing to add a new utimes() variant. If/when xstat becomes available, samba could use that instead of the EA for reading this value. -- Jeff Layton <jlayton@redhat.com> --
On Thu, 5 Aug 2010 22:55:06 -0500 Just a point of clarification - when you say it is common and important to be able to read the creation time on an existing file, and you still talking in the context of cifs/smb windows compatibility, or are you talking in the broader context? If you are referring to a broader context could be please give more details because I have not heard any mention of any real value of creation-time out side of window interoperability - have such a use clearly documented would assist the conversation I think. If on the other hand you are just referring the the windows interoperability context ... given that you have to read an EA if the create-time has been changed, you will always have to read and EA so having something else is Obviously if we were to make xattrs the preferred way to get create time out of the filesystem we would want to make sure it is efficient. It would seem to make perfect sense to add a 'getxattrat' syscall and allow an AT_NONBLOCK flag (which would probably be useful for statat too). The AT_NONBLOCK flag would only get attributes if they were available immediately without going to storage/network/whatever. And if it is simply a case of too many syscalls per file, then getxattrat_multi would seem to be the most general way to go. NeilBrown --
There are other cases, less common than cifs and smb2. One
that comes to mind is NFS version 4, but there are a few other
cases that I have heard of (backup/archive applications).
The RFC recommends that servers return attribute 50 (creation
time). See below text:
time_create 50 nfstime4 R/W The time of creation
of the object. This
attribute does not
have any relation to
the traditional UNIX
file attribute
"ctime" or "change
time".
--
Thanks,
Steve
--
On Fri, 6 Aug 2010 18:58:42 -0500 I really don't think NFSv4 is a separate justification. I'm fairly sure that attribute was only including in NFSv4 for enhanced Windows compatibility (windows interoperation was a big issue during the protocol development). That leaves hypothetical "backup/archive applications". Do you have a concrete example? Or we are left with just various flavours of Windows compatibility (not that I have a problem with Windows compatibility, but if that is the only reason that we have creation-time then I think it is important to be clear and open about that). NeilBrown --
Perhaps also useful for MacOS (and other BSD), not just Windows, although MacOS may use cifs more often than nfs. -- Thanks, Steve --
A quick search for backup applications in Wikipedia came up with a reference fairly easily (to backup app which uses creation time) for Linux: http://www.aqualab.cs.northwestern.edu/publications/Cornell04VFS.html Presumably Windows compat. is a stronger motivation, than BSD/MacOS NFSv4 (returning birth time) compat, and backup applications are a lesser motivations. There may also be some value in using creation time as a generation number where no generation number is available. Intuitively seems like creation time would be as "useful" as ctime (and probably more so) to app developers ... but that is hard to prove. -- Thanks, Steve --
On Fri, 6 Aug 2010 21:54:49 -0500 That publication seems to mention 'creation time' only as an abstract concept. The backup architecture keeps a history of the file all that way back to its "creation time". It doesn't appear to need or use a 'creation time' attribute stored with any I agree, it does seem like an intuitively valuable number - after all we each have a birthday which we are very aware of and often make use of. It is often treated as part of our identity - just like you were mentioning that the CIFS client uses creation-time to help identify files which lack the 'inode number' identifier that is the common tool in Unix and derivatives. But I'm not convinced that it is *practically* useful. The only practical use beyond windows-compatibility that has been mentioned is a stronger 'identity' tag. However inode+generation number, or "file-handle-fragment" are better things to use for identifying a file than "creation time", especially when the latter is settable. So if we were to add something for native applications to use, I doubt that it would be 'creation time' (but I'm still open to hearing a convincing use-case). So we are left with an attribute that is needed for windows compatibility, and so just needs to be understood by samba and wine. Some filesystems might support it efficiently, others might require the use of generic extended-attributes, still others might not support it at all (I guess you store it in some 'tdb' and hope it works well enough). Core-linux doesn't really need to know about this - there just needs to be a channel to pass it between samba/wine and the filesystem. xattr still seems the best mechanism to pass this stuff around. Team-samba can negotiate with fs developers to optimise/accelerate certain attributes, and linux-VFS doesn't need to know or care (except maybe to provide generic non-blocking or multiple-access interfaces). What is 'creation time' used for in the windows world??? Maybe there really is something ...
On Sat, 7 Aug 2010 13:32:40 +1000 IIUC, you're saying that we should basically just have samba stuff the current time into an xattr when it creates the file and leave the filesystems alone. If so, I disagree here. The problem with treating this as *just* an xattr is that it doesn't account for files that are created outside of samba but are then shared out by it. To handle this correctly, I believe it needs to be initialized by the kernel to the current time whenever an inode is created, even if samba doesn't create it. After that, it can be treated as just another xattr. -- Jeff Layton <jlayton@samba.org> --
On Sat, 7 Aug 2010 06:34:00 -0400 I'm not quite saying that (though there is a temptation). Some attributes are initialised by the filesystem rather than by common code. i_uid is a simple example. I have no problem with the filesystem initialising the storage that is used for this well-known-EA to the current time at creation. If something is created in a different universe, then brought into this one - when is its date of birth? The moment of creation, or the moment of entry into this universe? If both universes have a common time line (altough with a 10 year offset) then I guess the former, though I think it is a bit of Yes, I suspect that would be ideal, and trivial for the fs to implement (it has to initialise it to something after all). i.e. I agree. NeilBrown --
It's a matter of taste. The *BSD's have this right IMHO. It should be part of the stat information. A file timestamp is not an EA. Making it available that way just feels like an appalingly It is *already* useful to us, and is widely used in existing code. The occasions when btime is set are relatively rare, and at that point we store it in a separate EA for Windows reporting purposes. Jeremy. --
On Sun, 8 Aug 2010 05:12:09 -0700 It would be more convenient if this were part of stat() but adding a new stat call is non-trivial. Even if we did that, it still doesn't solve the problem of being able to set the create time. The fact that that's rarely done doesn't really matter much -- we ought to shoot for the semantics that are needed to handle this properly. If we do settle on a xstat() interface, it might also end up being able to report things like selinux labels which are also available and settable via xattr. I don't see a problem with presenting the same data via multiple interfaces. If presenting this data via xattr solves the immediate problem of being able to properly store and report the create If that's the case, don't you have to query for this EA every time you need to return the create time anyway? If so, then doing this really isn't any more costly -- you'd just be querying a different EA, right? -- Jeff Layton <jlayton@redhat.com> --
*BSD didn't. They just added something that was useful to UNIX. I'd be happy with that. We don't need to ape Windows in everything. The coming ACL disaster will show that (we will go from an ACL model that is slightly too complex to use, to one that is impossibly No, we'd be querying an additional EA. The EA we query contains the DOS attribues as well as the create time. Jeremy. --
Care to elaborate? And what would native ACL support mean for Samba? --b. --
POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here, people are asking for this. But Windows ACLs are a nightmare beyond human comprehension :-). In the "too complex to be RichACLs'll do it, but I feel sorry for the admins :-). Jeremy. --
Not much choice - even community colleges now have Yes - RichACLs and Windows ACLs allow you to set some strange combinations of permssion bits. RichACLs will make a more natural mapping for Samba and NFSv4 - and it is far too late to remove the requirement for Windows and MacOS (among other clients) support. -- Thanks, Steve --
Well, for one, ACLs in NT can be recursive IIRC. You can't say that of Linux ACLs - instead you have to setfacl -R and setfacl -Rd to give one user access to a directory and all its subdirs including future new inodes. --
You do realize that Windows does exactly the same thing under the covers, right ? Watch SMB or SMB2 traffic between a client and Windows server when someone changes an ACL sometime :-). Jeremy --
Yeah. There's some explanation here: http://tools.ietf.org/search/rfc5661#section-6.4.3.2 What NT-style ACLs provide is a few bits that help a setfacl-like application decide how to propagate the change. But it's still up to the application to do the recursive traversal. --b. --
I was curious whether you can support that with any data (or even just anecdotes) about real-world sysadmins. The NT-style ACLs give me a headache, honestly. But that may just be because I've been involved with the implementation. Admins may have the luxury of using only the subset that they're comfortable with. --b. --
Just an anecdote, but I remember giving a talk to a room full of admins, all of whom told me it was essential for Samba to implement "full Windows ACL compatibility" (we were in the process of coding it up at the time). I asked them to tell me the difference between object inherit, container inherit, and inherit only. Only one hand remained up (out of a room containing a couple of hundred Windows admins). I asked him where he worked, and the reply was Yeah. I think most sites set a group as the owner of a share and the directory so exported, set the directory to inherit everything down below, and just leave it up to the members of that group without getting further involved :-). Jeremy. --
On Sun, 8 Aug 2010 05:12:09 -0700 Unfortunately whenever you work on a collaborative project someone has to make concessions to taste, as we all taste different.. (or have different taste.. or something). So I think it is very important to clearly differentiate the practical issues from the aesthetic issues as I think we can hope for unity on the former, but I'm probably sounding like a scratched record, but when you say "is widely used" do you mean "is used in samba which is widely used" or do you mean "is used in a wide variety of applications"? Because if you are only saying the former, then I don't think we should copy BSD, but rather I think we should provide exactly the semantics that are most useful to samba - and that would seem to be creation-time and DOS flags which the filesystem can store directly in the inode and which samba can access cheaply. (and I would prefer to use xattrs, but that is a taste thing and as I'm not writing the code, I don't get to choose the taste). But if you are saying the later, then sharing those details might help us see that copying bsd is actually the best thing to do, or maybe that something else is better. I'm just afraid that if some new interface is added without clear, comprehensive and up-front justification then we will end up getting a sub-optimal interface. NeilBrown --
CacheFiles currently uses atime to determine least-recently-usedness. David --
How does this works right with noatime or relatime (which is default)? We had used FS-Cache with a few 10000s files cached. Doesn't it mean that the cleanup has to stat them all? Why didn't cachefilesd managed the cache index in a separate database like other caches? --
Because using atime is much simpler since the filesystem updates it automatically. If you have a separate database then you have redundant information and you need to maintain metadata integrity which has a cost, both in terms of disk usage and performance. I'm working on it, but you don't get it for free. David --
There just is no way currently to store creation times. Abusing ctimes for write-once archives also stops working once you rsync it from one place to another. (Which brings me to the side question of why the ctime isn't settable through futimesnat.) --
What do you mean? Ext4 and BtrFS can both do so; it's just that there's no user interface to it. David --
Return extended attributes from the eCryptFS filesystem, dredged up from the
lower filesystem.
Possibly eCryptFS should also set FS_COMPR_FL on its compressed files.
Signed-off-by: David Howells <dhowells@redhat.com>
---
fs/ecryptfs/inode.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 31ef525..41bc407 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -994,8 +994,10 @@ int ecryptfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat lower_stat;
int rc;
- rc = vfs_getattr(ecryptfs_dentry_to_lower_mnt(dentry),
- ecryptfs_dentry_to_lower(dentry), &lower_stat);
+ lower_stat.query_flags = stat->query_flags;
+ lower_stat.request_mask = stat->request_mask | XSTAT_REQUEST_BLOCKS;
+ rc = vfs_xgetattr(ecryptfs_dentry_to_lower_mnt(dentry),
+ ecryptfs_dentry_to_lower(dentry), &lower_stat);
if (!rc) {
generic_fillattr(dentry->d_inode, stat);
stat->blocks = lower_stat.blocks;
--
Mark arguments to certain system calls as being const where they should be but
aren't. The list includes:
(*) The filename arguments of various stat syscalls, execve(), various utimes
syscalls and some mount syscalls.
(*) The filename arguments of some syscall helpers relating to the above.
(*) The buffer argument of various write syscalls.
Signed-off-by: David Howells <dhowells@redhat.com>
---
arch/alpha/kernel/osf_sys.c | 6 +++---
arch/alpha/kernel/process.c | 2 +-
arch/arm/kernel/sys_arm.c | 4 ++--
arch/arm/kernel/sys_oabi-compat.c | 6 +++---
arch/avr32/include/asm/syscalls.h | 2 +-
arch/avr32/kernel/process.c | 3 ++-
arch/blackfin/kernel/process.c | 2 +-
arch/frv/kernel/process.c | 3 ++-
arch/h8300/kernel/process.c | 2 +-
arch/ia64/include/asm/unistd.h | 2 +-
arch/ia64/kernel/process.c | 2 +-
arch/m32r/kernel/process.c | 3 ++-
arch/m68k/kernel/process.c | 2 +-
arch/m68knommu/kernel/process.c | 2 +-
arch/microblaze/kernel/sys_microblaze.c | 2 +-
arch/mips/kernel/syscall.c | 2 +-
arch/mn10300/kernel/process.c | 2 +-
arch/parisc/hpux/fs.c | 7 ++++---
arch/powerpc/kernel/process.c | 2 +-
arch/powerpc/kernel/sys_ppc32.c | 2 +-
arch/s390/kernel/compat_linux.c | 10 +++++-----
arch/s390/kernel/compat_linux.h | 10 +++++-----
arch/s390/kernel/entry.h | 2 +-
arch/s390/kernel/process.c | 2 +-
arch/sh/include/asm/syscalls_32.h | 2 +-
arch/sh/include/asm/syscalls_64.h | 2 +-
arch/sh/kernel/process_64.c | 2 +-
arch/sparc/kernel/sys_sparc32.c | 7 ++++---
arch/um/kernel/exec.c | 6 +++---
arch/um/kernel/internal.h | 2 +-
...