git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin.git vfs-scale-working Here is an new set of vfs patches for review, not that there was much interest last time they were posted. It is structured like: * preparation patches * introduce new locks to take over dcache_lock, then remove it * cleaning up and reworking things for new locks * rcu-walk path walking * start on some fine grained locking steps Thanks, Nick Nick Piggin (46): Revert "fs: use RCU read side protection in d_validate" fs: d_validate fixes kernel: kmem_ptr_validate considered harmful fs: dcache documentation cleanup fs: change d_delete semantics cifs: dont overwrite dentry name in d_revalidate jfs: dont overwrite dentry name in d_revalidate fs: change d_compare for rcu-walk fs: change d_hash for rcu-walk hostfs: simplify locking fs: dcache scale hash fs: dcache scale lru fs: dcache scale dentry refcount fs: dcache scale d_unhashed fs: dcache scale subdirs fs: scale inode alias list fs: Use rename lock and RCU for multi-step operations fs: increase d_name lock coverage fs: dcache remove dcache_lock fs: dcache avoid starvation in dcache multi-step operations fs: dcache reduce dput locking fs: dcache reduce locking in d_alloc fs: dcache reduce dcache_inode_lock fs: dcache rationalise dget variants fs: dcache reduce d_parent locking fs: dcache reduce prune_one_dentry locking fs: reduce dcache_inode_lock width in lru scanning fs: use RCU in shrink_dentry_list to reduce lock nesting fs: consolidate dentry kill sequence fs: icache RCU free inodes fs: avoid inode RCU freeing for pseudo fs kernel: optimise seqlock fs: rcu-walk for path lookup fs: fs_struct use seqlock fs: dcache remove d_mounted fs: dcache reduce branches in lookup path fs: cache optimise dentry and inode for rcu-walk fs: prefetch inode data in dcache lookup fs: d_revalidate_rcu for rcu-walk fs: provide rcu-walk aware permission ...
RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- Documentation/filesystems/porting | 14 ++++++++++++++ arch/powerpc/platforms/cell/spufs/inode.c | 10 ++++++++-- drivers/staging/pohmelfs/inode.c | 11 +++++++++-- fs/9p/vfs_inode.c | 9 ++++++++- fs/adfs/super.c | 9 ++++++++- fs/affs/super.c | 9 ++++++++- fs/afs/super.c | 10 +++++++++- fs/befs/linuxvfs.c | 10 ++++++++-- fs/bfs/inode.c | 9 ...
Rather than keep a d_mounted count in the dentry, set a dentry flag instead. The flag can be cleared by checking the hash table to see if there are any mounts left, which is not time critical because it is performed at detach time. The mounted state of a dentry is only used to speculatively take a look in the mount hash table if it is set -- before following the mount, vfsmount lock is taken and mount re-checked without races. This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I might use later (and some configs have larger than 32-bit spinlocks which might make use of the hole). Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>: In autofs4, when expring direct (or offset) mounts we need to ensure that we block user path walks into the autofs mount, which is covered by another mount. To do this we clear the mounted status so that follows stop before walking into the mount and are essentially blocked until the expire is completed. The automount daemon still finds the correct dentry for the umount due to the follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as an inode operation for direct and offset mounts only and is called following the lookup that stopped at the covered mount. At the end of the expire the covering mount probably has gone away so the mounted status need not be restored. But we need to check this and only restore the mounted status if the expire failed. XXX: autofs may not work right if we have other mounts go over the top of it? Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- fs/autofs4/expire.c | 13 +++++++++++-- fs/dcache.c | 1 - fs/namespace.c | 29 ++++++++++++++++++++++++++--- include/linux/dcache.h | 42 +++++++++++++++++++++--------------------- 4 files changed, 58 insertions(+), 27 deletions(-) diff --git a/fs/autofs4/expire.c b/fs/autofs4/expire.c index 7869b3a..d9f9a15 100644 --- a/fs/autofs4/expire.c +++ b/fs/autofs4/expire.c @@ -295,7 ...
Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -l "d_op =" | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- arch/ia64/kernel/perfmon.c | 2 +- drivers/staging/autofs/root.c | 2 +- fs/9p/vfs_inode.c | 26 +++++++++++++------------- fs/adfs/dir.c | 2 +- fs/adfs/super.c | 2 +- fs/affs/namei.c | 2 +- fs/affs/super.c | 2 +- fs/afs/dir.c | 2 +- fs/anon_inodes.c | 2 +- fs/autofs4/inode.c | 2 +- fs/autofs4/root.c | 10 +++++----- fs/btrfs/export.c | 4 ++-- fs/btrfs/inode.c | 2 +- fs/ceph/dir.c | 6 +++--- fs/cifs/dir.c | 16 ++++++++-------- fs/cifs/inode.c | 8 ++++---- fs/cifs/link.c | 4 ++-- fs/cifs/readdir.c | 4 ++-- fs/coda/dir.c | 2 +- fs/configfs/dir.c | 8 ++++---- fs/dcache.c | 25 +++++++++++++++++++++---- fs/ecryptfs/inode.c | 2 +- fs/ecryptfs/main.c | 4 ++-- fs/fat/inode.c | 4 ++-- fs/fat/namei_msdos.c | 6 +++--- fs/fat/namei_vfat.c | 8 ++++---- fs/fuse/dir.c | 2 +- fs/fuse/inode.c | 4 ++-- fs/gfs2/export.c | 4 ++-- fs/gfs2/ops_fstype.c | 2 +- fs/gfs2/ops_inode.c | 2 +- fs/hfs/dir.c | 2 +- fs/hfs/super.c ...
dcache_inode_lock can be replaced with per-inode locking. Use existing
inode->i_lock for this. This is slightly non-trivial because we sometimes
need to find the inode from the dentry, which requires d_inode to be
stabilised (either with refcount or d_lock).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/9p/vfs_inode.c | 4 +-
fs/affs/amigaffs.c | 4 +-
fs/dcache.c | 84 +++++++++++++++++++++++++----------------------
fs/exportfs/expfs.c | 12 ++++---
fs/nfs/getroot.c | 4 +-
fs/notify/fsnotify.c | 4 +-
fs/ocfs2/dcache.c | 4 +-
include/linux/dcache.h | 1 -
8 files changed, 62 insertions(+), 55 deletions(-)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index df8bbb3..5978298 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -277,11 +277,11 @@ static struct dentry *v9fs_dentry_from_dir_inode(struct inode *inode)
{
struct dentry *dentry;
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
/* Directory should have only one entry. */
BUG_ON(S_ISDIR(inode->i_mode) && !list_is_singular(&inode->i_dentry));
dentry = list_entry(inode->i_dentry.next, struct dentry, d_alias);
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
return dentry;
}
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 600101a..3a4557e 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -128,7 +128,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
void *data = dentry->d_fsdata;
struct list_head *head, *next;
- spin_lock(&dcache_inode_lock);
+ spin_lock(&inode->i_lock);
head = &inode->i_dentry;
next = head->next;
while (next != head) {
@@ -139,7 +139,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
}
next = next->next;
}
- spin_unlock(&dcache_inode_lock);
+ spin_unlock(&inode->i_lock);
}
diff --git a/fs/dcache.c b/fs/dcache.c
index 5e19940..809ec46 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -39,8 +39,8 ...We can turn the dcache hash locking from a global dcache_hash_lock into
per-bucket locking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 131 ++++++++++++++++++++++++++++++++---------------
fs/super.c | 3 +-
include/linux/dcache.h | 23 ++-------
include/linux/fs.h | 3 +-
4 files changed, 97 insertions(+), 63 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 67a08d4..5e19940 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -33,13 +33,15 @@
#include <linux/bootmem.h>
#include <linux/fs_struct.h>
#include <linux/hardirq.h>
+#include <linux/bit_spinlock.h>
+#include <linux/rculist_bl.h>
#include "internal.h"
/*
* Usage:
* dcache_inode_lock protects:
* - i_dentry, d_alias, d_inode
- * dcache_hash_lock protects:
+ * dcache_hash_bucket lock protects:
* - the dcache hash table
* dcache_lru_lock protects:
* - the dcache lru lists and counters
@@ -57,7 +59,7 @@
* dcache_inode_lock
* dentry->d_lock
* dcache_lru_lock
- * dcache_hash_lock
+ * dcache_hash_bucket lock
*
* If there is an ancestor relationship:
* dentry->d_parent->...->d_parent->d_lock
@@ -74,13 +76,11 @@ int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_inode_lock);
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
EXPORT_SYMBOL(rename_lock);
EXPORT_SYMBOL(dcache_inode_lock);
-EXPORT_SYMBOL(dcache_hash_lock);
static struct kmem_cache *dentry_cache __read_mostly;
@@ -97,13 +97,35 @@ static struct kmem_cache *dentry_cache __read_mostly;
static unsigned int d_hash_mask __read_mostly;
static unsigned int d_hash_shift __read_mostly;
-static struct hlist_head *dentry_hashtable __read_mostly;
+
+struct dcache_hash_bucket {
+ struct ...Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/namei.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
include/linux/fs.h | 6 ++
2 files changed, 107 insertions(+), 30 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index d8f7ece..7fa6119 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -169,8 +169,36 @@ EXPORT_SYMBOL(putname);
/*
* This does basic POSIX ACL permission checking
*/
-static inline int __acl_permission_check(struct inode *inode, int mask,
- int (*check_acl)(struct inode *inode, int mask), int rcu)
+static int acl_permission_check_rcu(struct inode *inode, int mask, unsigned int flags,
+ int (*check_acl_rcu)(struct inode *inode, int mask, unsigned int flags))
+{
+ umode_t mode = inode->i_mode;
+
+ mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
+
+ if (current_fsuid() == inode->i_uid)
+ mode >>= 6;
+ else {
+ if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl_rcu) {
+ int error = check_acl_rcu(inode, mask, flags);
+ if (error != -EAGAIN)
+ return error;
+ }
+
+ if (in_group_p(inode->i_gid))
+ mode >>= 3;
+ }
+
+ /*
+ * If the DACs are ok we don't need any capability check.
+ */
+ if ((mask & ~mode) == 0)
+ return 0;
+ return -EACCES;
+}
+
+static int acl_permission_check(struct inode *inode, int mask, unsigned int flags,
+ int (*check_acl)(struct inode *inode, int mask))
{
umode_t mode = inode->i_mode;
@@ -180,7 +208,7 @@ static inline int __acl_permission_check(struct inode *inode, int mask,
mode >>= 6;
else {
if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
- if (rcu) {
+ if (flags) {
return -ECHILD;
} else {
int error = check_acl(inode, mask);
@@ -201,10 +229,52 @@ static inline int __acl_permission_check(struct inode *inode, int mask,
return -EACCES;
}
-static inline int acl_permission_check(struct inode *inode, int mask,
- int (*check_acl)(struct inode *inode, int mask))
+/**
+ * generic_permission_rcu - check for access ...prune_one_dentry can avoid quite a bit of locking in the common case where
ancestors have an elevated refcount. Alternatively, we could have gone the
other way and made fewer trylocks in the case where d_count goes to zero, but
is probably less common.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 27 +++++++++++++++------------
1 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 23f6fed..4fa27b5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -547,26 +547,29 @@ static void prune_one_dentry(struct dentry *dentry, struct dentry *parent)
* Prune ancestors.
*/
while (dentry) {
- spin_lock(&dcache_inode_lock);
-again:
+relock:
spin_lock(&dentry->d_lock);
+ if (dentry->d_count > 1) {
+ dentry->d_count--;
+ spin_unlock(&dentry->d_lock);
+ return;
+ }
+ if (!spin_trylock(&dcache_inode_lock)) {
+relock2:
+ spin_unlock(&dentry->d_lock);
+ cpu_relax();
+ goto relock;
+ }
+
if (IS_ROOT(dentry))
parent = NULL;
else
parent = dentry->d_parent;
if (parent && !spin_trylock(&parent->d_lock)) {
- spin_unlock(&dentry->d_lock);
- goto again;
- }
- dentry->d_count--;
- if (dentry->d_count) {
- if (parent)
- spin_unlock(&parent->d_lock);
- spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_inode_lock);
- return;
+ goto relock2;
}
-
+ dentry->d_count--;
dentry_lru_del(dentry);
__d_drop(dentry);
dentry = d_kill(dentry, parent);
--
1.7.1
--
dcache_inode_lock can be avoided in d_delete() and d_materialise_unique()
in cases where it is not required.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 3e4f7c1..6fe387d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1794,10 +1794,15 @@ void d_delete(struct dentry * dentry)
/*
* Are we the only user?
*/
- spin_lock(&dcache_inode_lock);
+again:
spin_lock(&dentry->d_lock);
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (dentry->d_count == 1) {
+ if (!spin_trylock(&dcache_inode_lock)) {
+ spin_unlock(&dentry->d_lock);
+ cpu_relax();
+ goto again;
+ }
dentry->d_flags &= ~DCACHE_CANT_MOUNT;
dentry_iput(dentry);
fsnotify_nameremove(dentry, isdir);
@@ -1808,7 +1813,6 @@ void d_delete(struct dentry * dentry)
__d_drop(dentry);
spin_unlock(&dentry->d_lock);
- spin_unlock(&dcache_inode_lock);
fsnotify_nameremove(dentry, isdir);
}
@@ -2106,14 +2110,15 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
BUG_ON(!d_unhashed(dentry));
- spin_lock(&dcache_inode_lock);
-
if (!inode) {
actual = dentry;
__d_instantiate(dentry, NULL);
- goto found_lock;
+ d_rehash(actual);
+ goto out_nolock;
}
+ spin_lock(&dcache_inode_lock);
+
if (S_ISDIR(inode->i_mode)) {
struct dentry *alias;
@@ -2145,10 +2150,9 @@ struct dentry *d_materialise_unique(struct dentry *dentry, struct inode *inode)
actual = __d_instantiate_unique(dentry, inode);
if (!actual)
actual = dentry;
- else if (unlikely(!d_unhashed(actual)))
- goto shouldnt_be_hashed;
+ else
+ BUG_ON(!d_unhashed(actual));
-found_lock:
spin_lock(&actual->d_lock);
found:
spin_lock(&dcache_hash_lock);
@@ -2164,10 +2168,6 @@ out_nolock:
iput(inode);
return actual;
-
-shouldnt_be_hashed:
- spin_unlock(&dcache_inode_lock);
- BUG();
}
...Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
using dcache_lock for these anyway (eg. using i_mutex).
Note: if we change the locking rule in future so that ->d_child protection is
provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
But it would be an exception to an otherwise regular locking scheme, so we'd
have to see some good results. Probably not worthwhile.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
drivers/staging/smbfs/cache.c | 4 +
drivers/usb/core/inode.c | 8 +-
fs/autofs4/autofs_i.h | 11 +++
fs/autofs4/expire.c | 129 +++++++++++++--------------
fs/autofs4/root.c | 18 ++++-
fs/ceph/dir.c | 6 +-
fs/ceph/inode.c | 8 ++-
fs/coda/cache.c | 2 +
fs/dcache.c | 195 +++++++++++++++++++++++++++++++----------
fs/libfs.c | 24 ++++--
fs/ncpfs/dir.c | 3 +
fs/ncpfs/ncplib_kernel.h | 4 +
fs/notify/fsnotify.c | 4 +-
include/linux/dcache.h | 1 +
kernel/cgroup.c | 19 ++++-
security/selinux/selinuxfs.c | 12 ++-
16 files changed, 314 insertions(+), 134 deletions(-)
diff --git a/drivers/staging/smbfs/cache.c b/drivers/staging/smbfs/cache.c
index dbb9865..29afb05 100644
--- a/drivers/staging/smbfs/cache.c
+++ b/drivers/staging/smbfs/cache.c
@@ -63,6 +63,7 @@ smb_invalidate_dircache_entries(struct dentry *parent)
struct dentry *dentry;
spin_lock(&dcache_lock);
+ spin_lock(&parent->d_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
dentry = list_entry(next, struct dentry, d_u.d_child);
@@ -70,6 +71,7 @@ smb_invalidate_dircache_entries(struct dentry *parent)
smb_age_dentry(server, dentry);
next = next->next;
}
+ spin_unlock(&parent->d_lock);
spin_unlock(&dcache_lock);
}
@@ -97,6 +99,7 @@ smb_dget_fpos(struct dentry *dentry, struct dentry ...The remaining usages for dcache_lock is to allow atomic, multi-step read-side
operations over the directory tree by excluding modifications to the tree.
Also, to walk in the leaf->root direction in the tree where we don't have
a natural d_lock ordering.
This could be accomplished by taking every d_lock, but this would mean a
huge number of locks and actually gets very tricky.
Solve this instead by using the rename seqlock for multi-step read-side
operations, retry in case of a rename so we don't walk up the wrong parent.
Concurrent dentry insertions are not serialised against. Concurrent deletes
are tricky when walking up the directory: our parent might have been deleted
when dropping locks so also need to check and retry for that.
We can also use the rename lock in cases where livelock is a worry (and it
is introduced in subsequent patch).
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
drivers/staging/pohmelfs/path_entry.c | 15 +++-
fs/autofs4/waitq.c | 18 ++++-
fs/dcache.c | 136 +++++++++++++++++++++++++++------
fs/nfs/namespace.c | 14 +++-
include/linux/dcache.h | 1 +
5 files changed, 153 insertions(+), 31 deletions(-)
diff --git a/drivers/staging/pohmelfs/path_entry.c b/drivers/staging/pohmelfs/path_entry.c
index 8ec83d2..bbe42f4 100644
--- a/drivers/staging/pohmelfs/path_entry.c
+++ b/drivers/staging/pohmelfs/path_entry.c
@@ -83,10 +83,11 @@ out:
int pohmelfs_path_length(struct pohmelfs_inode *pi)
{
struct dentry *d, *root, *first;
- int len = 1; /* Root slash */
+ int len;
+ unsigned seq;
- first = d = d_find_alias(&pi->vfs_inode);
- if (!d) {
+ first = d_find_alias(&pi->vfs_inode);
+ if (!first) {
dprintk("%s: ino: %llu, mode: %o.\n", __func__, pi->ino, pi->vfs_inode.i_mode);
return -ENOENT;
}
@@ -95,6 +96,11 @@ int pohmelfs_path_length(struct pohmelfs_inode *pi)
root = dget(current->fs->root.dentry);
spin_unlock(&current->fs->lock);
...Add a new lock, dcache_inode_lock, to protect the inode's i_dentry list
from concurrent modification. d_alias is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/9p/vfs_inode.c | 2 +
fs/affs/amigaffs.c | 2 +
fs/cifs/inode.c | 3 ++
fs/dcache.c | 66 ++++++++++++++++++++++++++++++++++++++++++------
fs/exportfs/expfs.c | 4 +++
fs/nfs/getroot.c | 4 +++
fs/notify/fsnotify.c | 2 +
fs/ocfs2/dcache.c | 3 +-
include/linux/dcache.h | 1 +
9 files changed, 78 insertions(+), 9 deletions(-)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 34bf71b..47dfd5d 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -271,9 +271,11 @@ static struct dentry *v9fs_dentry_from_dir_inode(struct inode *inode)
struct dentry *dentry;
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
/* Directory should have only one entry. */
BUG_ON(S_ISDIR(inode->i_mode) && !list_is_singular(&inode->i_dentry));
dentry = list_entry(inode->i_dentry.next, struct dentry, d_alias);
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
return dentry;
}
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 7d0f0a3..2321cc9 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -129,6 +129,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
struct list_head *head, *next;
spin_lock(&dcache_lock);
+ spin_lock(&dcache_inode_lock);
head = &inode->i_dentry;
next = head->next;
while (next != head) {
@@ -139,6 +140,7 @@ affs_fix_dcache(struct dentry *dentry, u32 entry_ino)
}
next = next->next;
}
+ spin_unlock(&dcache_inode_lock);
spin_unlock(&dcache_lock);
}
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index ef3a55b..0ee1767 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -805,12 +805,15 @@ inode_has_hashed_dentries(struct inode *inode)
struct dentry *dentry;
...Change d_hash so it may be called from lock-free RCU lookups. See similar patch for d_compare for details. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- Documentation/filesystems/Locking | 5 +++-- Documentation/filesystems/porting | 7 +++++++ Documentation/filesystems/vfs.txt | 8 ++++++-- fs/adfs/dir.c | 3 ++- fs/affs/namei.c | 20 ++++++++++++-------- fs/cifs/dir.c | 5 +++-- fs/cifs/readdir.c | 2 +- fs/dcache.c | 2 +- fs/ecryptfs/inode.c | 4 ++-- fs/fat/namei_msdos.c | 3 ++- fs/fat/namei_vfat.c | 8 +++++--- fs/gfs2/dentry.c | 3 ++- fs/hfs/hfs_fs.h | 3 ++- fs/hfs/string.c | 3 ++- fs/hfsplus/hfsplus_fs.h | 3 ++- fs/hfsplus/unicode.c | 3 ++- fs/hpfs/dentry.c | 3 ++- fs/isofs/inode.c | 28 ++++++++++++++++++---------- fs/jfs/namei.c | 3 ++- fs/namei.c | 5 +++-- fs/ncpfs/dir.c | 10 +++++----- fs/sysv/namei.c | 3 ++- include/linux/dcache.h | 3 ++- 23 files changed, 88 insertions(+), 49 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index fc5e1b7..5bceb19 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -10,7 +10,8 @@ be able to use diff(1). --------------------------- dentry_operations -------------------------- prototypes: int (*d_revalidate)(struct dentry *, int); - int (*d_hash) (struct dentry *, struct qstr *); + int (*d_hash)(const struct dentry *, const struct inode *, + struct qstr *); int (*d_compare)(const struct dentry *, const struct dentry *, const struct inode *, ...
Change d_compare so it may be called from lock-free RCU lookups. This does put significant restrictions on what may be done from the callback, however there don't seem to have been any problems with in-tree fses. If some strange use case pops up that _really_ cannot cope with the rcu-walk rules, we can just add new rcu-unaware callbacks, which would cause name lookup to drop out of rcu-walk mode. For in-tree filesystems, this is just a mechanical change. Signed-off-by: Nick Piggin <npiggin@kernel.dk> --- Documentation/filesystems/Locking | 4 +- Documentation/filesystems/porting | 7 +++ Documentation/filesystems/vfs.txt | 25 +++++++++- fs/adfs/dir.c | 8 ++- fs/affs/namei.c | 44 +++++++++++-------- fs/cifs/dir.c | 11 +++-- fs/dcache.c | 4 +- fs/fat/namei_msdos.c | 15 ++++--- fs/fat/namei_vfat.c | 39 +++++++++++----- fs/hfs/hfs_fs.h | 4 +- fs/hfs/string.c | 14 +++--- fs/hfsplus/hfsplus_fs.h | 4 +- fs/hfsplus/unicode.c | 14 +++--- fs/hpfs/dentry.c | 21 +++++--- fs/isofs/inode.c | 88 ++++++++++++++++++------------------- fs/isofs/namei.c | 3 +- fs/jfs/namei.c | 10 +++-- fs/ncpfs/dir.c | 29 ++++++++---- fs/ncpfs/ncplib_kernel.h | 8 ++-- fs/proc/proc_sysctl.c | 12 +++--- include/linux/dcache.h | 12 ++--- include/linux/ncp_fs.h | 4 +- 22 files changed, 228 insertions(+), 152 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index a91f308..fc5e1b7 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -11,7 +11,9 @@ be able to use diff(1). prototypes: int (*d_revalidate)(struct dentry *, int); int (*d_hash) (struct dentry ...
Add a new lock, dcache_hash_lock, to protect the dcache hash table from
concurrent modification. d_hash is also protected by d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 38 +++++++++++++++++++++++++++-----------
include/linux/dcache.h | 3 +++
2 files changed, 30 insertions(+), 11 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 4f9ccbe..50c65c7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -35,12 +35,27 @@
#include <linux/hardirq.h>
#include "internal.h"
+/*
+ * Usage:
+ * dcache_hash_lock protects dcache hash table
+ *
+ * Ordering:
+ * dcache_lock
+ * dentry->d_lock
+ * dcache_hash_lock
+ *
+ * if (dentry1 < dentry2)
+ * dentry1->d_lock
+ * dentry2->d_lock
+ */
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
- __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_hash_lock);
+__cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);
+EXPORT_SYMBOL(dcache_hash_lock);
EXPORT_SYMBOL(dcache_lock);
static struct kmem_cache *dentry_cache __read_mostly;
@@ -1195,7 +1210,9 @@ struct dentry *d_obtain_alias(struct inode *inode)
tmp->d_flags |= DCACHE_DISCONNECTED;
tmp->d_flags &= ~DCACHE_UNHASHED;
list_add(&tmp->d_alias, &inode->i_dentry);
+ spin_lock(&dcache_hash_lock);
hlist_add_head(&tmp->d_hash, &inode->i_sb->s_anon);
+ spin_unlock(&dcache_hash_lock);
spin_unlock(&tmp->d_lock);
spin_unlock(&dcache_lock);
@@ -1581,7 +1598,9 @@ void d_rehash(struct dentry * entry)
{
spin_lock(&dcache_lock);
spin_lock(&entry->d_lock);
+ spin_lock(&dcache_hash_lock);
_d_rehash(entry);
+ spin_unlock(&dcache_hash_lock);
spin_unlock(&entry->d_lock);
spin_unlock(&dcache_lock);
}
@@ -1661,8 +1680,6 @@ static void switch_names(struct dentry *dentry, struct dentry *target)
*/
static void d_move_locked(struct ...What locking is used to keep DCACHE_UNHASHED/d_unhashed() in check
with the whether the dentry is on the hash list or not? It looks to
me that to make any hash modification, you have to hold both the
dentry->d_lock and the dcache_hash_lock to keep them in step. If
Perhaps the places where we need to lock two dentries should use a
wrapper like we do for other objects. Such as:
void dentry_dlock_two(struct dentry *d1, struct dentry *d2)
{
if (d1 < d2) {
spin_lock(&d1->d_lock);
spin_lock_nested(&d2->d_lock, DENTRY_D_LOCK_NESTED);
} else {
spin_lock(&d2->d_lock);
spin_lock_nested(&d1->d_lock, DENTRY_D_LOCK_NESTED);
}
Shouldn't we really kill _d_rehash() by replacing all the callers
with direct calls to __d_rehash() first? There doesn't seem to be much
Un-inline __d_drop so you don't need to make the dcache_hash_lock
visible outside of fs/dcache.c. That happens later in the series
anyway, so may as well do it now...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
No, dcache_lock still does that. d_unhashed is protected with It only happens once in rename, so I don't think it's useful. No. Several filesystems are using it, and it's an exported symbol. I'm focusing on changed to locking, and keeping APIs the same, where possible. I don't want just more and more depencendies on pushing through filesystem changes before this series. Like I said, there are infinite cleanups or improvements you can make. It does not particularly matter that they happen before or after the scaling work, except if there are classes of APIs that the new locking Yeah that makes sense. Thanks, Nick --
d_unhashed() is just a flag bit in ->d_flags, which is apparently protected by ->d_lock in the next the LRU patch. I say apparently because that patch doesn't appear to protect anything to do with d_flags. I'm struggling to work out what is being protected in each patch because it is not clearexactly what is being done. It makes much more sense to me to start by making all access to d_flags atomic (i.e. protected by ->d_lock) in a separate patch than to do it hodge-podge across multiple patches as you currently are. Its hard to follow when things that are intimately related are That's __d_rehash(). _d_rehash() (single underscore) is static and only called by d_rehash() and d_materialise_unique() And is one line of code We do plenty of cleanups when changing code when the result gives us simpler and easier to understand code. It's a trivial change that, IMO, makes the code more consistent and easier to follow. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
No, d_flags in upstream kernel is protected by d_lock (and also you Well hopefully the code and changelog is more clear. The lockorder doc changes probably aren't a good source for a running commentary. Not that it's wrong, it just may have a few nits like this where an existing It might be easier to follow if you know d_flags is already protected by d_lock. The d_unhashed patch uses d_lock to keep _both_ the d_flags and the hash list membership status in sync. It does not make the d_flags Anyway, cleanup. It can equally well be done before or after, and seeing as we're being nice and not breaking my tree with minutiae, can you base it on That doesn't belong in this patch either, it's shuffling. Anyway I like Unrelated "cleanups" in the same patch as non trivial locking change is stupid. Necessary changes to prevent bad ugliness resulting, or preventing repeated steps for the particular changes, etc. of course. Killing un related functions no. --
So put it in another prepartory patch. It makes the locking changes Ok, I get the picture. You don't want a code review, you want a rubber stamp. Find someone else to get it from. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I didn't change that, though, the ordering of locking unrelated dentries and the code is already in rename code and is not touched Of course I want code review. I am not going to just do everything you say that I don't agree with, but I will explain why every time (as I have done to all your points). I would prefer more in-depth review than from someone who doesn't know d_lock protects d_flags, but any and all help is welcome. Even minor nitpicking or cleanups are welcome if they are relevant to the patches. Thanks, Nick PS. don't accuse me of not wanting a code review, because you're just projecting. To paraphrase you: I don't have to justify myself to you, nick, only the maintainers, so I'm not answering. In response to my questions. --
Which generally comes down to "I disagree with you". That's hard to argue against because you aren't willing to compromise. So, to address your next comment, I'll restate what I was proposing. That is, to ensure all the d_flags accesses protected by d_lock as an initial patch rather than cleaning it up in an ad-hoc fashion later on, such as this later patch in your series: [PATCH 14/46] fs: dcache scale d_unhashed which has the description: Protect d_unhashed(dentry) condition with d_lock. which illustrates my point that not all accesses to d_flags are Your implication about my competence is incorrect and entirely inappropriate. Ad hominen attacks don't improve your argument or If _you_ decide they are relevant. Nick, in the past couple of months you've burnt everyone who has tried to review your changes in any meaningful way. Nobody wants to engage with you because you've aggressively disagreed with every significant change that has been requested. You have shown no desire to compromise, instead you argue that you are right until you've had the last word, and you have frequently resorted to condesending and disrespectful attacks on reviewers. You would do well to keep that in mind next time you wonder why nobody is stepping up to review your code. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I have been trying to explain my reasoning. For example, the suggestion to change _d_rehash and to put by-address ordering of locking 2 dentries in its own function I simply said that I don't want to pull in such changes because they're not related or really touched by the patches. I think that's reasonable, and so if I have a reasonable objection to a It depends what you mean by accesses to d_flags. No, not all of them are, because there are in fact some cases where d_flags is read without any locking, when races don't matter or aren't applicable. But all writes to d_flags, in code where the dentry is live and there can be concurrent writes to d_flags *are* protected by d_lock. d_unhashed() is defined to: Returns true if the dentry passed is not currently hashed. So what I have called the d_unhashed condition, I mean the combination of DCACHE_UNHASHED and dentry membership on the hash list. I'll improve that changelog because now you've brought it to my Well you said that my patch adds d_flags protection in bits and pieces in a random manner. d_flags is already protected by d_lock Well you keep escalating it too, like you swore at me when I try several times to explain an issue. This is exactly what you and Christoph did, to me, actually. And you're wrong, nobody was reviewing my code long before that little episode. I certainly did compromise with Al, regarding the merging of the inode lock stuff, and although I disagreed with some parts, I said OK fine. You can't seem to concede a single time that I am right or have a valid point. The best you can possibly manage to to go silent (and then maybe bring it up again a few weeks later). This is perhaps why I appear so insolent, because when I disagree with you, I'm wrong so my reasoning must be irrelevant. It just keeps happening (recently again with the vfs percpu counters thread). So as far as I can see, there never was a bridge there to begin with. I wish we could work together because I don't in ...
I'll also just add that if you don't like ad-hominen attacks, you shouldn't have made implications about my integrity and honesty by accusing me of wanting a rubber stamp, rather than a real review. Which was before I suggested that you were confused about d_flags locking, you'll note. --
Remove dcache_lock locking from hostfs filesystem, and move it into dcache
helpers. All that should really be required is a coherent path name, protection
from concurrent modification is not provided outside path name generation
because dcache_lock is dropped before the path is used.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 15 +++++++++++++--
fs/hostfs/hostfs_kern.c | 24 ++++++++++--------------
include/linux/dcache.h | 2 +-
3 files changed, 24 insertions(+), 17 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 9fd5180..4f9ccbe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2140,7 +2140,7 @@ char *dynamic_dname(struct dentry *dentry, char *buffer, int buflen,
/*
* Write full pathname from the root of the filesystem into the buffer.
*/
-char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
+static char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
{
char *end = buf + buflen;
char *retval;
@@ -2167,7 +2167,18 @@ char *__dentry_path(struct dentry *dentry, char *buf, int buflen)
Elong:
return ERR_PTR(-ENAMETOOLONG);
}
-EXPORT_SYMBOL(__dentry_path);
+
+char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen)
+{
+ char *retval;
+
+ spin_lock(&dcache_lock);
+ retval = __dentry_path(dentry, buf, buflen);
+ spin_unlock(&dcache_lock);
+
+ return retval;
+}
+EXPORT_SYMBOL(dentry_path_raw);
char *dentry_path(struct dentry *dentry, char *buf, int buflen)
{
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index cfe8bc7..39dc505 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -92,12 +92,10 @@ __uml_setup("hostfs=", hostfs_args,
static char *__dentry_name(struct dentry *dentry, char *name)
{
- char *p = __dentry_path(dentry, name, PATH_MAX);
+ char *p = dentry_path_raw(dentry, name, PATH_MAX);
char *root;
size_t len;
- spin_unlock(&dcache_lock);
-
root = dentry->d_sb->s_fs_info;
len = strlen(root);
if ...Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/jfs/namei.c | 43 +++++++++++++++++++++++++++++++++++--------
1 files changed, 35 insertions(+), 8 deletions(-)
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 231ca4a..2da1546 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -18,6 +18,7 @@
*/
#include <linux/fs.h>
+#include <linux/namei.h>
#include <linux/ctype.h>
#include <linux/quotaops.h>
#include <linux/exportfs.h>
@@ -1597,21 +1598,47 @@ static int jfs_ci_compare(struct dentry *dir, struct qstr *a, struct qstr *b)
goto out;
}
result = 0;
+out:
+ return result;
+}
+static int jfs_ci_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
/*
- * We want creates to preserve case. A negative dentry, a, that
- * has a different case than b may cause a new entry to be created
- * with the wrong case. Since we can't tell if a comes from a negative
- * dentry, we blindly replace it with b. This should be harmless if
- * a is not a negative dentry.
+ * This is not negative dentry. Always valid.
+ *
+ * Note, rename() to existing directory entry will have ->d_inode,
+ * and will use existing name which isn't specified name by user.
+ *
+ * We may be able to drop this positive dentry here. But dropping
+ * positive dentry isn't good idea. So it's unsupported like
+ * rename("filename", "FILENAME") for now.
*/
- memcpy((unsigned char *)a->name, b->name, a->len);
-out:
- return result;
+ if (dentry->d_inode)
+ return 1;
+
+ /*
+ * This may be nfsd (or something), anyway, we can't see the
+ * intent of this. So, since this can be for creation, drop it.
+ */
+ if (!nd)
+ return 0;
+
+ /*
+ * Drop the negative dentry, in order to make sure to use the
+ * case sensitive name which is specified by user ...Use vfat's method for dealing with negative dentries to preserve case,
rather than overwrite dentry name in d_revalidate, which is a bit ugly
and also gets in the way of doing lock-free path walking.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/cifs/dir.c | 43 ++++++++++++++++++++++++-------------------
1 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 3840edd..521d841 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -656,22 +656,34 @@ lookup_out:
static int
cifs_d_revalidate(struct dentry *direntry, struct nameidata *nd)
{
- int isValid = 1;
-
if (direntry->d_inode) {
if (cifs_revalidate_dentry(direntry))
return 0;
- } else {
- cFYI(1, "neg dentry 0x%p name = %s",
- direntry, direntry->d_name.name);
- if (time_after(jiffies, direntry->d_time + HZ) ||
- !lookupCacheEnabled) {
- d_drop(direntry);
- isValid = 0;
- }
+ else
+ return 1;
}
- return isValid;
+ /*
+ * This may be nfsd (or something), anyway, we can't see the
+ * intent of this. So, since this can be for creation, drop it.
+ */
+ if (!nd)
+ return 0;
+
+ /*
+ * Drop the negative dentry, in order to make sure to use the
+ * case sensitive name which is specified by user if this is
+ * for creation.
+ */
+ if (!(nd->flags & (LOOKUP_CONTINUE | LOOKUP_PARENT))) {
+ if (nd->flags & (LOOKUP_CREATE | LOOKUP_RENAME_TARGET))
+ return 0;
+ }
+
+ if (time_after(jiffies, direntry->d_time + HZ) || !lookupCacheEnabled)
+ return 0;
+
+ return 1;
}
/* static int cifs_d_delete(struct dentry *direntry)
@@ -709,15 +721,8 @@ static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;
if ((a->len == b->len) &&
- (nls_strnicmp(codepage, a->name, b->name, a->len) == 0)) {
- /*
- * To preserve case, don't let an existing negative dentry's
- * case take precedence. If a is not a negative dentry, this
- * ...d_validate has been broken for a long time.
kmem_ptr_validate does not guarantee that a pointer can be dereferenced
if it can go away at any time. Even rcu_read_lock doesn't help, because
the pointer might be queued in RCU callbacks but not executed yet.
So the parent cannot be checked, nor the name hashed. The dentry pointer
can not be touched until it can be verified under lock. Hashing simply
cannot be used.
Instead, verify the parent/child relationship by traversing parent's
d_child list. It's slow, but only ncpfs and the destaged smbfs care
about it, at this point.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 25 +++++++------------------
1 files changed, 7 insertions(+), 18 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index cc2b938..9d1a59d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1483,41 +1483,30 @@ out:
}
/**
- * d_validate - verify dentry provided from insecure source
+ * d_validate - verify dentry provided from insecure source (deprecated)
* @dentry: The dentry alleged to be valid child of @dparent
* @dparent: The parent dentry (known to be valid)
*
* An insecure source has sent us a dentry, here we verify it and dget() it.
* This is used by ncpfs in its readdir implementation.
* Zero is returned in the dentry is invalid.
+ *
+ * This function is slow for big directories, and deprecated, do not use it.
*/
-
int d_validate(struct dentry *dentry, struct dentry *dparent)
{
- struct hlist_head *base;
- struct hlist_node *lhp;
-
- /* Check whether the ptr might be valid at all.. */
- if (!kmem_ptr_validate(dentry_cache, dentry))
- goto out;
-
- if (dentry->d_parent != dparent)
- goto out;
+ struct dentry *child;
spin_lock(&dcache_lock);
- base = d_hash(dparent, dentry->d_name.hash);
- hlist_for_each(lhp,base) {
- /* hlist_for_each_entry_rcu() not required for d_hash list
- * as it is parsed under dcache_lock
- */
- if (dentry == hlist_entry(lhp, struct dentry, d_hash)) ...I'd drop the previous revert patch and just convert the RCU hash traversal straight to the d_child traversal code you introduce here. This is a much better explanation of why the d_validate mechanism needs to be changed, and the revert is really an unnecessary extra step... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Has to be backported, though. Patch that is to be reverted obviously adds more brokenness and is a good example that you cannot dget() under rcu read protection even if the rest of the surrounding function is bugfree. I wouldn't have thought it's a big deal. --
Reverting something broken to something already broken just to fix to the less broken version seems like an unnecessary step. Just fix the brokenneѕs in a single patch - no need to indirect the real fix through a revert. One less patch to worry about. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Backported to stable/distro kernels I suppose. I'm not sure what your OK but I disagree. Firstly, reverting that patch gives a good record of that particular pattern of bug (that Christoph and Al both missed). With more RCU going into the vfs, people need to be pretty clear about the pitfalls. Secondly, as I said, reverting means that I can use exact same patch for upstream and stable kernels. And finally, it gives better bisectability. If somebody hits a bug in my patch, I would rather have them bisect into the well-worn (if buggy) version of the code than bisect into a different type of brokenness. It isn't indirecting the real fix through a revert, they are broken in different ways. My fix is for the bug that it doesn't guarantee the persistence of *memory* we are using, and the revert is for the bug that it doesn't guarantee the persistence/validity of the *object*, and which is actually more likely to be a problem if you think about it, because the window is much larger. Git has no problem with lots of patches, so I don't see any advantage to doing one patch, and you lose the advantages above. --
This reverts commit 3825bdb7ed920845961f32f364454bee5f469abb.
Patch is broken, you can't dget() without holding any locks!
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
---
fs/dcache.c | 31 +++++++++++++++++++------------
1 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 23702a9..cc2b938 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1491,26 +1491,33 @@ out:
* This is used by ncpfs in its readdir implementation.
* Zero is returned in the dentry is invalid.
*/
-int d_validate(struct dentry *dentry, struct dentry *parent)
+
+int d_validate(struct dentry *dentry, struct dentry *dparent)
{
- struct hlist_head *head = d_hash(parent, dentry->d_name.hash);
- struct hlist_node *node;
- struct dentry *d;
+ struct hlist_head *base;
+ struct hlist_node *lhp;
/* Check whether the ptr might be valid at all.. */
if (!kmem_ptr_validate(dentry_cache, dentry))
- return 0;
- if (dentry->d_parent != parent)
- return 0;
+ goto out;
- rcu_read_lock();
- hlist_for_each_entry_rcu(d, node, head, d_hash) {
- if (d == dentry) {
- dget(dentry);
+ if (dentry->d_parent != dparent)
+ goto out;
+
+ spin_lock(&dcache_lock);
+ base = d_hash(dparent, dentry->d_name.hash);
+ hlist_for_each(lhp,base) {
+ /* hlist_for_each_entry_rcu() not required for d_hash list
+ * as it is parsed under dcache_lock
+ */
+ if (dentry == hlist_entry(lhp, struct dentry, d_hash)) {
+ __dget_locked(dentry);
+ spin_unlock(&dcache_lock);
return 1;
}
}
- rcu_read_unlock();
+ spin_unlock(&dcache_lock);
+out:
return 0;
}
EXPORT_SYMBOL(d_validate);
--
1.7.1
--
I believe you can - for the same reasons we can take a reference to an inode without holding the inode_lock. That is, as long as the caller already holds an active reference to the dentry, dget() can be used to take another reference without needing the dcache_lock. Such usage appears to be described in the comment above dget() and there's a BUG_ON() in dget() to catch callers that don't already have an active reference. An example of a valid unlocked dget(): d_alloc() does an unlocked dget() to take a reference to the parent dentry which we already are guaranteed to have a reference to. As to d_validate() - it depends on the caller behaviour as to whether the unlocked dget() is valid or not. From a cursory check of the NCP and SMB readdir caches, both appear to hold an active reference to the dentry it is passing to d_validate(). If that is the case then there is nothing wrong with the way d_validate uses dget(). Can someone with more SMB/NCP expertise than me validate the use of cached dentries? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I don't see where? Can you point to where the refcount is taken? AFAIKS it drops the reference 3 lines after it puts the pointer Then why would it have to use d_validate if it has a reference? That is supposed to be for an "untrusted" pointer (which is why it had all the crazy checks that it's in kmem and in the right slab etc). --
Right, so the commit message is wrong. Can you update it to tell us why dget() can't be used there - the commit message from the second Yeah, you're right, I missed that one - I spent more tiem checking the validation part of the code than the initial insertion. Hence Code changes. It may not be doing what it was originally needed/intended to be doing - I don't need to waste time on code archeology and second guessing when there are others around that can tell me this off the top oftheir head. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I suppose if you're not reading it in the context of d_validate, then yes. And as an historical record, I'll clarify. Obviously if we do have a reference, then we can take another, and if we don't, then we need more than RCU because RCU only provides persistence guarantee for the memory, not any persistence Well the d_validate API is meant to provide that, so it's broken whether or not its callers use it correctly. It's also exported to external modules... Yes we should remove smbfs and rip the cache out of ncpfs and remove d_validate entirely when possible (or, provide a more reasonable API and caching library entirely in the dcache code that a filesystem might use). But this is the right first step. --
The reason why I don't remove the caching crap entirely is because it is not a trivial change to properly remove it, and I don't have the time or inclination to do it properly and have it tested when it's easy to provide a correct, simple, and back compatible API. The patch Christoph posted for smbfs and ncpfs just hacks out where the dentry is used from the cache, but leaves a lot of cache infrastructure For the record, if anyone cares, the sanest and simplest way to do this would be to store the dentry's hash along with its pointer, and have interfaces to allocate, destroy, insert, and delete from cache, and lookup based on a supplied compare function. (but obviously we shouldn't bother unless some good numbers turn up) --
I get: CC [M] fs/cifs/inode.o fs/cifs/inode.c: In function ‘inode_has_hashed_dentries’: fs/cifs/inode.c:807: error: ‘dcache_inode_lock’ undeclared (first use in this function) fs/cifs/inode.c:807: error: (Each undeclared identifier is reported only once fs/cifs/inode.c:807: error: for each function it appears in.) make[3]: *** [fs/cifs/inode.o] Error 1 make[2]: *** [fs/cifs] Error 2 make[1]: *** [fs] Error 2 I used the latest mainline. --
Sorry, missed a conversion, it just needs to be changed to inode->i_lock. Pushed the fix to git. Thanks, Nick --
I attached a patch to my posting in [1] but it was somehow "eaten" (here in my mbox the patch is definitely attached). Patchwork is also not listing my patch. Anyway, it's fixed - that's good. I have to check why I get a Call trace with systemd-v15 but not with sysvinit package here on Debian. - Sedat - [1] http://lkml.org/lkml/2010/11/27/145
From: Nick Piggin <npiggin@kernel.dk> Just want to say that I've been running this code for the past few days on my 128 cpu box and it seems quite sturdy. If there are any kinds of vfs benchmarks you want me to run on this machine both with and without the scaling changes, just let me know. --
Hi Dave, great, thanks! I am mostly interested in single-thread performance on different ISAs and different microarchitectures. Microbenchmarks are easy, I attached a couple of quick ones I use in an offline mail. Single-threaded (preloadindex=false), cached and uncached, git diff on an unmodified tree is slightly more real world. And then anything else you can think of. Thanks, Nick --
So I vfs-scale-working branch may not be entirely in the clear, seeing
as it touches
the code lower in the call chain. However I don't know what can cause
lockdep to go off
the rails like this.
There is a sequence I used to hack around lockdep nesting
restrictions, following this
pattern:
repeat:
spin_lock(&parent->d_lock);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
/* do stuff */
spin_unlock(&parent->d_lock);
spin_release(&dentry->d_lock.dep_map, 1, _RET_IP_);
parent = dentry;
spin_acquire(&this_parent->d_lock.dep_map, 0, 1, _RET_IP_);
goto repeat;
It's not directly in this call chain, but I wonder if it could have
caused any problem?
--
shouldn't that be s/this_parent/parent/ ? So what you're trying to do is: A -> B -> C -> ... lock A lock B, nested unlock A flip B from nested to top lock C, nested unlock B flip C from nested to top lock ... Anyway, the way to write that is something like: lock_set_subclass(&detry->d_lock.dep_map, 0, _RET_IP_); Which will reset the subclass of the held lock from DENTRY_D_LOCK_NESTED to 0. This is also used in double_unlock_balance(), we go into double_lock_balance() with this_rq locked and want to lock busiest, because of the lock ordering we might need to unlock this_rq and lock busiest first, at which point this_rq is nested. On unlock we thus need to map this_rq back to subclass 0 (which it had before double_lock_balance(), because otherwise subsequent lock operations will be done against the subclass and confuse things. --
OK, thanks. My version should not have caused any problems though, Thanks, Nick --
I tihnk so, yes, altough looking at it again I wonder why you use spin_aquire(.trylock=1) -- but that too shouldn't cause anything like Not directly, no. Usually lockdep crashes indicate use after free like things, where we try to lock a lock that's been scribbled on. But that usually explodes a bit earlier. You faulting in the middle of that breath-first-search does suggest some data corruption, but I'm not quite sure what kind, I can't remember ever having seem something like this before. I've CC'ed Ming Lei who wrote the bfs search, maybe he's got an idea. Copy of the splat: --
Ah, may have been a stupid little bug. The list entry check was being done
and then the pointer reloaded to be used.
What does the asm for shrink_dentry_list look like (before this patch)?
Thanks,
Nick
Index: linux-2.6/fs/dcache.c
===================================================================
--- linux-2.6.orig/fs/dcache.c 2010-12-08 12:41:35.000000000 +1100
+++ linux-2.6/fs/dcache.c 2010-12-08 14:23:46.000000000 +1100
@@ -657,10 +657,10 @@ static void shrink_dentry_list(struct li
struct dentry *dentry;
rcu_read_lock();
- while (!list_empty(list)) {
- dentry = list_entry(list->prev, struct dentry, d_lru);
-
- /* Don't need RCU dereference because we recheck under lock */
+ for (;;) {
+ dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
+ if (&dentry->d_lru == list)
+ break;
spin_lock(&dentry->d_lock);
if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
spin_unlock(&dentry->d_lock);
--
Stress test doing: single thread 50M inode create single thread rm -rf 2-way 50M inode create 2-way rm -rf 4-way 50M inode create 4-way rm -rf 8-way 50M inode create 8-way rm -rf 8-way 250M inode create 8-way rm -rf Failed about 5 minutes into the "4-way rm -rf" (~3 hours into the test) with a CPU stuck spinning on here: [37372.084012] NMI backtrace for cpu 5 [37372.084012] CPU 5 [37372.084012] Modules linked in: [37372.084012] [37372.084012] Pid: 15214, comm: rm Not tainted 2.6.37-rc4-dgc+ #797 /Bochs [37372.084012] RIP: 0010:[<ffffffff810643c4>] [<ffffffff810643c4>] __ticket_spin_lock+0x14/0x20 [37372.084012] RSP: 0018:ffff880114643c98 EFLAGS: 00000213 [37372.084012] RAX: 0000000000008801 RBX: ffff8800687be6c0 RCX: ffff8800c4eb2688 [37372.084012] RDX: ffff880114643d38 RSI: ffff8800dfd4ea80 RDI: ffff880114643d14 [37372.084012] RBP: ffff880114643c98 R08: 0000000000000003 R09: 0000000000000000 [37372.084012] R10: 0000000000000000 R11: dead000000200200 R12: ffff880114643d14 [37372.084012] R13: ffff880114643cb8 R14: ffff880114643d38 R15: ffff8800687be71c [37372.084012] FS: 00007fd6d7c93700(0000) GS:ffff8800dfd40000(0000) knlGS:0000000000000000 [37372.084012] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [37372.084012] CR2: 0000000000bbd108 CR3: 0000000107146000 CR4: 00000000000006e0 [37372.084012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [37372.084012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [37372.084012] Process rm (pid: 15214, threadinfo ffff880114642000, task ffff88011b16f890) [37372.084012] Stack: [37372.084012] ffff880114643ca8 ffffffff81ad044e ffff880114643cf8 ffffffff81167ae7 [37372.084012] 0000000000000000 ffff880114643d38 000000000000000e ffff88011901d800 [37372.084012] ffff8800cdb7cf5c ffff88011901d8e0 0000000000000000 0000000000000000 [37372.084012] Call Trace: [37372.084012] [<ffffffff81ad044e>] _raw_spin_lock+0xe/0x20 [37372.084012] [<ffffffff81167ae7>] ...
OK good, with any luck, that's the same bug. Is this XFS? Is there any concurrent activity happening on the same dentries? Ie. are the rm -rf threads running on the same directories, or is there any reclaim happening in the background? Thanks, Nick --
IIRC, kswapd was consuming about 5-10% of a CPU during parallel unlink tests. Mainly reclaiming XFS inodes, I think, but there may be dentry cache reclaim going as well. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Turns out that the kswapd peaks are upwards of 50% of a CPU for a
few seconds, then idle for 10-15s. Typical perf top output of kswapd
while it is active during unlinks is:
samples pcnt function DSO
_______ _____ ___________________________ _________________
17168.00 10.2% __call_rcu [kernel.kallsyms]
13223.00 7.8% kmem_cache_free [kernel.kallsyms]
12917.00 7.6% down_write [kernel.kallsyms]
12665.00 7.5% xfs_iunlock [kernel.kallsyms]
10493.00 6.2% xfs_reclaim_inode_grab [kernel.kallsyms]
9314.00 5.5% __lookup_tag [kernel.kallsyms]
9040.00 5.4% radix_tree_delete [kernel.kallsyms]
8694.00 5.1% is_bad_inode [kernel.kallsyms]
7639.00 4.5% __ticket_spin_lock [kernel.kallsyms]
6821.00 4.0% _raw_spin_unlock_irqrestore [kernel.kallsyms]
5484.00 3.2% __d_drop [kernel.kallsyms]
5114.00 3.0% xfs_reclaim_inode [kernel.kallsyms]
4626.00 2.7% __rcu_process_callbacks [kernel.kallsyms]
3556.00 2.1% up_write [kernel.kallsyms]
3206.00 1.9% _cond_resched [kernel.kallsyms]
3129.00 1.9% xfs_qm_dqdetach [kernel.kallsyms]
2327.00 1.4% radix_tree_tag_clear [kernel.kallsyms]
2327.00 1.4% call_rcu_sched [kernel.kallsyms]
2262.00 1.3% __ticket_spin_unlock [kernel.kallsyms]
2215.00 1.3% xfs_ilock [kernel.kallsyms]
2200.00 1.3% radix_tree_gang_lookup_tag [kernel.kallsyms]
1982.00 1.2% xfs_reclaim_inodes_ag [kernel.kallsyms]
1736.00 1.0% xfs_trans_unlocked_item [kernel.kallsyms]
1707.00 1.0% ...call_rcu shouldn't be doing much, except for disabling irqs and linking the object into the list. I have a patch somewhere to reduce the irq disable overhead a bit, but it really shouldn't be doing a lot of work. Sometimes you find that touching the rcu head field needs to get a cacheline exclusive, so a bit of work gets transferred there.... But it may also be something going a bit wrong in RCU. I blew it up once already, after the files_lock splitup that enabled all CPUs to create and destroy files :) --
Could you please enable CONFIG_RCU_TRACE, mount debugfs somewhere, and look at rcu/rcudata? There will be a "ql=" number printed for each CPU, and if that number is too large, __call_rcu() does take what it considers to be corrective action, which can incur some overhead. If this is the problem, then increasing the value of the qhimark module parameter might help. If this is not the problem, I could make a patch that disables some of __call_rcu()'s grace-period acceleration code if you are willing to try I would certainly like the opportunity to fix any bugs that might be in RCU... ;-) Thanx, Paul --
Another thing that might help is to reduce the value of CONFIG_RCU_FANOUT to something like 16. If this does help, then there is a reasonably straightforward change I can make to RCU. --
