This patch set is derived from Nick Piggin's VFS scalability tree. This is an attempt to push the process of finer grained review of the series for upstream inclusion. I'm hitting VFS lock contention problems with XFS on 8-16p machines now, so I need to get this stuff moving. This patch set is just the basic inode_lock breakup patches plus a few more simple changes to the inode code. It stops short of introducing RCU inode freeing because those changes are not completely baked yet. As a result, the full inode handling improvements of Nick's patch set are not realised with this short series. However, my own testing indicates that the amount of lock traffic and contention is down by an order of magnitude on an 8-way box for parallel inode create and unlink workloads, so there is still significant improvements from just this patch set. Version 2 of this series was a complete rework of the original patch series. I've pulled in several of the cleanups and re-ordered the series such that cleanups, factoring and list splitting are done before any of the locking changes. Instead of converting the inode state flags first, I've converted them last, ensuring that manipulations are kept inside other locks rather than outside them. The series is made up of the following steps: - inode counters are made per-cpu - inode LRU manipulations are made lazy - i_list is split into two lists (grows inode by 2 pointers), one for tracking lru status, one for writeback status - reference counting is factored, then renamed and locked differently - protect iunique counter with it's own lock - hash lookups and reference counting is cleaned up - inode hash operations are factored, then locked per bucket - superblock inode listis locked per-superblock - inode LRU is locked via a global lock - unclear what the best way to split this up from here is, so no attempt is made to optimise further. - Currently not showing signs of contention under any workload on an ...
From: Nick Piggin <npiggin@suse.de>
The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/block_dev.c | 4 ++--
fs/fs-writeback.c | 27 ++++++++++++++-------------
fs/inode.c | 38 +++++++++++++++++++++-----------------
fs/nilfs2/mdt.c | 3 ++-
include/linux/fs.h | 3 ++-
include/linux/writeback.h | 1 -
mm/backing-dev.c | 6 +++---
7 files changed, 44 insertions(+), 38 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 501eab5..63b1c4c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,8 +58,8 @@ static void bdev_inode_switch_bdi(struct inode *inode,
{
spin_lock(&inode_lock);
inode->i_data.backing_dev_info = dst;
- if (inode->i_state & I_DIRTY)
- list_move(&inode->i_list, &dst->wb.b_dirty);
+ if (!list_empty(&inode->i_wb_list))
+ list_move(&inode->i_wb_list, &dst->wb.b_dirty);
spin_unlock(&inode_lock);
}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 33e9857..92d73b6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
if (!list_empty(&wb->b_dirty)) {
struct inode *tail;
- tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+ tail = list_entry(wb->b_dirty.next, struct inode, i_wb_list);
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
- list_move(&inode->i_list, &wb->b_dirty);
+ list_move(&inode->i_wb_list, &wb->b_dirty);
}
...From: Dave Chinner <dchinner@redhat.com> The inode reference count is currently an atomic variable so that it can be sampled/modified outside the inode_lock. However, the inode_lock is still needed to synchronise the final reference count and checks against the inode state. To avoid needing the protection of the inode lock, protect the inode reference count with the per-inode i_lock and convert it to a normal variable. To avoid existing out-of-tree code accidentally compiling against the new method, rename the i_count field to i_ref. This is relatively straight forward as there are limited external references to the i_count field remaining. Based on work originally from Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- Documentation/filesystems/vfs.txt | 14 +++--- arch/powerpc/platforms/cell/spufs/file.c | 2 +- fs/btrfs/inode.c | 14 ++++-- fs/ceph/mds_client.c | 2 +- fs/cifs/inode.c | 2 +- fs/drop_caches.c | 4 +- fs/ext3/ialloc.c | 4 +- fs/ext4/ialloc.c | 4 +- fs/fs-writeback.c | 12 +++-- fs/hpfs/inode.c | 2 +- fs/inode.c | 79 ++++++++++++++++++++++------- fs/locks.c | 2 +- fs/logfs/readwrite.c | 2 +- fs/nfs/inode.c | 4 +- fs/nfs/nfs4state.c | 2 +- fs/nilfs2/mdt.c | 2 +- fs/notify/inode_mark.c | 25 ++++++--- fs/ntfs/inode.c | 6 +- fs/ntfs/super.c | 2 +- fs/quota/dquot.c | 4 +- fs/reiserfs/stree.c | 2 +- fs/smbfs/inode.c | 2 +- fs/ubifs/super.c ...
BTW, the same thing as with Nick's set - separate patch for "clone the reference to inode we are already holding" helper, in front of queue. --
Isn't that already done by the patch 6 "fs: Clean up inode reference counting"? That patch does the converting of stand-alone atomic_inc(&inode->i-count) into iref(inode), not this one. Maybe I've misunderstood what you are wanting to be changed with this patch - can you clarify, Al? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
From: Christoph Hellwig <hch@lst.de>
Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.
Based on a patch originally from Nick Piggin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/inode.c | 30 +++++++++++++++++++++++++-----
1 files changed, 25 insertions(+), 5 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index cfcafee..77ff091 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -913,6 +913,27 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
return inode;
}
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block *sb, unsigned long ino)
+{
+ struct hlist_head *b = inode_hashtable + hash(sb, ino);
+ struct hlist_node *node;
+ struct inode *inode;
+
+ hlist_for_each_entry(inode, node, b, i_hash) {
+ if (inode->i_ino == ino && inode->i_sb == sb)
+ return 0;
+ }
+
+ return 1;
+}
+
/**
* iunique - get a unique inode number
* @sb: superblock
@@ -934,19 +955,18 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
+ static DEFINE_SPINLOCK(iunique_lock);
static unsigned int counter;
- struct inode *inode;
- struct hlist_head *head;
ino_t res;
spin_lock(&inode_lock);
+ spin_lock(&iunique_lock);
do {
if (counter <= max_reserved)
counter = max_reserved + 1;
res = counter++;
- head = inode_hashtable + hash(sb, res);
- inode = find_inode_fast(sb, head, res);
- } while (inode != NULL);
+ } while (!test_inode_iunique(sb, res));
+ spin_unlock(&iunique_lock);
...Protect the inode hash with a single lock is not scalable. Convert
the inode hash to use the new bit-locked hash list implementation
that allows per-bucket locks to be used. This allows us to replace
the global inode_lock with finer grained locking without increasing
the size of the hash table.
Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/btrfs/inode.c | 2 +-
fs/fs-writeback.c | 2 +-
fs/hfs/hfs_fs.h | 2 +-
fs/hfs/inode.c | 2 +-
fs/hfsplus/hfsplus_fs.h | 2 +-
fs/hfsplus/inode.c | 2 +-
fs/inode.c | 168 ++++++++++++++++++++++++++++-------------------
fs/nilfs2/gcinode.c | 22 ++++---
fs/nilfs2/segment.c | 2 +-
fs/nilfs2/the_nilfs.h | 2 +-
fs/reiserfs/xattr.c | 2 +-
include/linux/fs.h | 8 ++-
include/linux/list_bl.h | 1 +
mm/shmem.c | 4 +-
14 files changed, 132 insertions(+), 89 deletions(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7947bf0..c7a2bef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3855,7 +3855,7 @@ again:
p = &root->inode_tree.rb_node;
parent = NULL;
- if (hlist_unhashed(&inode->i_hash))
+ if (inode_unhashed(inode))
return;
spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9832beb..1fb5d95 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -964,7 +964,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* dirty list. Add blockdev inodes as well.
*/
if (!S_ISBLK(inode->i_mode)) {
- if (hlist_unhashed(&inode->i_hash))
+ if (inode_unhashed(inode))
goto out;
}
if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
int fs_div;
- struct hlist_head ...From: Dave Chinner <dchinner@redhat.com>
Given that the inode LRU and IO lists are split apart, they do not
need to be protected by the same lock. So in preparation for removal
of the inode_lock, add new locks for them. The writeback lists are
only ever accessed in the context of a bdi, so add a per-BDI lock to
protect manipulations of these lists.
For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next step for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis can be done.
Because there will now be a situation where the inode is on
different lists protected by different locks during the freeing of
the inode (i.e. not an atomic state transition), we need to ensure
that we set the I_FREEING state flag before we start removing inodes
from the IO and LRU lists. This ensures that if we race with other
threads during freeing, they will notice the I_FREEING flag is set
and be able to take appropriate action to avoid problems.
Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/block_dev.c | 5 +++
fs/fs-writeback.c | 51 +++++++++++++++++++++++++++++++++---
fs/inode.c | 61 ++++++++++++++++++++++++++++++++++++++-----
fs/internal.h | 5 +++
include/linux/backing-dev.h | 3 ++
mm/backing-dev.c | 18 ++++++++++++
6 files changed, 132 insertions(+), 11 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 11ad53d..7909775 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -56,10 +56,15 @@ EXPORT_SYMBOL(I_BDEV);
static void bdev_inode_switch_bdi(struct inode *inode,
struct backing_dev_info *dst)
{
+ struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
spin_lock(&inode_lock);
+ bdi_lock_two(old, dst);
...From: Dave Chinner <dchinner@redhat.com> Inode reclaim can push many inodes into the I_FREEING state before it actually frees them. During the time it gathers these inodes, it can call iput(), invalidate_mapping_pages, be preempted, etc. As a result, holding inodes in I_FREEING can cause pauses. After the inode scalability work, there is not a big reason to batch up inodes to reclaim them, so we can dispose them as they are found from the LRU. Unmount does a very similar reclaim process via invalidate_list(), but currently uses the i_lru list to aggregate inodes for a batched disposal. This requires taking the inode_lru_lock for every inode we want to dispose. Instead, take the inodes off the superblock inode list (as we already hold the lock) and use i_sb_list as the aggregator for inodes to dispose to reduce lock traffic. Further, iput_final() does the same inode cleanup as reclaim and unmount, so convert them all to use a single function for destroying inodes. This is written such that the callers can optimise list removals to avoid unneccessary lock round trips when removing inodes from lists. Based on a patch originally from Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/inode.c | 112 ++++++++++++++++++++++++++++++++---------------------------- 1 files changed, 60 insertions(+), 52 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 0046ea8..d60e3b5 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -49,8 +49,8 @@ * * sb inode lock * inode_lru_lock - * wb->b_lock - * inode->i_lock + * wb->b_lock + * inode->i_lock * * wb->b_lock * sb_lock (pin sb for writeback) @@ -471,6 +471,48 @@ static void evict(struct inode *inode) } /* + * Free the inode passed in, removing it from the lists it is still connected + * to but avoiding unnecessary lock round-trips for the lists it is no longer + * on. + * + * An inode must already be marked I_FREEING so that we avoid the inode being + * moved back ...
From: Dave Chinner <dchinner@redhat.com>
Currently we have an optimisation in place in unlock_new_inode() to
avoid taking the inode_lock. This uses memory barriers to ensure
that the the clearing of I_NEW can be seen by all other CPUs. It
also means the wake_up_inode() call relies on a memory barrier to
ensure that processes it is waking see any changes the i_state field
prior to the wakeup being issued.
This serialisation is also necessary to protect the inode against
lookup/freeing races once the inode_lock is removed. The current
lookup and wait code is atomic as it is all performed under the
inode_lock, while the modified code now splits the locks. Hence we
can get the following race:
Thread 1: Thread 2:
iput_final
spin_lock(&inode->i_lock);
hlist_bl_lock()
Find inode
spin_lock(&inode->i_lock)
......
inode->i_state = I_FREEING;
spin_unlock(&inode->i_lock);
remove_inode_hash()
hlist_bl_lock()
......
if (I_FREEING)
hlist_bl_unlock()
......
hlist_bl_del_init()
hlist_bl_unlock()
wake_up_inode()
__wait_on_freeing_inode()
put on waitqueue
spin_unlock(&inode->i_lock);
schedule()
To avoid this race, wake ups need to be serialised against the
waiters, and using inode->i_lock is the natural solution.
The memory barrier optimisation is no longer needed to avoid the
global inode_lock contention as the inode->i_state field is now
protected by inode->i_lock. Hence we can revert the code to a much
simpler form and correctly serialise wake ups against
__wait_on_freeing_inode() by holding the inode->i_lock while we do
the wakeup.
Given the newfound simplicity of wake_up_inode() and the fact that
we need to change i_state in unlock_new_inode() before the wakeup,
just open code the wakeup in the couple of spots it is used and
remove wake_up_inode() entirely.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/inode.c | 41 ...Looks good, and thanks for documenting unlock_new_inode. Reviewed-by: Christoph Hellwig <hch@lst.de> --
You wanted an example of how the irregular locking requires handling of new concurrency situations? This is another good one. --
From: Dave Chinner <dchinner@redhat.com> All the functionality that the inode_lock protected has now been wrapped up in new independent locks and/or functionality. Hence the inode_lock does not serve a purpose any longer and hence can now be removed. Based on work originally done by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- Documentation/filesystems/Locking | 2 +- Documentation/filesystems/porting | 8 ++- Documentation/filesystems/vfs.txt | 2 +- fs/block_dev.c | 2 - fs/buffer.c | 2 +- fs/drop_caches.c | 4 - fs/fs-writeback.c | 85 ++++++---------------- fs/inode.c | 147 ++++++++----------------------------- fs/logfs/inode.c | 2 +- fs/notify/inode_mark.c | 10 +-- fs/notify/mark.c | 1 - fs/notify/vfsmount_mark.c | 1 - fs/ntfs/inode.c | 4 +- fs/ocfs2/inode.c | 2 +- fs/quota/dquot.c | 12 +-- include/linux/fs.h | 2 +- include/linux/writeback.h | 2 - mm/backing-dev.c | 4 - mm/filemap.c | 6 +- mm/rmap.c | 6 +- 20 files changed, 81 insertions(+), 223 deletions(-) diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 2db4283..7f98cd5 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -114,7 +114,7 @@ alloc_inode: destroy_inode: dirty_inode: (must not sleep) write_inode: -drop_inode: !!!inode_lock!!! +drop_inode: !!!i_lock!!! evict_inode: put_super: write write_super: read diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index b12c895..f182795 100644 --- a/Documentation/filesystems/porting +++ ...
From: Dave Chinner <dchinner@redhat.com>
We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state. Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.
Based on work originally written by Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/drop_caches.c | 9 +++--
fs/fs-writeback.c | 48 +++++++++++++++++++++------
fs/inode.c | 85 ++++++++++++++++++++++++++++++++++--------------
fs/nilfs2/gcdat.c | 1 +
fs/notify/inode_mark.c | 6 ++-
fs/quota/dquot.c | 12 +++---
6 files changed, 113 insertions(+), 48 deletions(-)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dfe8cb1..f958dd8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
spin_lock(&inode_lock);
spin_lock(&sb->s_inodes_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
- if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
- continue;
- if (inode->i_mapping->nrpages == 0)
- continue;
spin_lock(&inode->i_lock);
+ if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+ (inode->i_mapping->nrpages == 0)) {
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
inode->i_ref++;
spin_unlock(&inode->i_lock);
spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36106e6..807d936 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -304,10 +304,12 @@ static void inode_wait_for_writeback(struct inode *inode)
wait_queue_head_t *wqh;
wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
- while (inode->i_state & I_SYNC) {
+ while (inode->i_state & I_SYNC) {
+ spin_unlock(&inode->i_lock);
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, ...Sorry, that's broken. Leaving aside leaking inode_lock here (you remove taking it later anyway), this is racy. Look: inode is hashed. It's alive and well. It just has no references outside of the lists. Right? Now, what's to stop another CPU from a) looking it up in icache b) doing unlink() c) dropping the last reference d) freeing the sucker ... just as you are about to call inode_lru_list_add() here? For paths in iput() where we do set I_FREEING/I_WILL_FREE it's perfectly fine to drop all locks once that's done. Inode is ours, nobody will pick it and we are free to do as we wish. For the path that leaves the inode alive and hashed - sorry, can't do. AFAICS, unlike hash, wb and sb locks, lru lock should nest *inside* ->i_lock. And yes, it means trylock in prune_icache(), with "put it in the end of the list for one more spin" if we fail. In that case it's really cleaner that way. --
I still have found that nesting them all inside i_lock is much more natural a way to protect the inode's icache state. Typically, we either have an icache data structure that we want to look up one inode from, or we have an inode that we need to do one *or more* icache state manipulations on. When it involves putting the inode on or off different data structures, holding i_lock over the sequence allows us to lift inode_lock without much further thought. One downside is the trylocks -- most can be subsequently removed quite easily by doing data structure lookups without locks. I prefer this approach, and be easily able to hold i_lock over the entirety of something like iput_final. At least for the initial lock break work. --
Nothing - I hadn't considered that as a potential inode freeing race window, so my assumption that it was OK to do this is wrong. It Yes, I knew that bit - I went wrong making the same assumptions I left it outside i_lock to be consistent with all the new locks being introduced. fs/fs-writeback.c::sync_inode() has a similar inode life-time wart when adding clean inodes to the lru which I was never really happy about. I suspect it has similar problems. I had a bit of a think about playing refcounting games to avoid doing the LRU add without holding the i_lock (to avoid the above freeing problem), but that ends up with more complex and messy iput/ iput_final interactions. Likewise, adding trylocks into the lru list add sites is doesn't solve the inode-goes-away-after-i_lock- is-dropped problems. A couple of other ideas I had also drowned in complexity at birth. AFAICT, moving the inode_lru_lock inside i_lock doesn't affect the locking order of anything else, and I agree that putting a single trylock in the prune_icache loop is definitely cleaner than anything else I've been able to think of that retains the current locking order. It will also remove the wart in sync_inode(). So, I'll swallow my rhetoric and agree with you that inode_lru_lock inside the i_lock is the natural and easiest way to nest these locks. I'll rework the series to do this. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
FWIW, here's what I'd prefer: * move the trivial parts in front of queue (including exofs fix, etc., etc.) * make sure that _everything_ walking the lists honors I_FREEING/I_WILL_FREE as the first step. We are very close to that already * protect all accesses to ->i_state with ->i_lock * separate lru list from wb * protect lru list by spinlock nested inside ->i_lock, with trylock in prune_icache() * at that point we can rip the inode_lock off the initial part of iput moving it down to the point after having marked the inode with I_FREEING at that point we can take hash, etc. out of inode_lock and under locks of their own one by one. And that kills inode_lock completely, at which point the hierarchy is established and we can do the rest (non-atomic refcount, etc.) Note that I'd rather leave *all* non-trivialities along the lines of per-zone vs per-sb for after the hierarchy is done. I.e. let's start with really simple "here's the single spinlock for hash, here's the single spinlock for all sb lists". If we get that right, the rest will be localized; let's deal with the skeleton first. What I'm going to do is to put together a branch with essentially cleanups and trivial fixes, with both patchsets forked off its tip. Then move stuff to common stem, rediffing the branches as we go. Then see what's left. One more note: IMO sb list lock is better off inside the hash one; when we do per-chain hash locks, we'll be better off if we don't have to hold sb one over the entire chain search. --
Why would you nest these two at all? --
OK, the current (partial) set is in git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ #merge-stem What remains to be done (I'm about to fall down right now, so that'll have to wait until tomorrow): * writeback_sb_inode() told to ignore I_FREEING ones in addition to I_NEW and I_WILL_FREE ones it ignores now. Currently I_FREEING can't be found there at all, so that'll change nothing. * invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate list, then (after we'd evicted the stuff we'd decided to evict) wait until they get freed by whatever's freeing them already. * remove_dquot_ref() - looks like we might be OK with that one being as it is - it walks sb list of inodes and for things like prune_icache() the inodes stay on said list all the way through evict(), so it either doesn't care or it's already broken. And no, I'm not discounting either possibility - it needs further analysis. That's it - after that we'll be OK with dropping and regaining inode_lock between the moment when we set I_FREEING and removals from the lists. --
Note that we would only have to do this for the umount case. For others it's pretty pointless. But I think there's a better way to do it, and that's per-sb inode lru lists. By adopting the scheme from prune_dcache we'd always have s_umount exclusive for inode reclaims, and per defintion we would not have any ongoing reclaim when we do enter umount. It would also allow us to get rid of iprune_sem and the nasty unsolved locking issues it causes. --
Now that I've looked into it I think we basically fine right now. If we're in umount there should be no other I_FREEING inodes. - concurrent prune_icache is prevented by iprune_sem. - concurrent other invalidate_inodes should not happen because we're in unmount and the filesystem should not be reachable any more, and even if it did iprune_sem would protect us. - how could a concurrent iput_final happen? filesystem is not accessible anymore, and iput of fs internal inodes is single-threaded with the rest of the actual umount process. So just skipping over I_FREEING inodes here should be fine for non-umount callers, and for umount we could even do a WARN_ON. --
FWIW, I think we should kill most of invalidate_inodes() callers. Look: * call in generic_shutdown_super() is legitimate. The first one, that is. The second should be replaced with check for ->s_list being non-empty. Note that after the first pass we should have kicked out everything with zero i_count. Everything that gets dropped to zero i_count after that (i.e. during ->put_super()) will be evicted immediately and won't stay. I.e. the second call will evict *nothing*; it's just an overblown way to check if there are any inodes left. * call in ext2_remount() is hogwash - we do that with at least root inode pinned down, so it will fail, along with the remount attempt. * ntfs_fill_super() call - no-op. MS_ACTIVE hasn't been set yet, so there will be no inodes with zero i_count sitting around. * gfs2 calls - same story (no MS_ACTIVE yet in fill_super(), MS_ACTIVE already removed *and* invalidate_inodes() already called in gfs2_put_super()) * smb reconnect logics. AFAICS, that's complete crap; we *never* retain inodes on smbfs. IOW, nothing for invalidate_inodes() to do, other than evict fsnotify marks. Which is to say, we are calling the wrong function there, even assuming that fsnotify should try to work there. * finally, __invalidate_device(). Which has a slew of callers of its own and is *very* different from normal situation. Here we have underlying device gone bad. So I'm going to do the following: 1) split evict_inodes() off invalidate_inodes() and simplify it. 2) switch generic_shutdown_super() to that sucker, called once. 3) kill all calls of invalidate_inodes() except __invalidate_device() one. 4) think hard about __invalidate_device() situation. evict_inodes() should *not* see any inodes with I_NEW/I_FREEING/I_WILL_FREE. Just skip. It might see I_DIRTY/I_SYNC, but that's OK - evict_inode() will wait for that. OTOH, invalidate_inodes() from __invalidate_device() can run in parallel with e.g. final iput(). Currently it's not a problem, ...
And having it fail is a good thing. XIP mode means different file and address_space operations, which we don't even try to deal with right I don't think it should mess with fsnotify. fsnotify_unmount_inodes assumes it's only called on umount right now, and sends umount notifications to userspace (see my mail from a few days ago). So if you split invalidate_inodes it really should only go into the umount one. --
Exactly. But that should be done without that ridiculous call to invalidate_inodes() - we should simply fail remount() and be done No, I mean that it's not obvious that fsnotify clients can realistically work on smbfs in the first place. I.e. I suspect that fsnotify should refuse to set marks on that sucker. --
I don't know what's so unclean or wrong about the locking order. I'm still leaning much further than Al as to putting locks inside i_lock. With inode-rcu, we _always_ arrive at the inode _first_, before taking _any_ other locks (because we either come in via an external reference, or we can find the inode from any of the data structures using RCU rather than taking their locks). [The exception is hash insertion uniqueness enforcement, because we have no inode to lock, by definition. But in that case we're OK because the new inode we're about to insert has no concurrency on its i_lock so that can be initialised locked.] So what we have is an inode, which we need to do stuff to. That stuff involves moving it on or off data structures, and updating its refcount and state, etc. That suggests the i_lock -> data-structure-lock locking order. And look at the flexibility -- I could implement that without changing any other code or ordering in the icache. Sure you can reorder things around to cope with different locking strategies, but I'm not actually given a good reason _why_ i_lock first is not the "natural" ordering. Doing that means we simply don't _need_ to change any ordering or concurrency rules (although it doesn't prevent changes). --
From: Christoph Hellwig <hch@lst.de>
Split up inode_add_to_list/__inode_add_to_list. Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.
The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/inode.c | 70 +++++++++++++++++++-----------------------
fs/xfs/linux-2.6/xfs_iops.c | 4 ++-
include/linux/fs.h | 5 ++-
3 files changed, 38 insertions(+), 41 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 6b97eb7..301dff5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -359,6 +359,28 @@ void inode_lru_list_del(struct inode *inode)
}
}
+static inline void __inode_sb_list_add(struct inode *inode)
+{
+ list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
+}
+
+/**
+ * inode_sb_list_add - add inode to the superblock list of inodes
+ * @inode: inode to add
+ */
+void inode_sb_list_add(struct inode *inode)
+{
+ spin_lock(&inode_lock);
+ __inode_sb_list_add(inode);
+ spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL_GPL(inode_sb_list_add);
+
+static inline void __inode_sb_list_del(struct inode *inode)
+{
+ list_del_init(&inode->i_sb_list);
+}
+
static unsigned long hash(struct super_block *sb, unsigned long hashval)
{
unsigned long tmp;
@@ -379,9 +401,10 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
*/
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
- struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+ struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+
spin_lock(&inode_lock);
- hlist_add_head(&inode->i_hash, head);
+ hlist_add_head(&inode->i_hash, b);
spin_unlock(&inode_lock);
}
EXPORT_SYMBOL(__insert_inode_hash);
@@ -460,7 +483,7 @@ static void dispose_list(struct ...From: Christoph Hellwig <hch@lst.de> Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- drivers/infiniband/hw/ipath/ipath_fs.c | 1 + drivers/infiniband/hw/qib/qib_fs.c | 1 + drivers/misc/ibmasm/ibmasmfs.c | 1 + drivers/oprofile/oprofilefs.c | 1 + drivers/usb/core/inode.c | 1 + drivers/usb/gadget/f_fs.c | 1 + drivers/usb/gadget/inode.c | 1 + fs/anon_inodes.c | 1 + fs/autofs4/inode.c | 1 + fs/binfmt_misc.c | 1 + fs/configfs/inode.c | 1 + fs/debugfs/inode.c | 1 + fs/ext4/mballoc.c | 1 + fs/freevxfs/vxfs_inode.c | 1 + fs/fuse/control.c | 1 + fs/hugetlbfs/inode.c | 1 + fs/inode.c | 4 ++-- fs/ocfs2/dlmfs/dlmfs.c | 2 ++ fs/pipe.c | 2 ++ fs/proc/base.c | 2 ++ fs/proc/proc_sysctl.c | 2 ++ fs/ramfs/inode.c | 1 + fs/xfs/linux-2.6/xfs_buf.c | 1 + include/linux/fs.h | 1 + ipc/mqueue.c | 1 + kernel/cgroup.c | 1 + mm/shmem.c | 1 + net/socket.c ...
From: Dave Chinner <dchinner@redhat.com>
To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.
Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/drop_caches.c | 4 ++++
fs/fs-writeback.c | 4 ++++
fs/inode.c | 29 +++++++++++++++++++++++------
fs/notify/inode_mark.c | 3 +++
fs/quota/dquot.c | 6 ++++++
fs/super.c | 1 +
include/linux/fs.h | 1 +
7 files changed, 42 insertions(+), 6 deletions(-)
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 10c8c5a..dfe8cb1 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,6 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
struct inode *inode, *toput_inode = NULL;
spin_lock(&inode_lock);
+ spin_lock(&sb->s_inodes_lock);
list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
continue;
@@ -25,12 +26,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
spin_lock(&inode->i_lock);
inode->i_ref++;
spin_unlock(&inode->i_lock);
+ spin_unlock(&sb->s_inodes_lock);
spin_unlock(&inode_lock);
invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;
spin_lock(&inode_lock);
+ spin_lock(&sb->s_inodes_lock);
}
+ spin_unlock(&sb->s_inodes_lock);
spin_unlock(&inode_lock);
iput(toput_inode);
}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1fb5d95..676e048 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1031,6 +1031,7 @@ static void wait_sb_inodes(struct super_block *sb)
WARN_ON(!rwsem_is_locked(&sb->s_umount));
...From: Christoph Hellwig <hch@lst.de>
Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/inode.c | 30 +++++++++++-------------------
1 files changed, 11 insertions(+), 19 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 77ff091..6b97eb7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -657,11 +657,9 @@ static struct shrinker icache_shrinker = {
};
static void __wait_on_freeing_inode(struct inode *inode);
+
/*
* Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must take a reference
- * by hand after calling find_inode now! This simplifies iunique and won't
- * add any additional branch in the common code.
*/
static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
@@ -681,9 +679,12 @@ repeat:
__wait_on_freeing_inode(inode);
goto repeat;
}
- break;
+ spin_lock(&inode->i_lock);
+ inode->i_ref++;
+ spin_unlock(&inode->i_lock);
+ return inode;
}
- return node ? inode : NULL;
+ return NULL;
}
/*
@@ -706,9 +707,12 @@ repeat:
__wait_on_freeing_inode(inode);
goto repeat;
}
- break;
+ spin_lock(&inode->i_lock);
+ inode->i_ref++;
+ spin_unlock(&inode->i_lock);
+ return inode;
}
- return node ? inode : NULL;
+ return NULL;
}
static inline void
@@ -853,9 +857,6 @@ static struct inode *get_new_inode(struct super_block *sb,
* us. Use the old inode instead of the one we just
* allocated.
*/
- spin_lock(&old->i_lock);
- old->i_ref++;
- spin_unlock(&old->i_lock);
spin_unlock(&inode_lock);
destroy_inode(inode);
inode = old;
@@ -902,9 +903,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
* us. Use the old inode instead of the one we just
* allocated.
...From: Eric Dumazet <eric.dumazet@gmail.com>
new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.
Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations. This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/inode.c | 45 ++++++++++++++++++++++++++++++++++++++-------
1 files changed, 38 insertions(+), 7 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ba514a1..b33b57c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -815,6 +815,43 @@ repeat:
return NULL;
}
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+#define LAST_INO_BATCH 1024
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int get_next_ino(void)
+{
+ unsigned int *p = &get_cpu_var(last_ino);
+ unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+ if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+ static atomic_t shared_last_ino;
+ int next = ...From: Dave Chinner <dchinner@redhat.com> Lots of filesystem code open codes the act of getting a reference to an inode. Factor the open coded inode lock, increment, unlock into a function iref(). This removes most direct external references to the inode reference count. Originally based on a patch from Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/9p/vfs_inode.c | 5 +++-- fs/affs/inode.c | 2 +- fs/afs/dir.c | 2 +- fs/anon_inodes.c | 7 +------ fs/bfs/dir.c | 2 +- fs/block_dev.c | 13 ++++++------- fs/btrfs/inode.c | 2 +- fs/coda/dir.c | 2 +- fs/drop_caches.c | 2 +- fs/exofs/inode.c | 2 +- fs/exofs/namei.c | 2 +- fs/ext2/namei.c | 2 +- fs/ext3/namei.c | 2 +- fs/ext4/namei.c | 2 +- fs/fs-writeback.c | 7 +++---- fs/gfs2/ops_inode.c | 2 +- fs/hfsplus/dir.c | 2 +- fs/inode.c | 34 ++++++++++++++++++++++------------ fs/jffs2/dir.c | 4 ++-- fs/jfs/jfs_txnmgr.c | 2 +- fs/jfs/namei.c | 2 +- fs/libfs.c | 2 +- fs/logfs/dir.c | 2 +- fs/minix/namei.c | 2 +- fs/namei.c | 2 +- fs/nfs/dir.c | 2 +- fs/nfs/getroot.c | 2 +- fs/nfs/write.c | 2 +- fs/nilfs2/namei.c | 2 +- fs/notify/inode_mark.c | 8 ++++---- fs/ntfs/super.c | 4 ++-- fs/ocfs2/namei.c | 2 +- fs/quota/dquot.c | 2 +- fs/reiserfs/namei.c | 2 +- fs/sysv/namei.c | 2 +- fs/ubifs/dir.c | 2 +- fs/udf/namei.c | 2 +- fs/ufs/namei.c | 2 +- ...
This still needs to be an unlocked increment, as point out in reply to the last iteration. --
From: Dave Chinner <dchinner@redhat.com>
Direct modification of the inode reference count is a no-no. Convert
the exofs decrements to call iput() instead of acting directly on
i_count.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/exofs/inode.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b631ff3..0fb4d4c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
set_obj_created(oi);
- atomic_dec(&inode->i_count);
+ iput(inode);
wake_up(&oi->i_wq);
}
@@ -1161,7 +1161,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
ios->cred = oi->i_cred;
ret = exofs_sbi_create(ios);
if (ret) {
- atomic_dec(&inode->i_count);
+ iput(inode);
exofs_put_io_state(ios);
return ERR_PTR(ret);
}
--
1.7.1
--
From: Nick Piggin <npiggin@suse.de>
Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic. We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.
This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/fs-writeback.c | 14 +++---
fs/inode.c | 111 +++++++++++++++++++++++++++-----------------
fs/internal.h | 6 +++
include/linux/fs.h | 13 +++---
include/linux/writeback.h | 1 -
5 files changed, 88 insertions(+), 57 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..33e9857 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,16 +408,16 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
* completion.
*/
redirty_tail(inode);
- } else if (atomic_read(&inode->i_count)) {
- /*
- * The inode is clean, inuse
- */
- list_move(&inode->i_list, &inode_in_use);
} else {
/*
- * The inode is clean, unused
+ * The inode is clean. If it is unused, then make sure
+ * that it is put on the LRU correctly as iput_final()
+ * does not move dirty inodes to the LRU and dirty
+ * inodes are removed from the LRU during scanning.
*/
- list_move(&inode->i_list, ...This "optimisation" is surely wrong. How could we have no reference Avoiding the reclaim optimisation? As I said, I noticed some increased scanning in heavy reclaim from removing this. --
Good question. iput_final does so for unlinked inodes or umount, and that should be about it as it's the only place setting I_WILL_FREE and we require that for a 0 refcount at the beginning of writeback_single_inode. But adding it to the LRU case for that is rather pointless as we will remove it a little bit later. So I think the assignment can be safely removed, but I'd rather do in a separate, properly documented patch rather than hiding it somewhere unrelated. That patch could however go towards the beggining of the series to make things easier. --
From: Dave Chinner <dchinner@redhat.com>
The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.
Based on a patch originally from Eric Dumazet.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/fs-writeback.c | 5 +--
fs/inode.c | 64 ++++++++++++++++++++++++++++++++++++---------------
include/linux/fs.h | 4 ++-
kernel/sysctl.c | 4 +-
4 files changed, 52 insertions(+), 25 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab38fef..58a95b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -723,7 +723,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
wb->last_old_flush = jiffies;
nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ get_nr_dirty_inodes();
if (nr_pages) {
struct wb_writeback_work work = {
@@ -1090,8 +1090,7 @@ void writeback_inodes_sb(struct super_block *sb)
WARN_ON(!rwsem_is_locked(&sb->s_umount));
- work.nr_pages = nr_dirty + nr_unstable +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..b3b6a4b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -103,8 +103,41 @@ static DECLARE_RWSEM(iprune_sem);
*/
struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused ...Folks,
I just pushed a new version of this patchset which is pretty much a rebase on
2.6.36 out. I'm not going to post all the patches as not much changed - mostly
comments were changed. The changelog for the update is:
Version 7:
- rebase on 2.6.36
- iref() to inode->i_ref++ conversion in fs/nfs/write.c
- removed stray inode hash removal call from patches it didn't
belong in.
- cleaned up another stale remove_inode_hash comment.
- cleaned up more comments as reported by Christian Stroetmann
<stroetmann@ontolinux.com>.
--
The following changes since commit f6f94e2ab1b33f0082ac22d71f66385a60d8157f:
Linux 2.6.36 (2010-10-20 13:30:22 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale
Christoph Hellwig (4):
fs: Stop abusing find_inode_fast in iunique
fs: move i_ref increments into find_inode/find_inode_fast
fs: remove inode_add_to_list/__inode_add_to_list
fs: do not assign default i_ino in new_inode
Dave Chinner (13):
fs: switch bdev inode bdi's correctly
fs: Convert nr_inodes and nr_unused to per-cpu counters
fs: Clean up inode reference counting
exofs: use iput() for inode reference count decrements
fs: rework icount to be a locked variable
fs: Factor inode hash operations into functions
fs: Introduce per-bucket inode hash locks
fs: add a per-superblock lock for the inode list
fs: split locking of inode writeback and LRU lists
fs: Protect inode->i_state with the inode->i_lock
fs: protect wake_up_inode with inode->i_lock
fs: icache remove inode_lock
fs: Reduce inode I_FREEING and factor inode disposal
Eric Dumazet (1):
fs: introduce a per-cpu last_ino allocator
Nick Piggin (3):
kernel: add bl_list
fs: Implement lazy LRU updates for inodes
fs: inode split IO and LRU lists
Documentation/filesystems/Locking | 2 +-
...It seems we are at an impasse. It doesn't help that you are ignoring the most important concerns I've been raising with these patches. The locking model and the patch split up. I'd really like not to get deadlocked on this (haha), so please let's try to debate points. I've tried to reply to each point others have questioned me about, whether I agree or not I've given reasons. So, you know my objections to this approach already... I've got an update on my patchset coming, so I'd like to get some discussion going. I've cut out some of the stuff from mine so we don't get bogged down in boring things like per-zone locking or changing of the hash table data structure. Thanks, Nick --
No point appealing to me, Nick, it's not me that you have to convince. As I've said from the start, all I really care about is getting the code into shape that is acceptable to the reviewers. As such, I don't think there is anything _new_ to discuss - I'd simply be rehashing the same points I've already made to you over the past couple of weeks. That has done nothing to change you mind about anything, so it strikes me as a continuing exercise in futility. We have different ways of acheiving the same thing which have their pros and cons, and I think that the reviewers of the patch sets are aware of this. The reviewers are the people that will make the decision on the best way to proceed, and I'll follow their lead exactly as I have been since I started this process. So, if you want to continue arguing that your locking model is the One True Way, you need to convince the reviewers of the fact, not me. I'm just going to continue to address the concerns of the reviewers and fix bugs that reported until a decision is made one way or the other.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
"The reviewers"? I _am_ a reviewer of your code, and I've made some points, and you've ignored them. When you've reviewed my code and had comments, I've responded to them every time. I might not have agreed, but I tried to give you No you didn't make these points to me over the past couple of weeks. Specifically, do you agree or disagree about these points: - introducing new concurrency situations from not having a single lock for an inode's icache state is a negative? - if yes, then what aspect of your locking model justifies and outweighs it? - before the inode_lock is lifted, locking changes should be as simple and verifiable as absolutely possible, so that bisection has less chance of hitting the inode_lock wall? - further locking changes making the locking less regular and more complex should be done in small steps, after inode_lock is lifted And I have kept saying I would welcome your idea to reduce i_lock width in a small incremental patch. I still haven't figured out quite what is so important that can't be achieved in simpler ways (like rcu, or Yes of course I want to continue to argue that, because that is what my opinion is. What I need from you is to know why you believe yours is better, that I might concede I was wrong; point out where you are wrong; or show that one can be extended to have the positive aspects of another etc. --
No, it's not a small incremental. It's your locking order being wrong; the natural one is [hash, wb, sb] > ->i_lock > [lru] and that's one hell of a difference compared to what you are doing. Look: * iput_final() should happen under ->i_lock * if it leaves the inode alive, that's it; we can put it on LRU list since lru lock nests inside ->i_lock * if it decides to kill the inode, it sets I_FREEING or I_WILL_FREE before dropping ->i_lock. Once that's done, the inode is ours and nobody will pick it through the lists. We can release ->i_lock and then do what's needed. Safely. * accesses of ->i_state are under ->i_lock, including the switchover from I_WILL_FREE to I_FREEING * walkers of the sb, wb and hash lists can grab ->i_lock at will; it nests inside their locks. * prune_icache() grabs lru lock, then trylocks ->i_lock on the first element. If trylock fails, we just give inode another spin through the list by moving it to the tail; if it doesn't, we are holding ->i_lock and can proceed safely. What you seem to miss is that there are very few places accessing inode through the lists (i.e. via pointers that do not contribute to refcount) and absolute majority already checks for I_FREEING/I_WILL_FREE, refusing to pick such inodes. It's not an accidental subtle property of the code, it's bloody fundamental. As I've said, I've no religious problems with trylocks; we *do* need them for prune_icache() to get a sane locking scheme. But the way you put ->i_lock on the top of hierarchy is simply wrong. --
Importantly, to be able to manipulate the icache state in any number of steps, under a consistent lock. Exactly like we have with inode_lock today. Stepping away from that, adding code to handle new concurrencies, before inode_lock is able to be lifted is just wrong. The locking in my lock break patch is ugly and wrong, yes. But it is always an intermediate step. I want to argue that with RCU inode work *anyway*, there is not much point to reducing the strength of the i_lock property because locking can be cleaned up nicely and still keep i_lock ~= inode_lock (for a single inode). --
The other thing is that with RCU, the idea of locking an object in
the data structure with a per object lock actually *is* much more
natural. It's hard to do it properly with just a big data structure
lock.
If I want to take a reference to an inode from a data structre, how
to do it with RCU?
rcu_read_lock()
list_for_each(inode) {
spin_lock(&big_lock); /* oops, might as well not even use RCU then */
if (!unhashed) {
iget();
}
}
--
Huh? Why the hell does it have to be a big lock? You grab ->i_lock, then look at the damn thing. You also grab it on eviction from the list - *inside* the lock used for serializing the write access to your RCU list. --
That sucks, it requires more acquiring and dropping of i_lock and it hits single threaded performance. I looked at that. But it also loses the i_lock = inode_lock property. --
Look at the code. You are overengineering it. We do *not* need a framework for messing with these lists in arbitrary ways. Where would we need to do that to an inode we don't hold a reference to or had placed I_FREEING on and would need i_lock held by caller? Even assuming that we need to keep [present in hash, present on sb list] in sync (which I seriously doubt), Code outside of fs/inode.c and fs/fs-writeback.c generally has no business looking at the full icache state, period. --
Look, my point is that I believe it is an easier step to get from inode_lock to i_lock, and then from there we can go wild. What is your criteria for a particular lock ordering being "natural" versus not? In almost all cases we have [stuff with data structure] -> [stuff with inode] and [stuff with inode] -> [stuff with data structure] So neither is inherently more natural, I think. So it comes down to how the code fits together and works. The difficulty with inode_lock breaking is not the data structures. We know how to lock and modify them. The hardest part is verifying that a particular inode has no new, unhandled concurrency introduced to it (eg. the particular concurrency you pointed out in Dave's patch just now). Agree? So in that case, I think it is much more natural to be able to take an inode and with i_lock, cover it from all icache state concurrency. I object to it being called over engineering -- it's actually just a big hammer on the inode, compared with fiddling with more complex I'm not saying there is. Most of the problems would be between a particular inode state versus its membership on one of the lists. However, with my patches, I *don't care* if there is an issue there or not. It simply doesn't matter because it has the same protection as inode_lock at that point. If you want to micro optimise it, change lock orders around, and open more concurrency, that is easily possible to do after my patches lift inode_lock. If you do all the changes *before* inode_lock removal, then it's not bisectable at all. --
And yes, being a big hammer, it is actually ugly and clunky for the first pass. The intention is always that we can start steps to streamline it now. I had been looking at switching lock orders around, reducing i_lock coverage etc, but I found that with RCU, things got a lot cleaner without reducing i_lock coverage. With RCU the important part of the locking shifts back, from the read side, to the write side. Not surprisingly, this made my lock more natural, wheras it does nothing for a lock ordering which is the other way around. And I think you give too little credit to i_lock being used to protect all i_state. Sure it's not strictly needed, and we could start breaking bits and pieces. But it works really nicely, and is maintainable and easier to convince yourself of correctness. Have an inode? Want to do something to it? Take i_lock. You're done. We don't _need_ to ever think about concurrent modifications. Really, this is the "big dumb" approach IMO. Breaking things out finer than per-inode basis is premature optimisation at this point. Note that my series never precluded such incremental changes, in fact it makes them easier. --
