Re: [PATCH 16/21] fs: Protect inode->i_state with the inode->i_lock

Previous thread: [PATCH 02/21] kernel: add bl_list by Dave Chinner on Wednesday, October 20, 2010 - 5:49 pm. (1 message)

Next thread: [PATCH v2] power: bq24617: Adding initial charger support by rklein on Wednesday, October 20, 2010 - 5:56 pm. (25 messages)
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

This patch set is derived from Nick Piggin's VFS scalability tree.
This is an attempt to push the process of finer grained review of
the series for upstream inclusion. I'm hitting VFS lock contention
problems with XFS on 8-16p machines now, so I need to get this stuff
moving.

This patch set is just the basic inode_lock breakup patches plus a
few more simple changes to the inode code. It stops short of
introducing RCU inode freeing because those changes are not
completely baked yet.

As a result, the full inode handling improvements of Nick's patch
set are not realised with this short series. However, my own testing
indicates that the amount of lock traffic and contention is down by
an order of magnitude on an 8-way box for parallel inode create and
unlink workloads, so there is still significant improvements from
just this patch set.

Version 2 of this series was a complete rework of the original patch
series.  I've pulled in several of the cleanups and re-ordered the
series such that cleanups, factoring and list splitting are done
before any of the locking changes. Instead of converting the inode
state flags first, I've converted them last, ensuring that
manipulations are kept inside other locks rather than outside them.

The series is made up of the following steps:

	- inode counters are made per-cpu
	- inode LRU manipulations are made lazy
	- i_list is split into two lists (grows inode by 2
	  pointers), one for tracking lru status, one for writeback
	  status
	- reference counting is factored, then renamed and locked
	  differently
	- protect iunique counter with it's own lock
	- hash lookups and reference counting is cleaned up
	- inode hash operations are factored, then locked per bucket
	- superblock inode listis locked per-superblock
	- inode LRU is locked via a global lock
		- unclear what the best way to split this up from
		  here is, so no attempt is made to optimise
		  further.
		- Currently not showing signs of contention under
		  any workload on an ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Nick Piggin <npiggin@suse.de>

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c            |    4 ++--
 fs/fs-writeback.c         |   27 ++++++++++++++-------------
 fs/inode.c                |   38 +++++++++++++++++++++-----------------
 fs/nilfs2/mdt.c           |    3 ++-
 include/linux/fs.h        |    3 ++-
 include/linux/writeback.h |    1 -
 mm/backing-dev.c          |    6 +++---
 7 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 501eab5..63b1c4c 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,8 +58,8 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 {
 	spin_lock(&inode_lock);
 	inode->i_data.backing_dev_info = dst;
-	if (inode->i_state & I_DIRTY)
-		list_move(&inode->i_list, &dst->wb.b_dirty);
+	if (!list_empty(&inode->i_wb_list))
+		list_move(&inode->i_wb_list, &dst->wb.b_dirty);
 	spin_unlock(&inode_lock);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 33e9857..92d73b6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_wb_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_wb_list, &wb->b_dirty);
 }
 
 ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

The inode reference count is currently an atomic variable so that it
can be sampled/modified outside the inode_lock. However, the
inode_lock is still needed to synchronise the final reference count
and checks against the inode state.

To avoid needing the protection of the inode lock, protect the inode
reference count with the per-inode i_lock and convert it to a normal
variable. To avoid existing out-of-tree code accidentally compiling
against the new method, rename the i_count field to i_ref. This is
relatively straight forward as there are limited external references
to the i_count field remaining.

Based on work originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/vfs.txt        |   14 +++---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 fs/btrfs/inode.c                         |   14 ++++--
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/drop_caches.c                         |    4 +-
 fs/ext3/ialloc.c                         |    4 +-
 fs/ext4/ialloc.c                         |    4 +-
 fs/fs-writeback.c                        |   12 +++--
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   79 ++++++++++++++++++++++-------
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 +-
 fs/nfs/nfs4state.c                       |    2 +-
 fs/nilfs2/mdt.c                          |    2 +-
 fs/notify/inode_mark.c                   |   25 ++++++---
 fs/ntfs/inode.c                          |    6 +-
 fs/ntfs/super.c                          |    2 +-
 fs/quota/dquot.c                         |    4 +-
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/ubifs/super.c    ...
From: Al Viro
Date: Thursday, October 21, 2010 - 12:40 pm

BTW, the same thing as with Nick's set - separate patch for "clone the
reference to inode we are already holding" helper, in front of queue.
--

From: Dave Chinner
Date: Thursday, October 21, 2010 - 3:32 pm

Isn't that already done by the patch 6 "fs: Clean up inode reference
counting"? That patch does the converting of stand-alone
atomic_inc(&inode->i-count) into iref(inode), not this one.
Maybe I've misunderstood what you are wanting to be changed with
this patch - can you clarify, Al?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Christoph Hellwig <hch@lst.de>

Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.

Based on a patch originally from Nick Piggin.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   30 +++++++++++++++++++++++++-----
 1 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index cfcafee..77ff091 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -913,6 +913,27 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	return inode;
 }
 
+/*
+ * search the inode cache for a matching inode number.
+ * If we find one, then the inode number we are trying to
+ * allocate is not unique and so we should not use it.
+ *
+ * Returns 1 if the inode number is unique, 0 if it is not.
+ */
+static int test_inode_iunique(struct super_block *sb, unsigned long ino)
+{
+	struct hlist_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_node *node;
+	struct inode *inode;
+
+	hlist_for_each_entry(inode, node, b, i_hash) {
+		if (inode->i_ino == ino && inode->i_sb == sb)
+			return 0;
+	}
+
+	return 1;
+}
+
 /**
  *	iunique - get a unique inode number
  *	@sb: superblock
@@ -934,19 +955,18 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	 * error if st_ino won't fit in target struct field. Use 32bit counter
 	 * here to attempt to avoid that.
 	 */
+	static DEFINE_SPINLOCK(iunique_lock);
 	static unsigned int counter;
-	struct inode *inode;
-	struct hlist_head *head;
 	ino_t res;
 
 	spin_lock(&inode_lock);
+	spin_lock(&iunique_lock);
 	do {
 		if (counter <= max_reserved)
 			counter = max_reserved + 1;
 		res = counter++;
-		head = inode_hashtable + hash(sb, res);
-		inode = find_inode_fast(sb, head, res);
-	} while (inode != NULL);
+	} while (!test_inode_iunique(sb, res));
+	spin_unlock(&iunique_lock);
 ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

Protect the inode hash with a single lock is not scalable.  Convert
the inode hash to use the new bit-locked hash list implementation
that allows per-bucket locks to be used. This allows us to replace
the global inode_lock with finer grained locking without increasing
the size of the hash table.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c        |    2 +-
 fs/fs-writeback.c       |    2 +-
 fs/hfs/hfs_fs.h         |    2 +-
 fs/hfs/inode.c          |    2 +-
 fs/hfsplus/hfsplus_fs.h |    2 +-
 fs/hfsplus/inode.c      |    2 +-
 fs/inode.c              |  168 ++++++++++++++++++++++++++++-------------------
 fs/nilfs2/gcinode.c     |   22 ++++---
 fs/nilfs2/segment.c     |    2 +-
 fs/nilfs2/the_nilfs.h   |    2 +-
 fs/reiserfs/xattr.c     |    2 +-
 include/linux/fs.h      |    8 ++-
 include/linux/list_bl.h |    1 +
 mm/shmem.c              |    4 +-
 14 files changed, 132 insertions(+), 89 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7947bf0..c7a2bef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3855,7 +3855,7 @@ again:
 	p = &root->inode_tree.rb_node;
 	parent = NULL;
 
-	if (hlist_unhashed(&inode->i_hash))
+	if (inode_unhashed(inode))
 		return;
 
 	spin_lock(&root->inode_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9832beb..1fb5d95 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -964,7 +964,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * dirty list.  Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
+			if (inode_unhashed(inode))
 				goto out;
 		}
 		if (inode->i_state & I_FREEING)
diff --git a/fs/hfs/hfs_fs.h b/fs/hfs/hfs_fs.h
index 4f55651..24591be 100644
--- a/fs/hfs/hfs_fs.h
+++ b/fs/hfs/hfs_fs.h
@@ -148,7 +148,7 @@ struct hfs_sb_info {
 
 	int fs_div;
 
-	struct hlist_head ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

Given that the inode LRU and IO lists are split apart, they do not
need to be protected by the same lock. So in preparation for removal
of the inode_lock, add new locks for them. The writeback lists are
only ever accessed in the context of a bdi, so add a per-BDI lock to
protect manipulations of these lists.

For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next step for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis can be done.

Because there will now be a situation where the inode is on
different lists protected by different locks during the freeing of
the inode (i.e. not an atomic state transition), we need to ensure
that we set the I_FREEING state flag before we start removing inodes
from the IO and LRU lists. This ensures that if we race with other
threads during freeing, they will notice the I_FREEING flag is set
and be able to take appropriate action to avoid problems.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/block_dev.c              |    5 +++
 fs/fs-writeback.c           |   51 +++++++++++++++++++++++++++++++++---
 fs/inode.c                  |   61 ++++++++++++++++++++++++++++++++++++++-----
 fs/internal.h               |    5 +++
 include/linux/backing-dev.h |    3 ++
 mm/backing-dev.c            |   18 ++++++++++++
 6 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 11ad53d..7909775 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -56,10 +56,15 @@ EXPORT_SYMBOL(I_BDEV);
 static void bdev_inode_switch_bdi(struct inode *inode,
 			struct backing_dev_info *dst)
 {
+	struct backing_dev_info *old = inode->i_data.backing_dev_info;
+
 	spin_lock(&inode_lock);
+	bdi_lock_two(old, dst);
 ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

Inode reclaim can push many inodes into the I_FREEING state before
it actually frees them. During the time it gathers these inodes, it
can call iput(), invalidate_mapping_pages, be preempted, etc. As a
result, holding inodes in I_FREEING can cause pauses.

After the inode scalability work, there is not a big reason to batch
up inodes to reclaim them, so we can dispose them as they are found
from the LRU.

Unmount does a very similar reclaim process via invalidate_list(),
but currently uses the i_lru list to aggregate inodes for a batched
disposal. This requires taking the inode_lru_lock for every inode we
want to dispose. Instead, take the inodes off the superblock inode
list (as we already hold the lock) and use i_sb_list as the
aggregator for inodes to dispose to reduce lock traffic.

Further, iput_final() does the same inode cleanup as reclaim and
unmount, so convert them all to use a single function for destroying
inodes. This is written such that the callers can optimise list
removals to avoid unneccessary lock round trips when removing inodes
from lists.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |  112 ++++++++++++++++++++++++++++++++----------------------------
 1 files changed, 60 insertions(+), 52 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 0046ea8..d60e3b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -49,8 +49,8 @@
  *
  * sb inode lock
  *   inode_lru_lock
- *     wb->b_lock
- *       inode->i_lock
+ *   wb->b_lock
+ *     inode->i_lock
  *
  * wb->b_lock
  *   sb_lock (pin sb for writeback)
@@ -471,6 +471,48 @@ static void evict(struct inode *inode)
 }
 
 /*
+ * Free the inode passed in, removing it from the lists it is still connected
+ * to but avoiding unnecessary lock round-trips for the lists it is no longer
+ * on.
+ *
+ * An inode must already be marked I_FREEING so that we avoid the inode being
+ * moved back ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

Currently we have an optimisation in place in unlock_new_inode() to
avoid taking the inode_lock. This uses memory barriers to ensure
that the the clearing of I_NEW can be seen by all other CPUs. It
also means the wake_up_inode() call relies on a memory barrier to
ensure that processes it is waking see any changes the i_state field
prior to the wakeup being issued.

This serialisation is also necessary to protect the inode against
lookup/freeing races once the inode_lock is removed. The current
lookup and wait code is atomic as it is all performed under the
inode_lock, while the modified code now splits the locks. Hence we
can get the following race:

Thread 1:				Thread 2:
					iput_final
					spin_lock(&inode->i_lock);
hlist_bl_lock()
  Find inode
    spin_lock(&inode->i_lock)
					......
					inode->i_state = I_FREEING;
					spin_unlock(&inode->i_lock);
					remove_inode_hash()
					   hlist_bl_lock()
    ......
    if (I_FREEING)
	hlist_bl_unlock()
					   ......
					   hlist_bl_del_init()
					   hlist_bl_unlock()
					wake_up_inode()

	__wait_on_freeing_inode()
	  put on waitqueue
	  spin_unlock(&inode->i_lock);
	  schedule()

To avoid this race, wake ups need to be serialised against the
waiters, and using inode->i_lock is the natural solution.

The memory barrier optimisation is no longer needed to avoid the
global inode_lock contention as the inode->i_state field is now
protected by inode->i_lock. Hence we can revert the code to a much
simpler form and correctly serialise wake ups against
__wait_on_freeing_inode() by holding the inode->i_lock while we do
the wakeup.

Given the newfound simplicity of wake_up_inode() and the fact that
we need to change i_state in unlock_new_inode() before the wakeup,
just open code the wakeup in the couple of spots it is used and
remove wake_up_inode() entirely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   41 ...
From: Christoph Hellwig
Date: Wednesday, October 20, 2010 - 7:17 pm

Looks good, and thanks for documenting unlock_new_inode.


Reviewed-by: Christoph Hellwig <hch@lst.de>

--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 6:16 am

You wanted an example of how the irregular locking requires
handling of new concurrency situations? This is another good
one.

--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Based on work originally done by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |    8 ++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/block_dev.c                    |    2 -
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 -
 fs/fs-writeback.c                 |   85 ++++++----------------
 fs/inode.c                        |  147 ++++++++-----------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   10 +--
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   12 +--
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    2 -
 mm/backing-dev.c                  |    4 -
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 20 files changed, 81 insertions(+), 223 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..7f98cd5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..f182795 100644
--- a/Documentation/filesystems/porting
+++ ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Based on work originally written by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   48 +++++++++++++++++++++------
 fs/inode.c             |   85 ++++++++++++++++++++++++++++++++++--------------
 fs/nilfs2/gcdat.c      |    1 +
 fs/notify/inode_mark.c |    6 ++-
 fs/quota/dquot.c       |   12 +++---
 6 files changed, 113 insertions(+), 48 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index dfe8cb1..f958dd8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
-			continue;
 		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36106e6..807d936 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -304,10 +304,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, ...
From: Al Viro
Date: Thursday, October 21, 2010 - 6:56 pm

Sorry, that's broken.  Leaving aside leaking inode_lock here (you remove
taking it later anyway), this is racy.

Look: inode is hashed.  It's alive and well.  It just has no references outside
of the lists.  Right?  Now, what's to stop another CPU from
	a) looking it up in icache
	b) doing unlink()
	c) dropping the last reference
	d) freeing the sucker
... just as you are about to call inode_lru_list_add() here?

For paths in iput() where we do set I_FREEING/I_WILL_FREE it's perfectly
fine to drop all locks once that's done.  Inode is ours, nobody will pick
it and we are free to do as we wish.

For the path that leaves the inode alive and hashed - sorry, can't do.
AFAICS, unlike hash, wb and sb locks, lru lock should nest *inside*
->i_lock.  And yes, it means trylock in prune_icache(), with "put it in
the end of the list for one more spin" if we fail.  In that case it's
really cleaner that way.
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 7:26 pm

I still have found that nesting them all inside i_lock is much more
natural a way to protect the inode's icache state.

Typically, we either have an icache data structure that we want to
look up one inode from, or we have an inode that we need to do one
*or more* icache state manipulations on.

When it involves putting the inode on or off different data structures,
holding i_lock over the sequence allows us to lift inode_lock without
much further thought.

One downside is the trylocks -- most can be subsequently removed quite
easily by doing data structure lookups without locks. I prefer this
approach, and be easily able to hold i_lock over the entirety of
something like iput_final. At least for the initial lock break work.

--

From: Dave Chinner
Date: Thursday, October 21, 2010 - 8:14 pm

Nothing - I hadn't considered that as a potential inode freeing
race window, so my assumption that it was OK to do this is wrong. It

Yes, I knew that bit - I went wrong making the same assumptions

I left it outside i_lock to be consistent with all the new
locks being introduced. fs/fs-writeback.c::sync_inode() has a
similar inode life-time wart when adding clean inodes to the lru
which I was never really happy about. I suspect it has similar
problems.

I had a bit of a think about playing refcounting games to avoid
doing the LRU add without holding the i_lock (to avoid the above
freeing problem), but that ends up with more complex and messy iput/
iput_final interactions.  Likewise, adding trylocks into the lru
list add sites is doesn't solve the inode-goes-away-after-i_lock-
is-dropped problems.  A couple of other ideas I had also drowned in
complexity at birth.

AFAICT, moving the inode_lru_lock inside i_lock doesn't affect the
locking order of anything else, and I agree that putting a single
trylock in the prune_icache loop is definitely cleaner than anything
else I've been able to think of that retains the current locking
order. It will also remove the wart in sync_inode().

So, I'll swallow my rhetoric and agree with you that inode_lru_lock
inside the i_lock is the natural and easiest way to nest these
locks. I'll rework the series to do this. 

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Al Viro
Date: Friday, October 22, 2010 - 3:37 am

FWIW, here's what I'd prefer:

* move the trivial parts in front of queue (including exofs fix, etc., etc.)
* make sure that _everything_ walking the lists honors I_FREEING/I_WILL_FREE
as the first step.  We are very close to that already 
* protect all accesses to ->i_state with ->i_lock
* separate lru list from wb
* protect lru list by spinlock nested inside ->i_lock, with trylock in
prune_icache()
* at that point we can rip the inode_lock off the initial part of iput
moving it down to the point after having marked the inode with I_FREEING
at that point we can take hash, etc. out of inode_lock and under locks of
their own one by one.  And that kills inode_lock completely, at which point
the hierarchy is established and we can do the rest (non-atomic refcount,
etc.)

Note that I'd rather leave *all* non-trivialities along the lines of
per-zone vs per-sb for after the hierarchy is done.  I.e. let's start
with really simple "here's the single spinlock for hash, here's the
single spinlock for all sb lists".  If we get that right, the rest will
be localized; let's deal with the skeleton first.

What I'm going to do is to put together a branch with essentially cleanups
and trivial fixes, with both patchsets forked off its tip.  Then move stuff
to common stem, rediffing the branches as we go.  Then see what's left.

One more note: IMO sb list lock is better off inside the hash one; when we
do per-chain hash locks, we'll be better off if we don't have to hold sb
one over the entire chain search.
--

From: Christoph Hellwig
Date: Friday, October 22, 2010 - 4:40 am

Why would you nest these two at all?

--

From: Al Viro
Date: Saturday, October 23, 2010 - 2:37 pm

OK, the current (partial) set is in

git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ #merge-stem

What remains to be done (I'm about to fall down right now, so that'll have
to wait until tomorrow):

	* writeback_sb_inode() told to ignore I_FREEING ones in addition to
I_NEW and I_WILL_FREE ones it ignores now.  Currently I_FREEING can't be
found there at all, so that'll change nothing.
	* invalidate_inodes() - collect I_FREEING/I_WILL_FREE on a separate
list, then (after we'd evicted the stuff we'd decided to evict) wait until
they get freed by whatever's freeing them already.
	* remove_dquot_ref() - looks like we might be OK with that one being
as it is - it walks sb list of inodes and for things like prune_icache()
the inodes stay on said list all the way through evict(), so it either
doesn't care or it's already broken.  And no, I'm not discounting either
possibility - it needs further analysis.

That's it - after that we'll be OK with dropping and regaining inode_lock
between the moment when we set I_FREEING and removals from the lists.
--

From: Christoph Hellwig
Date: Sunday, October 24, 2010 - 7:13 am

Note that we would only have to do this for the umount case.  For others
it's pretty pointless.

But I think there's a better way to do it, and that's per-sb inode lru
lists.  By adopting the scheme from prune_dcache we'd always have
s_umount exclusive for inode reclaims, and per defintion we would not
have any ongoing reclaim when we do enter umount.  It would also allow
us to get rid of iprune_sem and the nasty unsolved locking issues it
causes.

--

From: Christoph Hellwig
Date: Sunday, October 24, 2010 - 9:21 am

Now that I've looked into it I think we basically fine right now.

If we're in umount there should be no other I_FREEING inodes.

 - concurrent prune_icache is prevented by iprune_sem.
 - concurrent other invalidate_inodes should not happen because we're
   in unmount and the filesystem should not be reachable any more,
   and even if it did iprune_sem would protect us.
 - how could a concurrent iput_final happen?  filesystem is not
   accessible anymore, and iput of fs internal inodes is single-threaded
   with the rest of the actual umount process.

So just skipping over I_FREEING inodes here should be fine for
non-umount callers, and for umount we could even do a WARN_ON.
--

From: Al Viro
Date: Sunday, October 24, 2010 - 12:17 pm

FWIW, I think we should kill most of invalidate_inodes() callers.  Look:
	* call in generic_shutdown_super() is legitimate.  The first one,
that is.  The second should be replaced with check for ->s_list being
non-empty.  Note that after the first pass we should have kicked out
everything with zero i_count.  Everything that gets dropped to zero
i_count after that (i.e. during ->put_super()) will be evicted immediately
and won't stay.  I.e. the second call will evict *nothing*; it's just
an overblown way to check if there are any inodes left.
	* call in ext2_remount() is hogwash - we do that with at least
root inode pinned down, so it will fail, along with the remount attempt.
	* ntfs_fill_super() call - no-op.  MS_ACTIVE hasn't been set
yet, so there will be no inodes with zero i_count sitting around.
	* gfs2 calls - same story (no MS_ACTIVE yet in fill_super(),
MS_ACTIVE already removed *and* invalidate_inodes() already called
in gfs2_put_super())
	* smb reconnect logics.  AFAICS, that's complete crap; we *never*
retain inodes on smbfs.  IOW, nothing for invalidate_inodes() to do, other
than evict fsnotify marks.  Which is to say, we are calling the wrong
function there, even assuming that fsnotify should try to work there.
	* finally, __invalidate_device().  Which has a slew of callers of
its own and is *very* different from normal situation.  Here we have
underlying device gone bad.

So I'm going to do the following:
	1) split evict_inodes() off invalidate_inodes() and simplify it.
	2) switch generic_shutdown_super() to that sucker, called once.
	3) kill all calls of invalidate_inodes() except __invalidate_device()
one.
	4) think hard about __invalidate_device() situation.

	evict_inodes() should *not* see any inodes with
I_NEW/I_FREEING/I_WILL_FREE.  Just skip.  It might see I_DIRTY/I_SYNC,
but that's OK - evict_inode() will wait for that.

	OTOH, invalidate_inodes() from __invalidate_device() can run in
parallel with e.g. final iput().  Currently it's not a problem, ...
From: Christoph Hellwig
Date: Sunday, October 24, 2010 - 1:04 pm

And having it fail is a good thing.  XIP mode means different file and
address_space operations, which we don't even try to deal with right

I don't think it should mess with fsnotify.  fsnotify_unmount_inodes
assumes it's only called on umount right now, and sends umount
notifications to userspace (see my mail from a few days ago).  So if
you split invalidate_inodes it really should only go into the umount
one.

--

From: Al Viro
Date: Sunday, October 24, 2010 - 1:36 pm

Exactly.  But that should be done without that ridiculous call to
invalidate_inodes() - we should simply fail remount() and be done

No, I mean that it's not obvious that fsnotify clients can realistically
work on smbfs in the first place.  I.e. I suspect that fsnotify should
refuse to set marks on that sucker.
--

From: Nick Piggin
Date: Saturday, October 23, 2010 - 7:18 pm

I don't know what's so unclean or wrong about the locking order.  I'm
still leaning much further than Al as to putting locks inside i_lock.

With inode-rcu, we _always_ arrive at the inode _first_, before taking
_any_ other locks (because we either come in via an external reference,
or we can find the inode from any of the data structures using RCU
rather than taking their locks).

[The exception is hash insertion uniqueness enforcement, because we have
 no inode to lock, by definition. But in that case we're OK because the
 new inode we're about to insert has no concurrency on its i_lock so that
 can be initialised locked.]

So what we have is an inode, which we need to do stuff to. That stuff
involves moving it on or off data structures, and updating its refcount
and state, etc.

That suggests the i_lock -> data-structure-lock locking order.

And look at the flexibility -- I could implement that without changing
any other code or ordering in the icache. Sure you can reorder things
around to cope with different locking strategies, but I'm not actually
given a good reason _why_ i_lock first is not the "natural" ordering.
Doing that means we simply don't _need_ to change any ordering or
concurrency rules (although it doesn't prevent changes).

--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Christoph Hellwig <hch@lst.de>

Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c                  |   70 +++++++++++++++++++-----------------------
 fs/xfs/linux-2.6/xfs_iops.c |    4 ++-
 include/linux/fs.h          |    5 ++-
 3 files changed, 38 insertions(+), 41 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6b97eb7..301dff5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -359,6 +359,28 @@ void inode_lru_list_del(struct inode *inode)
 	}
 }
 
+static inline void __inode_sb_list_add(struct inode *inode)
+{
+	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
+}
+
+/**
+ * inode_sb_list_add - add inode to the superblock list of inodes
+ * @inode: inode to add
+ */
+void inode_sb_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	__inode_sb_list_add(inode);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL_GPL(inode_sb_list_add);
+
+static inline void __inode_sb_list_del(struct inode *inode)
+{
+	list_del_init(&inode->i_sb_list);
+}
+
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
@@ -379,9 +401,10 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+
 	spin_lock(&inode_lock);
-	hlist_add_head(&inode->i_hash, head);
+	hlist_add_head(&inode->i_hash, b);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
@@ -460,7 +483,7 @@ static void dispose_list(struct ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Christoph Hellwig <hch@lst.de>

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/infiniband/hw/ipath/ipath_fs.c |    1 +
 drivers/infiniband/hw/qib/qib_fs.c     |    1 +
 drivers/misc/ibmasm/ibmasmfs.c         |    1 +
 drivers/oprofile/oprofilefs.c          |    1 +
 drivers/usb/core/inode.c               |    1 +
 drivers/usb/gadget/f_fs.c              |    1 +
 drivers/usb/gadget/inode.c             |    1 +
 fs/anon_inodes.c                       |    1 +
 fs/autofs4/inode.c                     |    1 +
 fs/binfmt_misc.c                       |    1 +
 fs/configfs/inode.c                    |    1 +
 fs/debugfs/inode.c                     |    1 +
 fs/ext4/mballoc.c                      |    1 +
 fs/freevxfs/vxfs_inode.c               |    1 +
 fs/fuse/control.c                      |    1 +
 fs/hugetlbfs/inode.c                   |    1 +
 fs/inode.c                             |    4 ++--
 fs/ocfs2/dlmfs/dlmfs.c                 |    2 ++
 fs/pipe.c                              |    2 ++
 fs/proc/base.c                         |    2 ++
 fs/proc/proc_sysctl.c                  |    2 ++
 fs/ramfs/inode.c                       |    1 +
 fs/xfs/linux-2.6/xfs_buf.c             |    1 +
 include/linux/fs.h                     |    1 +
 ipc/mqueue.c                           |    1 +
 kernel/cgroup.c                        |    1 +
 mm/shmem.c                             |    1 +
 net/socket.c                   ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/drop_caches.c       |    4 ++++
 fs/fs-writeback.c      |    4 ++++
 fs/inode.c             |   29 +++++++++++++++++++++++------
 fs/notify/inode_mark.c |    3 +++
 fs/quota/dquot.c       |    6 ++++++
 fs/super.c             |    1 +
 include/linux/fs.h     |    1 +
 7 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 10c8c5a..dfe8cb1 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,6 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -25,12 +26,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		spin_lock(&inode->i_lock);
 		inode->i_ref++;
 		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1fb5d95..676e048 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1031,6 +1031,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Christoph Hellwig <hch@lst.de>

Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   30 +++++++++++-------------------
 1 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 77ff091..6b97eb7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -657,11 +657,9 @@ static struct shrinker icache_shrinker = {
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);
+
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must take a reference
- * by hand after calling find_inode now! This simplifies iunique and won't
- * add any additional branch in the common code.
  */
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_head *head,
@@ -681,9 +679,12 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
-		break;
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
-	return node ? inode : NULL;
+	return NULL;
 }
 
 /*
@@ -706,9 +707,12 @@ repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
-		break;
+		spin_lock(&inode->i_lock);
+		inode->i_ref++;
+		spin_unlock(&inode->i_lock);
+		return inode;
 	}
-	return node ? inode : NULL;
+	return NULL;
 }
 
 static inline void
@@ -853,9 +857,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
-		old->i_ref++;
-		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -902,9 +903,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/inode.c |   45 ++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ba514a1..b33b57c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -815,6 +815,43 @@ repeat:
 	return NULL;
 }
 
+/*
+ * Each cpu owns a range of LAST_INO_BATCH numbers.
+ * 'shared_last_ino' is dirtied only once out of LAST_INO_BATCH allocations,
+ * to renew the exhausted range.
+ *
+ * This does not significantly increase overflow rate because every CPU can
+ * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
+ * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
+ * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
+ * overflow rate by 2x, which does not seem too significant.
+ *
+ * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
+ * error if st_ino won't fit in target struct field. Use 32bit counter
+ * here to attempt to avoid that.
+ */
+#define LAST_INO_BATCH 1024
+static DEFINE_PER_CPU(unsigned int, last_ino);
+
+static unsigned int get_next_ino(void)
+{
+	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+		static atomic_t shared_last_ino;
+		int next = ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iref(). This removes most direct external references to
the inode reference count.

Originally based on a patch from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/9p/vfs_inode.c           |    5 +++--
 fs/affs/inode.c             |    2 +-
 fs/afs/dir.c                |    2 +-
 fs/anon_inodes.c            |    7 +------
 fs/bfs/dir.c                |    2 +-
 fs/block_dev.c              |   13 ++++++-------
 fs/btrfs/inode.c            |    2 +-
 fs/coda/dir.c               |    2 +-
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    2 +-
 fs/exofs/namei.c            |    2 +-
 fs/ext2/namei.c             |    2 +-
 fs/ext3/namei.c             |    2 +-
 fs/ext4/namei.c             |    2 +-
 fs/fs-writeback.c           |    7 +++----
 fs/gfs2/ops_inode.c         |    2 +-
 fs/hfsplus/dir.c            |    2 +-
 fs/inode.c                  |   34 ++++++++++++++++++++++------------
 fs/jffs2/dir.c              |    4 ++--
 fs/jfs/jfs_txnmgr.c         |    2 +-
 fs/jfs/namei.c              |    2 +-
 fs/libfs.c                  |    2 +-
 fs/logfs/dir.c              |    2 +-
 fs/minix/namei.c            |    2 +-
 fs/namei.c                  |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/nfs/getroot.c            |    2 +-
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    2 +-
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 ++--
 fs/ocfs2/namei.c            |    2 +-
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    2 +-
 fs/sysv/namei.c             |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/udf/namei.c              |    2 +-
 fs/ufs/namei.c              |    2 +-
 ...
From: Christoph Hellwig
Date: Wednesday, October 20, 2010 - 6:41 pm

This still needs to be an unlocked increment, as point out in reply to
the last iteration.

--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

Direct modification of the inode reference count is a no-no. Convert
the exofs decrements to call iput() instead of acting directly on
i_count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/exofs/inode.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index b631ff3..0fb4d4c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1101,7 +1101,7 @@ static void create_done(struct exofs_io_state *ios, void *p)
 
 	set_obj_created(oi);
 
-	atomic_dec(&inode->i_count);
+	iput(inode);
 	wake_up(&oi->i_wq);
 }
 
@@ -1161,7 +1161,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	ios->cred = oi->i_cred;
 	ret = exofs_sbi_create(ios);
 	if (ret) {
-		atomic_dec(&inode->i_count);
+		iput(inode);
 		exofs_put_io_state(ios);
 		return ERR_PTR(ret);
 	}
-- 
1.7.1

--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c         |   14 +++---
 fs/inode.c                |  111 +++++++++++++++++++++++++++-----------------
 fs/internal.h             |    6 +++
 include/linux/fs.h        |   13 +++---
 include/linux/writeback.h |    1 -
 5 files changed, 88 insertions(+), 57 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..33e9857 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,16 +408,16 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
 			/*
-			 * The inode is clean, unused
+			 * The inode is clean. If it is unused, then make sure
+			 * that it is put on the LRU correctly as iput_final()
+			 * does not move dirty inodes to the LRU and dirty
+			 * inodes are removed from the LRU during scanning.
 			 */
-			list_move(&inode->i_list, ...
From: Nick Piggin
Date: Thursday, October 21, 2010 - 3:07 am

This "optimisation" is surely wrong. How could we have no reference

Avoiding the reclaim optimisation? As I said, I noticed some increased
scanning in heavy reclaim from removing this.

--

From: Christoph Hellwig
Date: Thursday, October 21, 2010 - 5:22 am

Good question.  iput_final does so for unlinked inodes or umount,
and that should be about it as it's the only place setting I_WILL_FREE
and we require that for a 0 refcount at the beginning of
writeback_single_inode.  But adding it to the LRU case for that
is rather pointless as we will remove it a little bit later.

So I think the assignment can be safely removed, but I'd rather do in
a separate, properly documented patch rather than hiding it somewhere
unrelated.  That patch could however go towards the beggining of the
series to make things easier.
--

From: Dave Chinner
Date: Wednesday, October 20, 2010 - 5:49 pm

From: Dave Chinner <dchinner@redhat.com>

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/fs-writeback.c  |    5 +--
 fs/inode.c         |   64 ++++++++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |    4 ++-
 kernel/sysctl.c    |    4 +-
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab38fef..58a95b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -723,7 +723,7 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	wb->last_old_flush = jiffies;
 	nr_pages = global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+			get_nr_dirty_inodes();
 
 	if (nr_pages) {
 		struct wb_writeback_work work = {
@@ -1090,8 +1090,7 @@ void writeback_inodes_sb(struct super_block *sb)
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+	work.nr_pages = nr_dirty + nr_unstable + get_nr_dirty_inodes();
 
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..b3b6a4b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -103,8 +103,41 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
+static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+static struct percpu_counter nr_inodes_unused ...
From: Dave Chinner
Date: Wednesday, October 20, 2010 - 10:04 pm

Folks,

I just pushed a new version of this patchset which is pretty much a rebase on
2.6.36 out. I'm not going to post all the patches as not much changed - mostly
comments were changed. The changelog for the update is:

Version 7:
- rebase on 2.6.36
- iref() to inode->i_ref++ conversion in fs/nfs/write.c
- removed stray inode hash removal call from patches it didn't
  belong in.
- cleaned up another stale remove_inode_hash comment.
- cleaned up more comments as reported by Christian Stroetmann
  <stroetmann@ontolinux.com>.

--

The following changes since commit f6f94e2ab1b33f0082ac22d71f66385a60d8157f:

  Linux 2.6.36 (2010-10-20 13:30:22 -0700)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git inode-scale

Christoph Hellwig (4):
      fs: Stop abusing find_inode_fast in iunique
      fs: move i_ref increments into find_inode/find_inode_fast
      fs: remove inode_add_to_list/__inode_add_to_list
      fs: do not assign default i_ino in new_inode

Dave Chinner (13):
      fs: switch bdev inode bdi's correctly
      fs: Convert nr_inodes and nr_unused to per-cpu counters
      fs: Clean up inode reference counting
      exofs: use iput() for inode reference count decrements
      fs: rework icount to be a locked variable
      fs: Factor inode hash operations into functions
      fs: Introduce per-bucket inode hash locks
      fs: add a per-superblock lock for the inode list
      fs: split locking of inode writeback and LRU lists
      fs: Protect inode->i_state with the inode->i_lock
      fs: protect wake_up_inode with inode->i_lock
      fs: icache remove inode_lock
      fs: Reduce inode I_FREEING and factor inode disposal

Eric Dumazet (1):
      fs: introduce a per-cpu last_ino allocator

Nick Piggin (3):
      kernel: add bl_list
      fs: Implement lazy LRU updates for inodes
      fs: inode split IO and LRU lists

 Documentation/filesystems/Locking        |    2 +-
 ...
From: Nick Piggin
Date: Thursday, October 21, 2010 - 6:20 am

It seems we are at an impasse.

It doesn't help that you are ignoring the most important concerns
I've been raising with these patches. The locking model and the
patch split up. I'd really like not to get deadlocked on this (haha),
so please let's try to debate points. I've tried to reply to each
point others have questioned me about, whether I agree or not I've
given reasons.

So, you know my objections to this approach already... I've got an
update on my patchset coming, so I'd like to get some discussion
going. I've cut out some of the stuff from mine so we don't get
bogged down in boring things like per-zone locking or changing of
the hash table data structure.

Thanks,
Nick

--

From: Dave Chinner
Date: Thursday, October 21, 2010 - 4:52 pm

No point appealing to me, Nick, it's not me that you have to
convince. As I've said from the start, all I really care about is
getting the code into shape that is acceptable to the reviewers.

As such, I don't think there is anything _new_ to discuss - I'd
simply be rehashing the same points I've already made to you over
the past couple of weeks. That has done nothing to change you mind
about anything, so it strikes me as a continuing exercise in
futility.

We have different ways of acheiving the same thing which have their
pros and cons, and I think that the reviewers of the patch sets are
aware of this. The reviewers are the people that will make the
decision on the best way to proceed, and I'll follow their lead
exactly as I have been since I started this process.

So, if you want to continue arguing that your locking model is the
One True Way, you need to convince the reviewers of the fact, not
me. I'm just going to continue to address the concerns of the
reviewers and fix bugs that reported until a decision is made one
way or the other....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 5:45 pm

"The reviewers"? I _am_ a reviewer of your code, and I've made
some points, and you've ignored them.

When you've reviewed my code and had comments, I've responded to
them every time. I might not have agreed, but I tried to give you

No you didn't make these points to me over the past couple of weeks.
Specifically, do you agree or disagree about these points:
- introducing new concurrency situations from not having a single lock
  for an inode's icache state is a negative?
- if yes, then what aspect of your locking model justifies and outweighs
  it?
- before the inode_lock is lifted, locking changes should be as simple
  and verifiable as absolutely possible, so that bisection has less
  chance of hitting the inode_lock wall?
- further locking changes making the locking less regular and more
  complex should be done in small steps, after inode_lock is lifted

And I have kept saying I would welcome your idea to reduce i_lock width
in a small incremental patch. I still haven't figured out quite what
is so important that can't be achieved in simpler ways (like rcu, or


Yes of course I want to continue to argue that, because that is what
my opinion is. What I need from you is to know why you believe yours
is better, that I might concede I was wrong; point out where you are
wrong; or show that one can be extended to have the positive aspects
of another etc.


--

From: Al Viro
Date: Thursday, October 21, 2010 - 7:20 pm

No, it's not a small incremental.  It's your locking order being wrong;
the natural one is
	[hash, wb, sb] > ->i_lock > [lru]
and that's one hell of a difference compared to what you are doing.

Look:
	* iput_final() should happen under ->i_lock
	* if it leaves the inode alive, that's it; we can put it on LRU list
since lru lock nests inside ->i_lock
	* if it decides to kill the inode, it sets I_FREEING or I_WILL_FREE
before dropping ->i_lock.  Once that's done, the inode is ours and nobody
will pick it through the lists.  We can release ->i_lock and then do what's
needed.  Safely.
	* accesses of ->i_state are under ->i_lock, including the switchover
from I_WILL_FREE to I_FREEING
	* walkers of the sb, wb and hash lists can grab ->i_lock at will;
it nests inside their locks.
	* prune_icache() grabs lru lock, then trylocks ->i_lock on the
first element.  If trylock fails, we just give inode another spin through
the list by moving it to the tail; if it doesn't, we are holding ->i_lock
and can proceed safely.

What you seem to miss is that there are very few places accessing inode through
the lists (i.e. via pointers that do not contribute to refcount) and absolute
majority already checks for I_FREEING/I_WILL_FREE, refusing to pick such
inodes.  It's not an accidental subtle property of the code, it's bloody
fundamental.

As I've said, I've no religious problems with trylocks; we *do* need them for
prune_icache() to get a sane locking scheme.  But the way you put ->i_lock on
the top of hierarchy is simply wrong.
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 7:41 pm

Importantly, to be able to manipulate the icache state in any number of
steps, under a consistent lock. Exactly like we have with inode_lock
today.

Stepping away from that, adding code to handle new concurrencies, before
inode_lock is able to be lifted is just wrong.

The locking in my lock break patch is ugly and wrong, yes. But it is
always an intermediate step. I want to argue that with RCU inode work
*anyway*, there is not much point to reducing the strength of the
i_lock property because locking can be cleaned up nicely and still
keep i_lock ~= inode_lock (for a single inode).
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 7:48 pm

The other thing is that with RCU, the idea of locking an object in
the data structure with a per object lock actually *is* much more
natural. It's hard to do it properly with just a big data structure
lock.

If I want to take a reference to an inode from a data structre, how
to do it with RCU?

rcu_read_lock()
list_for_each(inode) {
  spin_lock(&big_lock); /* oops, might as well not even use RCU then */
  if (!unhashed) {
    iget();
  }
}


--

From: Al Viro
Date: Thursday, October 21, 2010 - 8:12 pm

Huh?  Why the hell does it have to be a big lock?  You grab ->i_lock,
then look at the damn thing.  You also grab it on eviction from the
list - *inside* the lock used for serializing the write access to
your RCU list.
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 9:48 pm

That sucks, it requires more acquiring and dropping of i_lock and
it hits single threaded performance. I looked at that.

But it also loses the i_lock = inode_lock property.

--

From: Al Viro
Date: Thursday, October 21, 2010 - 8:07 pm

Look at the code.  You are overengineering it.  We do *not* need a framework
for messing with these lists in arbitrary ways.  Where would we need to
do that to an inode we don't hold a reference to or had placed I_FREEING
on and would need i_lock held by caller?  Even assuming that we need to
keep [present in hash, present on sb list] in sync (which I seriously doubt),

Code outside of fs/inode.c and fs/fs-writeback.c generally has no business
looking at the full icache state, period.
--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 9:46 pm

Look, my point is that I believe it is an easier step to get from
inode_lock to i_lock, and then from there we can go wild.

What is your criteria for a particular lock ordering being "natural"
versus not? In almost all cases we have

[stuff with data structure] -> [stuff with inode]
and
[stuff with inode] -> [stuff with data structure]

So neither is inherently more natural, I think. So it comes down to
how the code fits together and works.

The difficulty with inode_lock breaking is not the data structures.
We know how to lock and modify them. The hardest part is verifying
that a particular inode has no new, unhandled concurrency introduced
to it (eg. the particular concurrency you pointed out in Dave's patch
just now). Agree?

So in that case, I think it is much more natural to be able to take
an inode and with i_lock, cover it from all icache state concurrency.
I object to it being called over engineering -- it's actually just a
big hammer on the inode, compared with fiddling with more complex

I'm not saying there is. Most of the problems would be between a
particular inode state versus its membership on one of the lists.
However, with my patches, I *don't care* if there is an issue there
or not. It simply doesn't matter because it has the same protection
as inode_lock at that point.

If you want to micro optimise it, change lock orders around, and
open more concurrency, that is easily possible to do after my patches
lift inode_lock. If you do all the changes *before* inode_lock removal,
then it's not bisectable at all.

--

From: Nick Piggin
Date: Thursday, October 21, 2010 - 10:01 pm

And yes, being a big hammer, it is actually ugly and clunky for
the first pass.

The intention is always that we can start steps to streamline it
now. I had been looking at switching lock orders around, reducing
i_lock coverage etc, but I found that with RCU, things got a lot
cleaner without reducing i_lock coverage. With RCU the important
part of the locking shifts back, from the read side, to the write
side. Not surprisingly, this made my lock more natural, wheras it
does nothing for a lock ordering which is the other way around.

And I think you give too little credit to i_lock being used to
protect all i_state. Sure it's not strictly needed, and we could
start breaking bits and pieces. But it works really nicely, and
is maintainable and easier to convince yourself of correctness.

Have an inode? Want to do something to it? Take i_lock. You're
done. We don't _need_ to ever think about concurrent modifications.
Really, this is the "big dumb" approach IMO. Breaking things out
finer than per-inode basis is premature optimisation at this point.
Note that my series never precluded such incremental changes, in
fact it makes them easier.


--

Previous thread: [PATCH 02/21] kernel: add bl_list by Dave Chinner on Wednesday, October 20, 2010 - 5:49 pm. (1 message)

Next thread: [PATCH v2] power: bq24617: Adding initial charger support by rklein on Wednesday, October 20, 2010 - 5:56 pm. (25 messages)