Re: [2.6.26 PATCH, RESEND]: fs_stack/eCryptfs: fsstack_copy_* updates

Previous thread: Re: -mm merge plans for 2.6.26 (memcgroup) by Balbir Singh on Sunday, April 20, 2008 - 11:47 pm. (10 messages)

Next thread: OOM killer doesn't kill the right task.... by David Chinner on Monday, April 21, 2008 - 12:01 am. (4 messages)
From: Erez Zadok
Date: Sunday, April 20, 2008 - 11:50 pm

1. remove the 3rd arg to fsstack_copy_attr_all.  There are no users for it:
   ecryptfs never used the 3rd arg; unionfs stopped using it a long time
   ago.  Halcrow ok'ed this patch some time ago.

2. add necessary locking for 32-bit smp systems in fsstack_copy_inode_size
   (courtesy Hugh Dickins).

3. minor commenting style changes, and addition of copyrights which were
   missing.

Acked-by: Mike Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>

diff --git a/fs/ecryptfs/dentry.c b/fs/ecryptfs/dentry.c
index 5e59658..4621f89 100644
--- a/fs/ecryptfs/dentry.c
+++ b/fs/ecryptfs/dentry.c
@@ -62,7 +62,7 @@ static int ecryptfs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
 		struct inode *lower_inode =
 			ecryptfs_inode_to_lower(dentry->d_inode);
 
-		fsstack_copy_attr_all(dentry->d_inode, lower_inode, NULL);
+		fsstack_copy_attr_all(dentry->d_inode, lower_inode);
 	}
 out:
 	return rc;
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e238611..687819d 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -575,9 +575,9 @@ ecryptfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 			lower_new_dir_dentry->d_inode, lower_new_dentry);
 	if (rc)
 		goto out_lock;
-	fsstack_copy_attr_all(new_dir, lower_new_dir_dentry->d_inode, NULL);
+	fsstack_copy_attr_all(new_dir, lower_new_dir_dentry->d_inode);
 	if (new_dir != old_dir)
-		fsstack_copy_attr_all(old_dir, lower_old_dir_dentry->d_inode, NULL);
+		fsstack_copy_attr_all(old_dir, lower_old_dir_dentry->d_inode);
 out_lock:
 	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
 	dput(lower_new_dentry->d_parent);
@@ -910,7 +910,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 
 	rc = notify_change(lower_dentry, ia);
 out:
-	fsstack_copy_attr_all(inode, lower_inode, NULL);
+	fsstack_copy_attr_all(inode, lower_inode);
 	return rc;
 }
 
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index d25ac95..7eac062 100644
--- ...
From: Andrew Morton
Date: Wednesday, April 30, 2008 - 10:17 am

On Mon, 21 Apr 2008 02:50:42 -0400

The defined(CONFIG_SMP) is wrong.  The spinlock is here to protect
dst->i_blocks, but it can be corrupted via preemption on uniprocessor as
well.  So a plain old

#if BITS_PER_LONG == 32


However, what about src->i_blocks?  It is protected by src->i_lock.  The
code as you have it here could read transient values.

Furthermore, i_lock is defined as an innermost lock, for protection of
inode internals.  But here we're proposing "taking" inode->i_size_seqcount
inside i_lock.  Not necessarily a problem, but it broke the old rule.

We're also doing a read_seqlock of a _different_ inode inside this inode's
i_lock.  Again, this is not necessarily a problem (but it might be!) but it
adds complexity and needs thought.


Can we avoid having to think?

void fsstack_copy_inode_size(struct inode *dst, const struct inode *src)
{
	blkcnt_t i_blocks;
	loff_t i_size;

	i_size = i_size_read(src);
	spin_lock_32bit(&src->i_lock);
	i_blocks = src->i_blocks;
	spin_unlock_32bit(&src->i_lock);

	i_size_write(dst, i_size);
	spin_lock_32bit(&dst->i_lock)
	dst->i_blocks = i_blocks;
	spin_unlock_32bit(&dst->i_lock)
}


--

From: Erez Zadok
Date: Wednesday, April 30, 2008 - 2:09 pm

Thanks.  I can't say that I'm an expert in these SMP issues.  But I'll run
your rewritten function through my 32 and 64-bit SMP and non-SMP systems,
and see how it behaves.

Erez.
--

From: Andrew Morton
Date: Wednesday, April 30, 2008 - 2:25 pm

On Wed, 30 Apr 2008 17:09:15 -0400

The obvious risk here is that there's no synchronisation between the
copying of i_size and i_blocks.  If that's a problem, I _expect_ that
i_mutex wold give pretty good coverage (but insufficient for
mmap-write-over-a-hole, I guess).

And someone needs to write spin_lock_32bit() ;)
--

From: Erez Zadok
Date: Thursday, May 1, 2008 - 10:18 am

In message <20080430101704.9cbd6384.akpm@linux-foundation.org>, Andrew Morton writes:

I can't find spin_[um]lock_32bit anywhere (checkd latest mmotm and linus's
tree).  I therefore assume this was just your way of saying I should:

#if BITS_PER_LONG == 32
	spin_unlock(&dst->i_lock);
#endif

Erez.
--

From: Andrew Morton
Date: Thursday, May 1, 2008 - 12:21 pm

On Thu, 1 May 2008 13:18:09 -0400

Nope, it was my way of suggesting that you implement it ;)
include/linux/spinlock.h would be a good place.

--

From: Erez Zadok
Date: Thursday, May 1, 2008 - 4:44 pm

Sure, I can implement it and submit a patch.  But I've got two questions
first:

1. i_size_read has a tri-state behaviour:

#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
	// play with seqcount
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
	// wrap in preempt_disable/enable
#else
	// just return value
#endif

Shouldn't the same be done for i_blocks, with matching i_blocks_read(),
i_blocks_write(), and adding an inode->i_blocks_seqcount field?


2. I've rewritten your suggested code a bit to reduce stack use.  Modulo
   having 32-bit spin_lock/unlock variants, do you see any problem with this
   code below?  My testing of it so far on 32/64-bit SMP/UMP have all
   passed.

void fsstack_copy_inode_size(struct inode *dst, struct inode *src)
{
#if BITS_PER_LONG == 32
	blkcnt_t i_blocks;

	spin_lock(&src->i_lock);
	i_blocks = src->i_blocks;
	spin_unlock(&src->i_lock);
	spin_lock(&dst->i_lock);
	dst->i_blocks = i_blocks;
	spin_unlock(&dst->i_lock);
#else
	dst->i_blocks = src->i_blocks;
#endif
	i_size_write(dst, i_size_read(src));
}

Thanks,
Erez.
--

From: Andrew Morton
Date: Thursday, May 1, 2008 - 5:08 pm

On Thu, 1 May 2008 19:44:18 -0400

I guess so, yes.  Maybe there are other things too.  We just haven't cared
enough to do all this fuss for i_blocks.  Even i_size was a bit marginal.

i_size is much more important because glitches in there can result in
incorrect data being returned from read() and things like that.  i_blocks

That looks sane, as long as we don't care about i_size-vs-i_blocks
coherency.

However I expect that approximately zero of the sites which modify i_blocks
are taking i_lock to do so.

otoh, many of them _will_ accidentally have i_mutex coverage.  On the
write() path at least.  What happened to that idea?
--

From: Erez Zadok
Date: Thursday, May 1, 2008 - 10:58 pm

If i_blocks is indeed less important than i_size, then we can live with some
incoherency b/t i_size and i_blocks, for now.  Nevertheless, I propose
adding this to linux/fs.h:

static inline blkcnt_t i_blocks_read(const struct inode *inode)
{
#if BITS_PER_LONG == 32
	blkcnt_t i_blocks;
	spin_lock(&src->i_lock);
	i_blocks = src->i_blocks;
	spin_unlock(&src->i_lock);
	return i_blocks;
#else
	return src->i_blocks;
#endif
}

and a matching i_blocks_write function.  We can then gradually convert those
"unsafe" users of i_blocks to use the new i_blocks_read/write helpers.

The nice thing about these two helpers is fsstack_copy_inode_size becomes a
lot cleaner and more elegant:

void fsstack_copy_inode_size(struct inode *dst, struct inode *src)
{
	i_blocks_write(dst, i_blocks_read(src));
	i_size_write(dst, i_size_read(src));
}

And, if we ever wanted to ensure coherency b/t i_blocks and i_size, we'll
need to create helpers that merge the functionality of i_size_read/write and
i_blocks_read/write.

What do you think?

Erez.
--

From: Andrew Morton
Date: Thursday, May 1, 2008 - 11:11 pm

We actually only need the spinlocked version if blkcnt_t is 64-bit.

So #if BITS_PER_LONG == 32 && defined(CONFIG_LSF), plus explanatory comment.


You'll also need i_blocks_mod() for things like



--

From: hooanon05
Date: Sunday, May 11, 2008 - 5:44 pm

While this function looks simple, I think it is more generic and better
that the caller of fsstack_copy_inode_size() holds i_lock instead of
lock it in the callee.


Junjiro Okajima
--

From: Hugh Dickins
Date: Friday, May 2, 2008 - 6:17 am

Sorry, I keep putting off a reply until I can append a patch,
but still won't get to do so today.  Comments below.


Some confusion here - which would have been avoided if I'd been
so good as to put much-needed comment in the code of the original
patch - sorry.  There's two separate issues, i_size and i_blocks.

i_size is what I was addressing by adding the #ifdef'ed spin_lock,
i_blocks (in the CONFIG_LSF case more precisely than the 32-bit case)
is an arguable problem that nobody noticed until you did so now.
Here's my original patch comment to Erez:

+ LTP's iogen01 doio tests hang nicely on 32-bit SMP when /tmp is a unionfs
+ mount of a tmpfs.  See the comment on i_size_write in linux/fs.h: it needs
+ to be locked, otherwise i_size_read can spin forever waiting for a lost
+ seqcount update.
+ 
+ Most filesystems are already holding i_mutex for this, but unionfs calls
+ fsstack_copy_inode_size from many places, not necessarily holding i_mutex.
+ Use the low-level i_lock within fsstack_copy_inode_size when 32-bit SMP.

So far as that goes, the CONFIG_SMP is correct, because i_size_write
does preempt_disable/preempt_enable in the 32-bit PREEMPT but not SMP
case.  The ifdef chosen reflects that in i_size_write, but I've no
objection to removing the CONFIG_SMP part of it if it ends up more
palatable that way - I imagine the cost of a double preempt_disable
where a single would do is just too minuscule to worry about.

But the new use of i_lock, together with the way I only unlock after
dealing with i_blocks (I thought, if we have to have preemption off,
let's do i_blocks too while it's still off), led you to think that I
was trying to keep the two halves of a CONFIG_LSF i_blocks together -
whereas I'd never even glanced at the typedef of i_blocks.

Personally I don't think it's worth worrying about i_blocks coherency
here.  stat's generic_fillattr doesn't worry about it.  unionfs (ah,
but here we're in general stackable filesystem territory) won't very
often be handling ...
Previous thread: Re: -mm merge plans for 2.6.26 (memcgroup) by Balbir Singh on Sunday, April 20, 2008 - 11:47 pm. (10 messages)

Next thread: OOM killer doesn't kill the right task.... by David Chinner on Monday, April 21, 2008 - 12:01 am. (4 messages)