[PATCH 36/39] ocfs2: Limit inode allocation to 32bits.

Previous thread: [PATCH v2] fsl-dma: allow Freescale Elo DMA driver to be compiled as a module by Timur Tabi on Wednesday, September 24, 2008 - 2:59 pm. (9 messages)

Next thread: [patch 2.6.27-rc7] gpiolib: request/free hooks by David Brownell on Wednesday, September 24, 2008 - 3:08 pm. (6 messages)
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

Hi,

	The following patches comprise the bulk of Ocfs2 updates for the
2.6.28 merge window. They can roughly be broken up into 4 sets which add
incremental features to Ocfs2. The patches are presented as they come in
git.


EA Support

The largest set adds support for extended attributes in Ocfs2. Extended
attributes are stored both within the inode block, and externally, when their
numbers grow. Individual attributes can be arbitrarily sized. Smaller ones
have their data stored inline. Larger attributes grow out to a btree. In
theory the btrees have similar limits to inode data. In practice though, the
VFS limits EA sizes to 64K.

When inode space for attributes run low, new ones are created in an external
disk block. When the block fills up, external attributes are moved to an
indexed btree. The btree can store many thousands of attributes, if needed.

The patches leading up to EA support further abstracted portions of the Ocfs2
btree code. Ultimately, this means we can "add" a btree to any Ocfs2 structure
by embedding a header, and providing the proper callbacks to manipulate
certain key fields. The xattr code makes use of this, as will future Ocfs2
features.

Joel made some further improvements to our 'generic' (for Ocfs2 at least)
btree support which completed the interface by cleaning things up and
providing for proper callbacks in a static operations structure. Those patches
follow the xattr series as they were developed afterwards.


JBD2 Support

Ocfs2 can now use JBD2. Amongst other benefits, this allows us to support
large block devices with more than 32 bits worth of block numbers. As a part
of these patches, and 'inode64' mount option is added which toggles creation
of inodes whose inode number requires more than 32 bits to be adequately
described.

JBD2 support in Ocfs2 is compiled in by default, however since journaling is
so central to the operation of a file system, we kept our 'legacy' JBD
support. We did this to provide a fallback for any users who ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

This is actually pretty easy since fs/dlm already handles the bulk of the
work. The Ocfs2 userspace cluster stack module already uses fs/dlm as the
underlying lock manager, so I only had to add the right calls.

Cluster-aware POSIX locks ("plocks") can be turned off by the same means at
UNIX locks - mount with 'noflocks', or create a local-only Ocfs2 volume.
Internally, the file system uses two sets of file_operations, depending on
whether cluster aware plocks is required. This turns out to be easier than
implementing local-only versions of ->lock.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/file.c       |   51 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/file.h       |    2 +
 fs/ocfs2/inode.c      |   15 ++++++++++++-
 fs/ocfs2/locks.c      |   15 ++++++++++++++
 fs/ocfs2/locks.h      |    1 +
 fs/ocfs2/stack_user.c |   33 +++++++++++++++++++++++++++++++
 fs/ocfs2/stackglue.c  |   20 +++++++++++++++++++
 fs/ocfs2/stackglue.h  |   19 ++++++++++++++++++
 8 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index ec2ed15..60232b1 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2236,6 +2236,10 @@ const struct inode_operations ocfs2_special_file_iops = {
 	.permission	= ocfs2_permission,
 };
 
+/*
+ * Other than ->lock, keep ocfs2_fops and ocfs2_dops in sync with
+ * ocfs2_fops_no_plocks and ocfs2_dops_no_plocks!
+ */
 const struct file_operations ocfs2_fops = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -2250,6 +2254,7 @@ const struct file_operations ocfs2_fops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
+	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
 	.splice_read	= ocfs2_file_splice_read,
 	.splice_write	= ocfs2_file_splice_write,
@@ -2266,5 +2271,51 @@ const struct file_operations ocfs2_dops = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl   = ocfs2_compat_ioctl,
 #endif
+	.lock		= ocfs2_lock,
+	.flock		= ocfs2_flock,
+};
+
+/*
+ ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:11 pm

It's pointless doing !! on something which is already 0 or 1.
--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 1:09 pm

Sure - the following patch is now on the 'merge_window' branch of ocfs2.git.

Also, thanks for all the review you did on these.
	--Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.com>

ocfs2: Remove pointless !!

ocfs2_stack_supports_plocks() doesn't need this to properly return a zero or
one value.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/stackglue.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/stackglue.c b/fs/ocfs2/stackglue.c
index 7150f5d..68b668b 100644
--- a/fs/ocfs2/stackglue.c
+++ b/fs/ocfs2/stackglue.c
@@ -290,7 +290,7 @@ EXPORT_SYMBOL_GPL(ocfs2_dlm_dump_lksb);
 
 int ocfs2_stack_supports_plocks(void)
 {
-	return !!(active_stack && active_stack->sp_ops->plock);
+	return active_stack && active_stack->sp_ops->plock;
 }
 EXPORT_SYMBOL_GPL(ocfs2_stack_supports_plocks);
 
-- 
1.5.4.1

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |   34 ++++++++++++----------------------
 fs/ocfs2/ocfs2.h      |   10 +++++++++-
 fs/ocfs2/super.c      |    8 +++++---
 3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 28e492e..b05ce66 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -47,8 +47,6 @@
 
 #define OCFS2_LOCAL_ALLOC(dinode)	(&((dinode)->id2.i_lab))
 
-static inline int ocfs2_local_alloc_window_bits(struct ocfs2_super *osb);
-
 static u32 ocfs2_local_alloc_count_bits(struct ocfs2_dinode *alloc);
 
 static int ocfs2_local_alloc_find_clear_bits(struct ocfs2_super *osb,
@@ -75,21 +73,13 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb,
 static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
 					  struct inode *local_alloc_inode);
 
-static inline int ocfs2_local_alloc_window_bits(struct ocfs2_super *osb)
-{
-	BUG_ON(osb->s_clustersize_bits > 20);
-
-	/* Size local alloc windows by the megabyte */
-	return osb->local_alloc_size << (20 - osb->s_clustersize_bits);
-}
-
 /*
  * Tell us whether a given allocation should use the local alloc
  * file. Otherwise, it has to go to the main bitmap.
  */
 int ocfs2_alloc_should_use_local(struct ocfs2_super *osb, u64 bits)
 {
-	int la_bits = ocfs2_local_alloc_window_bits(osb);
+	int la_bits = osb->local_alloc_bits;
 	int ret = 0;
 
 	if (osb->local_alloc_state != OCFS2_LA_ENABLED)
@@ -120,14 +110,16 @@ int ocfs2_load_local_alloc(struct ocfs2_super *osb)
 
 	mlog_entry_void();
 
-	if (osb->local_alloc_size == 0)
+	if (osb->local_alloc_bits == 0)
 		goto bail;
 
-	if (ocfs2_local_alloc_window_bits(osb) >= osb->bitmap_cpg) {
+	if (osb->local_alloc_bits >= ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Factor out the non-inode specifics of ocfs2_do_extend_allocation() into a more generic
function, ocfs2_do_cluster_allocation(). ocfs2_do_extend_allocation calls
ocfs2_do_cluster_allocation() now, but the latter can be used for other
btree types as well.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |  110 +++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/alloc.h |   17 +++++++
 fs/ocfs2/aops.c  |    8 ++--
 fs/ocfs2/dir.c   |    6 +-
 fs/ocfs2/file.c  |  136 +++++++++++-------------------------------------------
 fs/ocfs2/file.h  |   26 ++++------
 fs/ocfs2/namei.c |    8 ++--
 7 files changed, 176 insertions(+), 135 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 90cefc5..1332309 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -4302,6 +4302,116 @@ bail:
 	return status;
 }
 
+/*
+ * Allcate and add clusters into the extent b-tree.
+ * The new clusters(clusters_to_add) will be inserted at logical_offset.
+ * The extent b-tree's root is root_el and it should be in root_bh, and
+ * it is not limited to the file storage. Any extent tree can use this
+ * function if it implements the proper ocfs2_extent_tree.
+ */
+int ocfs2_add_clusters_in_btree(struct ocfs2_super *osb,
+				struct inode *inode,
+				u32 *logical_offset,
+				u32 clusters_to_add,
+				int mark_unwritten,
+				struct buffer_head *root_bh,
+				struct ocfs2_extent_list *root_el,
+				handle_t *handle,
+				struct ocfs2_alloc_context *data_ac,
+				struct ocfs2_alloc_context *meta_ac,
+				enum ocfs2_alloc_restarted *reason_ret,
+				enum ocfs2_extent_tree_type type)
+{
+	int status = 0;
+	int free_extents;
+	enum ocfs2_alloc_restarted reason = RESTART_NONE;
+	u32 bit_off, num_bits;
+	u64 block;
+	u8 flags = 0;
+
+	BUG_ON(!clusters_to_add);
+
+	if (mark_unwritten)
+		flags = OCFS2_EXT_UNWRITTEN;
+
+	free_extents = ocfs2_num_free_extents(osb, inode, root_bh, ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Ocfs2 uses a very flexible structure for storing extended attributes on
disk. Small amount of attributes are stored directly in the inode block - up
to 256 bytes worth. If that fills up, attributes are also stored in an
external block, linked to from the inode block. That block can in turn
expand to a btree, capable of storing large numbers of attributes.

Individual attribute values are stored inline if they're small enough
(currently about 80 bytes, this can be changed though), and otherwise are
expanded to a btree. The theoretical limit to the size of an individual
attribute is about the same as an inode, though the kernel's upper bound on
the size of an attributes data is far smaller.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/ocfs2_fs.h |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 118 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 4f61985..1b46505 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -64,6 +64,7 @@
 #define OCFS2_INODE_SIGNATURE		"INODE01"
 #define OCFS2_EXTENT_BLOCK_SIGNATURE	"EXBLK01"
 #define OCFS2_GROUP_DESC_SIGNATURE      "GROUP01"
+#define OCFS2_XATTR_BLOCK_SIGNATURE	"XATTR01"
 
 /* Compatibility flags */
 #define OCFS2_HAS_COMPAT_FEATURE(sb,mask)			\
@@ -715,6 +716,123 @@ struct ocfs2_group_desc
 /*40*/	__u8    bg_bitmap[0];
 };
 
+/*
+ * On disk extended attribute structure for OCFS2.
+ */
+
+/*
+ * ocfs2_xattr_entry indicates one extend attribute.
+ *
+ * Note that it can be stored in inode, one block or one xattr bucket.
+ */
+struct ocfs2_xattr_entry {
+	__le32	xe_name_hash;    /* hash value of xattr prefix+suffix. */
+	__le16	xe_name_offset;  /* byte offset from the 1st etnry in the local
+				    local xattr storage(inode, xattr block or
+				    xattr bucket). */
+	__u8	xe_name_len;	 /* xattr name len, does't include prefix. */
+	__u8	xe_type;      ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

The old uptodate only handles the issue of removing one buffer_head from
ocfs2 inode's buffer cache. With xattr clusters, we may need to remove
multiple buffer_head's at a time.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/uptodate.c |   32 ++++++++++++++++++++++++++------
 fs/ocfs2/uptodate.h |    3 +++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/uptodate.c b/fs/ocfs2/uptodate.c
index 4da8851..e26459e 100644
--- a/fs/ocfs2/uptodate.c
+++ b/fs/ocfs2/uptodate.c
@@ -511,14 +511,10 @@ static void ocfs2_remove_metadata_tree(struct ocfs2_caching_info *ci,
 	ci->ci_num_cached--;
 }
 
-/* Called when we remove a chunk of metadata from an inode. We don't
- * bother reverting things to an inlined array in the case of a remove
- * which moves us back under the limit. */
-void ocfs2_remove_from_cache(struct inode *inode,
-			     struct buffer_head *bh)
+static void ocfs2_remove_block_from_cache(struct inode *inode,
+					  sector_t block)
 {
 	int index;
-	sector_t block = bh->b_blocknr;
 	struct ocfs2_meta_cache_item *item = NULL;
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 	struct ocfs2_caching_info *ci = &oi->ip_metadata_cache;
@@ -544,6 +540,30 @@ void ocfs2_remove_from_cache(struct inode *inode,
 		kmem_cache_free(ocfs2_uptodate_cachep, item);
 }
 
+/*
+ * Called when we remove a chunk of metadata from an inode. We don't
+ * bother reverting things to an inlined array in the case of a remove
+ * which moves us back under the limit.
+ */
+void ocfs2_remove_from_cache(struct inode *inode,
+			     struct buffer_head *bh)
+{
+	sector_t block = bh->b_blocknr;
+
+	ocfs2_remove_block_from_cache(inode, block);
+}
+
+/* Called when we remove xattr clusters from an inode. */
+void ocfs2_remove_xattr_clusters_from_cache(struct inode *inode,
+					    sector_t block,
+					    u32 c_len)
+{
+	u64 i, b_len = ocfs2_clusters_to_blocks(inode->i_sb, ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:11 pm

I really really hope that `i' and `b_len' didn't really need to be
64-bit here.
--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 1:18 pm

Yeah, there's no way currently that any of those variables should get even
close to that large. I made them unsigned ints with the patch below.
	--Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.com>

ocfs2: use smaller counters in ocfs2_remove_xattr_clusters_from_cache

i and b_len don't really need to be u64's. Xattr extent lengths should be
limited by the VFS, and then the size of our on-disk length field.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/uptodate.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/uptodate.c b/fs/ocfs2/uptodate.c
index 5235140..187b99f 100644
--- a/fs/ocfs2/uptodate.c
+++ b/fs/ocfs2/uptodate.c
@@ -562,7 +562,7 @@ void ocfs2_remove_xattr_clusters_from_cache(struct inode *inode,
 					    sector_t block,
 					    u32 c_len)
 {
-	u64 i, b_len = ocfs2_clusters_to_blocks(inode->i_sb, 1) * c_len;
+	unsigned int i, b_len = ocfs2_clusters_to_blocks(inode->i_sb, 1) * c_len;
 
 	for (i = 0; i < b_len; i++, block++)
 		ocfs2_remove_block_from_cache(inode, block);
-- 
1.5.4.1

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

In the old extent tree operation, we take the hypothesis that we
are using the ocfs2_extent_list in ocfs2_dinode as the tree root.
As xattr will also use ocfs2_extent_list to store large value
for a xattr entry, we refactor the tree operation so that xattr
can use it directly.

The refactoring includes 4 steps:
1. Abstract set/get of last_eb_blk and update_clusters since they may
   be stored in different location for dinode and xattr.
2. Add a new structure named ocfs2_extent_tree to indicate the
   extent tree the operation will work on.
3. Remove all the use of fe_bh and di, use root_bh and root_el in
   extent tree instead. So now all the fe_bh is replaced with
   et->root_bh, el with root_el accordingly.
4. Make ocfs2_lock_allocators generic. Now it is limited to be only used
   in file extend allocation. But the whole function is useful when we want
   to store large EAs.

Note: This patch doesn't touch ocfs2_commit_truncate() since it is not used
for anything other than truncate inode data btrees.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c    |  508 +++++++++++++++++++++++++++++++++------------------
 fs/ocfs2/alloc.h    |   23 ++-
 fs/ocfs2/aops.c     |   11 +-
 fs/ocfs2/dir.c      |    7 +-
 fs/ocfs2/file.c     |  104 ++---------
 fs/ocfs2/file.h     |    4 -
 fs/ocfs2/suballoc.c |   82 ++++++++
 fs/ocfs2/suballoc.h |    5 +
 8 files changed, 456 insertions(+), 288 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index dc844df..90cefc5 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -49,6 +49,143 @@
 
 #include "buffer_head_io.h"
 
+/*
+ * ocfs2_extent_tree and ocfs2_extent_tree_operations are used to abstract
+ * the b-tree operations in ocfs2. Now all the b-tree operations are not
+ * limited to ocfs2_dinode only. Any data which need to allocate clusters
+ * to store can use b-tree. And it only needs to implement its ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

ocfs2_extend_meta_needed(), ocfs2_calc_extend_credits() and
ocfs2_reserve_new_metadata() are all useful for extent tree operations. But
they are all limited to an inode btree because they use a struct
ocfs2_dinode parameter. Change their parameter to struct ocfs2_extent_list
(the part of an ocfs2_dinode they actually use) so that the xattr btree code
can use these functions.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c    |    3 ++-
 fs/ocfs2/alloc.h    |   12 +++++++++---
 fs/ocfs2/aops.c     |    3 ++-
 fs/ocfs2/dir.c      |    5 +++--
 fs/ocfs2/file.c     |    9 +++++----
 fs/ocfs2/journal.h  |   17 +++++++++++------
 fs/ocfs2/suballoc.c |    4 ++--
 fs/ocfs2/suballoc.h |    7 ++++++-
 8 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index c74711f..dc844df 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -4536,7 +4536,8 @@ static int ocfs2_split_tree(struct inode *inode, struct buffer_head *di_bh,
 	} else
 		rightmost_el = path_leaf_el(path);
 
-	credits += path->p_tree_depth + ocfs2_extend_meta_needed(di);
+	credits += path->p_tree_depth +
+		   ocfs2_extend_meta_needed(&di->id2.i_list);
 	ret = ocfs2_extend_trans(handle, credits);
 	if (ret) {
 		mlog_errno(ret);
diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h
index 758dbda..249e79e 100644
--- a/fs/ocfs2/alloc.h
+++ b/fs/ocfs2/alloc.h
@@ -48,8 +48,14 @@ int ocfs2_remove_extent(struct inode *inode, struct buffer_head *di_bh,
 int ocfs2_num_free_extents(struct ocfs2_super *osb,
 			   struct inode *inode,
 			   struct buffer_head *bh);
-/* how many new metadata chunks would an allocation need at maximum? */
-static inline int ocfs2_extend_meta_needed(struct ocfs2_dinode *fe)
+/*
+ * how many new metadata chunks would an allocation need at maximum?
+ *
+ * Please note that the caller must make sure that root_el is the root
+ * of extent tree. So for ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Add some thin wrappers around ocfs2_insert_extent() for each of the 3
different btree types, ocfs2_inode_insert_extent(),
ocfs2_xattr_value_insert_extent() and ocfs2_xattr_tree_insert_extent(). The
last is for the xattr index btree, which will be used in a followup patch.

All the old callers in file.c etc will call ocfs2_dinode_insert_extent(),
while the other two handle the xattr issue. And the init of extent tree are
handled by these functions.

When storing xattr value which is too large, we will allocate some clusters
for it and here ocfs2_extent_list and ocfs2_extent_rec will also be used. In
order to re-use the b-tree operation code, a new parameter named "private"
is added into ocfs2_extent_tree and it is used to indicate the root of
ocfs2_exent_list. The reason is that we can't deduce the root from the
buffer_head now. It may be in an inode, an ocfs2_xattr_block or even worse,
in any place in an ocfs2_xattr_bucket.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/Makefile          |    3 +-
 fs/ocfs2/alloc.c           |  184 +++++++++++++++++++++-----
 fs/ocfs2/alloc.h           |   42 ++++--
 fs/ocfs2/aops.c            |    5 +-
 fs/ocfs2/cluster/masklog.c |    1 +
 fs/ocfs2/cluster/masklog.h |    1 +
 fs/ocfs2/dir.c             |   11 +-
 fs/ocfs2/extent_map.c      |   60 +++++++++
 fs/ocfs2/extent_map.h      |    3 +
 fs/ocfs2/file.c            |    9 +-
 fs/ocfs2/suballoc.c        |    5 +-
 fs/ocfs2/suballoc.h        |    3 +-
 fs/ocfs2/xattr.c           |  305 ++++++++++++++++++++++++++++++++++++++++++++
 13 files changed, 568 insertions(+), 64 deletions(-)
 create mode 100644 fs/ocfs2/xattr.c

diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index f6956de..af63980 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -34,7 +34,8 @@ ocfs2-objs := \
 	symlink.o 		\
 	sysfile.o 		\
 	uptodate.o		\
-	ver.o
+	ver.o			\
+	xattr.o			\
 
 ocfs2_stackglue-objs ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:12 pm

many etceteras.
--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 1:19 pm

Thanks, luckily though, it seems these got fixed later in the series :)
	--Mark

--
Mark Fasheh
--

From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:12 pm

brelse(0) is legal.  Please do an fs-wide review for this.

It shouldn't affect code generation because brelse() is inlined.
--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 2:19 pm

Ok, I'll queue that up. I'll watch to make sure that new patches from here

Fair enough, thanks.
	--Mark


--
Mark Fasheh
--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

When necessary, an ocfs2_xattr_block will embed an ocfs2_extent_list to
store large numbers of EAs. This patch adds a new type in
ocfs2_extent_tree_type and adds the implementation so that we can re-use the
b-tree code to handle the storage of many EAs.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/alloc.h |   10 ++++++
 2 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index d175db1..47cdea6 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -177,6 +177,48 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_et_ops = {
 	.sanity_check		= ocfs2_xattr_value_sanity_check,
 };
 
+static void ocfs2_xattr_tree_set_last_eb_blk(struct ocfs2_extent_tree *et,
+					     u64 blkno)
+{
+	struct ocfs2_xattr_block *xb =
+		(struct ocfs2_xattr_block *) et->root_bh->b_data;
+	struct ocfs2_xattr_tree_root *xt = &xb->xb_attrs.xb_root;
+
+	xt->xt_last_eb_blk = cpu_to_le64(blkno);
+}
+
+static u64 ocfs2_xattr_tree_get_last_eb_blk(struct ocfs2_extent_tree *et)
+{
+	struct ocfs2_xattr_block *xb =
+		(struct ocfs2_xattr_block *) et->root_bh->b_data;
+	struct ocfs2_xattr_tree_root *xt = &xb->xb_attrs.xb_root;
+
+	return le64_to_cpu(xt->xt_last_eb_blk);
+}
+
+static void ocfs2_xattr_tree_update_clusters(struct inode *inode,
+					     struct ocfs2_extent_tree *et,
+					     u32 clusters)
+{
+	struct ocfs2_xattr_block *xb =
+			(struct ocfs2_xattr_block *)et->root_bh->b_data;
+
+	le32_add_cpu(&xb->xb_attrs.xb_root.xt_clusters, clusters);
+}
+
+static int ocfs2_xattr_tree_sanity_check(struct inode *inode,
+					 struct ocfs2_extent_tree *et)
+{
+	return 0;
+}
+
+static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
+	.set_last_eb_blk	= ocfs2_xattr_tree_set_last_eb_blk,
+	.get_last_eb_blk	= ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tiger Yang <tiger.yang@oracle.com>

Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c    |   22 ++++++++++++++++------
 fs/ocfs2/ocfs2.h    |    1 +
 fs/ocfs2/ocfs2_fs.h |   46 +++++++++++++++++++++++++++++++++++++++++++---
 fs/ocfs2/super.c    |    2 ++
 4 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 130988f..d175db1 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6586,20 +6586,29 @@ out:
 	return ret;
 }
 
-static void ocfs2_zero_dinode_id2(struct inode *inode, struct ocfs2_dinode *di)
+static void ocfs2_zero_dinode_id2_with_xattr(struct inode *inode,
+					     struct ocfs2_dinode *di)
 {
 	unsigned int blocksize = 1 << inode->i_sb->s_blocksize_bits;
+	unsigned int xattrsize = le16_to_cpu(di->i_xattr_inline_size);
 
-	memset(&di->id2, 0, blocksize - offsetof(struct ocfs2_dinode, id2));
+	if (le16_to_cpu(di->i_dyn_features) & OCFS2_INLINE_XATTR_FL)
+		memset(&di->id2, 0, blocksize -
+				    offsetof(struct ocfs2_dinode, id2) -
+				    xattrsize);
+	else
+		memset(&di->id2, 0, blocksize -
+				    offsetof(struct ocfs2_dinode, id2));
 }
 
 void ocfs2_dinode_new_extent_list(struct inode *inode,
 				  struct ocfs2_dinode *di)
 {
-	ocfs2_zero_dinode_id2(inode, di);
+	ocfs2_zero_dinode_id2_with_xattr(inode, di);
 	di->id2.i_list.l_tree_depth = 0;
 	di->id2.i_list.l_next_free_rec = 0;
-	di->id2.i_list.l_count = cpu_to_le16(ocfs2_extent_recs_per_inode(inode->i_sb));
+	di->id2.i_list.l_count = cpu_to_le16(
+		ocfs2_extent_recs_per_inode_with_xattr(inode->i_sb, di));
 }
 
 void ocfs2_set_inode_data_inline(struct inode *inode, struct ocfs2_dinode *di)
@@ -6616,9 +6625,10 @@ ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

ocfs2_num_free_extents() is used to find the number of free extent records
in an inode btree. Hence, it takes an "ocfs2_dinode" parameter. We want to
use this for extended attribute trees in the future, so genericize the
interface the take a buffer head. A future patch will allow that buffer_head
to contain any structure rooting an ocfs2 btree.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |    3 ++-
 fs/ocfs2/alloc.h |    2 +-
 fs/ocfs2/aops.c  |    5 +++--
 fs/ocfs2/dir.c   |    3 ++-
 fs/ocfs2/file.c  |   11 ++++++-----
 fs/ocfs2/file.h  |    2 +-
 6 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 10bfb46..c74711f 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -368,12 +368,13 @@ struct ocfs2_merge_ctxt {
  */
 int ocfs2_num_free_extents(struct ocfs2_super *osb,
 			   struct inode *inode,
-			   struct ocfs2_dinode *fe)
+			   struct buffer_head *bh)
 {
 	int retval;
 	struct ocfs2_extent_list *el;
 	struct ocfs2_extent_block *eb;
 	struct buffer_head *eb_bh = NULL;
+	struct ocfs2_dinode *fe = (struct ocfs2_dinode *)bh->b_data;
 
 	mlog_entry_void();
 
diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h
index 42ff94b..758dbda 100644
--- a/fs/ocfs2/alloc.h
+++ b/fs/ocfs2/alloc.h
@@ -47,7 +47,7 @@ int ocfs2_remove_extent(struct inode *inode, struct buffer_head *di_bh,
 			struct ocfs2_cached_dealloc_ctxt *dealloc);
 int ocfs2_num_free_extents(struct ocfs2_super *osb,
 			   struct inode *inode,
-			   struct ocfs2_dinode *fe);
+			   struct buffer_head *bh);
 /* how many new metadata chunks would an allocation need at maximum? */
 static inline int ocfs2_extend_meta_needed(struct ocfs2_dinode *fe)
 {
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index a53da14..e2008dc 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -1712,8 +1712,9 @@ int ocfs2_write_begin_nolock(struct address_space ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Add code to lookup a given extended attribute in the xattr btree. Lookup
follows this general scheme:

1. Use ocfs2_xattr_get_rec to find the xattr extent record

2. Find the xattr bucket within the extent which may contain this xattr

3. Iterate the bucket to find the xattr. In ocfs2_xattr_block_get(), we need
   to recalcuate the block offset and name offset for the right position of
   name/value.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/xattr.c |  351 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 328 insertions(+), 23 deletions(-)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index fb17f7f..acccdfa 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -99,12 +99,25 @@ struct ocfs2_xattr_search {
 	 */
 	struct buffer_head *xattr_bh;
 	struct ocfs2_xattr_header *header;
+	struct ocfs2_xattr_bucket bucket;
 	void *base;
 	void *end;
 	struct ocfs2_xattr_entry *here;
 	int not_found;
 };
 
+static int ocfs2_xattr_bucket_get_name_value(struct inode *inode,
+					     struct ocfs2_xattr_header *xh,
+					     int index,
+					     int *block_off,
+					     int *new_offset);
+
+static int ocfs2_xattr_index_block_find(struct inode *inode,
+					struct buffer_head *root_bh,
+					int name_index,
+					const char *name,
+					struct ocfs2_xattr_search *xs);
+
 static int ocfs2_xattr_tree_list_index_block(struct inode *inode,
 					struct ocfs2_xattr_tree_root *xt,
 					char *buffer,
@@ -604,7 +617,7 @@ static int ocfs2_xattr_find_entry(int name_index,
 }
 
 static int ocfs2_xattr_get_value_outside(struct inode *inode,
-					 struct ocfs2_xattr_search *xs,
+					 struct ocfs2_xattr_value_root *xv,
 					 void *buffer,
 					 size_t len)
 {
@@ -613,12 +626,8 @@ static int ocfs2_xattr_get_value_outside(struct inode *inode,
 	int i, ret = 0;
 	size_t cplen, blocksize;
 	struct buffer_head *bh = NULL;
-	struct ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Tao Ma <tao.ma@oracle.com>

In inode removal, we need to iterate all the buckets, remove any
externally-stored EA values and delete the xattr buckets.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/xattr.c |   84 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 5e8fae9..9ec7136 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -131,6 +131,9 @@ static int ocfs2_xattr_set_entry_index_block(struct inode *inode,
 					     struct ocfs2_xattr_info *xi,
 					     struct ocfs2_xattr_search *xs);
 
+static int ocfs2_delete_xattr_index_block(struct inode *inode,
+					  struct buffer_head *xb_bh);
+
 static inline struct xattr_handler *ocfs2_xattr_handler(int name_index)
 {
 	struct xattr_handler *handler = NULL;
@@ -1511,13 +1514,14 @@ static int ocfs2_xattr_block_remove(struct inode *inode,
 				    struct buffer_head *blk_bh)
 {
 	struct ocfs2_xattr_block *xb;
-	struct ocfs2_xattr_header *header;
 	int ret = 0;
 
 	xb = (struct ocfs2_xattr_block *)blk_bh->b_data;
-	header = &(xb->xb_attrs.xb_header);
-
-	ret = ocfs2_remove_value_outside(inode, blk_bh, header);
+	if (!(le16_to_cpu(xb->xb_flags) & OCFS2_XATTR_INDEXED)) {
+		struct ocfs2_xattr_header *header = &(xb->xb_attrs.xb_header);
+		ret = ocfs2_remove_value_outside(inode, blk_bh, header);
+	} else
+		ret = ocfs2_delete_xattr_index_block(inode, blk_bh);
 
 	return ret;
 }
@@ -4738,3 +4742,75 @@ out:
 	mlog_exit(ret);
 	return ret;
 }
+
+static int ocfs2_delete_xattr_in_bucket(struct inode *inode,
+					struct ocfs2_xattr_bucket *bucket,
+					void *para)
+{
+	int ret = 0;
+	struct ocfs2_xattr_header *xh = bucket->xh;
+	u16 i;
+	struct ocfs2_xattr_entry *xe;
+
+	for (i = 0; i < le16_to_cpu(xh->xh_count); i++) {
+		xe = &xh->xh_entries[i];
+		if (ocfs2_xattr_is_local(xe))
+			continue;
+
+		ret = ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Tiger Yang <tiger.yang@oracle.com>

This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/ocfs2.h    |    7 +++++++
 fs/ocfs2/ocfs2_fs.h |   19 +++++++++++++------
 fs/ocfs2/super.c    |    3 ++-
 fs/ocfs2/xattr.c    |   12 ++++++++++++
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index cae0dd4..6d3c10d 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -363,6 +363,13 @@ static inline int ocfs2_supports_inline_data(struct ocfs2_super *osb)
 	return 0;
 }
 
+static inline int ocfs2_supports_xattr(struct ocfs2_super *osb)
+{
+	if (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_XATTR)
+		return 1;
+	return 0;
+}
+
 /* set / clear functions because cluster events can make these happen
  * in parallel so we want the transitions to be atomic. this also
  * means that any future flags osb_flags must be protected by spinlock
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 8d5e72f..f24ce3d 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -91,7 +91,8 @@
 					 | OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC \
 					 | OCFS2_FEATURE_INCOMPAT_INLINE_DATA \
 					 | OCFS2_FEATURE_INCOMPAT_EXTENDED_SLOT_MAP \
-					 | OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK)
+					 | OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK \
+					 | OCFS2_FEATURE_INCOMPAT_XATTR)
 #define OCFS2_FEATURE_RO_COMPAT_SUPP	OCFS2_FEATURE_RO_COMPAT_UNWRITTEN
 
 /*
@@ -128,10 +129,6 @@
 /* Support for data packed into inode blocks */
 #define OCFS2_FEATURE_INCOMPAT_INLINE_DATA	0x0040
 
-/* Support for the extended slot map */
-#define OCFS2_FEATURE_INCOMPAT_EXTENDED_SLOT_MAP 0x100
-
-
 /*
  * Support for alternate, userspace cluster stacks.  If set, the superblock
  * field ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

This patch fixes the following build warnings:

fs/ocfs2/xattr.c: In function 'ocfs2_half_xattr_bucket':
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 7 has type 'long int'
fs/ocfs2/xattr.c:3282: warning: format '%d' expects type 'int', but argument 8 has type 'long int'
fs/ocfs2/xattr.c: In function 'ocfs2_xattr_set_entry_in_bucket':
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'
fs/ocfs2/xattr.c:4092: warning: format '%d' expects type 'int', but argument 6 has type 'size_t'

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/xattr.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 090449f..1b349c7 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -3264,7 +3264,8 @@ static int ocfs2_half_xattr_bucket(struct inode *inode,
 	xe = &xh->xh_entries[start];
 	len = sizeof(struct ocfs2_xattr_entry) * (count - start);
 	mlog(0, "mv xattr entry len %d from %d to %d\n", len,
-	     (char *)xe - (char *)xh, (char *)xh->xh_entries - (char *)xh);
+	     (int)((char *)xe - (char *)xh),
+	     (int)((char *)xh->xh_entries - (char *)xh));
 	memmove((char *)xh->xh_entries, (char *)xe, len);
 	xe = &xh->xh_entries[count - start];
 	len = sizeof(struct ocfs2_xattr_entry) * start;
@@ -4073,8 +4074,8 @@ static int ocfs2_xattr_set_entry_in_bucket(struct inode *inode,
 	u16 blk_per_bucket = ocfs2_blocks_per_xattr_bucket(inode->i_sb);
 	struct ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Where the previous patches added the ability of list/get xattr in buckets
for ocfs2, this patch enables ocfs2 to store large numbers of EAs.

The original design doc is written by Mark Fasheh, and it can be found in
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/IndexedEATrees. I only had to
make small modifications to it.

First, because the bucket size is 4K, a new field named xh_free_start is added
in ocfs2_xattr_header to indicate the next valid name/value offset in a bucket.
It is used when we store new EA name/value. With this field, we can find the
place more quickly and what's more, we don't need to sort the name/value every
time to let the last entry indicate the next unused space. This makes the
insert operation more efficient for blocksizes smaller than 4k.

Because of the new xh_free_start, another field named as xh_name_value_len is
also added in ocfs2_xattr_header. It records the total length of all the
name/values in the bucket. We need this so that we can check it and defragment
the bucket if there is not enough contiguous free space.

An xattr insertion looks like this:
1. xattr_index_block_find: find the right bucket by the name_hash, say bucketA.
2. check whether there is enough space in bucketA. If yes, insert it directly
   and modify xh_free_start and xh_name_value_len accordingly. If not, check
   xh_name_value_len to see whether we can store this by defragment the bucket.
   If yes, defragment it and go on insertion.
3. If defragement doesn't work, check whether there is new empty bucket in
   the clusters within this extent record. If yes, init the new bucket and move
   all the buckets after bucketA one by one to the next bucket. Move half of the
   entries in bucketA to the next bucket and go on insertion.
4. If there is no new bucket, grow the extent tree.

As for xattr deletion, we will delete an xattr bucket when all it's xattrs
are removed and move all the buckets after it to the previous one. When all
the ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

The ocfs2_extent_tree_operations structure gains a field prefix on its
members.  The ->eo_sanity_check() operation gains a wrapper function for
completeness.  All of the extent tree operation wrappers gain a
consistent name (ocfs2_et_*()).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   85 +++++++++++++++++++++++++++++------------------------
 1 files changed, 46 insertions(+), 39 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 16879bd..9fe49f2 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -65,12 +65,13 @@
 struct ocfs2_extent_tree;
 
 struct ocfs2_extent_tree_operations {
-	void (*set_last_eb_blk) (struct ocfs2_extent_tree *et, u64 blkno);
-	u64 (*get_last_eb_blk) (struct ocfs2_extent_tree *et);
-	void (*update_clusters) (struct inode *inode,
-				 struct ocfs2_extent_tree *et,
-				 u32 new_clusters);
-	int (*sanity_check) (struct inode *inode, struct ocfs2_extent_tree *et);
+	void (*eo_set_last_eb_blk)(struct ocfs2_extent_tree *et,
+				   u64 blkno);
+	u64 (*eo_get_last_eb_blk)(struct ocfs2_extent_tree *et);
+	void (*eo_update_clusters)(struct inode *inode,
+				   struct ocfs2_extent_tree *et,
+				   u32 new_clusters);
+	int (*eo_sanity_check)(struct inode *inode, struct ocfs2_extent_tree *et);
 };
 
 struct ocfs2_extent_tree {
@@ -132,10 +133,10 @@ static int ocfs2_dinode_sanity_check(struct inode *inode,
 }
 
 static struct ocfs2_extent_tree_operations ocfs2_dinode_et_ops = {
-	.set_last_eb_blk	= ocfs2_dinode_set_last_eb_blk,
-	.get_last_eb_blk	= ocfs2_dinode_get_last_eb_blk,
-	.update_clusters	= ocfs2_dinode_update_clusters,
-	.sanity_check		= ocfs2_dinode_sanity_check,
+	.eo_set_last_eb_blk	= ocfs2_dinode_set_last_eb_blk,
+	.eo_get_last_eb_blk	= ocfs2_dinode_get_last_eb_blk,
+	.eo_update_clusters	= ocfs2_dinode_update_clusters,
+	.eo_sanity_check	= ocfs2_dinode_sanity_check,
 };
 
 static void ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

In xattr bucket, we want to limit the maximum size of a btree leaf,
otherwise we'll lose the benefits of hashing because we'll have to search
large leaves.

So add a new field in ocfs2_extent_tree which indicates the maximum leaf cluster
size we want so that we can prevent ocfs2_insert_extent() from merging the leaf
record even if it is contiguous with an adjacent record.

Other btree types are not affected by this change.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   39 ++++++++++++++++++++++++++++++---------
 fs/ocfs2/alloc.h |    5 +++++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 47cdea6..16879bd 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -79,6 +79,7 @@ struct ocfs2_extent_tree {
 	struct buffer_head *root_bh;
 	struct ocfs2_extent_list *root_el;
 	void *private;
+	unsigned int max_leaf_clusters;
 };
 
 static void ocfs2_dinode_set_last_eb_blk(struct ocfs2_extent_tree *et,
@@ -220,7 +221,8 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
 };
 
 static struct ocfs2_extent_tree*
-	 ocfs2_new_extent_tree(struct buffer_head *bh,
+	 ocfs2_new_extent_tree(struct inode *inode,
+			       struct buffer_head *bh,
 			       enum ocfs2_extent_tree_type et_type,
 			       void *private)
 {
@@ -248,6 +250,8 @@ static struct ocfs2_extent_tree*
 			(struct ocfs2_xattr_block *)bh->b_data;
 		et->root_el = &xb->xb_attrs.xb_root.xt_list;
 		et->eops = &ocfs2_xattr_tree_et_ops;
+		et->max_leaf_clusters = ocfs2_clusters_for_bytes(inode->i_sb,
+						OCFS2_MAX_XATTR_TREE_LEAF_SIZE);
 	}
 
 	return et;
@@ -4118,7 +4122,8 @@ out:
 static void ocfs2_figure_contig_type(struct inode *inode,
 				     struct ocfs2_insert_type *insert,
 				     struct ocfs2_extent_list *el,
-				     struct ocfs2_extent_rec *insert_rec)
+				     struct ocfs2_extent_rec *insert_rec,
+				    ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tao Ma <tao.ma@oracle.com>

Ocfs2 breaks up xattr index tree leaves into 4k regions, called buckets.
Attributes are stored within a given bucket, depending on hash value.

After a discussion with Mark, we decided that the per-bucket index
(xe_entry[]) would only exist in the 1st block of a bucket. Likewise,
name/value pairs will not straddle more than one block. This allows the
majority of operations to work directly on the buffer heads in a leaf block.

This patch adds code to iterate the buckets in an EA. A new abstration of
ocfs2_xattr_bucket is added. It records the bhs in this bucket and
ocfs2_xattr_header. This keeps the code neat, improving readibility.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/ocfs2_fs.h |   35 +++++++-
 fs/ocfs2/xattr.c    |  255 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/ocfs2/xattr.h    |    9 ++
 3 files changed, 293 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 98e1f8b..8d5e72f 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -755,8 +755,13 @@ struct ocfs2_xattr_header {
 	__le16	xh_count;                       /* contains the count of how
 						   many records are in the
 						   local xattr storage. */
-	__le16	xh_reserved1;
-	__le32	xh_reserved2;
+	__le16	xh_free_start;                  /* current offset for storing
+						   xattr. */
+	__le16	xh_name_value_len;              /* total length of name/value
+						   length in this bucket. */
+	__le16	xh_num_buckets;                 /* bucket nums in one extent
+						   record, only valid in the
+						   first bucket. */
 	__le64  xh_csum;
 	struct ocfs2_xattr_entry xh_entries[0]; /* xattr entry list. */
 };
@@ -793,6 +798,10 @@ struct ocfs2_xattr_tree_root {
 #define OCFS2_XATTR_SIZE(size)	(((size) + OCFS2_XATTR_ROUND) & \
 				~(OCFS2_XATTR_ROUND))
 
+#define OCFS2_XATTR_BUCKET_SIZE			4096
+#define OCFS2_XATTR_MAX_BLOCKS_PER_BUCKET ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

From: Tiger Yang <tiger.yang@oracle.com>

This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/Makefile        |    2 +
 fs/ocfs2/file.c          |    5 +
 fs/ocfs2/inode.c         |    8 +
 fs/ocfs2/inode.h         |    3 +
 fs/ocfs2/journal.h       |   10 +
 fs/ocfs2/namei.c         |    5 +
 fs/ocfs2/ocfs2.h         |    2 +
 fs/ocfs2/ocfs2_fs.h      |    8 +-
 fs/ocfs2/suballoc.c      |   17 +-
 fs/ocfs2/suballoc.h      |    3 +
 fs/ocfs2/super.c         |   14 +
 fs/ocfs2/symlink.c       |    9 +
 fs/ocfs2/xattr.c         | 1620 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/xattr.h         |   51 ++
 fs/ocfs2/xattr_trusted.c |   82 +++
 fs/ocfs2/xattr_user.c    |   94 +++
 16 files changed, 1927 insertions(+), 6 deletions(-)
 create mode 100644 fs/ocfs2/xattr.h
 create mode 100644 fs/ocfs2/xattr_trusted.c
 create mode 100644 fs/ocfs2/xattr_user.c

diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index af63980..21323da 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -36,6 +36,8 @@ ocfs2-objs := \
 	uptodate.o		\
 	ver.o			\
 	xattr.o			\
+	xattr_user.o		\
+	xattr_trusted.o
 
 ocfs2_stackglue-objs := stackglue.o
 ocfs2_stack_o2cb-objs := stack_o2cb.o
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4dc5edf..7ddb363 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -55,6 +55,7 @@
 #include "mmap.h"
 #include "suballoc.h"
 #include "super.h"
+#include "xattr.h"
 
 #include "buffer_head_io.h"
 
@@ -2070,6 +2071,10 @@ const struct inode_operations ocfs2_file_iops = {
 	.setattr	= ocfs2_setattr,
 	.getattr	= ocfs2_getattr,
 	.permission	= ocfs2_permission,
+	.setxattr	= ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:12 pm

Is there a documentation update for these?
--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 1:22 pm

There is now  :) I'm actually usually a bit of a stickler for those too, but
obviously I missed this one.
	--Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.com>

ocfs2: Documentation update for user_xattr / nouser_xattr mount options

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 Documentation/filesystems/ocfs2.txt |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
index 6acf1b4..4340cc8 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -80,3 +80,5 @@ inode64			Indicates that Ocfs2 is allowed to create inodes at
 			any location in the filesystem, including those which
 			will result in inode numbers occupying more than 32
 			bits of significance.
+user_xattr	(*)	Enables Extended User Attributes.
+nouser_xattr		Disables Extended User Attributes.
-- 
1.5.4.1

--

From: Christoph Hellwig
Date: Thursday, October 2, 2008 - 1:16 am

Please don't split this up, it's always been a really stupid idea in
extN.  The only difference between secure, trusted and user attrs is
that they go into a different namespace bit (and have different
permission checking, but that's handled in the VFS).  I have some
upcoming patches to store a fs private flag in struct xattr_handler
so that even those flags wrappers can go away, and each of the
namespaces will just be five lines of code for the xattr_handler

You seem to need the handler mostly for getting back to the prefix
from the handler.  This is a pretty clear indicator that you don't
want to use the xattr_handler splitting but deal with the whole
attr name.  Take a look at the btrfs code after my recent xattr changes

And I think there's far too much inlining going on in here..

--

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 3:08 pm

Ok. The following patch (in ocfs2.git now) removes those two files, and puts
the code for user and trusted xattrs at the bottom of xattr.c. Is that


Yep, I went ahead and un-inlined that function.

Thanks for the review,
	--Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.com>

ocfs2: Move trusted and user attribute support into xattr.c

Per Christoph Hellwig's suggestion - don't split these up. It's not like we
gained much by having the two tiny files around.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/Makefile        |    4 +-
 fs/ocfs2/xattr.c         |  110 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/xattr_trusted.c |   82 ----------------------------------
 fs/ocfs2/xattr_user.c    |   94 ---------------------------------------
 4 files changed, 111 insertions(+), 179 deletions(-)
 delete mode 100644 fs/ocfs2/xattr_trusted.c
 delete mode 100644 fs/ocfs2/xattr_user.c

diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile
index 21323da..589dcdf 100644
--- a/fs/ocfs2/Makefile
+++ b/fs/ocfs2/Makefile
@@ -35,9 +35,7 @@ ocfs2-objs := \
 	sysfile.o 		\
 	uptodate.o		\
 	ver.o			\
-	xattr.o			\
-	xattr_user.o		\
-	xattr_trusted.o
+	xattr.o
 
 ocfs2_stackglue-objs := stackglue.o
 ocfs2_stack_o2cb-objs := stack_o2cb.o
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index e21a1a8..0f556b0 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -37,6 +37,9 @@
 #include <linux/writeback.h>
 #include <linux/falloc.h>
 #include <linux/sort.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/string.h>
 
 #define MLOG_MASK_PREFIX ML_XATTR
 #include <cluster/masklog.h>
@@ -4740,3 +4743,110 @@ static int ocfs2_delete_xattr_index_block(struct inode *inode,
 out:
 	return ret;
 }
+
+/*
+ * 'trusted' attributes support
+ */
+
+#define XATTR_TRUSTED_PREFIX "trusted."
+
+static size_t ocfs2_xattr_trusted_list(struct inode *inode, char *list,
+				       size_t list_size, const char *name,
+				       ...
From: Tiger Yang
Date: Tuesday, October 7, 2008 - 6:56 pm

I have looked the patch for btrfs about this. We are different.
Btrfs store the whole xattr name including the prefix "user." 
"trusted.", we store index number instead of it.

regards,
tiger
--

From: Christoph Hellwig
Date: Wednesday, October 8, 2008 - 6:16 am

In which case you shouldn't need to look the handler up anyway.  I'll
re-review the code once you post the next version.

--

From: Christoph Hellwig
Date: Wednesday, October 8, 2008 - 6:34 am

I looked at the git tree and there are two users of
ocfs2_xattr_handler().

 (1) for using the ->list handler in listattr.  That's something I fixed
     in btrfs that I wanted to point you to.  The whole concept of a
     ->list handler is stupid, and it was only added as a hack for
     the tmpfs "generic" xattr support which is a mess.  Instead of
     looking up a handler that would only do the same thing anyway
     for all on-disk attributes just call the code directly and
     have a map from index to prefix (look at
     fs/xfs/linux-2.6/xfs_xattr.c for an example).  You
     also have a check for OCFS2_MOUNT_NOUSERXATTR for the user
     attributes, but that's much easier done by just checking the
     index in an if (and I'd personally just kill it completely, the
     options doesn't seem useful - but that's an unrelated bit)
 
 (2) For generating the hash.  I don't quite understand why you want to
     also hash the prefix if it's not store on disk anyway but sorted
     into the numeric buckets.
--

From: Tao Ma
Date: Wednesday, October 8, 2008 - 7:04 am

yes, you are right. The handler for list is borrowed from ext3 and 
somewhat ugly. We just need the prefix name but use such a complicated 
This is done intentionally. See the design doc 
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/ExtendedAttributes.
"Each entry has a 32-bit hash value associated with it. The hash value 
is calculated using the full (prefix.suffix) name of the xattr to avoid 
hash collisions when the same suffix is used in multiple attribute 
namespaces. "
So Mark, do you think we need this prefix hash?
Anyway, if we make consensus that the hash calculation doesn't need 
prefix any more, we can remove the ocfs2_xattr_handler safely.

Regards,
Tao
--

From: Mark Fasheh
Date: Wednesday, October 8, 2008 - 5:38 pm

Removing the prefix hash should be fine. Technically, this changes the disk
format, but nobody should be using this for production yet anyway.
	--Mark

--
Mark Fasheh
--

From: Christoph Hellwig
Date: Wednesday, October 8, 2008 - 6:22 am

Yeah.

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

A per-mount debugfs file, "local_alloc" is created which when read will
expose live state of the nodes local alloc file. Performance impact is
minimal, only a bit of memory overhead per mount point. Still, the code is
hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
local alloc performance problems on a live system.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |   87 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/ocfs2.h      |    5 +++
 2 files changed, 92 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index f71658a..b889f10 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/highmem.h>
 #include <linux/bitops.h>
+#include <linux/debugfs.h>
 
 #define MLOG_MASK_PREFIX ML_DISK_ALLOC
 #include <cluster/masklog.h>
@@ -73,6 +74,85 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb,
 static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
 					  struct inode *local_alloc_inode);
 
+#ifdef CONFIG_OCFS2_FS_STATS
+
+DEFINE_MUTEX(la_debug_mutex);
+
+static int ocfs2_la_debug_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+#define LA_DEBUG_BUF_SZ	PAGE_CACHE_SIZE
+#define LA_DEBUG_VER	1
+static ssize_t ocfs2_la_debug_read(struct file *file, char __user *userbuf,
+				   size_t count, loff_t *ppos)
+{
+	struct ocfs2_super *osb = file->private_data;
+	int written, ret;
+	char *buf = osb->local_alloc_debug_buf;
+
+	mutex_lock(&la_debug_mutex);
+	memset(buf, 0, LA_DEBUG_BUF_SZ);
+
+	written = snprintf(buf, LA_DEBUG_BUF_SZ,
+			   "0x%x\t0x%llx\t%u\t%u\t0x%x\n",
+			   LA_DEBUG_VER,
+			   (unsigned long long)osb->la_last_gd,
+			   osb->local_alloc_default_bits,
+			   osb->local_alloc_bits, osb->local_alloc_state);
+
+	ret = simple_read_from_buffer(userbuf, count, ppos, buf, ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:11 pm

From: Mark Fasheh
Date: Tuesday, October 7, 2008 - 1:10 pm

Thanks, fixed in 'merge_window' branch of ocfs2.git.
	--Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.com>

ocfs2: make la_debug_mutex static

It can also be moved into ocfs2_la_debug_read().

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index 02227c3..b1c634d 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -76,8 +76,6 @@ static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
 
 #ifdef CONFIG_OCFS2_FS_STATS
 
-DEFINE_MUTEX(la_debug_mutex);
-
 static int ocfs2_la_debug_open(struct inode *inode, struct file *file)
 {
 	file->private_data = inode->i_private;
@@ -89,6 +87,7 @@ static int ocfs2_la_debug_open(struct inode *inode, struct file *file)
 static ssize_t ocfs2_la_debug_read(struct file *file, char __user *userbuf,
 				   size_t count, loff_t *ppos)
 {
+	static DEFINE_MUTEX(la_debug_mutex);
 	struct ocfs2_super *osb = file->private_data;
 	int written, ret;
 	char *buf = osb->local_alloc_debug_buf;
-- 
1.5.4.1

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:00 pm

Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.

To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.

Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |  198 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ocfs2/localalloc.h |    4 +
 fs/ocfs2/ocfs2.h      |   23 +++++-
 fs/ocfs2/suballoc.c   |   31 ++++----
 fs/ocfs2/suballoc.h   |    1 +
 fs/ocfs2/super.c      |    4 +-
 6 files changed, 230 insertions(+), 31 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index b05ce66..f71658a 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -73,16 +73,51 @@ static int ocfs2_local_alloc_new_window(struct ocfs2_super *osb,
 static int ocfs2_local_alloc_slide_window(struct ocfs2_super *osb,
 					  struct inode *local_alloc_inode);
 
+static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
+{
+	return (osb->local_alloc_state == OCFS2_LA_THROTTLED ||
+		osb->local_alloc_state == OCFS2_LA_ENABLED);
+}
+
+void ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:11 pm

cacnel_delayed_work() is a pretty risky function.  The work handler
(ocfs2_la_enable_worker) can execute an arbitrarily long time after
cancel_delayed_work() has returned.  Can all the code here cope with such a
surprise alteration of ->local_alloc_state()?

And you canot use cancel_delayed_work_sync() here due to a deadlock on
->osb_lock().

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

Provide an optional extent_tree_operation to specify the
max_leaf_clusters of an ocfs2_extent_tree.  If not provided, the value
is 0 (unlimited).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 0b900f6..7c0721d 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -76,6 +76,8 @@ struct ocfs2_extent_tree_operations {
 	/* These are internal to ocfs2_extent_tree and don't have
 	 * accessor functions */
 	void (*eo_fill_root_el)(struct ocfs2_extent_tree *et);
+	void (*eo_fill_max_leaf_clusters)(struct inode *inode,
+					  struct ocfs2_extent_tree *et);
 };
 
 struct ocfs2_extent_tree {
@@ -205,6 +207,14 @@ static void ocfs2_xattr_tree_fill_root_el(struct ocfs2_extent_tree *et)
 	et->et_root_el = &xb->xb_attrs.xb_root.xt_list;
 }
 
+static void ocfs2_xattr_tree_fill_max_leaf_clusters(struct inode *inode,
+						    struct ocfs2_extent_tree *et)
+{
+	et->et_max_leaf_clusters =
+		ocfs2_clusters_for_bytes(inode->i_sb,
+					 OCFS2_MAX_XATTR_TREE_LEAF_SIZE);
+}
+
 static void ocfs2_xattr_tree_set_last_eb_blk(struct ocfs2_extent_tree *et,
 					     u64 blkno)
 {
@@ -243,6 +253,7 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
 	.eo_update_clusters	= ocfs2_xattr_tree_update_clusters,
 	.eo_sanity_check	= ocfs2_xattr_tree_sanity_check,
 	.eo_fill_root_el	= ocfs2_xattr_tree_fill_root_el,
+	.eo_fill_max_leaf_clusters = ocfs2_xattr_tree_fill_max_leaf_clusters,
 };
 
 static void ocfs2_get_extent_tree(struct ocfs2_extent_tree *et,
@@ -254,7 +265,6 @@ static void ocfs2_get_extent_tree(struct ocfs2_extent_tree *et,
 	et->et_type = et_type;
 	get_bh(bh);
 	et->et_root_bh = bh;
-	et->et_max_leaf_clusters = 0;
 	if (!obj)
 		obj = (void *)bh->b_data;
 	et->et_object = obj;
@@ -265,11 +275,13 @@ ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

A couple places check an extent_tree for a valid inode.  We move that
out to add an eo_insert_check() operation.  It can be called from
ocfs2_insert_extent() and elsewhere.

We also have the wrapper calls ocfs2_et_insert_check() and
ocfs2_et_sanity_check() ignore NULL ops.  That way we don't have to
provide useless operations for xattr types.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   69 ++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 44 insertions(+), 25 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 243bacf..2083c2c 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -71,6 +71,9 @@ struct ocfs2_extent_tree_operations {
 	void (*eo_update_clusters)(struct inode *inode,
 				   struct ocfs2_extent_tree *et,
 				   u32 new_clusters);
+	int (*eo_insert_check)(struct inode *inode,
+			       struct ocfs2_extent_tree *et,
+			       struct ocfs2_extent_rec *rec);
 	int (*eo_sanity_check)(struct inode *inode, struct ocfs2_extent_tree *et);
 
 	/* These are internal to ocfs2_extent_tree and don't have
@@ -125,6 +128,25 @@ static void ocfs2_dinode_update_clusters(struct inode *inode,
 	spin_unlock(&OCFS2_I(inode)->ip_lock);
 }
 
+static int ocfs2_dinode_insert_check(struct inode *inode,
+				     struct ocfs2_extent_tree *et,
+				     struct ocfs2_extent_rec *rec)
+{
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+
+	BUG_ON(OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL);
+	mlog_bug_on_msg(!ocfs2_sparse_alloc(osb) &&
+			(OCFS2_I(inode)->ip_clusters != rec->e_cpos),
+			"Device %s, asking for sparse allocation: inode %llu, "
+			"cpos %u, clusters %u\n",
+			osb->dev_str,
+			(unsigned long long)OCFS2_I(inode)->ip_blkno,
+			rec->e_cpos,
+			OCFS2_I(inode)->ip_clusters);
+
+	return 0;
+}
+
 static int ocfs2_dinode_sanity_check(struct inode *inode,
 				     struct ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

We now have three different kinds of extent trees in ocfs2: inode data
(dinode), extended attributes (xattr_tree), and extended attribute
values (xattr_value).  There is a nice abstraction for them,
ocfs2_extent_tree, but it is hidden in alloc.c.  All the calling
functions have to pick amongst a varied API and pass in type bits and
often extraneous pointers.

A better way is to make ocfs2_extent_tree a first-class object.
Everyone converts their object to an ocfs2_extent_tree() via the
ocfs2_get_*_extent_tree() calls, then uses the ocfs2_extent_tree for all
tree calls to alloc.c.

This simplifies a lot of callers, making for readability.  It also
provides an easy way to add additional extent tree types, as they only
need to be defined in alloc.c with a ocfs2_get_<new>_extent_tree()
function.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c    |  300 +++++++++++++++-----------------------------------
 fs/ocfs2/alloc.h    |  111 +++++++++++---------
 fs/ocfs2/aops.c     |   16 ++-
 fs/ocfs2/dir.c      |   20 ++--
 fs/ocfs2/file.c     |   36 ++++---
 fs/ocfs2/suballoc.c |   12 +--
 fs/ocfs2/suballoc.h |    6 +-
 fs/ocfs2/xattr.c    |   71 +++++++------
 8 files changed, 240 insertions(+), 332 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 2083c2c..d196d40 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -49,20 +49,6 @@
 
 #include "buffer_head_io.h"
 
-/*
- * ocfs2_extent_tree and ocfs2_extent_tree_operations are used to abstract
- * the b-tree operations in ocfs2. Now all the b-tree operations are not
- * limited to ocfs2_dinode only. Any data which need to allocate clusters
- * to store can use b-tree. And it only needs to implement its ocfs2_extent_tree
- * and operation.
- *
- * ocfs2_extent_tree contains info for the root of the b-tree, it must have a
- * root ocfs2_extent_list and a root_bh so that they can be used ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

struct ocfs2_extent_tree_operations provides methods for the different
on-disk btrees in ocfs2.  Describing what those methods do is probably a
good idea.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   45 +++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index d196d40..51c3183 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -50,21 +50,62 @@
 #include "buffer_head_io.h"
 
 
+/*
+ * Operations for a specific extent tree type.
+ *
+ * To implement an on-disk btree (extent tree) type in ocfs2, add
+ * an ocfs2_extent_tree_operations structure and the matching
+ * ocfs2_get_<thingy>_extent_tree() function.  That's pretty much it
+ * for the allocation portion of the extent tree.
+ */
 struct ocfs2_extent_tree_operations {
+	/*
+	 * last_eb_blk is the block number of the right most leaf extent
+	 * block.  Most on-disk structures containing an extent tree store
+	 * this value for fast access.  The ->eo_set_last_eb_blk() and
+	 * ->eo_get_last_eb_blk() operations access this value.  They are
+	 *  both required.
+	 */
 	void (*eo_set_last_eb_blk)(struct ocfs2_extent_tree *et,
 				   u64 blkno);
 	u64 (*eo_get_last_eb_blk)(struct ocfs2_extent_tree *et);
+
+	/*
+	 * The on-disk structure usually keeps track of how many total
+	 * clusters are stored in this extent tree.  This function updates
+	 * that value.  new_clusters is the delta, and must be
+	 * added to the total.  Required.
+	 */
 	void (*eo_update_clusters)(struct inode *inode,
 				   struct ocfs2_extent_tree *et,
 				   u32 new_clusters);
+
+	/*
+	 * If ->eo_insert_check() exists, it is called before rec is
+	 * inserted into the extent tree.  It is optional.
+	 */
 	int (*eo_insert_check)(struct inode *inode,
 			       struct ocfs2_extent_tree *et,
 			       struct ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

A caller knows what kind of extent tree they have.  There's no reason
they have to call ocfs2_get_extent_tree() with a NULL when they could
just as easily call a specific function to their type of extent tree.

Introduce ocfs2_dinode_get_extent_tree(),
ocfs2_xattr_tree_get_extent_tree(), and
ocfs2_xattr_value_get_extent_tree().  They only take the necessary
arguments, calling into the underlying __ocfs2_get_extent_tree() to do
the real work.

__ocfs2_get_extent_tree() is the old ocfs2_get_extent_tree(), but
without needing any switch-by-type logic.

ocfs2_get_extent_tree() is now a wrapper around the specific calls.  It
exists because a couple alloc.c functions can take et_type.  This will
go later.

Another benefit is that ocfs2_xattr_value_get_extent_tree() can take a
struct ocfs2_xattr_value_root* instead of void*.  This gives us
typechecking where we didn't have it before.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   76 +++++++++++++++++++++++++++++++++++++++---------------
 fs/ocfs2/alloc.h |    2 +-
 2 files changed, 56 insertions(+), 22 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 7c0721d..243bacf 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -192,7 +192,7 @@ static int ocfs2_xattr_value_sanity_check(struct inode *inode,
 	return 0;
 }
 
-static struct ocfs2_extent_tree_operations ocfs2_xattr_et_ops = {
+static struct ocfs2_extent_tree_operations ocfs2_xattr_value_et_ops = {
 	.eo_set_last_eb_blk	= ocfs2_xattr_value_set_last_eb_blk,
 	.eo_get_last_eb_blk	= ocfs2_xattr_value_get_last_eb_blk,
 	.eo_update_clusters	= ocfs2_xattr_value_update_clusters,
@@ -256,27 +256,21 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
 	.eo_fill_max_leaf_clusters = ocfs2_xattr_tree_fill_max_leaf_clusters,
 };
 
-static void ocfs2_get_extent_tree(struct ocfs2_extent_tree *et,
-				  struct inode ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

The root_el of an ocfs2_extent_tree needs to be calculated from
et->et_object.  Make it an operation on et->et_ops.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   38 ++++++++++++++++++++++++++++++--------
 1 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 93f44f4..fb6ae67 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -72,6 +72,10 @@ struct ocfs2_extent_tree_operations {
 				   struct ocfs2_extent_tree *et,
 				   u32 new_clusters);
 	int (*eo_sanity_check)(struct inode *inode, struct ocfs2_extent_tree *et);
+
+	/* These are internal to ocfs2_extent_tree and don't have
+	 * accessor functions */
+	void (*eo_fill_root_el)(struct ocfs2_extent_tree *et);
 };
 
 struct ocfs2_extent_tree {
@@ -83,6 +87,13 @@ struct ocfs2_extent_tree {
 	unsigned int				et_max_leaf_clusters;
 };
 
+static void ocfs2_dinode_fill_root_el(struct ocfs2_extent_tree *et)
+{
+	struct ocfs2_dinode *di = et->et_object;
+
+	et->et_root_el = &di->id2.i_list;
+}
+
 static void ocfs2_dinode_set_last_eb_blk(struct ocfs2_extent_tree *et,
 					 u64 blkno)
 {
@@ -136,8 +147,16 @@ static struct ocfs2_extent_tree_operations ocfs2_dinode_et_ops = {
 	.eo_get_last_eb_blk	= ocfs2_dinode_get_last_eb_blk,
 	.eo_update_clusters	= ocfs2_dinode_update_clusters,
 	.eo_sanity_check	= ocfs2_dinode_sanity_check,
+	.eo_fill_root_el	= ocfs2_dinode_fill_root_el,
 };
 
+static void ocfs2_xattr_value_fill_root_el(struct ocfs2_extent_tree *et)
+{
+	struct ocfs2_xattr_value_root *xv = et->et_object;
+
+	et->et_root_el = &xv->xr_list;
+}
+
 static void ocfs2_xattr_value_set_last_eb_blk(struct ocfs2_extent_tree *et,
 					      u64 blkno)
 {
@@ -176,8 +195,16 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_et_ops = {
 	.eo_get_last_eb_blk	= ocfs2_xattr_value_get_last_eb_blk,
 	.eo_update_clusters	= ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.

It's a pretty trivial change.  Most functions are just renamed.  The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.

Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem.  It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.

We provide a compatibility option so that paranoid people can still use
JBD for the time being.  This will go away shortly.

[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
  ocfs2_truncate_for_delete(). --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/Kconfig                  |   40 +++++++++++++--------
 fs/ocfs2/alloc.c            |   28 ++++++---------
 fs/ocfs2/aops.c             |   21 ++++++++---
 fs/ocfs2/file.c             |   14 +++++--
 fs/ocfs2/inode.c            |    5 +++
 fs/ocfs2/inode.h            |    1 +
 fs/ocfs2/journal.c          |   72 ++++++++++++++++++++------------------
 fs/ocfs2/journal.h          |   25 +++++++++++--
 fs/ocfs2/ocfs2.h            |    7 +++-
 fs/ocfs2/ocfs2_jbd_compat.h |   82 +++++++++++++++++++++++++++++++++++++++++++
 fs/ocfs2/super.c            |   10 +++--
 fs/ocfs2/uptodate.c         |    6 +++-
 12 files changed, 227 insertions(+), 84 deletions(-)
 create mode 100644 fs/ocfs2/ocfs2_jbd_compat.h

diff --git a/fs/Kconfig b/fs/Kconfig
index abccb5d..e651a36 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -206,17 +206,16 @@ config JBD
 	tristate
 	help
 	  This is a generic journalling layer for block devices.  It is
-	  currently used by the ext3 and OCFS2 file systems, but it could
-	  also be used to add journal support to other file systems or block
+	  currently used by the ext3 file system, but it could also be
+	  used to add ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

ocfs2_num_free_extents() re-implements the logic of
ocfs2_get_extent_tree().  Now that ocfs2_get_extent_tree() does not
allocate, let's use it in ocfs2_num_free_extents() to simplify the code.

The inode validation code in ocfs2_num_free_extents() is not needed.
All callers are passing in pre-validated inodes.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   30 +++++-------------------------
 1 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index fb6ae67..0b900f6 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -618,34 +618,13 @@ int ocfs2_num_free_extents(struct ocfs2_super *osb,
 	struct ocfs2_extent_block *eb;
 	struct buffer_head *eb_bh = NULL;
 	u64 last_eb_blk = 0;
+	struct ocfs2_extent_tree et;
 
 	mlog_entry_void();
 
-	if (type == OCFS2_DINODE_EXTENT) {
-		struct ocfs2_dinode *fe =
-				(struct ocfs2_dinode *)root_bh->b_data;
-		if (!OCFS2_IS_VALID_DINODE(fe)) {
-			OCFS2_RO_ON_INVALID_DINODE(inode->i_sb, fe);
-			retval = -EIO;
-			goto bail;
-		}
-
-		if (fe->i_last_eb_blk)
-			last_eb_blk = le64_to_cpu(fe->i_last_eb_blk);
-		el = &fe->id2.i_list;
-	} else if (type == OCFS2_XATTR_VALUE_EXTENT) {
-		struct ocfs2_xattr_value_root *xv =
-			(struct ocfs2_xattr_value_root *) obj;
-
-		last_eb_blk = le64_to_cpu(xv->xr_last_eb_blk);
-		el = &xv->xr_list;
-	} else if (type == OCFS2_XATTR_TREE_EXTENT) {
-		struct ocfs2_xattr_block *xb =
-			(struct ocfs2_xattr_block *)root_bh->b_data;
-
-		last_eb_blk = le64_to_cpu(xb->xb_attrs.xb_root.xt_last_eb_blk);
-		el = &xb->xb_attrs.xb_root.xt_list;
-	}
+	ocfs2_get_extent_tree(&et, inode, root_bh, type, obj);
+	el = et.et_root_el;
+	last_eb_blk = ocfs2_et_get_last_eb_blk(&et);
 
 	if (last_eb_blk) {
 		retval = ocfs2_read_block(osb, last_eb_blk,
@@ -665,6 +644,7 @@ bail:
 	if (eb_bh)
 		brelse(eb_bh);
 ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Sunil Mushran <sunil.mushran@oracle.com>

Patch adds check for [no]user_xattr in ocfs2_show_options() that completes
the list of all mount options.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/super.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 39f6238..6b4b86e 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1010,6 +1010,11 @@ static int ocfs2_show_options(struct seq_file *s, struct vfsmount *mnt)
 		seq_printf(s, ",cluster_stack=%.*s", OCFS2_STACK_LABEL_LEN,
 			   osb->osb_cluster_stack);
 
+	if (opts & OCFS2_MOUNT_NOUSERXATTR)
+		seq_printf(s, ",nouser_xattr");
+	else
+		seq_printf(s, ",user_xattr");
+
 	if (opts & OCFS2_MOUNT_INODE64)
 		seq_printf(s, ",inode64");
 
-- 
1.5.4.5

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

ocfs2 inode numbers are block numbers.  For any filesystem with less
than 2^32 blocks, this is not a problem.  However, when ocfs2 starts
using JDB2, it will be able to support filesystems with more than 2^32
blocks.  This would result in inode numbers higher than 2^32.

The problem is that stat(2) can't handle those numbers on 32bit
machines.  The simple solution is to have ocfs2 allocate all inodes
below that boundary.

The suballoc code is changed to honor an optional block limit.  Only the
inode suballocator sets that limit - all other allocations stay unlimited.

The biggest trick is to grow the inode suballocator beneath that limit.
There's no point in allocating block groups that are above the limit,
then rejecting their elements later on.  We want to prevent the inode
allocator from ever having block groups above the limit.  This involves
a little gyration with the local alloc code.  If the local alloc window
is above the limit, it signals the caller to try the global bitmap but
does not disable the local alloc file (which can be used for other
allocations).

[ Minor cleanup - removed an ML_NOTICE comment. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/localalloc.c |   55 ++++++++++++++++++++++++++++++++
 fs/ocfs2/suballoc.c   |   83 ++++++++++++++++++++++++++++++++++++++++---------
 fs/ocfs2/suballoc.h   |   11 ++++--
 3 files changed, 130 insertions(+), 19 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index b889f10..02227c3 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -570,6 +570,46 @@ out:
 	return status;
 }
 
+/* Check to see if the local alloc window is within ac->ac_max_block */
+static int ocfs2_local_alloc_in_range(struct inode *inode,
+				      struct ocfs2_alloc_context *ac,
+				      u32 bits_wanted)
+{
+	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	struct ocfs2_dinode ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit.  This parallels XFS.  64bit systems can handle the
larger inode numbers.

[ Added description of inode64 mount option in ocfs2.txt. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 Documentation/filesystems/ocfs2.txt |    4 ++++
 fs/ocfs2/ocfs2.h                    |    1 +
 fs/ocfs2/suballoc.c                 |    5 +++--
 fs/ocfs2/super.c                    |   17 +++++++++++++++++
 4 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt
index c318a8b..6acf1b4 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -76,3 +76,7 @@ localalloc=8(*)		Allows custom localalloc size in MB. If the value is too
 			large, the fs will silently revert it to the default.
 			Localalloc is not enabled for local mounts.
 localflocks		This disables cluster aware flock.
+inode64			Indicates that Ocfs2 is allowed to create inodes at
+			any location in the filesystem, including those which
+			will result in inode numbers occupying more than 32
+			bits of significance.
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 6d3c10d..78ae4f8 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -189,6 +189,7 @@ enum ocfs2_mount_options
 	OCFS2_MOUNT_DATA_WRITEBACK = 1 << 4, /* No data ordering */
 	OCFS2_MOUNT_LOCALFLOCKS = 1 << 5, /* No cluster aware user file locks */
 	OCFS2_MOUNT_NOUSERXATTR = 1 << 6, /* No user xattr */
+	OCFS2_MOUNT_INODE64 = 1 << 7,	/* Allow inode numbers > 2^32 */
 };
 
 #define OCFS2_OSB_SOFT_RO	0x0001
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 213bdca..d7a6f92 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -601,9 +601,10 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb,
 	/*
 	 * stat(2) can't handle ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Tao Ma <tao.ma@oracle.com>

In ocfs2_xattr_free_block, we take a cluster lock on xb_alloc_inode while we
have a transaction open. This will deadlock the downconvert thread, so fix
it.

We can clean up how xattr blocks are removed while here - this patch also
moves the mechanism of releasing xattr block (including both value, xattr
tree and xattr block) into this function.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/xattr.c |  152 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 82 insertions(+), 70 deletions(-)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 38e3e5e..b2e25a8 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -1427,51 +1427,6 @@ out:
 
 }
 
-static int ocfs2_xattr_free_block(handle_t *handle,
-				  struct ocfs2_super *osb,
-				  struct ocfs2_xattr_block *xb)
-{
-	struct inode *xb_alloc_inode;
-	struct buffer_head *xb_alloc_bh = NULL;
-	u64 blk = le64_to_cpu(xb->xb_blkno);
-	u16 bit = le16_to_cpu(xb->xb_suballoc_bit);
-	u64 bg_blkno = ocfs2_which_suballoc_group(blk, bit);
-	int ret = 0;
-
-	xb_alloc_inode = ocfs2_get_system_file_inode(osb,
-				EXTENT_ALLOC_SYSTEM_INODE,
-				le16_to_cpu(xb->xb_suballoc_slot));
-	if (!xb_alloc_inode) {
-		ret = -ENOMEM;
-		mlog_errno(ret);
-		goto out;
-	}
-	mutex_lock(&xb_alloc_inode->i_mutex);
-
-	ret = ocfs2_inode_lock(xb_alloc_inode, &xb_alloc_bh, 1);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto out_mutex;
-	}
-	ret = ocfs2_extend_trans(handle, OCFS2_SUBALLOC_FREE);
-	if (ret < 0) {
-		mlog_errno(ret);
-		goto out_unlock;
-	}
-	ret = ocfs2_free_suballoc_bits(handle, xb_alloc_inode, xb_alloc_bh,
-				       bit, bg_blkno, 1);
-	if (ret < 0)
-		mlog_errno(ret);
-out_unlock:
-	ocfs2_inode_unlock(xb_alloc_inode, 1);
-	brelse(xb_alloc_bh);
-out_mutex:
-	mutex_unlock(&xb_alloc_inode->i_mutex);
-	iput(xb_alloc_inode);
-out:
-	return ret;
-}
-
 static int ocfs2_remove_value_outside(struct ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Tao Ma <tao.ma@oracle.com>

In ocfs2_extend_trans, when we can't extend the current
transaction, it will commit current transaction and restart
a new one. So if the previous credits we have allocated aren't
used(the block isn't dirtied before our extend), we will not
have enough credits for any future operation(it will cause jbd
complain and bug out). So check this and re-extend it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/xattr.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 1a4de3d..38e3e5e 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -1336,8 +1336,9 @@ static int ocfs2_xattr_set_entry(struct inode *inode,
 	}
 
 	if (!(flag & OCFS2_INLINE_XATTR_FL)) {
-		/*set extended attribue in external blcok*/
+		/* set extended attribute in external block. */
 		ret = ocfs2_extend_trans(handle,
+					 OCFS2_INODE_UPDATE_CREDITS +
 					 OCFS2_XATTR_BLOCK_UPDATE_CREDITS);
 		if (ret) {
 			mlog_errno(ret);
@@ -3701,6 +3702,18 @@ static int ocfs2_add_new_xattr_cluster(struct inode *inode,
 		}
 	}
 
+	if (handle->h_buffer_credits < credits) {
+		/*
+		 * The journal has been restarted before, and don't
+		 * have enough space for the insertion, so extend it
+		 * here.
+		 */
+		ret = ocfs2_extend_trans(handle, credits);
+		if (ret) {
+			mlog_errno(ret);
+			goto leave;
+		}
+	}
 	mlog(0, "Insert %u clusters at block %llu for xattr at %u\n",
 	     num_bits, block, v_start);
 	ret = ocfs2_insert_extent(osb, handle, inode, &et, v_start, block,
-- 
1.5.4.5

--

From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

The original get/put_extent_tree() functions held a reference on
et_root_bh.  However, every single caller already has a safe reference,
making the get/put cycle irrelevant.

We change ocfs2_get_*_extent_tree() to ocfs2_init_*_extent_tree().  It
no longer gets a reference on et_root_bh.  ocfs2_put_extent_tree() is
removed.  Callers now have a simpler init+use pattern.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   49 +++++++++++++++++++++----------------------------
 fs/ocfs2/alloc.h |   26 ++++++++++++--------------
 fs/ocfs2/aops.c  |    6 ++----
 fs/ocfs2/dir.c   |    6 ++----
 fs/ocfs2/file.c  |   10 +++-------
 fs/ocfs2/xattr.c |   14 ++++----------
 6 files changed, 44 insertions(+), 67 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 51c3183..5f44ef8 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -55,7 +55,7 @@
  *
  * To implement an on-disk btree (extent tree) type in ocfs2, add
  * an ocfs2_extent_tree_operations structure and the matching
- * ocfs2_get_<thingy>_extent_tree() function.  That's pretty much it
+ * ocfs2_init_<thingy>_extent_tree() function.  That's pretty much it
  * for the allocation portion of the extent tree.
  */
 struct ocfs2_extent_tree_operations {
@@ -301,14 +301,13 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
 	.eo_fill_max_leaf_clusters = ocfs2_xattr_tree_fill_max_leaf_clusters,
 };
 
-static void __ocfs2_get_extent_tree(struct ocfs2_extent_tree *et,
-				    struct inode *inode,
-				    struct buffer_head *bh,
-				    void *obj,
-				    struct ocfs2_extent_tree_operations *ops)
+static void __ocfs2_init_extent_tree(struct ocfs2_extent_tree *et,
+				     struct inode *inode,
+				     struct buffer_head *bh,
+				     void *obj,
+				     struct ocfs2_extent_tree_operations *ops)
 {
 	et->et_ops = ops;
-	get_bh(bh);
 	et->et_root_bh = bh;
 ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

The 'private' pointer was a way to store off xattr values, which don't
live at a set place in the bh.  But the concept of "the object
containing the extent tree" is much more generic.  For an inode it's the
struct ocfs2_dinode, for an xattr value its the value.  Let's save off
the 'object' at all times.  If NULL is passed to
ocfs2_get_extent_tree(), 'object' is set to bh->b_data;

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |   62 +++++++++++++++++++++++++----------------------------
 1 files changed, 29 insertions(+), 33 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 0abf11e..93f44f4 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -79,15 +79,14 @@ struct ocfs2_extent_tree {
 	struct ocfs2_extent_tree_operations	*et_ops;
 	struct buffer_head			*et_root_bh;
 	struct ocfs2_extent_list		*et_root_el;
-	void					*et_private;
+	void					*et_object;
 	unsigned int				et_max_leaf_clusters;
 };
 
 static void ocfs2_dinode_set_last_eb_blk(struct ocfs2_extent_tree *et,
 					 u64 blkno)
 {
-	struct ocfs2_dinode *di =
-		(struct ocfs2_dinode *)et->et_root_bh->b_data;
+	struct ocfs2_dinode *di = et->et_object;
 
 	BUG_ON(et->et_type != OCFS2_DINODE_EXTENT);
 	di->i_last_eb_blk = cpu_to_le64(blkno);
@@ -95,8 +94,7 @@ static void ocfs2_dinode_set_last_eb_blk(struct ocfs2_extent_tree *et,
 
 static u64 ocfs2_dinode_get_last_eb_blk(struct ocfs2_extent_tree *et)
 {
-	struct ocfs2_dinode *di =
-		(struct ocfs2_dinode *)et->et_root_bh->b_data;
+	struct ocfs2_dinode *di = et->et_object;
 
 	BUG_ON(et->et_type != OCFS2_DINODE_EXTENT);
 	return le64_to_cpu(di->i_last_eb_blk);
@@ -106,8 +104,7 @@ static void ocfs2_dinode_update_clusters(struct inode *inode,
 					 struct ocfs2_extent_tree *et,
 					 u32 clusters)
 {
-	struct ocfs2_dinode *di =
-			(struct ocfs2_dinode *)et->et_root_bh->b_data;
+	struct ocfs2_dinode *di = ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

The members of the ocfs2_extent_tree structure gain a prefix of 'et_'.
All users are updated.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |  118 ++++++++++++++++++++++++++++--------------------------
 1 files changed, 61 insertions(+), 57 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 9fe49f2..ab16b89 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -75,28 +75,30 @@ struct ocfs2_extent_tree_operations {
 };
 
 struct ocfs2_extent_tree {
-	enum ocfs2_extent_tree_type type;
-	struct ocfs2_extent_tree_operations *eops;
-	struct buffer_head *root_bh;
-	struct ocfs2_extent_list *root_el;
-	void *private;
-	unsigned int max_leaf_clusters;
+	enum ocfs2_extent_tree_type		et_type;
+	struct ocfs2_extent_tree_operations	*et_ops;
+	struct buffer_head			*et_root_bh;
+	struct ocfs2_extent_list		*et_root_el;
+	void					*et_private;
+	unsigned int				et_max_leaf_clusters;
 };
 
 static void ocfs2_dinode_set_last_eb_blk(struct ocfs2_extent_tree *et,
 					 u64 blkno)
 {
-	struct ocfs2_dinode *di = (struct ocfs2_dinode *)et->root_bh->b_data;
+	struct ocfs2_dinode *di =
+		(struct ocfs2_dinode *)et->et_root_bh->b_data;
 
-	BUG_ON(et->type != OCFS2_DINODE_EXTENT);
+	BUG_ON(et->et_type != OCFS2_DINODE_EXTENT);
 	di->i_last_eb_blk = cpu_to_le64(blkno);
 }
 
 static u64 ocfs2_dinode_get_last_eb_blk(struct ocfs2_extent_tree *et)
 {
-	struct ocfs2_dinode *di = (struct ocfs2_dinode *)et->root_bh->b_data;
+	struct ocfs2_dinode *di =
+		(struct ocfs2_dinode *)et->et_root_bh->b_data;
 
-	BUG_ON(et->type != OCFS2_DINODE_EXTENT);
+	BUG_ON(et->et_type != OCFS2_DINODE_EXTENT);
 	return le64_to_cpu(di->i_last_eb_blk);
 }
 
@@ -105,7 +107,7 @@ static void ocfs2_dinode_update_clusters(struct inode *inode,
 					 u32 clusters)
 {
 	struct ocfs2_dinode *di =
-			(struct ocfs2_dinode *)et->root_bh->b_data;
+			(struct ocfs2_dinode ...
From: Mark Fasheh
Date: Wednesday, September 24, 2008 - 3:01 pm

From: Joel Becker <joel.becker@oracle.com>

Rather than allocating a struct ocfs2_extent_tree, just put it on the
stack.  Fill it with ocfs2_get_extent_tree() and drop it with
ocfs2_put_extent_tree().  Now the callers don't have to ENOMEM, yet
still safely ref the root_bh.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
---
 fs/ocfs2/alloc.c |  117 ++++++++++++++++-------------------------------------
 1 files changed, 36 insertions(+), 81 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index ab16b89..0abf11e 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -223,22 +223,17 @@ static struct ocfs2_extent_tree_operations ocfs2_xattr_tree_et_ops = {
 	.eo_sanity_check	= ocfs2_xattr_tree_sanity_check,
 };
 
-static struct ocfs2_extent_tree*
-	 ocfs2_new_extent_tree(struct inode *inode,
-			       struct buffer_head *bh,
-			       enum ocfs2_extent_tree_type et_type,
-			       void *private)
+static void ocfs2_get_extent_tree(struct ocfs2_extent_tree *et,
+				  struct inode *inode,
+				  struct buffer_head *bh,
+				  enum ocfs2_extent_tree_type et_type,
+				  void *private)
 {
-	struct ocfs2_extent_tree *et;
-
-	et = kzalloc(sizeof(*et), GFP_NOFS);
-	if (!et)
-		return NULL;
-
 	et->et_type = et_type;
 	get_bh(bh);
 	et->et_root_bh = bh;
 	et->et_private = private;
+	et->et_max_leaf_clusters = 0;
 
 	if (et_type == OCFS2_DINODE_EXTENT) {
 		et->et_root_el =
@@ -257,16 +252,11 @@ static struct ocfs2_extent_tree*
 		et->et_max_leaf_clusters = ocfs2_clusters_for_bytes(inode->i_sb,
 						OCFS2_MAX_XATTR_TREE_LEAF_SIZE);
 	}
-
-	return et;
 }
 
-static void ocfs2_free_extent_tree(struct ocfs2_extent_tree *et)
+static void ocfs2_put_extent_tree(struct ocfs2_extent_tree *et)
 {
-	if (et) {
-		brelse(et->et_root_bh);
-		kfree(et);
-	}
+	brelse(et->et_root_bh);
 }
 
 static inline void ocfs2_et_set_last_eb_blk(struct ocfs2_extent_tree *et,
@@ -4439,22 +4429,15 @@ int ...
From: Tao Ma
Date: Saturday, September 27, 2008 - 10:16 pm

Hi Mark,
	do you see my 2 patches for xattr?
http://oss.oracle.com/pipermail/ocfs2-devel/2008-September/002839.html
this is pretty straightforward and I think it can be committed with it.
http://oss.oracle.com/pipermail/ocfs2-devel/2008-September/002839.html
this is the new support for empty bucket.

Regards,
Tao


--

Previous thread: [PATCH v2] fsl-dma: allow Freescale Elo DMA driver to be compiled as a module by Timur Tabi on Wednesday, September 24, 2008 - 2:59 pm. (9 messages)

Next thread: [patch 2.6.27-rc7] gpiolib: request/free hooks by David Brownell on Wednesday, September 24, 2008 - 3:08 pm. (6 messages)