Re: fallocate() man page - darft 2

Previous thread: Patch Related with Fork Bombing Attack by Anand Jahagirdar on Friday, July 13, 2007 - 5:39 am. (3 messages)

Next thread: howto create partitions bigger than 2TB by Ingo Freund on Friday, July 13, 2007 - 5:54 am. (6 messages)
From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:38 am

This is the latest fallocate patchset and is based on 2.6.22.

* Following are the changes from TAKE6:
1) We now just have two modes (and no deallocation modes).
2) Updated the man page
3) Added a new patch submitted by David P. Quigley  (Patch 3/6).
4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6.
5) Included below in the end is a small testcase to test fallocate.

* Following are the changes from TAKE5 to TAKE6:
1) Rebased to 2.6.22
2) Added compat wrapper for x86_64
3) Dropped s390 and ia64 patches, since the platform maintaners can
   add the support for fallocate once it is in mainline.
4) Added a change suggested by Andreas for better extent-to-group
   alignment in ext4 (Patch 6/6). Please refer following post:
http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html
5) Renamed mode flags and values from "FA_" to "FALLOC_"
6) Added manpage (updated version of the one initially submitted by
   David Chinner).


Todos:
-----
1> Implementation on other architectures (other than i386, x86_64,
   and ppc64). s390(x) and ia64 patches are ready and will be pushed
   by platform maintaners when the fallocate is in mainline.
2> A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3> Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()
4> Patch to e2fsprogs to recognize and display uninitialized extents.


Following patches follow:
Patch 1/6 : manpage for fallocate
Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc
Patch 3/6 : revalidate write permissions for fallocate
Patch 4/6 : ext4: fallocate support in ext4
Patch 5/6 : ext4: write support for preallocated blocks
Patch 6/6 : ext4: change for better extent-to-group alignment

Note: Attached below is a small testcase to test fallocate. The __NR_fallocate
will need to be changed depending on the system ...
From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:46 am

Following is the modified version of the manpage originally submitted by
David Chinner. Please use `nroff -man fallocate.2 | less` to view.

This includes changes suggested by Heikki Orsila and Barry Naujok.


.TH fallocate 2
.SH NAME
fallocate \- allocate or remove file space
.SH SYNOPSIS
.nf
.B #include <fcntl.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " len);
.SH DESCRIPTION
The
.B fallocate
syscall allows a user to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
The
.I mode
parameter determines the operation to be performed on the given range.
Currently there are two modes:
.TP
.B FALLOC_ALLOCATE
allocates and initialises to zero the disk space within the given range.
After a successful call, subsequent writes are guaranteed not to fail because
of lack of disk space.  If the size of the file is less than
.IR offset + len ,
then the file is increased to this size; otherwise the file size is left
unchanged.
.B FALLOC_ALLOCATE
closely resembles
.BR posix_fallocate (3)
and is intended as a method of optimally implementing this function.
.B FALLOC_ALLOCATE
may allocate a larger range than that was specified.
.TP
.B FALLOC_RESV_SPACE
provides the same functionality as
.B FALLOC_ALLOCATE
except it does not ever change the file size. This allows allocation
of zero blocks beyond the end of file and is useful for optimising
append workloads.
.SH RETURN VALUE
.B fallocate
returns zero on success, or an error number on failure.
Note that
.I errno
is not set.
.SH ERRORS
.TP
.B EBADF
.I fd
is not a valid file descriptor, or is not opened for writing.
.TP
.B EFBIG
.IR offset + len
exceeds the maximum file size.
.TP
.B EINVAL
.I offset
was less than 0, or
.I len
was less than or equal to 0.
.TP
.B ENODEV
.I fd
does not refer to a regular file or a directory.
.TP
.B ENOSPC
There is not enough space left on ...
From: David Chinner
Date: Friday, July 13, 2007 - 7:06 am

If fallocate is just being used for allocating space this is wrong.
maybe - "manipulate file space" instead?


"of zeroed blocks"

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 7:27 am

Ok.

--
Regards,
-

From: Michael Kerrisk
Date: Saturday, July 14, 2007 - 1:23 am

[CC += mtk-manpages@gmx.net]

Amit,

Thanks for this page.  I will endeavour to review it in 
the coming days.  In the meantime, the better address to CC
me on fot man pages stuff is mtk-manpages@gmx.net.

Cheers,


-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
-

From: Amit K. Arora
Date: Sunday, July 15, 2007 - 10:32 pm

Sure.

BTW, this man page has changed a bit and the one in TAKE8 of fallocate
patches is the latest one. You are copied on that too.
I will forward that mail to "mtk-manpages@gmx.net" id also, so that you
do not miss it. Thanks!

--
Regards,
-

From: Michael Kerrisk
Date: Sunday, July 22, 2007 - 11:09 pm

Amit,

I've taken the page that you sent and made various minor formatting and
wording fixes.  I've also added various FIXMEs to the page.  Some of these
("FIXME .") are things that I need to check up later.  Some others are
questions for which I need input from you, David, or someone else with the
relevant info (I've marked these "FIXME Amit:").  Could you please review,
and send a new draft of the page back to me.

Cheers,

Michael


.\" FIXME Amit: I need author and license information for this page.
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initializes to zero the disk space within the given range.
.\" FIXME Amit: The next two sentences seem to contradict
.\" each other somewhat.  On the one hand, later writes
.\" are guaranteed not to fail for lack of space; on the other
.\" hand, the file size id not changed even if it is currently
.\" smaller than offset+len bytes.
.\" Could you explain this a little further.  (E.g., how does
.\" the kernel guarantee space without changing the size
.\" of the file?)
After a successful call,
subsequent writes are guaranteed not to fail because
of lack of disk space.
Even if the size of the file is less than
.IR offset + len ,
the file size is not ...
From: Amit K. Arora
Date: Monday, July 23, 2007 - 6:10 am

Hi Michael,


Thanks for going through the manpage and improving it!

My comments are below in between <Amit> ... </Amit> tags.

Thanks!
--
Regards,
Amit Arora



.\" FIXME Amit: I need author and license information for this page.
.\" <Amit>
.\"    David Chinner is the original author, hence he can help with this.
.\" </Amit>
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
allocates and initializes to zero the disk space within the given range.
.\" FIXME Amit: The next two sentences seem to contradict
.\" each other somewhat.  On the one hand, later writes
.\" are guaranteed not to fail for lack of space; on the other
.\" hand, the file size id not changed even if it is currently
.\" smaller than offset+len bytes.
.\" Could you explain this a little further.  (E.g., how does
.\" the kernel guarantee space without changing the size
.\" of the file?)
.\" <Amit>
.\"     Well, this is a feature where you can allocate/reserve space for
.\" a file without changing the file size. This is done by allocating blocks
.\" to the file, but still not changing the size. As mentioned below, this
.\" helps applications that use append mode a lot. These can open
.\" a file in append mode and start writing to "preallocated" ...
From: David Chinner
Date: Tuesday, July 24, 2007 - 12:06 am

Patch below.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

diff -u orig/fallocate.3 new/fallocate.3
--- orig/fallocate.3	Tue Jul 24 17:00:42 2007
+++ new/fallocate.3	Tue Jul 24 17:02:44 2007
@@ -1,7 +1,6 @@
-.\" FIXME Amit: I need author and license information for this page.
-.\" <Amit>
-.\"    David Chinner is the original author, hence he can help with this.
-.\" </Amit>
+.\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved
+.\" Written by Dave Chinner <dgc@sgi.com>
+.\" May be distributed as per GPLv2
 .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
 fallocate \- manipulate file space
-

From: Michael Kerrisk
Date: Sunday, July 29, 2007 - 11:21 pm

Thanks David.  Applied, but I wrote "GNU General Public License vesion 2".

Cheers,

Michael

-

From: Michael Kerrisk
Date: Monday, July 30, 2007 - 12:43 pm

Okay -- I tried rewording the text here a little to make this clearer.  Can
you review the new version to see that it's okay.


Thanks for the info.



I made the sentence:

    Because allocation is done in block size chunks, fallocate()
    may allocate a larger range than that which was specified.

okay?


Okay -- thanks.  I reworded the text for the ESNODEV error to make this
clearer.  (Please check the wording in the next draft.)

By the way in fs/open.c I see the comment:

        /*
         * Let individual file system decide if it supports preallocation
         * for directories or not.
         */
        if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
                goto out_fput;

But that comment doesn't seem to accord with the line of code immediately
below it (S_ISDIR() check is doene regardles of file system type).  Do I
misunderstand something -- or is the comment wrong?


I made it:

    The mode is not supported by the file system containing the file
    referred to by fd.

Okay?

[...]

New version of the page on its way soon.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-

From: Amit K. Arora
Date: Tuesday, July 31, 2007 - 6:56 am

Hi Michael,

<Amit>
Ok. Will review the draft version soon and will get back to you.

<Amit>
Ok.

<Amit>
Sure.

<Amit>
I think it is correct. We are failing ("goto out_fput;") _only_ if it is
not a regular file AND also not a directory. In the case when the
concerned object is a directory, the above "if" condition won't be true
and thus the "goto" won't get called. Hence, the individual file
system's ->fallocate() inode op will be called, which will decide if it
wants to support directories or not.

<Amit>
Ok.

<Amit>
I have received it. Will review it soon (maybe by tomorrow) and get
back. Thanks!
</Amit>

--
Regards,
-

From: Michael Kerrisk
Date: Monday, July 30, 2007 - 12:44 pm

Amit, David,

I've edited the previous version of the page, adding David's license, and
integrating Amit's comments.  I've also added a few new FIXMES.  ("FIXME
Amit" again.)

Could you please review the changes, and the FIXMEs.

Cheers,

Michael



.\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved
.\" Written by Dave Chinner <dgc@sgi.com>
.\" May be distributed as per GNU General Public License version 2.
.\"
.TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual"
.SH NAME
fallocate \- manipulate file space
.SH SYNOPSIS
.nf
.\" FIXME . eventually this #include will probably be something
.\" different when support is added in glibc.
.B #include <linux/falloc.h>
.PP
.BI "long fallocate(int " fd ", int " mode ", loff_t " offset \
", loff_t " len ");
.\" FIXME . check later what feature text macros are  required in
.\" glibc
.SH DESCRIPTION
.BR fallocate ()
allows the caller to directly manipulate the allocated disk space
for the file referred to by
.I fd
for the byte range starting at
.I offset
and continuing for
.I len
bytes.
.\" FIXME Amit: in other words the affected byte range
.\" is the bytes from (offset) to (offset + len - 1), right?

The
.I mode
argument determines the operation to be performed on the given range.
Currently only one flag is supported for
.IR mode :
.TP
.B FALLOC_FL_KEEP_SIZE
This flag allocates and initializes to zero the disk space
within the range specified by
.I offset
and
.IR len .
After a successful call, subsequent writes into this range
are guaranteed not to fail because of lack of disk space.
Preallocating zeroed blocks beyond the end of the file
is useful for optimizing append workloads.
Preallocating blocks does not change
the file size (as reported by
.BR stat (2))
even if it is less than
.\" FIXME Amit: "offset + len" is written here.  But should it be
.\" "offset + len - 1" ?
.IR offset + len .
.\"
.\" Note from Amit Arora:
.\" There were few more flags which were discussed, but none ...
From: Amit K. Arora
Date: Thursday, August 2, 2007 - 10:36 am

Hi Michael,




--
Regards,

<Amit>
Yes, you are right.

<Amit>
Good point. This text was directly taken from the man page of
posix_fallocate and is also there on the posix specifications at:
http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html

The current posix_fallocate() implementation and also the fallocate()
implementation in ext4 are based on above documentation, wherein EOF is
compared with "offset + len" and not with "offset + len - 1".

I am not sure if this is right or wrong. But, this is as per posix
specifications. ;)

<Amit>
Please see my previous comment.

<Amit>
There is a typo above. We have "file system" repeated twice in above
sentence. Second one should be "file".
-

From: Michael Kerrisk
Date: Friday, August 3, 2007 - 4:59 am

Thanks.



Ahhh -- the off by one error was inside my head!  Obviously if we allocate
bytes for offset 1000, len 100, then the affected byte range would run to
offset 1099, giving a file size of 1100 bytes -- that is (offset + len) --
not (offset + len - 1), which is of course the offset of the last byte.
Sorry for the confusion.


Thanks for catching that.

Okay -- it seems that this page is pretty much ready for publication,
right?  I'll hold off for a bit, until nearer the end of the 2.6.23 cycle.

Cheers,

Michael

-- 
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?  Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-

From: Amit K. Arora
Date: Sunday, August 5, 2007 - 11:10 pm

I agree. Thanks!

--
Regards,
Amit Arora
-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:47 am

From: Amit Arora <aarora@in.ibm.com>

sys_fallocate() implementation on i386, x86_64 and powerpc

fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called ->fallocate().
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.

Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
ToDos:
1. Implementation on other architectures (other than i386, x86_64,
   and ppc). Patches for s390(x) and ia64 are already available from
   previous posts, but it was decided that they should be added later
   once fallocate is in the mainline. Hence not including those patches
   in this take.
2. A generic file system operation to handle fallocate
   (generic_fallocate), for filesystems that do _not_ have the fallocate
   inode operation implemented.
3. Changes to glibc,
   a) to support fallocate() system call
   b) to make posix_fallocate() and posix_fallocate64() call fallocate()


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/arch/i386/kernel/syscall_table.S
===================================================================
--- ...
From: Christoph Hellwig
Date: Friday, July 13, 2007 - 6:21 am

kerneldoc comments are for in-kernel APIs which syscalls aren't.  I'd say

Please remove the comment, adding a generic fallback in kernelspace is a

Just remove FALLOC_ALLOCATE, 0 flags should be the default.  I'm also
not sure there is any point in having two namespace now that we have a flags-
based ABI.

Also please don't add this to fs.h.  fs.h is a complete mess and the
falloc flags are a new user ABI.  Add a linux/falloc.h instead which can
be added to headers-y so the ABI constant can be exported to userspace.

-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 7:18 am

Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want
to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this
flag and remove the other mode too (FALLOC_RESV_SPACE).

Should we need a header file just to declare one flag - i.e.
FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two
modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h"
be a sane place for this flag ?

Thanks!
--
Regards,
Amit Arora
-

From: Christoph Hellwig
Date: Friday, July 13, 2007 - 7:46 am

It might sound a litte silly but is the cleanest thing we could do by
far.  And I suspect there will be more more flags soon..

-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:50 am

From: Amit Arora <aarora@in.ibm.com>

fallocate support in ext4

This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.


Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
 		} else if (path->p_ext) {
 			ext_debug("  %d:%d:%llu ",
 				  le32_to_cpu(path->p_ext->ee_block),
-				  le16_to_cpu(path->p_ext->ee_len),
+				  ext4_ext_get_actual_len(path->p_ext),
 				  ext_pblock(path->p_ext));
 		} else
 			ext_debug("  []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
 
 	for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
 		ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
-			  le16_to_cpu(ex->ee_len), ext_pblock(ex));
+			  ext4_ext_get_actual_len(ex), ext_pblock(ex));
 	}
 	ext_debug("\n");
 }
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode, 
 	ext_debug("  -> %d:%llu:%d ",
 			le32_to_cpu(path->p_ext->ee_block),
 			ext_pblock(path->p_ext),
-			le16_to_cpu(path->p_ext->ee_len));
+			ext4_ext_get_actual_len(path->p_ext));
 
 #ifdef CHECK_BINSEARCH
 	{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
 		ext_debug("move %d:%llu:%d in new leaf %llu\n",
 				le32_to_cpu(path[depth].p_ext->ee_block),
 				ext_pblock(path[depth].p_ext),
-				le16_to_cpu(path[depth].p_ext->ee_len),
+				ext4_ext_get_actual_len(path[depth].p_ext),
 				newblock);
 		/*memmove(ex++, path[depth].p_ext++,
 ...
From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:48 am

From: David P. Quigley <dpquigl@tycho.nsa.gov>

Revalidate the write permissions for fallocate(2), in case security policy has
changed since the files were opened.

Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>

---
 fs/open.c |    3 +++
 1 files changed, 3 insertions(+)

Index: linux-2.6.22/fs/open.c
===================================================================
--- linux-2.6.22.orig/fs/open.c
+++ linux-2.6.22/fs/open.c
@@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in
 		goto out;
 	if (!(file->f_mode & FMODE_WRITE))
 		goto out_fput;
+	ret = security_file_permission(file, MAY_WRITE);
+	if (ret)
+		goto out_fput;
 
 	inode = file->f_path.dentry->d_inode;
 
-

From: Christoph Hellwig
Date: Friday, July 13, 2007 - 6:21 am

This should be merged into the main falloc patch.

-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 7:28 am

Ok. Will merge it...

--
Regards,
Amit Arora
-

From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:52 am

From:  Amit Arora <aarora@in.ibm.com>

write support for preallocated blocks

This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode 
 }
 
 /*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+			  struct ext4_ext_path *path,
+			  struct ext4_extent *ex)
+{
+	struct ext4_extent_header *eh;
+	unsigned int depth, len;
+	int merge_done = 0;
+	int uninitialized = 0;
+
+	depth = ext_depth(inode);
+	BUG_ON(path[depth].p_hdr == NULL);
+	eh = path[depth].p_hdr;
+
+	while (ex < EXT_LAST_EXTENT(eh)) {
+		if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+			break;
+		/* merge with next extent! */
+		if (ext4_ext_is_uninitialized(ex))
+			uninitialized = 1;
+		ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+				+ ext4_ext_get_actual_len(ex + 1));
+		if (uninitialized)
+			ext4_ext_mark_uninitialized(ex);
+
+		if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+			len = (EXT_LAST_EXTENT(eh) - ex - 1)
+				* sizeof(struct ext4_extent);
+			memmove(ex + 1, ex + 2, len);
+		}
+		eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+		merge_done = 1;
+		WARN_ON(eh->eh_entries == 0);
+		if (!eh->eh_entries)
+			ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+			 ...
From: Amit K. Arora
Date: Friday, July 13, 2007 - 5:52 am

From: Amit Arora <aarora@in.ibm.com>

Change on-disk format for extent to represent uninitialized/initialized extents

This change was suggested by Andreas Dilger. 
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.

This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN
instead of 0x8000 at some places.

Signed-off-by: Amit Arora <aarora@in.ibm.com>

Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
 ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
 				struct ext4_extent *ex2)
 {
-	unsigned short ext1_ee_len, ext2_ee_len;
+	unsigned short ext1_ee_len, ext2_ee_len, max_len;
 
 	/*
 	 * Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode 
 	if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
 		return 0;
 
+	if (ext4_ext_is_uninitialized(ex1))
+		max_len = EXT_UNINIT_MAX_LEN;
+	else
+		max_len = EXT_INIT_MAX_LEN;
+
 	ext1_ee_len = ext4_ext_get_actual_len(ex1);
 	ext2_ee_len = ext4_ext_get_actual_len(ex2);
 
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode 
 	 * as an RO_COMPAT feature, refuse to merge to extents if
 	 * this can result in the top bit of ee_len being set.
 	 */
-	if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
+	if (ext1_ee_len + ext2_ee_len > max_len)
 		return 0;
 #ifdef AGGRESSIVE_TEST
 	if (le16_to_cpu(ex1->ee_len) >= 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
 
 ...
Previous thread: Patch Related with Fork Bombing Attack by Anand Jahagirdar on Friday, July 13, 2007 - 5:39 am. (3 messages)

Next thread: howto create partitions bigger than 2TB by Ingo Freund on Friday, July 13, 2007 - 5:54 am. (6 messages)