This is the latest fallocate patchset and is based on 2.6.22. * Following are the changes from TAKE6: 1) We now just have two modes (and no deallocation modes). 2) Updated the man page 3) Added a new patch submitted by David P. Quigley (Patch 3/6). 4) Used EXT_INIT_MAX_LEN instead of 0x8000 in Patch 6/6. 5) Included below in the end is a small testcase to test fallocate. * Following are the changes from TAKE5 to TAKE6: 1) Rebased to 2.6.22 2) Added compat wrapper for x86_64 3) Dropped s390 and ia64 patches, since the platform maintaners can add the support for fallocate once it is in mainline. 4) Added a change suggested by Andreas for better extent-to-group alignment in ext4 (Patch 6/6). Please refer following post: http://www.mail-archive.com/linux-ext4@vger.kernel.org/msg02445.html 5) Renamed mode flags and values from "FA_" to "FALLOC_" 6) Added manpage (updated version of the one initially submitted by David Chinner). Todos: ----- 1> Implementation on other architectures (other than i386, x86_64, and ppc64). s390(x) and ia64 patches are ready and will be pushed by platform maintaners when the fallocate is in mainline. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() 4> Patch to e2fsprogs to recognize and display uninitialized extents. Following patches follow: Patch 1/6 : manpage for fallocate Patch 2/6 : fallocate() implementation in i386, x86_64 and powerpc Patch 3/6 : revalidate write permissions for fallocate Patch 4/6 : ext4: fallocate support in ext4 Patch 5/6 : ext4: write support for preallocated blocks Patch 6/6 : ext4: change for better extent-to-group alignment Note: Attached below is a small testcase to test fallocate. The __NR_fallocate will need to be changed depending on the system ...
Following is the modified version of the manpage originally submitted by David Chinner. Please use `nroff -man fallocate.2 | less` to view. This includes changes suggested by Heikki Orsila and Barry Naujok. .TH fallocate 2 .SH NAME fallocate \- allocate or remove file space .SH SYNOPSIS .nf .B #include <fcntl.h> .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset ", loff_t " len); .SH DESCRIPTION The .B fallocate syscall allows a user to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode parameter determines the operation to be performed on the given range. Currently there are two modes: .TP .B FALLOC_ALLOCATE allocates and initialises to zero the disk space within the given range. After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. If the size of the file is less than .IR offset + len , then the file is increased to this size; otherwise the file size is left unchanged. .B FALLOC_ALLOCATE closely resembles .BR posix_fallocate (3) and is intended as a method of optimally implementing this function. .B FALLOC_ALLOCATE may allocate a larger range than that was specified. .TP .B FALLOC_RESV_SPACE provides the same functionality as .B FALLOC_ALLOCATE except it does not ever change the file size. This allows allocation of zero blocks beyond the end of file and is useful for optimising append workloads. .SH RETURN VALUE .B fallocate returns zero on success, or an error number on failure. Note that .I errno is not set. .SH ERRORS .TP .B EBADF .I fd is not a valid file descriptor, or is not opened for writing. .TP .B EFBIG .IR offset + len exceeds the maximum file size. .TP .B EINVAL .I offset was less than 0, or .I len was less than or equal to 0. .TP .B ENODEV .I fd does not refer to a regular file or a directory. .TP .B ENOSPC There is not enough space left on ...
If fallocate is just being used for allocating space this is wrong. maybe - "manipulate file space" instead? "of zeroed blocks" -- Dave Chinner Principal Engineer SGI Australian Software Group -
[CC += mtk-manpages@gmx.net] Amit, Thanks for this page. I will endeavour to review it in the coming days. In the meantime, the better address to CC me on fot man pages stuff is mtk-manpages@gmx.net. Cheers, -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -
Sure. BTW, this man page has changed a bit and the one in TAKE8 of fallocate patches is the latest one. You are copied on that too. I will forward that mail to "mtk-manpages@gmx.net" id also, so that you do not miss it. Thanks! -- Regards, -
Amit, I've taken the page that you sent and made various minor formatting and wording fixes. I've also added various FIXMEs to the page. Some of these ("FIXME .") are things that I need to check up later. Some others are questions for which I need input from you, David, or someone else with the relevant info (I've marked these "FIXME Amit:"). Could you please review, and send a new draft of the page back to me. Cheers, Michael .\" FIXME Amit: I need author and license information for this page. .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .\" FIXME . eventually this #include will probably be something .\" different when support is added in glibc. .B #include <linux/falloc.h> .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \ ", loff_t " len "); .\" FIXME . check later what feature text macros are required in .\" glibc .SH DESCRIPTION .BR fallocate () allows the caller to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode argument determines the operation to be performed on the given range. Currently only one flag is supported for .IR mode : .TP .B FALLOC_FL_KEEP_SIZE allocates and initializes to zero the disk space within the given range. .\" FIXME Amit: The next two sentences seem to contradict .\" each other somewhat. On the one hand, later writes .\" are guaranteed not to fail for lack of space; on the other .\" hand, the file size id not changed even if it is currently .\" smaller than offset+len bytes. .\" Could you explain this a little further. (E.g., how does .\" the kernel guarantee space without changing the size .\" of the file?) After a successful call, subsequent writes are guaranteed not to fail because of lack of disk space. Even if the size of the file is less than .IR offset + len , the file size is not ...
Hi Michael, Thanks for going through the manpage and improving it! My comments are below in between <Amit> ... </Amit> tags. Thanks! -- Regards, Amit Arora .\" FIXME Amit: I need author and license information for this page. .\" <Amit> .\" David Chinner is the original author, hence he can help with this. .\" </Amit> .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .\" FIXME . eventually this #include will probably be something .\" different when support is added in glibc. .B #include <linux/falloc.h> .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \ ", loff_t " len "); .\" FIXME . check later what feature text macros are required in .\" glibc .SH DESCRIPTION .BR fallocate () allows the caller to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. The .I mode argument determines the operation to be performed on the given range. Currently only one flag is supported for .IR mode : .TP .B FALLOC_FL_KEEP_SIZE allocates and initializes to zero the disk space within the given range. .\" FIXME Amit: The next two sentences seem to contradict .\" each other somewhat. On the one hand, later writes .\" are guaranteed not to fail for lack of space; on the other .\" hand, the file size id not changed even if it is currently .\" smaller than offset+len bytes. .\" Could you explain this a little further. (E.g., how does .\" the kernel guarantee space without changing the size .\" of the file?) .\" <Amit> .\" Well, this is a feature where you can allocate/reserve space for .\" a file without changing the file size. This is done by allocating blocks .\" to the file, but still not changing the size. As mentioned below, this .\" helps applications that use append mode a lot. These can open .\" a file in append mode and start writing to "preallocated" ...
Patch below. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group diff -u orig/fallocate.3 new/fallocate.3 --- orig/fallocate.3 Tue Jul 24 17:00:42 2007 +++ new/fallocate.3 Tue Jul 24 17:02:44 2007 @@ -1,7 +1,6 @@ -.\" FIXME Amit: I need author and license information for this page. -.\" <Amit> -.\" David Chinner is the original author, hence he can help with this. -.\" </Amit> +.\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved +.\" Written by Dave Chinner <dgc@sgi.com> +.\" May be distributed as per GPLv2 .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" .SH NAME fallocate \- manipulate file space -
Thanks David. Applied, but I wrote "GNU General Public License vesion 2". Cheers, Michael -
Okay -- I tried rewording the text here a little to make this clearer. Can
you review the new version to see that it's okay.
Thanks for the info.
I made the sentence:
Because allocation is done in block size chunks, fallocate()
may allocate a larger range than that which was specified.
okay?
Okay -- thanks. I reworded the text for the ESNODEV error to make this
clearer. (Please check the wording in the next draft.)
By the way in fs/open.c I see the comment:
/*
* Let individual file system decide if it supports preallocation
* for directories or not.
*/
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
goto out_fput;
But that comment doesn't seem to accord with the line of code immediately
below it (S_ISDIR() check is doene regardles of file system type). Do I
misunderstand something -- or is the comment wrong?
I made it:
The mode is not supported by the file system containing the file
referred to by fd.
Okay?
[...]
New version of the page on its way soon.
Cheers,
Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
-
Hi Michael, <Amit> Ok. Will review the draft version soon and will get back to you. <Amit> Ok. <Amit> Sure. <Amit> I think it is correct. We are failing ("goto out_fput;") _only_ if it is not a regular file AND also not a directory. In the case when the concerned object is a directory, the above "if" condition won't be true and thus the "goto" won't get called. Hence, the individual file system's ->fallocate() inode op will be called, which will decide if it wants to support directories or not. <Amit> Ok. <Amit> I have received it. Will review it soon (maybe by tomorrow) and get back. Thanks! </Amit> -- Regards, -
Amit, David, I've edited the previous version of the page, adding David's license, and integrating Amit's comments. I've also added a few new FIXMES. ("FIXME Amit" again.) Could you please review the changes, and the FIXMEs. Cheers, Michael .\" Copyright (c) 2007 Silicon Graphics, Inc. All Rights Reserved .\" Written by Dave Chinner <dgc@sgi.com> .\" May be distributed as per GNU General Public License version 2. .\" .TH FALLOCATE 2 2007-07-20 "Linux" "Linux Programmer's Manual" .SH NAME fallocate \- manipulate file space .SH SYNOPSIS .nf .\" FIXME . eventually this #include will probably be something .\" different when support is added in glibc. .B #include <linux/falloc.h> .PP .BI "long fallocate(int " fd ", int " mode ", loff_t " offset \ ", loff_t " len "); .\" FIXME . check later what feature text macros are required in .\" glibc .SH DESCRIPTION .BR fallocate () allows the caller to directly manipulate the allocated disk space for the file referred to by .I fd for the byte range starting at .I offset and continuing for .I len bytes. .\" FIXME Amit: in other words the affected byte range .\" is the bytes from (offset) to (offset + len - 1), right? The .I mode argument determines the operation to be performed on the given range. Currently only one flag is supported for .IR mode : .TP .B FALLOC_FL_KEEP_SIZE This flag allocates and initializes to zero the disk space within the range specified by .I offset and .IR len . After a successful call, subsequent writes into this range are guaranteed not to fail because of lack of disk space. Preallocating zeroed blocks beyond the end of the file is useful for optimizing append workloads. Preallocating blocks does not change the file size (as reported by .BR stat (2)) even if it is less than .\" FIXME Amit: "offset + len" is written here. But should it be .\" "offset + len - 1" ? .IR offset + len . .\" .\" Note from Amit Arora: .\" There were few more flags which were discussed, but none ...
Hi Michael, -- Regards, <Amit> Yes, you are right. <Amit> Good point. This text was directly taken from the man page of posix_fallocate and is also there on the posix specifications at: http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html The current posix_fallocate() implementation and also the fallocate() implementation in ext4 are based on above documentation, wherein EOF is compared with "offset + len" and not with "offset + len - 1". I am not sure if this is right or wrong. But, this is as per posix specifications. ;) <Amit> Please see my previous comment. <Amit> There is a typo above. We have "file system" repeated twice in above sentence. Second one should be "file". -
Thanks. Ahhh -- the off by one error was inside my head! Obviously if we allocate bytes for offset 1000, len 100, then the affected byte range would run to offset 1099, giving a file size of 1100 bytes -- that is (offset + len) -- not (offset + len - 1), which is of course the offset of the last byte. Sorry for the confusion. Thanks for catching that. Okay -- it seems that this page is pretty much ready for publication, right? I'll hold off for a bit, until nearer the end of the 2.6.23 cycle. Cheers, Michael -- Michael Kerrisk maintainer of Linux man pages Sections 2, 3, 4, 5, and 7 Want to help with man page maintenance? Grab the latest tarball at http://www.kernel.org/pub/linux/docs/manpages/ read the HOWTOHELP file and grep the source files for 'FIXME'. -
I agree. Thanks! -- Regards, Amit Arora -
From: Amit Arora <aarora@in.ibm.com> sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com> Index: linux-2.6.22/arch/i386/kernel/syscall_table.S =================================================================== --- ...
kerneldoc comments are for in-kernel APIs which syscalls aren't. I'd say Please remove the comment, adding a generic fallback in kernelspace is a Just remove FALLOC_ALLOCATE, 0 flags should be the default. I'm also not sure there is any point in having two namespace now that we have a flags- based ABI. Also please don't add this to fs.h. fs.h is a complete mess and the falloc flags are a new user ABI. Add a linux/falloc.h instead which can be added to headers-y so the ABI constant can be exported to userspace. -
Ok. Since we have only one flag (FALLOC_FL_KEEP_SIZE) and we do not want to declare the default mode (FALLOC_ALLOCATE), we can _just_ have this flag and remove the other mode too (FALLOC_RESV_SPACE). Should we need a header file just to declare one flag - i.e. FALLOC_FL_KEEP_SIZE (since now there is no point of declaring the two modes) ? If "linux/fs.h" is not a good place, will "asm-generic/fcntl.h" be a sane place for this flag ? Thanks! -- Regards, Amit Arora -
It might sound a litte silly but is the cleanest thing we could do by far. And I suspect there will be more more flags soon.. -
From: Amit Arora <aarora@in.ibm.com>
fallocate support in ext4
This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation. Current implementation only supports
preallocation for regular files (directories not supported as of date)
with extent maps. This patch does not support block-mapped files currently.
Only FALLOC_ALLOCATE and FALLOC_RESV_SPACE modes are being supported as of
now.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -282,7 +282,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug(" %d:%d:%llu ",
le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
ext_pblock(path->p_ext));
} else
ext_debug(" []");
@@ -305,7 +305,7 @@ static void ext4_ext_show_leaf(struct in
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
}
@@ -425,7 +425,7 @@ ext4_ext_binsearch(struct inode *inode,
ext_debug(" -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
- le16_to_cpu(path->p_ext->ee_len));
+ ext4_ext_get_actual_len(path->p_ext));
#ifdef CHECK_BINSEARCH
{
@@ -686,7 +686,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
- le16_to_cpu(path[depth].p_ext->ee_len),
+ ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
...From: David P. Quigley <dpquigl@tycho.nsa.gov> Revalidate the write permissions for fallocate(2), in case security policy has changed since the files were opened. Acked-by: James Morris <jmorris@namei.org> Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov> --- fs/open.c | 3 +++ 1 files changed, 3 insertions(+) Index: linux-2.6.22/fs/open.c =================================================================== --- linux-2.6.22.orig/fs/open.c +++ linux-2.6.22/fs/open.c @@ -407,6 +407,9 @@ asmlinkage long sys_fallocate(int fd, in goto out; if (!(file->f_mode & FMODE_WRITE)) goto out_fput; + ret = security_file_permission(file, MAY_WRITE); + if (ret) + goto out_fput; inode = file->f_path.dentry->d_inode; -
This should be merged into the main falloc patch. -
Ok. Will merge it... -- Regards, Amit Arora -
From: Amit Arora <aarora@in.ibm.com>
write support for preallocated blocks
This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1140,6 +1140,53 @@ ext4_can_extents_be_merged(struct inode
}
/*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+ struct ext4_extent_header *eh;
+ unsigned int depth, len;
+ int merge_done = 0;
+ int uninitialized = 0;
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ eh = path[depth].p_hdr;
+
+ while (ex < EXT_LAST_EXTENT(eh)) {
+ if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+ break;
+ /* merge with next extent! */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(ex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
+
+ if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+ len = (EXT_LAST_EXTENT(eh) - ex - 1)
+ * sizeof(struct ext4_extent);
+ memmove(ex + 1, ex + 2, len);
+ }
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries) - 1);
+ merge_done = 1;
+ WARN_ON(eh->eh_entries == 0);
+ if (!eh->eh_entries)
+ ext4_error(inode->i_sb, "ext4_ext_try_to_merge",
+ ...From: Amit Arora <aarora@in.ibm.com>
Change on-disk format for extent to represent uninitialized/initialized extents
This change was suggested by Andreas Dilger.
This patch changes the EXT_MAX_LEN value and extent code which marks/checks
uninitialized extents. With this change it will be possible to have
initialized extents with 2^15 blocks (earlier the max blocks we could have
was 2^15 - 1). This way we can have better extent-to-block alignment.
Now, maximum number of blocks we can have in an initialized extent is 2^15
and in an uninitialized extent is 2^15 - 1.
This patch takes care of Andreas's suggestion of using EXT_INIT_MAX_LEN
instead of 0x8000 at some places.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22/fs/ext4/extents.c
===================================================================
--- linux-2.6.22.orig/fs/ext4/extents.c
+++ linux-2.6.22/fs/ext4/extents.c
@@ -1106,7 +1106,7 @@ static int
ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ext4_extent *ex2)
{
- unsigned short ext1_ee_len, ext2_ee_len;
+ unsigned short ext1_ee_len, ext2_ee_len, max_len;
/*
* Make sure that either both extents are uninitialized, or
@@ -1115,6 +1115,11 @@ ext4_can_extents_be_merged(struct inode
if (ext4_ext_is_uninitialized(ex1) ^ ext4_ext_is_uninitialized(ex2))
return 0;
+ if (ext4_ext_is_uninitialized(ex1))
+ max_len = EXT_UNINIT_MAX_LEN;
+ else
+ max_len = EXT_INIT_MAX_LEN;
+
ext1_ee_len = ext4_ext_get_actual_len(ex1);
ext2_ee_len = ext4_ext_get_actual_len(ex2);
@@ -1127,7 +1132,7 @@ ext4_can_extents_be_merged(struct inode
* as an RO_COMPAT feature, refuse to merge to extents if
* this can result in the top bit of ee_len being set.
*/
- if (ext1_ee_len + ext2_ee_len > EXT_MAX_LEN)
+ if (ext1_ee_len + ext2_ee_len > max_len)
return 0;
#ifdef AGGRESSIVE_TEST
if (le16_to_cpu(ex1->ee_len) >= 4)
@@ -1814,7 +1819,11 @@ ext4_ext_rm_leaf(handle_t *handle, struc
...