This is to give a heads up on few patches that we will be soon coming up with. These patches implement a new system call sys_fallocate() and a new inode operation "fallocate", for persistent preallocation. The new system call, as Andrew suggested, will look like: asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len); As we are developing and testing the required patches, we decided to post a preliminary patch and get inputs from the community to give it a right direction and shape. First, a little description on the feature. Persistent preallocation is a file system feature using which an application (say, relational database servers) can explicitly preallocate blocks to a particular file. This feature can be used to reserve space for a file to get mainly the following benefits: 1> contiguity - less defragmentation and thus faster access speed, and 2> guarantee for a minimum space availibility (depending on how many blocks were preallocated) for the file, even if the filesystem becomes full. XFS already has an implementation for this, using an ioctl interface. And, ext4 is now coming up with this feature. In coming time we may see a few more file systems implementing this. Thus, it makes sense to have a more standard interface for this, like this new system call. Here is the initial and incomplete version of the patch, which can be used for the discussion, till we come up with a set of more complete patches. --- arch/i386/kernel/syscall_table.S | 1 + fs/ext4/file.c | 1 + fs/open.c | 18 ++++++++++++++++++ include/asm-i386/unistd.h | 3 ++- include/linux/fs.h | 1 + include/linux/syscalls.h | 1 + 6 files changed, 24 insertions(+), 1 deletion(-) Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S =================================================================== --- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S +++ ...
On Fri, 2 Mar 2007 00:04:45 +0530 It is intended that glibc use this same syscall for both posix_fallocate() and posix_fallocate64(). I'd agree with Eric on the "command" flag extension. That new argument might need to come after "fd" - ARM has funny requirements on Please always put a blank line between the variable definitions and the first statement. Please always use hard tabs, not bunch-of-spaces. This seems to happening rather a lot in the ext4 patches. It's a trivial thing, but also trivial to fix. A grep across the diffs is needed. ENOTTY is a bit unconventional - we often use EINVAL for this sort of thing. But EINVAL has other meanings for posix_fallocate() and isn't really appropriate here anyway. So I'm not sure what would be better... -
Seems like a separate syscall would be better, "command" sounds a bit ioctl like, especially if that command is passed into the filesystems.. cheers. -- Nathan -
On Fri, 02 Mar 2007 09:40:54 +1100 madvise, fadvise, lseek, etc seem to work OK. I get repeatedly traumatised by patch rejects whenever a new syscall gets added, so I'm biased. The advantage of a command flag is that we can add new modes in the future without causing lots of churn, waiting for arch maintainers to catch up, potentially adding new compat code, etc. Rename it to "mode"? ;) I'm inclined to merge this patch nice and early, so the syscall number is stabilised. Otherwise the people who are working on out-of-tree code (ie: ext4) will have to keep playing catchup. -
I am wondering if it is useful to add another mode to advise block allocation policy? Something like indicating which physical block/block group to allocate from (goal), and whether ask for strict contigous blocks. This will help preallocation or reservation to choose the right blocks for the file. Right now neither ext4 preallocation implementation or reservation are guranteed to allocate/reserve contigugous extents. If the application told it so, it could do more searching to satisfy the requirement. Or fadvise is the right interface? -
Yes, I also think this would be useful so you can "guide" preallocation for things like defragmentation (e.g. preallocate space for the file being defragmented and move the file to it). Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Yep, I think it makes sense to use preallocation for defragmentation. After all both preallocation and defragmentation shall call underlying filesystem multiple block allocator to try to allocate a chunk of contiguous blocks on disk. ext4 online defrag implementation by Takashi already support to choose a "goal" allocation block to guide the ext4 block allocator to place the defraged file is a specific location. Passing a little bit more hint to sys_fallocate() (i.e, goal block, and/or whether the goal block is important over the size of prealloc extent), might make it more useful for the orginial goal (get contigous and guranteed blocks) and for defragmentation. Regards, Mingming -
fallocate with the whence argument and flags is already quite complicated, I'd rather have another call for placement decisions, that would be called on an fd to do placement decissions for any further allocations (prealloc, write, etc) -
Yes, posix_fallocate shouldn't be made more complicated. But I don't understand why requesting linear layout of the blocks should be an option. It's always an advantage if the blocks requested this way are linear on disk. So, the kernel should always do its best to make this happen, without needing an additional option. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96
Actually, it's not that simple. You want linear layout of blocks you are going to read. That is not necessary a linear layout of blocks in a single file - trace sometime a start of some complicated app like KDE. You find it's seeking like a hell because it needs a few blocks from a ton of distinct files (shared libs, config files, etc). As these files are mostly read only, it's advantageous to interleave them on disk or at least keep them close. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
At some point shouldn't the apps be fixed, rather than do crazy things with the filesystem? :) -Eric -
Yes :) That's basically what we told KDE developpers when they were complaining ;) But it's hard to fix it for them too (because of some desktop specs requiring lots of different text config files which can change anytime - don't ask me who designed it). Moreover for example for loading shared libraries from which you need just a few blocks scattered all over the place the problem is in ELF itself. I'll probably first write some userspace fs-reorganizer to find out how much these changes in layout are able to give you in performance (i.e. whether it's worth the effort of more complicated kernel online defragmenter). Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
Have tried profiling the read accesses and prereading them asynchronously on startup? That appears to have improved E17 a lot. See http://lca2007.linux.org.au/talk/101 (and watch the video). Jörn -- The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. -- Edsger W. Dijkstra -
There are HPC workloads where you have multi writers on multiple machines that write to different parts of a file. You preferably want each of those regions in separate allocation groups. (Or tell the customers to use separate files for the regions..) -
Agreed on both points. The hints would be for things like start block, or speculative EOF preallocation, not contiguity, which I think should always be the goal. -Eric -
ISTR having had this discussion before ;) About guided preallocation for defrag: http://marc.info/?t=116247859500001&r=1&w=2 e.g.: The sorts of policies we need for effective use of preallocation: http://marc.info/?l=linux-fsdevel&m=116184475308164&w=2 http://marc.info/?l=linux-fsdevel&m=116278169519095&w=2 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Hints & policies for allocation would certainly be useful, but I think they belong outside this interface. i.e. you could flag an inode for whatever allocation you choose, and -then- call posix_fallocate so that the allocator will take the hints you've given it. See also this blurb from the posix_fallocate definition: "It is implementation-defined whether a previous posix_fadvise() call influences allocation strategy." FWIW I don't see a lot of point in asking for "strict contiguous blocks" - the allocator will presumeably try to do this in any case, and I'm not sure when you would want to fail if you get more than one extent...? -Eric -
I'm fine with that too, I'd just like the functionality :) -Eric -
Would EINVAL (or whatever) make it back to the caller of posix_fallocate(), or would glibc fall back to its current implementation? Forgive me if I haven't put enough thought into it, but would it be useful to create a generic_fallocate() that writes zeroed pages for any non-existent pages in the range? I don't know how glibc currently implements posix_fallocate(), but maybe the kernel could do it more efficiently, even in generic code. Maybe we don't care, since the major file systems can probably do something better in their own code. -- David Kleikamp IBM Linux Technology Center -
On Thu, 01 Mar 2007 22:44:16 +0000 Given that glibc already implements fallocate for all filesystems, it will need to continue to do so for filesystems which don't implement this syscall - otherwise applications would start breaking. However with this kernel change, glibc will need to look at the errno, so that it can correctly propagate EIO, ENOSPC and whatever. So we will need to return a reliable and stable and sensible value so that glibc knows when it should emulate and when it should propagate. Perhaps Ulrich can comment. -
I didn't make it clear, but my point was to call generic_fallocate if the file system did not define i_op->allocate(). if (inode->i_op && inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, offset, len); else ret = generic_fallocate(inode, offset, len); I'm not sure it's worth the effort, but I thought I'd throw the idea out there. -- David Kleikamp IBM Linux Technology Center -
Writing zeroes using glibc emu most likely means write() -- so generic_fallocate should be preferable (think splice). Or does glibc use mmap() and it's all different? Jan -- -
I was out of town, hence the delay. I think that if there is no support for the syscall the correct answer is to return ENOSYS. In this case the current userlevel code would be used and ENOSYS is also used to trigger the use of the compat code in glibc in case the syscall does not exist at all. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96
I'd be more happy to have the write out zeroes loop in glibc. And glibc needs to have it anyway, for older kernels. -
A generic_fallocate makes sense to me iff we can do it in the kernel more significantly more efficiently than in glibc, e.g. by using only a single page in page cache instead of one for each page to be preallocated. If glibc is smart enough to do an optimal implementation, I fully agree with you. Arnd <>< -
glibc cannot ever be smart enough because a file system driver will always know better and be able to do things in a much more optimized way. For example on NTFS fallocate() only needs to involve the setting of a few bits in the volume block allocation bitmap (one bit for each logical block being allocated) and update the extent map in the on- disk inode to reflect that those blocks are now allocated to the inode. Then it just needs to update the allocated size and optionally the data size (if fallocate wants to increase the file size rather than just the allocated size). And that is it. No zeroing needs to happen at all because we have not updated the initialized size of the inode! glibc can only dream of an implementation like this. (-; Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ -
Ok, that's not what I meant. It's obvious that the file system itself can do better than both VFS and glibc. The question is whether VFS can be better than glibc on file systems that don't offer their own implementation of the fallocate operation. Arnd <>< -
When you do it like this, who can the kernel/filesystem *guarantee* that when the data is written there actually is room on the harddrive? What you described seems like using truncate/ftruncate to increase the file's size. That is not at all what posix_fallocate is for. posix_fallocate must make sure that the requested blocks on the disk are reserved (allocated) for the file's use and that at no point in the future will, say, a msync() fail because a mmap(MAP_SHARED) page has been written to. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96
Hi, The blocks are allocated so of course it is guaranteed. Subsequent writes to this file will not generate any allocations thus No that is different. I described performing the allocations in the volume bitmap, i.e. for each allocated block the corresponding "in use" bit is set in the bitmap (NTFS uses a linear bitmap where byte 0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...). Also I described updating the extent map of the inode such that it describes the physical blocks as belonging to the file, thus you would have "logical file block X corresponds to physical block Y on volume" entries entered into the extent map of the inode and they would describe the just allocated blocks. Finally I described updating the allocated size in the inode which basically says "there are that many bytes worth of blocks allocated to this inode". And optionally I described updating the data size in the inode which basically says "this file has size Z bytes". And I specifically did NOT update the initialized size in the inode thus it will remain at its old value thus all new allocated blocks will be considered as present but not initialized thus a read will always return zero whilst a write will do the right thing and pad with zeroes as necessary (if the write is smaller than the block size, etc). Note that you are right that this is like truncate in NTFS for non- sparse enabled inodes/volumes. But for sparse ones, instead of doing any allocations in the bitmap and entering them in the extent map, you would simply add a single entry to the extent map that says "X blocks allocated starting at logical block Y corresponding to no physical blocks, i.e. they are sparse". You would then also update the allocated size and data size as above and now you can even (but do not have to) update the initialized size to be equal to the data size as the file can ...
Anton, You're describing a method of doing in-advance preallocation where the filesystem format explicitly has support for this kind of feature in a way that doesn't require pre-zeroing the data blocks in question. The question which this subthread was concerned about was whether the kernel should get involved in initializing datablocks in the case where the filesystem format does not have this support, or whether this functionality should continue to be done in userspace. Given that glibc already has to support this for older kernels, I would argue that there's no point putting in generic support for filesystem that can't support a more advanced way of doing things. Regards, - Ted -
Yes, I understood that after I had sent my post... And yes, I would agree. If glibc already does this there does not appear to be any value in just moving existing functionality into the kernel. Simply let "dumb" file systems return ENOSYS and let glibc do it... And any FS which can do it better can implement the function and then glibc should not go anywhere near it. Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ -
Well, I'm sure the kernel can do better than the code we have in libc now. The kernel has access to the bitmasks which say which blocks have already been allocated. The libc code does not and we have to be very simple-minded and simply touch every block. And this means reading it and then writing it back. The kernel would know when the reading part is not necessary. Add to then the block granularity (we use f_bsize as returned from fstatfs but that's not the best value in some cases) and you have compelling data to have generic code in the kernel. Then libc implementation can then go away completely which is a good thing. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96
The layer of the kernel where a totally generic fallback would be implemented does not have access to this information. We could do a mostly generic helper for block filesystems that allows to implement fallocate this way without a lot of their own code. -
You have a very good point; indeed since we don't export an interface which allows userspace to determine whether or not a block is in use, that does mean a huge amount of churn in the page cache. So maybe it would be worth doing in the kernel as a result, although the libc implementation still wouldn't be able to go away for long time due to the need to be backwards compatible with older kernels that didn't have this support. Regards, - Ted -
It's better than that. If somebody compiles glibc to not run on older kernels at all (tested at runtime) then the code is dropped. E.g., the current Fedora glibc does not support 2.6.8 or earlier. So, don't let the compat code be a factor in the decision making. --=20 =E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro St = =E2=9E=A7 Mountain View, CA =E2=9D=96
That actually causes an interesting problem for compressing filesystems. The space consumed by blocks depends on their contents and how well it compresses. At the moment, the only option I see to support posix_fallocate for LogFS is to set an inode flag disabling compression, then allocate the blocks. But if the file already contains large amounts of compressed data, I have a problem. Disabling compression for a range within a file is not supported, so I can only return an error. But which one? Jörn -- A surrounded army must be given a way out. -- Sun Tzu -
Please read the thread again. That is not what anyone proposed. The issues we're discussing is whether fallback for a filesystem that does not support preallocation natively should be done in kernelspace or in userspace. -
We can't do that with the current page cache interfaces. But what might make sense is to have a block_dump_prealloc that takes a get_block callback to do what you propose. It still wouldn't be entirely generic, but would allow block based filesystems to do a not entirely dumb implementation. -
FYI the 32bit ppc ABI does too, from arch/powerpc/kernel/sys_ppc32.c: /* * long long munging: * The 32 bit ABI passes long longs in an odd even register pair. */ and the first argument in a function call is in r3. Anton -
One thing I'd like to see is a cmd argument as well, to allow for example allocation vs. reservation (i.e. allocating blocks vs. simply reserving a number), as well as the inverse of those functions (un-reservation, de-allocation)? If the allocation interface allows allocation/reservation within arbitrary ranges, if the only way to un-allocate is via a truncate, that's pretty asymmetric. -Eric -
I'd rather we just get the oft-discussed punch() syscall instead. This is really what "unallocate" would do for persistent allocations and it would be useful for files that were not preallocated. For filesystems that don't implement punch glibc() would do zero-filling of the punched area I guess (to make it equivalent to reading from a hole in the file). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I can see a difference though. punch() would throw away written data as well as pre-allocated-but-never-written-to data. I can see where a user might preallocate a large file and do a lot of random writes. At some point, he decides the file isn't going to grow much more, so let's free up the remaining pre-allocated blocks. This makes even more sense with reservation. The alternative would be to have punch() take a flag to specify if only Or it could just fail. Writing zeroes may be really slow and not give the caller any benefit. (The intention was to free blocks back to the file system.) Shaggy -- David Kleikamp IBM Linux Technology Center -
I certainly agree that we want something like this. posix_fallocate() is the glibc interface we want to be compatible with (which your definition is, AFAICS). Jeff -
This would be great for Samba. Windows clients do this a lot.... Jeremy. -
You can only allocate space on typewriters? ;)
J
-
On Thu, 01 Mar 2007 13:14:32 -0800 A lot of people get confused about -ENOTTY, but it is the return for attempting to use an ioctl on the wrong type of object, so this appears to be quite correct. -
This is a syscall though; ENOSYS is probably a better match.
J
-
On Thu, 01 Mar 2007 14:05:36 -0800 ENOSYS indicates quite different things and ENOTTY is also used for syscalls. I still think ENOTTY is correct. -
Yes, ENOSYS tends to me "operation flat out not support" rather than
"not on this object". I think we can do better than ENOTTY though -
ENOTSUP for example (modulo the confusion over EOPNOTSUPP).
(You can tell the patch has very little real substance if we're arguing
over errnos at this point :)
J
-
Amit K. Arora wrote: Might want more error checking in there, something like (rough cut)... > + > + ret = -EINVAL; > + if (len == 0 || offset < 0) > + goto out; > + if (!(file->f_mode & FMODE_WRITE)) > + ret = -ESPIPE; > + if (S_ISFIFO(inode->i_mode)) > + goto out_fput; > + ret = -ENODEV; > + if (!S_ISREG(inode->i_mode)) > + goto out_fput; > + ret = -EFBIG; > + if (offset + len > inode->i_sb->s_maxbytes) which would keep things in line with posix_fallocate's specified errors, too? -Eric -
Yeah, we need to have this checks. We can't rely on userspace not passing arguments that might corrupt your filesystem or let you Yes, very good idea. -
Thanks a lot, this has been long overdue.
Please don't forget to Cc the XFS list to keep developers of the only
Linux filesystem supporting persistant allocations for a long time :)
Various people will beat you up for the above syscall as lots of
architectures really want 64bit arguments aligned in a proper way,
e.g. you at least need a pad after 'int fd'. Then again I already
have suggestions for filling up that slot with useful information:
- you really want a whence argument as to lseek, as it makes a lot
of sense for applications to allocate from the end of the file
or the current file positions. The existing XFS ioctl already
has this, and it's trivial to support this in any preallocation
implementation I could imagine.
- we should think about having a flag value for which kind of preallocation
we want. XFS currently has two:
ALLOCSP which updates the inode size and physically zeroes blocks
RESVSP which does not update inode size but creates and unwritten
extent
the current posix_fallocate semantics are somewhere in the middle, as
it requires and update to the inode size, but does not specify at
all what happens if you read from the newly allocated space.
And yes, as and heads up to developers implementing this feature
on new filesystems: don't just return new blocks, that's a gapping
This should use fget_light, and I'm sure the code could be written
in a slightly more readable:
asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
{
struct file *file = fget(fd);
ret = -EINVAL;
if (file)
struct inode *inode = file->f_path.dentry->d_inode;
if (inode->i_op && inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, offset, len);
else
ret = -ENOTTY;
fput(file);
}
return ret;
}
p.s. you reference ext4_fallocate in the patch but don't actually
introduce it, it definitively won't compile as-is :)
-
First of all, thanks for the overwhelming response! Based on the suggestions received, I have added a new parameter to the sys_fallocate() system call - an interger called "mode", just after the "fd". Now the system call looks like this: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for preallocation and deallocation of preallocated blocks respectively. More modes can be added, when required. And these modes can be renamed, since I am sure these are no way the best ones ! :) Attached below is the patch which implements this system call. It has been currently implemented and tested on i386, ppc64 and x86_64 architectures. I am facing some problems while trying to implement this on s390, and thus the delay. While I try to get it right on s390(x), we thought of posting this patch, so that we can save some time. Parallely we will work on getting the patch work on s390, and probably it will come as a separate patch. ToDos: ===== Following is pending: 1> Implementation on other architectures (other than i386, x86_64 and ppc64) like s390(x) 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> ext4 patches that support fallocate inode operation are ready. I plan to submit those separately to just ext4 mailing list. 4> Changes to glibc, so that posix_fallocate() and posix_fallocate64() call fallocate() system call 5> Changes to XFS to implement the fallocate inode operation Signed-off-by: Amit K Arora <aarora@in.ibm.com> --- arch/i386/kernel/syscall_table.S | 1 arch/x86_64/kernel/functionlist | 1 fs/open.c | 41 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 +- include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 +- include/asm-x86_64/unistd.h | 4 ++- ...
What's the problem you face on s390? If it's just the compat wrapper, you may look at sys_sync_file_range_wrapper. Or I will send a patch if needed. -
Hi Heiko, Yes, the problem was adding compat wrapper for this. I will appreciate your help in writing it. Only thing is that we might have to wait till the order of the arguments is decided upon. Thanks! -- Regards, Amit Arora -
There is probably not much choice. If you want to stay with the loff_t arguments it won't work on 31-bit s390 or 32-bit powerpc dependent on the order of the arguments. So you should go for what Matthew Wilcox suggested: asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, u32 len_low, u32 len_high); That way it will work an all architectures and in addition no architecture has to do some magic to combine the splitted 64 bit arguments in compat mode. -
There is something here that will not work on s390 (31bit): the arguments would end up in: fd -> r2 mode -> r3 offset -> r4 + r5 len -> r6 + second halve on stack But the s390 ABI says that a long long will be put into two consecutive registers if the first register is smaller than 6, or it will be put completely on the stack. So both 32 bit parts of len will end up on the stack. That would make it a syscall with seven arguments which we currently don't support on s390. There is no way to access the second half of len from kernel space and that is why it is not working for you. So you either rearrange the parameters or convert the loff_t's to pointers. e.g. asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len, int mode) would work even on s390 ;) -
... but wouldn't work on 32-bit powerpc. :( We would end up with a pad argument between fd and offset, giving 7 arguments in all (counting the loff_t's as 2), but we only support 6. Paul. -
Can't be. Or: mips supports 7 arguments and parisc doesn't pad. Otherwise they couldn't have wired up sys_sync_file_range(int fd, loff_t offset, loff_t nbytes, unsigned int flags) But from what I read, it's currently not possible for 32-bit powerpc to wire up the already present sync_file_range system call. -
32bit native is fine (as the ABI in user mode is the same as that in the kernel). For 32bit on a 64bit kernel you need the arch specific comapt routine that I used in the patch I posteda little while ago, -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
Sorry, I take that back ... -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
How about: asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high, u32 len_low, u32 len_high); That way we all suffer equally ... -
As suggested by you and Russel, I have made this change to the patch.
Here is how it looks like now. Please let me know if anyone has concerns
about passing arguments this way (breaking each "loff_t" into two "u32"s).
Signed-off-by: Amit K Arora <aarora@in.ibm.com>
---
arch/i386/kernel/syscall_table.S | 1
arch/x86_64/kernel/functionlist | 1
fs/open.c | 46 +++++++++++++++++++++++++++++++++++++++
include/asm-i386/unistd.h | 3 +-
include/asm-powerpc/systbl.h | 1
include/asm-powerpc/unistd.h | 3 +-
include/asm-x86_64/unistd.h | 4 ++-
include/linux/fs.h | 7 +++++
include/linux/syscalls.h | 2 +
9 files changed, 65 insertions(+), 3 deletions(-)
Index: linux-2.6.20.1/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.20.1.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.20.1/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_fallocate /* 320 */
Index: linux-2.6.20.1/fs/open.c
===================================================================
--- linux-2.6.20.1.orig/fs/open.c
+++ linux-2.6.20.1/fs/open.c
@@ -350,6 +350,52 @@ asmlinkage long sys_ftruncate64(unsigned
}
#endif
+asmlinkage long sys_fallocate(int fd, int mode, u32 off_low, u32 off_high,
+ u32 len_low, u32 len_high)
+{
+ struct file *file;
+ struct inode *inode;
+ loff_t offset, len;
+ long ret = -EINVAL;
+
+ offset = (off_high << 32) + off_low;
+ len = (len_high << 32) + len_low;
+
+ if (len == 0 || offset < 0)
+ goto out;
+
+ ret = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto out;
+ if (!(file->f_mode & FMODE_WRITE))
+ goto out_fput;
+
+ inode = file->f_path.dentry->d_inode;
+
+ ret = -ESPIPE;
+ if (S_ISFIFO(inode->i_mode))
+ goto out_fput;
+
+ ret = -ENODEV;
+ if (!S_ISREG(inode->i_mode))
+ goto ...I hate to comment at this late stage, especially on something that I think is really a great idea (I did similar more complex, sys_blkalloc with even more arguments time ago --- I'm glad given how complex this thread has become I didn't post them now). In the past there wasn't that much incentive to get this functionality exposed because of various other issues (mmap + page dirty didn't flush reliably) which are close to being resolve, so I think the timing of this is really great.... I really dislike breaking 64-bit args up unless it's necessary. I given there are the only TWO modes right now, why not leave the arguments as 64-bit sane and simply have two syscalls, one for each? -
Hello, We need to come up with the best possible layout of arguments for the fallocate() system call. Various architectures have different requirements for how the arguments should look like. Since the mail chain has become huge, here is the summary of various inputs received so far. Platform: s390 -------------- s390 prefers following layout: int fallocate(int fd, loff_t offset, loff_t len, int mode) For details on why and how "int, int, loff_t, loff_t" is a problem on s390, please see Heiko's mail on 16th March. Here is the link: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg133595.html Platform: ppc, arm ------------------ ppc (32 bit) has a problem with "int, loff_t, loff_t, int" layout, since this will result in a pad between fd and offset, making seven arguments total - which is not supported by ppc32. It supports only 6 arguments. Thus the desired layout by ppc32 is: int fallocate(int fd, int mode, loff_t offset, loff_t len) Even ARM prefers above kind of layout. For details please see the definition of sys_arm_sync_file_range(). Option of loff_t => high u32 + low u32 -------------------------------------- Matthew and Russell have suggested another option of breaking each "loff_t" into two "u32"s. This will result in 6 arguments in total. Following think that this is a good alternative: Matthew Wilcox, Russell King, Heiko Carstens Following do not like this idea: Chris Wedgwood What are your thoughts on this ? What layout should we finalize on ? Perhaps, since sync_file_range() system call has similar arguments, we can take hint from the challenges faced on implementing it on various architectures, and decide. Please suggest. Thanks! -- Regards, Amit Arora -
Right now there are only two possible values for mode --- it's not
clear what additional values there will be in the future.
How about two syscalls? If we decide later on we need something more
complicated we can revisit this and *THEN* add another system call
which may end up being a superset of the other two.
I know that sounds somewhat icky but:
* it's fairly simple
* we get nice argument handling on all arches by dropping u32 mode
(don't we?)
* syscalls don't really cost a lot to keep about, they do cost in
terms on maintenance though, but in this case i don't see it being
all that much of a problem
* IMO badly/over designed syscalls are going to be a bigger problem
long term
Given that *NO* single fs in mainline right now can *reliably* use
this functionality for a while maybe whatever solution people come up
with next should sit in -mm for a while? At least that gives people
exposure to it and a chance to make some changes as once it's merged
to mainline it's pretty hard to change.
-
Hi,
Quoting that...
|len -> r6 + second halve on stack
Then, is not this a gcc glitch? (IMO, it should put all of "len" on the
Does it actually matter? Glibc can have its own argument ordering
different from the syscalls, so at least it would be possible to lay out
the syscall arguments in the most portable way while retaining nice
userspace C code. Hey, glibc might even wrap it up in a struct! (Using a
pointer, as suggested in one of the proposals.)
int fallocate(int fd, loff_t offset, loff_t len, int mode)
{
struct fallocate_foobar d = {fd, offset, len, mode};
return _syscall(..., &d);
}
Jan
--
-
I think it's always better to put only a pointer on the stack as above. Cheers, Dick Johnson Penguin : Linux version 2.6.16.24 on an i686 machine (5592.62 BogoMips). New book: http://www.AbominableFirebug.com/ _ **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. -
I have to disagree, since wrapping it into a struct and copying the struct in kernelspace from userspace requires more code. Pointers only become useful at 3 (rarely) or 4 (yeah, more likely) and 5+ (definitely) arguments, (3) see above about copying, (4) middle thing and (5) tons of arguments like mmap() should be wrapped up... for simplicity of dealing with it later. Jan -- -
Not just more code, but more security issues too. Passing system call arguments by value means that there are no subtle security issues - the value you use is the value you got. But once you pass-by-reference, you have to make damn sure that you do the proper user space accesses and verify the pointer correctly. User-space (aka "user-supplied") pointers are just more dangerous. We obviously can't avoid them, but they need much more care than just a random value directly passed in a register. Linus -
It _does_ put all of "len" on the stack. That is what I tried to explain in the section that follows what you quoted. -
This is a clean-looking option. Can s390 be changed to support seven-arg It's a bit weird-looking, but the six-32-bit-args approach is simple enought to understand and implement. Presumably the glibc wrapper -
Wouldn't int fallocate(loff_t offset, loff_t len, int fd, int mode) work on both s390 and ppc/arm? glibc will certainly wrap it and reorder the arguments as needed, so there is no need to keep fd first. Jakub -
This should work on all the platforms. The only concern I can think of
here is the convention being followed till now, where all the entities on
which the action has to be performed by the kernel (say fd, file/device
name, pid etc.) is the first argument of the system call. If we can live
with the small exception here, fine.
Or else, we may have to implement the
int fd, int mode, loff_t offset, loff_t len
as the layout of arguments here. I think only s390 will have a problem
with this, and we can think of a workaround for it (may be similar to
what ARM did to implement sync_file_range() system call) :
asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
{
return sys_fallocate(fd, offset, len, mode);
}
To me both the approaches look slightly unconventional. But, we need to
compromise somewhere to make things work on all the platforms.
Any thoughts on which one of the above should we finalize on ?
Thanks!
--
Regards,
Amit Arora
-
On Thu, Apr 05, 2007 at 04:56:19PM +0530, Amit K. Arora wrote: -- Regards, Amit Arora -
If s390 can work around the calling order that easily, I certainly --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I think more people are comfirtable with this approach. Since glibc will wrap the system call and export the "conventional" interface (with fd first) to applications, we may not worry about keeping fd first in kernel code. I am personally fine with this approach. Still, if people have major concerns, we can think of getting rid of the "mode" argument itself. Anyhow we may, in future, need to have a policy based system call (say, for providing the goal block by applications for performance reasons). "mode" can then be made part of it. Comments ? -- Regards, Amit Arora -
Really? I thought from the last postings that "fd first, wrap on s390" It would seem to make more sense to wrap the syscall on those architectures We need at least mode="unallocate" or a separate funallocate() call to allow allocated-but-unwritten blocks to be unallocated without actually punching out written data. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Ok. In this case we may have to consider following things: 1) Obviously, for this glibc will have to call fallocate() syscall with different arguments on s390, than other archs. I think this should be doable and should not be an issue with glibc folks (right?). 2) we also need to see how strace behaves in this case. With little knowledge that I have of strace, I don't think it should depend on argument ordering of a system call on different archs (since it uses ptrace internally and that should take care of it). But, it will be nice if someone can confirm this. Thanks! -- Regards, Amit Arora -
glibc can cope with this easily, will just add sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override strace would solve this with #ifdef mess, it already does that in many places so guess another few lines don't make it significantly worse. Jakub -
I will work on the revised fallocate patchset and will post it soon. Thanks! -- Regards, Amit Arora -
Based on the discussion, this new patchset uses following as the interface for fallocate() system call: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) It seems that only s390 architecture has a problem with such a layout of arguments in fallocate(). Thus for s390, we plan to have a wrapper (say, sys_s390_fallocate()) for the sys_fallocate(), which will get called by glibc when an application issues a fallocate() system call on s390. The s390 arch specific changes will be part of a separate patch (PATCH 2/5). It will be great if some s390 expert can verify the patch, since I have not been able to test it on s390 so far. It was also noted that minor changes might be required to strace code to take care of "different arguments on s390" issue. Currently we have two modes FA_ALLOCATE and FA_DEALLOCATE, for preallocation and deallocation of preallocated blocks respectively. More modes can be added, when required. ToDos: ===== 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)) 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) so that posix_fallocate() and posix_fallocate64() call fallocate() system call 4> Changes to XFS to implement the fallocate inode operation Following patches follow: Patch 1/5 : fallocate() implementation in i86, x86_64 and powerpc Patch 2/5 : fallocate() on s390 Patch 3/5 : ext4: Extent overlap bugfix Patch 4/5 : ext4: fallocate support in ext4 Patch 5/5 : ext4: write support for preallocated blocks -- Regards, Amit Arora -
This patch implements the fallocate() system call and adds support for
i386, x86_64 and powerpc.
NOTE: It is based on 2.6.21 kernel version.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
arch/i386/kernel/syscall_table.S | 1
arch/powerpc/kernel/sys_ppc32.c | 7 ++++++
arch/x86_64/kernel/functionlist | 1
fs/open.c | 41 +++++++++++++++++++++++++++++++++++++++
include/asm-i386/unistd.h | 3 +-
include/asm-powerpc/systbl.h | 1
include/asm-powerpc/unistd.h | 3 +-
include/asm-x86_64/unistd.h | 4 ++-
include/linux/fs.h | 7 ++++++
include/linux/syscalls.h | 1
10 files changed, 66 insertions(+), 3 deletions(-)
Index: linux-2.6.21/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.21.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.21/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_fallocate /* 320 */
Index: linux-2.6.21/arch/x86_64/kernel/functionlist
===================================================================
--- linux-2.6.21.orig/arch/x86_64/kernel/functionlist
+++ linux-2.6.21/arch/x86_64/kernel/functionlist
@@ -931,6 +931,7 @@
*(.text.sys_getitimer)
*(.text.sys_getgroups)
*(.text.sys_ftruncate)
+*(.text.sys_fallocate)
*(.text.sysfs_lookup)
*(.text.sys_exit_group)
*(.text.stub_fork)
Index: linux-2.6.21/fs/open.c
===================================================================
--- linux-2.6.21.orig/fs/open.c
+++ linux-2.6.21/fs/open.c
@@ -350,6 +350,47 @@ asmlinkage long sys_ftruncate64(unsigned
}
#endif
+asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
+{
+ struct file *file;
+ struct inode *inode;
+ long ret = -EINVAL;
+
+ if (len == 0 || offset < 0)
+ goto out;
+
+ ret = -EBADF;
+ file = fget(fd);
+ if (!file)
+ goto ...Please add a comment over this function which specifies its behaviour. Really it should be enough material from which a full manpage can be written. If that's all too much, this material should at least be spelled out in the changelog. Because there's no way in which this change can be fully reviewed unless someone (ie: you) tells us what it is setting out to achieve. If we 100% implement some standard then a URL for what we claim to implement would suffice. Given that we're at least using different types from posix I doubt if such a thing would be sufficient. And given the complexity and potential variability within the filesystem implementations of this, I'd expect that _something_ additional needs to be The posix spec implies that negative `len' is permitted - presumably "allocate So we return ENODEV against an S_ISBLK fd, as per the posix spec. That This code does handle offset+len going negative, but only by accident, I suspect. It happens that s_maxbytes has unsigned type. Perhaps a comment If we _are_ going to support negative `len', as posix suggests, I think we should perform the appropriate sanity conversions to `offset' and `len' right here, rather than expecting each filesystem to do it. Now those aren't in posix. They should be documented, along with their I really do think it's better to put the variable names in definitions such as this. Especially when we have two identically-typed variables next to each other like that. Quick: which one is the offset and which is the length? -
But it doesn't handle offset+len wrapping through zero. -
This looks like it will have the same problem on s390 as sys_sync_file_range. Maybe the prototype should be: asmlinkage long sys_fallocate(loff_t offset, loff_t len, int fd, int mode) Paul. -
Yes, but the trouble is that there was a contrary viewpoint preferring that fd first be maintained as a convention like other syscalls (see the following posts) http://marc.info/?l=linux-fsdevel&m=117585330016809&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117690157917378&w=2 (Andreas) http://marc.info/?l=linux-fsdevel&m=117578821827323&w=2 (Randy) So we are kind of deadlocked, aren't we ? The debates on the proposed solution for s390 http://marc.info/?l=linux-fsdevel&m=117760995610639&w=2 http://marc.info/?l=linux-fsdevel&m=117708124913098&w=2 http://marc.info/?l=linux-fsdevel&m=117767607229807&w=2 Are there any better ideas ? Regards -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India -
Of course the interface used by an application program would have the fd first. Glibc can do the translation. Paul. -
I think that was understood. Regards -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India -
OK, then what does it matter what the glibc/kernel interface is, as long as it works? It's only a minor point; the order of arguments can vary between architectures if necessary, but it's nicer if they don't have to. 32-bit powerpc will need to have the two int arguments adjacent in order to avoid using more than 6 argument registers at the user/kernel boundary, and s390 will need to avoid having a 64-bit argument last (if I understand it correctly). Paul. -
Ah, almost but not quite the point. But I admit it is hard to understand.. The trouble started with the futex call which has been the first system call with 6 arguments. s390 supported only 5 arguments up to that point (%r2 - %r6). For futex we added a wrapper to the glibc that loaded the 6th argument to %r7. In entry.S we set up things so that %r7 gets stored to the kernel stack where normal C code expects the first overflow argument. This enabled us to use the standard futex system call with 6 arguments. fallocate now has an additional problem: the last argument is a 64 bit integers AND registers %r2-%r5 are already used. In this case the 64 bit number would have to be split into the high part in %r6 and the low part on the stack so that the glibc wrapper can load the low part to %r7. But the C compiler will skip %r6 and store the 64 bit number on the stack. If the order of the arguments if modified so that %r6 is assigned to a 32-bit argument, then the entry.S magic with %r7 would work. -- blue skies, Martin -
You are right to say that. But, it may not be _that_ a minor point, especially for the arch which is getting affected. It has other implications like what Heiko noticed in his post below: http://lkml.org/lkml/2007/4/27/377 - implications like modifying glibc and *trace utilities for a particular arch. -- Regards, Amit Arora -
I just checked the man page for posix_fallocate() and it says:
EINVAL offset or len was less than zero.
Hmmmm - I thought that the intention of sys_fallocate() was to
be generic enough to eventually allow preallocation on directories.
If that is the case, then this check will prevent that....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
Yes, I think so. I'm suspecting that http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html is just buggy. Or I can't read. I mean, if we're going to support negative `len' then is the byte at `offset' inside or outside the segment? Head spins. However it would be neat if someone could test $OTHER_OS and, perhaps more importantly, the present glibc emulation (which I assume your manpage is The above opengroup page only permits S_ISREG. Preallocating directories sounds quite useful to me, although it's something which would be pretty hard to emulate if the FS doesn't support it. And there's a decent case to be made for emulating it - run-anywhere reasons. Does glibc emulation support directories? Quite unlikely. But yes, sounds like a desirable thing. Would XFS support it easily if the above check was relaxed? -
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
struct stat64 st;
struct statfs f;
/* `off_t' is a signed type. Therefore we can determine whether
OFFSET + LEN is too large if it is a negative value. */
if (offset < 0 || len < 0)
return EINVAL;
if (offset + len < 0)
return EFBIG;
/* First thing we have to make sure is that this is really a regular
file. */
if (__fxstat64 (_STAT_VER, fd, &st) != 0)
return EBADF;
if (S_ISFIFO (st.st_mode))
return ESPIPE;
if (! S_ISREG (st.st_mode))
return ENODEV;
if (len == 0)
{
if (st.st_size < offset)
{
int ret = __ftruncate (fd, offset);
if (ret != 0)
ret = errno;
return ret;
}
return 0;
}
...
is what glibc does ATM. Seems we violate the case where len == 0, as
EINVAL in that case is "shall fail". But reading the standard to imply
negative len is ok is too much guessing, there is no word what it means
when len is negative and
"required storage for regular file data starting at offset and continuing for len bytes"
doesn't make sense for negative size.
And given the general
"Implementations may support additional errors not included in this list,
may generate errors included in this list under circumstances other than
those described here, or may contain extensions or limitations that prevent
some errors from occurring."
I believe returning EINVAL for len < 0 is not a POSIX violation.
That doesn't mean the standard shouldn't be clarified, whether by saying
EINVAL must be returned for non-positive len or saying that using negative
No, see above.
Jakub
-
This wording has already been cleaned up. The current draft for the next revision reads: [EINVAL] The len argument is less than or equal to zero, or the offset argument is less than zero, or the underlying file system does not support this operation. I still don't like it since len==0 shouldn't create an error (it's inconsistent) but len<0 is already outlawed. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -
I don't think we should care. If we provide a syscall with the semantics of "allocate from offset to offset+len" then glibc's implementation can turn negative length into two separate No - right now empty blocks are pruned from the directory immediately so I don't think we really have a concept of empty blocks in the btree structure. dir2 is bloody complex, so adding preallocation is probably not going to be simple to do. It's not high on my list to add, either, because we can typically avoid the worst case directory fragmentation by using larger directory block sizes (e.g. 16k instead of the default 4k on a 4k block size fs). IIRC directory preallocation has been talked about more for ext3/4.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
I think we may relax the check here and let the individual file system decide if they support preallocation for directories or not. What do you think ? One thing to be thought in this case is the error code which should be returned by the file system implementation, incase it doesn't support preallocation for directories. Should it be -ENODEV (to match with what posix says) , or something else (which might make more sense in this case) ? -- Regards, Amit Arora -
Andrew, Thanks for the review comments! I think we should go ahead with current glibc implementation (which Jakub poited at) of not allowing a negative 'len', since posix also Will add a check for negative 'len' and return -EINVAL. This will be done where currently we check for negative offset (i.e. at the start of Ok. Will add the variable names here. -- Regards, Amit Arora -
I have the updated patches ready which take care of Andrew's comments.
Will run some tests and post them soon.
But, before submitting these patches, I think it will be better to finalize
on certain things which might be worth some discussion here:
1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
file size in this case. I also tend to agree with them. Does anyone
has an argument in favor of changing the filesize ?
If not, I will remove the code which changes the filesize, before I
resubmit the concerned ext4 patch.
2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
we need to finalize on the convention here as a general guideline
to all the filesystems that implement fallocate.
3) If above is true, the file size will need to be changed
for "unallocation" when block holding the EOF gets unallocated.
- If we do not "unallocate" normal (non-preallocated) blocks and we
do not change the file size on preallocation, then this is a
non-issue.
4) Should we update mtime & ctime on a successfull allocation/
unallocation ?
- David Chinner raised this question in following post:
http://lkml.org/lkml/2007/4/29/407
I think it makes sense to update the [mc]time for a successfull
preallocation/unallocation. Does anyone feel otherwise ?
It will be interesting to know how XFS behaves currently. Does XFS
update [mc]time for preallocation ?
--
Regards,
Amit Arora
-
I would only allow this on FA_ALLOCATE extents. That means it won't be possible to do this for filesystems that don't understand unwritten Not necessarily. That will just make the file sparse. If FA_ALLOCATE I would say yes. If glibc does the fallback fallocate via write() the mtime/ctime will be updated, so it makes sense to be consistent for both methods. Also, it just makes sense from the "this file was modified" point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
If we chose not to update the file size beyong EOF, then for filesystem without fallocate() support (ext2,3 currently), posix_fallocate() will follow the hard way(zero-out) to do preallocation. Then we will get different behavior on filesystems w/o fallocate() support. It make sense to be consistent, IMO. My point of view, preallocation is just a efficient way to allocating blocks for files without zero-out, other than this, the new behavior should be consistent with the old way: file size update,mtime/ctime, ENOSPC etc. Mingming -
I think there needs to be both. If we don't have a mechanism to atomically change the file size with the preallocation, then applications that use stat() to work out if they need to preallocate Yes. That is the current XFS implementation for XFS_IOC_UNRESVSP, and No - we punch a hole. If you want the filesize to change, then you use ftruncate() to remove the blocks at EOF and change the No, XFS does *not* update a/m/ctime on prealloc/punch unless the file size changes. If the filesize changes, it behaves exactly the same way that ftruncate() behaves. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
By "both" above, do you mean we should give user the flexibility if it wants the filesize changed or not ? It can be done by having *two* modes for preallocation in the system call - say FA_PREALLOCATE and FA_ALLOCATE. If we use FA_PREALLOCATE mode, fallocate() will allocate blocks, but will not change the filesize and [cm]time. If FA_ALLOCATE mode is used, fallocate() will change the filesize if required (i.e. when allocation is beyond EOF) and also update [cm]time. This way, the application can decide what it wants. This will be helpfull for the partial allocation scenario also. Think of the case when we do not change the filesize in fallocate() and expect applications/posix_fallocate() to do ftruncate() after fallocate() for this. Now if fallocate() results in a partial allocation with -ENOSPC error returned, applications/posix_fallocate() will not know for what length ftruncate() has to be called. :( Hence it may be a good idea to give user the flexibility if it wants to atomically change the file size with preallocation or not. But, with more flexibility there comes inconsistency in behavior, which is worth Ok. But, some people may not expect/like this. I think, we can keep it Having additional mode (of FA_PREALLOCATE) might help here too. Please see above. -- Regards, Amit Arora -
Well, posix_fallocate() either gets all the space or it fails. If you truncate to extend the file size after an ENOSPC, then that is a buggy implementation. The same could be said for any application, or even the fallocate() call itself if it changes the filesize without having completely We've got different modes to specify different behaviour. That's what the mode field was put there for in the first place - the interface is *designed* to support different preallocation How can it be a "backburner" issue when it defines the implementation? I've already implemented some thing in XFS that sort of does what I think that the interface is supposed to do, but I need that interface to be nailed down before proceeding any further. All I'm really interested in right now is that the fallocate _interface_ can be used as a *complete replacement* for the pre-existing XFS-specific ioctls that are already used by applications. What ext4 can or can't do right now is irrelevant to this discussion - the interface definition needs to take priority over implementation.... Cheers, Dave, -- Dave Chinner Principal Engineer SGI Australian Software Group -
Would you like to write up an interface definition description (likely man page) and post it for review, possibly with a mention of apps using it today ? One reason for introducing the mode parameter was to allow the interface to evolve incrementally as more options / semantic questions are proposed, so that we don't have to make all the decisions right now. So it would be good to start with a *minimal* definition, even just one mode. The rest could follow as subsequent patches, each being reviewed and debated separately. Otherwise this discussion can drag on for a long time. Regards -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India -
Yeah, I started doing that yesterday as i figured it was the only way Minimal definition to replace what applicaitons use on XFS and to support poasix_fallocate are the thre that have been mentioned so far (FA_ALLOCATE, FA_PREALLOCATE, FA_DEALLOCATE). I'll document them all in a man page... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Hi Dave, Did you get time to write the above man page ? It will help to push further patches in time (eg. for FA_PREALLOCATE mode). The idea I had was to push the patch with bare minimum functionality (FA_ALLOCATE and FA_DEALLOCATE modes) and parallely finalize on other new mode(s) based on the man page you planned to provide. Thanks! -- Regards, -
No, I didn't. Instead of working on new preallocation stuff, I've been spending all my time fixing bugs found by new and interesting Push them. I'll just make XFS work with whatever is provided. Is there a test harness for the syscall yet? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
First pass is attached. `nroff -man fallocate.2 | less` to view. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group
So this is essentially the same as "punch". There doesn't seem to be
a mechanism to only unallocate unused FA_{PRE,}ALLOCATE space at the
This also seems to be a bit of a wart, since it isn't a natural converse
of either of the above functions. How about having two modes,
similar to FA_ALLOCATE and FA_PREALLOCATE? Say, FA_PUNCH (which
would be as you describe here - deletes all data in the specified
range changing the file size if it overlaps EOF, and FA_DEALLOCATE,
which only deallocates unused FA_{PRE,}ALLOCATE space?
We might also consider making @mode be a mask instead of an enumeration:
FA_FL_DEALLOC 0x01 (default allocate)
FA_FL_KEEP_SIZE 0x02 (default extend/shrink size)
FA_FL_DEL_DATA 0x04 (default keep written data on DEALLOC)
We might then build FA_ALLOCATE and FA_DEALLOCATE out of these flags
without making the interface sub-optimal.
I suppose it might be a bit late in the game to add a "goal"
parameter and e.g. FA_FL_REQUIRE_GOAL, FA_FL_NEAR_GOAL, etc to make
the API more suitable for XFS? The goal could be a single __u64, or
a struct with e.g. __u64 byte offset (possibly also __u32 lun like
in FIEMAP). I guess the one potential limitation here is the
Should probably say whether space is removed on failure or not. In
some (primitive) implementations it might no longer be possible to
distinguish between unwritten extents and zero-filled blocks, though
at this point DEALLOC of zero-filled blocks might not be harmful either.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
-
<shrug> Punch means different things to different people. To me (and probably most XFS aware ppl) punch implies no change to the file size. i.e. anyone curently using XFS_IOC_UNRESVSP will expect punching holes to leave the file size unchanged. This is the behaviour I That's an "unwritten-to-hole" extent conversion. Is that really useful for anything? That's easily implemented with FIEMAP and FA_DEALLOCATE. Anyway, because we can't agree on a single pair of flags: FA_ALLOCATE == posix_fallocate() FA_DEALLOCATE == unwritten-to-hole ??? FA_RESV_SPACE == XFS_IOC_RESVSP64 i.e: #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE It would suffice for the simpler operations, I think, but we'll rapidly run out of flags and we'll still need another interface Right. I'd say on error you need to FA_DEALLOCATE to ensure any space allocated was freed back up. That way the error handling in the allocate functions is much simpler (i.e. no need to undo there). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
No, that will delete written data also. What I'm thinking is in cases where an application does fallocate() to reserve a lot of space, and If "punch" does not change the file size, how is it possible to determine the end of the actual written data? Say you have a file with records in it, and these records are cancelled as they are processed (e.g. a journal of sorts). One usage model for punch() that we had in the past is to punch out each record after it finishes processing, so that it will not be re-processed after a crash. If the file size doesn't change with punch then there is no way to know when the last record is hit and the But why force the application to do this instead of making the I'd think this makes sense, being natural opposites of each other. FA_ALLOCATE doesn't overwrite existing data with zeros, so FA_DEALLOCATE shouldn't erase existing data. If FA_ALLOCATE extends the file size, OK, this makes the semantics of XFS_IOC_RESVSP64 and XFS_IOC_UNRESVSP64 clear at least. The benefit is that it would also be possible (I'm not necessarily advocating this as a flag, just an example) to have semantics that are like XFS_IOC_ALLOCSP64 (zeroing written data while preallocating) with: #define FA_ZERO_SPACE FA_DEL_DATA or whatever semantics the caller actually wants, instead of restricting them to the subset of combinations given by FA_ALLOCATE and FA_DEALLOCATE Hmm, another flag? FA_FL_FREE_ENOSPC? I can imagine applications like PVRs to want to preallocate, say, an estimated 30 min of space for a show but if they only get 25 min of space returned they know some cleanup is in order (which can be done asynchronously while the show is filling the first 25 min of preallocated space). Otherwise, they have to loop in userspace trying decreasing preallocations until they fit, or starting small and incrementally preallocating space until they get an error. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
N O T E:
-------
1) Only Patches 4/7 and 7/7 are NEW. Rest of them are _already_ part
of ext4 patch queue git tree hosted by Ted.
2) The above new patches (4/7 and 7/7) are based on the dicussion
between Andreas Dilger and David Chinner on the mode argument,
when later posted a man page on fallocate.
3) All of these patches are based on 2.6.22-rc4 kernel and apply to
2.6.22-rc5 too (with some successfull hunks, though - since the
ext4 patch queue git tree has some other patches as well before
fallocate patches in the patch series).
Changelog:
---------
Changes from Take4 to Take5:
1) New Patch 4/7 implements new flags and values for mode
argument of fallocate system call.
2) New Patch 7/7 implements 2 (out of 4) modes in ext4.
Implementation of rest of the (two) modes is yet to be done.
3) Updated the interface description below to mention new modes
being supported.
4) Removed "extent overlap check" bugfix (patch 4/6 in TAKE4,
since it is now part of mainline.
5) Corrected format of couple of multi-line comments, which got
missed in earlier take.
Changes from Take2 to Take3:
1) Return type is now described in the interface description
above.
2) Patches rebased to 2.6.22-rc1 kernel.
** Each post will have an individual changelog for a particular patch.
Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.
Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the ...This patch implements sys_fallocate() and adds support on i386, x86_64
and powerpc platforms.
Changelog:
---------
Changes from Take3 to Take4:
1) Do not update c/mtime. Let each filesystem update ctime (update of
mtime will not be required for allocation since we touch only
metadata/inode and not blocks), if required.
Changes from Take2 to Take3:
1) Patches now based on 2.6.22-rc1 kernel.
Changes from Take1(initial post on 26th April, 2007) to Take2:
1) Added description before sys_fallocate() definition.
2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to,
posix_fallocate should return EINVAL for len <= 0.
3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE
4) Do not return ENODEV for dirs (let individual file systems decide if
they want to support preallocation to directories or not.
5) Check for wrap through zero.
6) Update c/mtime if fallocate() succeeds.
7) Added mode descriptions in fs.h
8) Added variable names to function definition (fallocate inode op)
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/i386/kernel/syscall_table.S
+++ linux-2.6.22-rc4/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
.long sys_signalfd
.long sys_timerfd
.long sys_eventfd
+ .long sys_fallocate
Index: linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
===================================================================
--- linux-2.6.22-rc4.orig/arch/powerpc/kernel/sys_ppc32.c
+++ linux-2.6.22-rc4/arch/powerpc/kernel/sys_ppc32.c
@@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(con
return sys_truncate(path, (high << 32) | low);
}
+asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo,
+ u32 lenhi, u32 lenlo)
+{
+ return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo,
+ ...Btw. this is also (still?) broken. x86_64 needs a compat syscall here. -
This is the patch suggested by Martin Schwidefsky to support
sys_fallocate() on s390(x) platform.
He also suggested a wrapper in glibc to handle this system call on
s390. Posting it here so that we get feedback for this too.
.globl __fallocate
ENTRY(__fallocate)
stm %r6,%r7,28(%r15) /* save %r6/%r7 on stack */
cfi_offset (%r7, -68)
cfi_offset (%r6, -72)
lm %r6,%r7,96(%r15) /* load loff_t len from stack */
svc SYS_ify(fallocate)
lm %r6,%r7,28(%r15) /* restore %r6/%r7 from stack */
br %r14
PSEUDO_END(__fallocate)
Here are the comments and the patch to linux kernel from him.
-------------
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with the arguments of this system call.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Index: linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.22-rc4.orig/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:16:01.000000000 -0700
+++ linux-2.6.22-rc4/arch/s390/kernel/compat_wrapper.S 2007-06-11 16:27:29.000000000 -0700
@@ -1683,6 +1683,16 @@
llgtr %r3,%r3 # struct compat_timeval *
jg compat_sys_utimes
+ .globl sys_fallocate_wrapper
+sys_fallocate_wrapper:
+ lgfr %r2,%r2 # int
+ lgfr %r3,%r3 # int
+ sllg %r4,%r4,32 # get high word of 64bit loff_t
+ lr %r4,%r5 # get low word of 64bit loff_t
+ sllg %r5,%r6,32 # get high word of 64bit loff_t
+ l %r5,164(%r15) # get low word of 64bit loff_t
+ jg sys_fallocate
+
.globl compat_sys_utimensat_wrapper
compat_sys_utimensat_wrapper:
llgfr %r2,%r2 # unsigned int
Index: linux-2.6.22-rc4/arch/s390/kernel/sys_s390.c
===================================================================
--- ...You need to remove the NI_SYSCALL line. Otherwise all following entries Erm... no. You use slot 314 in the syscall table but assign number 319. That won't work. Please use 314 for both. I assume this got broken when updating to newer kernel versions. -
fallocate() on ia64 ia64 fallocate syscall support. Signed-off-by: Dave Chinner <dgc@sgi.com> Index: linux-2.6.22-rc4/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.22-rc4.orig/arch/ia64/kernel/entry.S 2007-06-11 17:22:15.000000000 -0700 +++ linux-2.6.22-rc4/arch/ia64/kernel/entry.S 2007-06-11 17:30:37.000000000 -0700 @@ -1588,5 +1588,6 @@ data8 sys_signalfd data8 sys_timerfd data8 sys_eventfd + data8 sys_fallocate // 1310 .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc4/include/asm-ia64/unistd.h =================================================================== --- linux-2.6.22-rc4.orig/include/asm-ia64/unistd.h 2007-06-11 17:22:15.000000000 -0700 +++ linux-2.6.22-rc4/include/asm-ia64/unistd.h 2007-06-11 17:30:37.000000000 -0700 @@ -299,11 +299,12 @@ #define __NR_signalfd 1307 #define __NR_timerfd 1308 #define __NR_eventfd 1309 +#define __NR_fallocate 1310 #ifdef __KERNEL__ -#define NR_syscalls 286 /* length of syscall table */ +#define NR_syscalls 287 /* length of syscall table */ /* * The following defines stop scripts/checksyscalls.sh from complaining about -
Implement new flags and values for mode argument. This patch implements the new flags and values for the "mode" argument of the fallocate system call. It is based on the discussion between Andreas Dilger and David Chinner on the man page proposed (by the later) on fallocate. Signed-off-by: Amit Arora <aarora@in.ibm.com> Index: linux-2.6.22-rc4/include/linux/fs.h =================================================================== --- linux-2.6.22-rc4.orig/include/linux/fs.h +++ linux-2.6.22-rc4/include/linux/fs.h @@ -267,15 +267,16 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WAIT_AFTER 4 /* - * sys_fallocate modes - * Currently sys_fallocate supports two modes: - * FA_ALLOCATE : This is the preallocate mode, using which an application/user - * may request (pre)allocation of blocks. - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free - * the preallocated blocks. + * sys_fallocate mode flags and values */ -#define FA_ALLOCATE 0x1 -#define FA_DEALLOCATE 0x2 +#define FA_FL_DEALLOC 0x01 /* default is allocate */ +#define FA_FL_KEEP_SIZE 0x02 /* default is extend/shrink size */ +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC */ + +#define FA_ALLOCATE 0 +#define FA_DEALLOCATE FA_FL_DEALLOC +#define FA_RESV_SPACE FA_FL_KEEP_SIZE +#define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) #ifdef __KERNEL__ Index: linux-2.6.22-rc4/fs/open.c =================================================================== --- linux-2.6.22-rc4.orig/fs/open.c +++ linux-2.6.22-rc4/fs/open.c @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned * sys_fallocate - preallocate blocks or free preallocated blocks * @fd: the file descriptor * @mode: mode specifies if fallocate should preallocate blocks OR free - * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and - * FA_DEALLOCATE modes are supported. + * (unallocate) preallocated blocks. * @offset: The offset ...
I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323 post. If it is decided that these flags are also needed, I will update this patch. Thanks! -
Can you clarify - what is the current behaviour when ENOSPC (or some other error) is hit? Does it keep the current fallocate() or does it free it? For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we don't want to expose uninitialized disk blocks to userspace. I'm not Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Currently it is left on the file system implementation. In ext4, we do not undo preallocation if some error (say, ENOSPC) is hit. Hence it may end up with partial (pre)allocation. This is inline with dd and I don't think we need to make it default - atleast for filesystems which have a mechanism to distinguish preallocated blocks from "regular" ones. In ext4, for example, we will have a way to mark uninitialized extents. All the preallocated blocks will be part of these uninitialized extents. And any read on these extents will treat them as a hole, returning zeroes to user land. Thus any existing data on uninitialized blocks will not be exposed to the userspace. -- Regards, Amit Arora -
Since I believe the XFS allocation ioctls do it the opposite way (free preallocated space on error) this should be encoded into the flags. What I mean is that any data read from the file should have the "appearance" of being zeroed (whether zeroes are actually written to disk or not). What I _think_ David is proposing is to allow fallocate() to return without marking the blocks even "uninitialized" and subsequent reads would return the old data from the disk. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Ok, got your point. Maybe we can have a flag for this, as you suggested. But, default behavior IMHO should be _not_ to undo partial allocation (thus the file system will have the option of supporting this flag or not and it will be inline with posix_fallocate; XFS will obviously I can't think of a good reason for this (i.e. returning stale data from preallocated blocks). It is infact a security issue to me. Anyhow, this may though be beneficial for file systems which have noticable overhead in marking the blocks "uninitialized/preallocated". Can you or David please throw some light on how this option might really be helpful ? Thanks! -- Regards, Amit Arora -
No, XFs does not free preallocated space on error. it is up to the Correct, but for swap files that's not an issue - no user should be able too read them, and FA_MKSWAP would really need root privileges to execute. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Since XFS also does not free preallocated space on error and this behavior is inline with dd, posix_fallocate() and the current ext4 Will the FA_MKSWAP mode still be required with your suggested change of teaching do_mpage_readpage() about unwritten extents being in place ? Or, will you still like to have FA_MKSWAP mode ? -- Regards, Amit Arora -
There's no need for a MKSWAP flag. cheers. -- Nathan -
budgie:/mnt/test # xfs_io -f -c "resvsp 0 1048576" -c "truncate 1048576" swap_file
budgie:/mnt/test # mkswap swap_file
Setting up swapspace version 1, size = 1032 kB
budgie:/mnt/test # swapon -v swap_file
swapon on swap_file
budgie:/mnt/test # swapon -s
Filename Type Size Used Priority
/dev/sda2 partition 9437152 0 -1
/mnt/test/swap_file file 992 0 -2
budgie:/mnt/test # xfs_bmap -vvp swap_file
swap_file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..31]: 96..127 0 (96..127) 32
1: [32..2047]: 128..2143 0 (128..2143) 2016 10000
FLAG Values:
010000 Unwritten preallocated extent
001000 Doesn't begin on stripe unit
000100 Doesn't end on stripe unit
000010 Doesn't begin on stripe width
000001 Doesn't end on stripe width
Looks like the changes work, so FA_MKSWAP is not necessary for XFS.
We can drop that for the moment unless anyone else sees a need for it.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
I can't find anything in the specification of posix_fallocate (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) that tells what should happen to allocate blocks on error. But common sense would be to not leak disk space on failure of this syscall, and this definitively should not be left up to the filesystem, either we always leak it or always free it, and I'd strongly favour This is the xfs unwritten extent behaviour. But anyway, the important bit is uninitialized blocks should never ever leak to userspace, so there is not need for the flag. -
I definitely agree that the behaviour should be specified part of
the interface. The current behaviour of both ext4 and XFS is that the
successful part of the unallocated extent is left in place when returning
ENOSPC so we considered this the "consistent" behaviour. This is the same
as e.g. sys_write() which does not remove the part of the write that was
successful if ENOSPC is hit. I think this also makes sense for some usa
cases, because application like PVR may want to preallocate approximately
30min of space, but if it gets only 25min worth then it can at least start
using this while it also begins looking for and/or freeing old files.
If the space is always freed on ENOSPC, then there may be a significant
amount of work done and undone while the application is iterating over
possible sizes until one works. It is easy for the application to
use fstat() to see the blocks/size actually preallocated on failure, and
explicitly request unallocation of this space if the outcome is undesirable.
If you think that applications have a strong preference for both kinds
of behaviour (e.g. database which requires the full allocation to succeed,
I agree that we shouldn't need FA_ZERO_SPACE. If an application wants
explicit zeros written to disk it can just do this with O_DIRECT writes
I don't think the current @mode flags introduce any significant complexity
in the implementation, and in fact one of the reasons these came up in the
first place was because David pointed out the XFS behaviour did NOT match
with posix_fallocate() and we started getting strange semantics enforced
by monolithic modes. IMHO, coding for and understanding the semantics of
the monolithic modes is much more complex and less useful than the explicit
flags.
The @mode flags that are currently under consideration are (AFAIK):
FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */
FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */
FA_FL_DEL_DATA 0x04 /* ...We now have two sets of flags - 1) the above three with which I think no one has any issues with, and 2) the ones below, for which we need some discussions before finalizing on them. I will prefer fallocate going in mainline with the above three modes, and rest of the modes can be debated upon and discussed parallely. And, each new mode/flag can be pushed as a separate patch. This will not hold fallocate feature indefinitely... Please confirm if you find this approach ok. Otherwise, please object. -- Regards, Amit Arora -
Yes, I do. FA_FL_DEL_DATA is plain stupid, a preallocation call should never delete data. FA_FL_DEALLOC should probably be a separate syscall because it's very different functionality. While we're at it I also dislike the FA_ prefix becuase it doesn't say NACK on this one. We should have just one behaviour, and from the thread NACK to these aswell. If i_size changes c/mtime need updates, if the size doesn't chamge they don't. No need to add more flags for this. -
Well, if you see the modes proposed using above flags : #define FA_ALLOCATE 0 #define FA_DEALLOCATE FA_FL_DEALLOC #define FA_RESV_SPACE FA_FL_KEEP_SIZE #define FA_UNRESV_SPACE (FA_FL_DEALLOC | FA_FL_KEEP_SIZE | FA_FL_DEL_DATA) FA_FL_DEL_DATA is _not_ being used for preallocation. We have two modes for preallocation FA_ALLOCATE and FA_RESV_SPACE, which do not use this flag. Hence prealloction will never delete data. This mode is required only for FA_UNRESV_SPACE, which is a deallocation mode, to support any existing XFS aware applications/usage-scenarios. And, regarding FA_FL_DEALLOC being a separate syscall - I think then the very purpose of @mode argument is not justified. We have this mode so that we can provide more features like this. That said, I don't say that we should make things very complicated; but, atleast we should provide some basic features which we expect most of the applications wanting preallocation to use. To start with, we need to cater to already existing applications/user base who use XFS preallocation feature. And further advanced features, like goal based preallocation, can be This requirement was from the point of view of HSM applications. Hope you saw Andreas previous post and are keeping that in mind. -- Regards, Amit Arora -
We use this capability in XFS at the moment. I think this is mainly for DMF (HSM) but is done via the xfs handle interface (xfs_open_by_handle) AFAICT. This sets up a set of invisible operations (xfs_invis_file_operations). xfs_file_ioctl_invis goes on to set IO_INVIS which goes on to set ATTR_DMI which is then tested in xfs_change_file_space() (which handles XFS_IOC_RESVSP & friends) for whether xfs_ichgtime(ip, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG) is called or not. --Tim -
You're not :) You're using an O_INVIBLE equivalent (as described below), which would be a useful thing to have at the VFS level, but adding hacks to some system calls only wouldn't help any HSM system. It's just useless API clutter. -
Sorry, but this doesn't make any sense. There is no need to put every feature in the XFS ioctls in the syscalls. The XFS ioctls will need to be supported forever anyway - as I suggested before they really should be moved to generic code. What needs to be supported is what makes sense as an interface. A punch a hole interface does make sense, but trying to hack this into a preallocation system call is just madness. We're not IRIX or windows that fit things into random subcall just because there was some space HSMs needs this basically for every system call, which screams for an open flag like O_INVISIBLE anyway. Adding this in a generic way is a good idea, but hacking bits and pieces that won't fit into the global design is completely wrong. -
Why don't we just merge the interface for preallocation (essentially enough to satisfy posix_fallocate() and the simple XFS requirement for space reservation without changing file size), which there is clear agreement on (I hope :)). After all, this was all that we set out to do when we started. And leave all the dealloc/punch/hsm type features for separate future patches/ debates, those really shouldn't hold up the basic fallocate interface. I agree with Christoph that we are just diverging too much in trying to club those decisions here. Dave, Andreas, Ted ? Regards -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India -
As you suggest, let us just have two modes for the time being: #define FALLOC_ALLOCATE 0x1 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it will result in file size not being changed even if the preallocation is -- Regards, Amit Arora -
What does FALLOC_ALLOCATE mean vs. not passing this flag? I have no objection to this as long as the code remains with these as "flags" instead of "modes"... Essentially just dropping the FALLOC_FL_DEALLOCATE and FALLOC_FL_DEL_DATA from the interface. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Sure. I'll just make XFS work with whatever it is that gets merged. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Great. I will post the new patches soon. -- Regards, Amit Arora -
We can't simply walk the range an remove unwritten extents, as some of them may have been present before the fallocate() call. That makes it extremely difficult to undo a failed call and not remove more pre-existing pre-allocations. Given the current behaviour for posix_fallocate() in glibc, I think that retaining the same error semantic and punting the cleanup to userspace (where the app will fail with ENOSPC anyway) is the only sane thing we can do here. Trying to undo this in the kernel leads to lots of extra rarely used code in error handling paths... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
I would not call it a "leak", since the blocks which got allocated as part of the partial success of the fallocate syscall can be strictly accounted for (i.e. they are assigned to a particular inode). And these Same is true for ext4 too. It is very difficult to keep track of which uninitialized (unwritten) extents got allocated as part of the current syscall. This is because, as David mentions, some of them might be already present; and also because some of the older ones may have got merged with the *new* uninitialized/unwritten extents as part of the Right. This gives applications the free hand if they really want to use the partially preallocated space, OR they want to free it; without introducing additional complexity in the kernel. -- Regards, Amit Arora -
Agreed, looks like we should stay with the user has to clean up behaviour. -
Someone on the XFs list had an interesting request - preallocated swap files. You can't use unwritten extents for this because of sys_swapon()s use of bmap() (XFS returns holes for reading unwritten extents), so we need a method of preallocating that does not zero or mark the extent unread. i.e. FA_MKSWAP. I thinkthis would be: #define FA_FL_NO_ZERO_SPACE 0x08 /* default is to zero space */ #define FA_MKSWAP (FA_ALLOCATE | FA_FL_NO_ZERO_SPACE) That way we can allocate large swap files that don't need zeroing in a single, fast operation, and hence potentially bring new swap space online without needed very much memory at all (i.e. should succeed in most near-OOM conditions). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
In XFS one of the (many) ALLOC modes is to zero existing data on allocate. For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on each extent. For some workloads this would be much faster than truncate and reallocate of all the blocks in a file. In that light, please change the comment to /* default is keep existing data */ so that it doesn't imply this is only for DEALLOC. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
In ext4, we already mark each extent having preallocated blocks as
uninitialized. This is done as part of following code (which is part of
patch 5/7) in ext4_ext_get_blocks() :
@@ -2122,6 +2160,8 @@ int ext4_ext_get_blocks(handle_t *handle
/* try to insert new extent into found leaf and return */
ext4_ext_store_pblock(&newex, newblock);
newex.ee_len = cpu_to_le16(allocated);
+ if (create == EXT4_CREATE_UNINITIALIZED_EXT) /* Mark uninitialized */
+ ext4_ext_mark_uninitialized(&newex);
err = ext4_ext_insert_extent(handle, inode, path, &newex);
if (err) {
Ok. Will update the comment.
Thanks!
--
Regards,
Amit Arora
-
What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.
So, if you had a sparse file with some data in it:
AAAAA BBBBBB
fallocate() would allocate the holes:
00000AAAAA000000000BBBBBB00000000
XFS_IOC_ALLOCSP would overwrite everything:
000000000000000000000000000000000
In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation). This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
-
Ok, agreed. Will add the FA_ZERO_SPACE mode too. Thanks! -- Regards, Amit Arora -
No, it wouldn't. XFS_IOC_ALLOCSP would give you:
AAAAA BBBBBB00000000
because it only allocates the space between the old EOF and the new
EOF. Graphic demonstration - write 4k @ 4k, 4k @ 16k, allocsp out to 32k:
wrote 4096/4096 bytes at offset 4096
4 KiB, 1 ops; 0.0000 sec (108.507 MiB/sec and 27777.7778 ops/sec)
wrote 4096/4096 bytes at offset 16384
4 KiB, 1 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
/mnt/test/alfred:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: hole 8
1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8
2: [16..31]: hole 16
3: [32..39]: 5226888..5226895 4 (1022184..1022191) 8
/mnt/test/alfred:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7]: hole 8
1: [8..15]: 5226864..5226871 4 (1022160..1022167) 8
2: [16..31]: hole 16
3: [32..63]: 5226888..5226919 4 (1022184..1022215) 32
budgie:~ #
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
No, none of the XFS allocation modes do that. XFS_IOC_ALLOCSP, which does write zeros to disk, only allocates and writes zeros in the range between the old file size and the new file size. XFS_IOC_RESVSP, which alocates unwritten extents, only allocates where extents do not currently exist. It does not zero existing extents. IOWs, you can't overwrite existing data with XFS preallocation. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
This patch implements ->fallocate() inode operation in ext4. With this
patch users of ext4 file systems will be able to use fallocate() system
call for persistent preallocation.
Current implementation only supports preallocation for regular files
(directories not supported as of date) with extent maps. This patch
does not support block-mapped files currently.
Only FA_ALLOCATE mode is being supported as of now. Supporting
FA_DEALLOCATE mode is a <ToDo> item.
Changelog:
---------
Changes from Take3 to Take4:
1) Changed ext4_fllocate() declaration and definition to return a
"long"
and not an "int", to match with ->fallocate() inode op.
2) Update ctime if new blocks get allocated.
Changes from Take2 to Take3:
1) Patch rebased to 2.6.22-rc1 kernel version.
2) Removed unnecessary "EXPORT_SYMBOL(ext4_fallocate);".
Changes from Take1 to Take2:
1) Added more description for ext4_fallocate().
2) Now returning EOPNOTSUPP when files are block-mapped (non-extent).
3) Moved journal_start & journal_stop inside the while loop.
4) Replaced BUG_ON with WARN_ON & ext4_error.
5) Make EXT4_BLOCK_ALIGN use ALIGN macro internally.
6) Added variable names in the function declaration of ext4_fallocate()
7) Converted macros that handle uninitialized extents into inline
functions.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -316,7 +316,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug(" %d:%d:%llu ",
le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
ext_pblock(path->p_ext));
} else
ext_debug(" []");
@@ -339,7 +339,7 @@ static void ext4_ext_show_leaf(struct in
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
...This patch adds write support to the uninitialized extents that get
created when a preallocation is done using fallocate(). It takes care of
splitting the extents into multiple (upto three) extents and merging the
new split extents with neighbouring ones, if possible.
Changelog:
---------
Changes from Take3 to Take4:
- no change -
Changes from Take2 to Take3:
1) Patch now rebased to 2.6.22-rc1 kernel.
Changes from Take1 to Take2:
1) Replaced BUG_ON with WARN_ON & ext4_error.
2) Added variable names to the function declaration of
ext4_ext_try_to_merge().
3) Updated variable declarations to use multiple-definitions-per-line.
4) "if((a=foo())).." was broken into "a=foo(); if(a).."
5) Removed extra spaces.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -1167,6 +1167,53 @@ ext4_can_extents_be_merged(struct inode
}
/*
+ * This function tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+ struct ext4_extent_header *eh;
+ unsigned int depth, len;
+ int merge_done = 0;
+ int uninitialized = 0;
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ eh = path[depth].p_hdr;
+
+ while (ex < EXT_LAST_EXTENT(eh)) {
+ if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+ break;
+ /* merge with next extent! */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(ex + 1));
+ if ...Support new values of mode in ext4.
This patch supports new mode values/flags in ext4. With this patch ext4
will be able to support FA_ALLOCATE and FA_RESV_SPACE modes. Supporting
FA_DEALLOCATE and FA_UNRESV_SPACE fallocate modes in ext4 is a work for
future.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
Index: linux-2.6.22-rc4/fs/ext4/extents.c
===================================================================
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c
+++ linux-2.6.22-rc4/fs/ext4/extents.c
@@ -2477,7 +2477,8 @@ int ext4_ext_writepage_trans_blocks(stru
/*
* preallocate space for a file. This implements ext4's fallocate inode
* operation, which gets called from sys_fallocate system call.
- * Currently only FA_ALLOCATE mode is supported on extent based files.
+ * Currently only FA_ALLOCATE and FA_RESV_SPACE modes are supported on
+ * extent based files.
* We may have more modes supported in future - like FA_DEALLOCATE, which
* tells fallocate to unallocate previously (pre)allocated blocks.
* For block-mapped files, posix_fallocate should fall back to the method
@@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
* currently supporting (pre)allocate mode for extent-based
* files _only_
*/
- if (mode != FA_ALLOCATE || !(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+ if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL) ||
+ !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
return -EOPNOTSUPP;
/* preallocation to directories is currently not supported */
@@ -2572,9 +2574,10 @@ retry:
/*
* Time to update the file size.
- * Update only when preallocation was requested beyond the file size.
+ * Update only when preallocation was requested beyond the file size
+ * and when FA_FL_KEEP_SIZE mode is not specified!
*/
- if ((offset + len) > i_size_read(inode)) {
+ if (!(mode & FA_FL_KEEP_SIZE) && (offset + len) > i_size_read(inode)) {
if (ret > 0) {
/*
* if no error, we assume preallocation succeeded
-
This should probably just check for the individual flags it can support (e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA). I also thought another proposed flag was to determine whether mtime (and maybe ctime) is changed when doing prealloc/dealloc space? Default should probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone else should decide if we want to allow changing the file w/o changing ctime, if that is required even though the file is not visibly changing. Maybe the ctime update should be implicit if the size or mtime are changing? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Hmm.. I am thinking of a scenario when the file system supports some individual flags, but does not support a particular combination of them. Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported for some reason). This means that although we support FA_FL_DEALLOC, FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the Is it really required ? I mean, why should we allow users not to update ctime/mtime even if the file metadata/data gets updated ? It sounds a bit "unnatural" to me. Is there any application scenario in your mind, when you suggest of giving this flexibility to userspace ? I think, modifying ctime/mtime should be dependent on the other flags. E.g., if we do not zero out data blocks on allocation/deallocation, update only ctime. Otherwise, update ctime and mtime both. -- Regards, Amit Arora -
That is up to the filesystem to determine then. I just thought it should be clear to return an error for flags (or as you say combinations thereof) that the filesystem doesn't understand. That said, I'd think in most cases the flags are orthogonal, so if you support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC) then you will also support other combinations of those flags just from One reason is that XFS does NOT update the mtime/ctime when doing the I'm only being the advocate for requirements David Chinner has put forward due to existing behaviour in XFS. This is one of the reasons why I think the "flags" mechanism we now have - we can encode the various different behaviours in any way we want and leave it to the caller. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I understand. May be we can confirm once more with David Chinner if this is really required. Will it really be a compatibility issue if new XFS preallocations (ie. via fallocate) update mtime/ctime ? Will old applications really get affected ? If yes, then it might be worth implementing - even though I personally don't like it. David, can you please confirm ? Thanks! -- Regards, Amit Arora -
Not totally correct. XFS_IOC_ALLOCSP/FREESP change timestamps if they change the file size (via the truncate call made to change the file size). If they don't change the file size, then they are a no-op and should not change the file size. XFS_IOC_RESVSP/UNRESVSP don't change timestamps just like they don't change file size. That is by design AFAICT so these calls can be used by HSM-type applications that don't want to change timestamps It should be left up to the filesystem to decide. Only the filesystem knows whether something changed and the timestamp should or should not be updated. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Since Andreas had suggested FA_FL_NO_MTIME flag thinking it as a requirement from XFS (whereas XFS does not need this flag), I don't think we need to add this new flag. Please let know if someone still feels FA_FL_NO_MTIME flag can be useful. -- Regards, Amit Arora -
Can you include the man page in this patch set, please? That way it can be kept up to date with the rest of the patch set. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Why the heck are replacements for these things being sent out again when they're already in -mm and they're already in Ted's queue (from which I need to diligently drop them each time I remerge)? Are we all supposed to re-review the entire patchset (or at least #4 and #7) again? The core kernel changes are not appropriate to the ext4 tree. For a start, the syscall numbers in Ted's queue are wrong (other new syscalls are pending). Patches which add syscalls are an utter PITA to carry due to all the patch conflicts and to the relatively frequent syscall renumbering (they don't get numbered in time-of-arrival order due to differing rates at which patches mature). Please drop the non-ext4 patches from the ext4 tree and send incremental patches against the (non-ext4) fallocate patches in -mm. And try to get the code finished? Time is pressing. Thanks. -
The ext4 fallocate() patches are dependent on the core fallocate() patches, so ext4 patch-queue and git tree won't compile (it's not based on mm tree) without the core changes. We can send ext4 fallocate patches (incremental patches against mm tree) and drop the full fallocate patches(ext4 and non ext4 part) from ext4 I looked at the mm tree, there are other ext4 features/changes that are currently in ext4-patch-queue(not ext4 git tree) that not in part of ext4 series yet. Ted, can you merge those patches to your git tree? Thanks! Thanks for your patience. Mingming. -
As I mentioned in the note above, only patches #4 and #7 were new and thus these needed to be reviewed. Other patches are _not_ replacements of any of the patches which are already part of -mm and/or in Ted's patch queue. They were posted again as just "placeholders" so that the Please let us know what you think of Mingming's suggestion of posting all the fallocate patches including the ext4 ones as incremental ones against the -mm. -- Regards, Amit Arora -
I think Mingming was asking that Ted move the current quilt tree into git, presumably because she's working off git. I'm not sure what to do, really. The core kernel patches need to be in Ted's tree for testing but that'll create a mess for me. ug. Options might be a) I drop the fallocate patches from -mm and from the ext4 tree, hack up any needed build fixes, then just wait for it all to mature and then think about it again b) We do what we normally don't do and reserve the syscall slots in mainline. -
I moved the fallocate patches to the very end of the series in the quilt tree. This way the patches will be in the quilt tree for testing, but Ted can easily leave them out of the git tree so you and Linus won't pull them with the ext4 patches. Fortunately, the ext4-specific fallocate patches don't conflict with the other patches in the queue, so they can (at least for now) be handled -- David Kleikamp IBM Linux Technology Center -
If everyone agrees it's going to happen... why not? Jeff -
Could we please stop this stupid ext4-centrism? XFS is ready so we can put in the syscalls backed by XFS. We have already done this with the xattr syscalls in 2.4, btw. Then again I don't think we should put it in quite yet, because this thread has degraded into creeping featurism, please give me some more time to preparate a semi-coheret rants about this.. -
No, mingming and I both work off of the patch queue (which is also stored in git). So what mingming was asking for exactly was just posting the incremental patches and tagging them appropriately to avoid confusion. I tried building the patch queue earlier in the week and it there were multiple oops/panics as I ran things through various regression tests, but that may have been fixed since (the tree was broken over the weekend and I may have grabbed a broken patch series) or it may have been a screw up on my part feeding them into our testing grid. I haven't had time to try again this week, but I'll try to put together I don't think we have a problem here. What we have now is fine, and it was just people kvetching that Amit reposted patches that were already in -mm and ext4. In any case, the plan is to push all of the core bits into Linus tree for 2.6.22 once it opens up, which should be Real Soon Now, it looks like. - Ted -
It's fine for ext4, but not the wider world. This is a common problem Presumably you mean 2.6.23. Jeff -
Yes, sorry. I meant once Linus releases 2.6.22, and we would be aiming to merge before the 2.6.23-rc1 window. - Ted -
I think the ext4 patch queue is in good shape now. Shaggy have tested in on dbench, fsx, and tiobench, tests runs fine. and BULL team has benchmarked the latest ext4 patch queue with iozone and FFSB. Regards, -
On Fri, 29 Jun 2007 11:50:04 -0400 Which ext4 patches are you intending to merge into 2.6.23? Please send all those out to lkml for review? -
Hi Andrew, Here are the patches in ext4-patch-queue that I think can be considered to be merged to upstream. Please review. All of the patches have been posted on ext4 mailinglist before. Some are bug fixes, some are features, to summaries: - make extents on by default in ext4dev - nanosecond timestamp - 64 bit inode versioning support - remove 32k subdir limits - journal checksumming - journal stats via procfs - delayed allocation for ext4 writeback mode - fallocate() All the patches can be found at http://repo.or.cz/w/ext4-patch-queue.git and have been tested(with fsx ,dbench, FFSB, iozone) on x86,x86_64,ppc64, with extents and delayed allocation enabled And the full series can be found at http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=series;h=2f43431db28778ce8d2149bce7a... I will post the patches-in-good-shape (in 9 set of patches) to lkml in the following emails, except for the bottom two feature: *the fallocate() patches, which Amit just posted a few days ago and are under review (hopefully we can reach a agreement on the interface and the "modes" before 2.6.23-rc1 window closed). *Another one is the delayed allocation patches in ext4 patch queue. Alex mentioned in another email that he is working on another version of delalloc that can handle block size < page size, and move some work to vfs. So it's probably not very useful to post this version for people to review. So, here is the series file. # Rebased the patches to 2.6.22-rc6 # Add mount option to turn off extents ext4_noextent_mount_opt.patch # Mounted ext4dev fs with extents by default for testing purpose, # for Ext4 product release, extents mount option # will be turn on only if the fs has EXTENTS feature on ext4_extents_on_by_default.patch # Propagate inode flags ext4-propagate_flags.patch # Add extent sanity checks ext4-extent-sanity-checks.patch # Bug fix:set 64bit JBD2 feature on >32bit ext4 ...
The new patches are definitely a big improvement over the previous API, and need to go in before fallocate() goes into mainline. This last set of changes allows the behaviour of these syscalls to accomodate the various different semantics desired by XFS in a sensible manner instead of tying all of the individual behaviours (time update, size update, alloc/free, etc) into monolithic modes that will never make everyone happy. My understanding is that you only need to grab #4 and #7 to get your tree into get fallocate in sync with the ext4 patch queue (i.e. they are incremental over the previous set). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
This isn't really about "more suitable for XFS" but more about more suitable for sophisticated layout decisions. But I'm still not confident this should be shohorned into this syscall. In fact I'm already rather unhappy about the feature churn in the current patch series. The more I think about it the more I'd prefer we would just put a simple syscall in that implements nothing but the posix_fallocate(3) semantics as defined in SuS, and then go on to brainstorm about advanced preallocation / layout hint semantics. -
This patch implements support of fallocate system call on s390(x)
platform. A wrapper is added to address the issue which s390 ABI has
with "preferred" ordering of arguments in this system call (i.e. int,
int, loff_t, loff_t).
I will request s390 experts to please review this code and verify if
this patch is correct. Thanks!
Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
arch/s390/kernel/compat_wrapper.S | 10 ++++++++++
arch/s390/kernel/sys_s390.c | 10 ++++++++++
arch/s390/kernel/syscalls.S | 1 +
include/asm-s390/unistd.h | 3 ++-
4 files changed, 23 insertions(+), 1 deletion(-)
Index: linux-2.6.21/arch/s390/kernel/compat_wrapper.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/compat_wrapper.S
+++ linux-2.6.21/arch/s390/kernel/compat_wrapper.S
@@ -1682,3 +1682,13 @@ compat_sys_utimes_wrapper:
llgtr %r2,%r2 # char *
llgtr %r3,%r3 # struct compat_timeval *
jg compat_sys_utimes
+
+ .globl s390_fallocate_wrapper
+s390_fallocate_wrapper:
+ lgfr %r2,%r2 # int
+ sllg %r3,%r3,32 # get high word of 64bit loff_t
+ or %r3,%r4 # get low word of 64bit loff_t
+ sllg %r4,%r5,32 # get high word of 64bit loff_t
+ or %r4,%r6 # get low word of 64bit loff_t
+ llgf %r5,164(%r15) # unsigned int
+ jg s390_fallocate
Index: linux-2.6.21/arch/s390/kernel/sys_s390.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/sys_s390.c
+++ linux-2.6.21/arch/s390/kernel/sys_s390.c
@@ -268,6 +268,16 @@ s390_fadvise64_64(struct fadvise64_64_ar
}
/*
+ * This is a wrapper to call sys_fallocate(). Since s390 ABI has a problem
+ * with the int, int, loff_t, loff_t ordering of arguments, this wrapper
+ * is required.
+ */
+asmlinkage long s390_fallocate(int fd, loff_t offset, loff_t len, int mode)
+{
+ return sys_fallocate(fd, mode, offset, len);
+}
+
+/*
* Do a system call from kernel instead of calling sys_execve so we
...This is a fix for an extent-overlap bug. The fallocate() implementation
on ext4 depends on this bugfix. Though this fix had been posted earlier,
but because it is still not part of mainline code, I have attached it
here too.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
fs/ext4/extents.c | 50 ++++++++++++++++++++++++++++++++++++++--
include/linux/ext4_fs_extents.h | 1
2 files changed, 49 insertions(+), 2 deletions(-)
Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1129,6 +1129,45 @@ ext4_can_extents_be_merged(struct inode
}
/*
+ * ext4_ext_check_overlap:
+ * check if a portion of the "newext" extent overlaps with an
+ * existing extent.
+ *
+ * If there is an overlap discovered, it updates the length of the newext
+ * such that there will be no overlap, and then returns 1.
+ * If there is no overlap found, it returns 0.
+ */
+unsigned int ext4_ext_check_overlap(struct inode *inode,
+ struct ext4_extent *newext,
+ struct ext4_ext_path *path)
+{
+ unsigned long b1, b2;
+ unsigned int depth, len1;
+
+ b1 = le32_to_cpu(newext->ee_block);
+ len1 = le16_to_cpu(newext->ee_len);
+ depth = ext_depth(inode);
+ if (!path[depth].p_ext)
+ goto out;
+ b2 = le32_to_cpu(path[depth].p_ext->ee_block);
+
+ /* get the next allocated block if the extent in the path
+ * is before the requested block(s) */
+ if (b2 < b1) {
+ b2 = ext4_ext_next_allocated_block(path);
+ if (b2 == EXT_MAX_BLOCK)
+ goto out;
+ }
+
+ if (b1 + len1 > b2) {
+ newext->ee_len = cpu_to_le16(b2 - b1);
+ return 1;
+ }
+out:
+ return 0;
+}
+
+/*
* ext4_ext_insert_extent:
* tries to merge requsted extent into the existing extent or
* inserts requested extent as new one into the tree,
@@ -2032,7 +2071,15 @@ int ext4_ext_get_blocks(handle_t *handle
/* allocate new block */
goal = ext4_ext_find_goal(inode, ...This patch has the ext4 implemtation of fallocate system call.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
fs/ext4/extents.c | 201 +++++++++++++++++++++++++++++++---------
fs/ext4/file.c | 1
include/linux/ext4_fs.h | 7 +
include/linux/ext4_fs_extents.h | 13 ++
4 files changed, 179 insertions(+), 43 deletions(-)
Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -283,7 +283,7 @@ static void ext4_ext_show_path(struct in
} else if (path->p_ext) {
ext_debug(" %d:%d:%llu ",
le32_to_cpu(path->p_ext->ee_block),
- le16_to_cpu(path->p_ext->ee_len),
+ ext4_ext_get_actual_len(path->p_ext),
ext_pblock(path->p_ext));
} else
ext_debug(" []");
@@ -306,7 +306,7 @@ static void ext4_ext_show_leaf(struct in
for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ex++) {
ext_debug("%d:%d:%llu ", le32_to_cpu(ex->ee_block),
- le16_to_cpu(ex->ee_len), ext_pblock(ex));
+ ext4_ext_get_actual_len(ex), ext_pblock(ex));
}
ext_debug("\n");
}
@@ -426,7 +426,7 @@ ext4_ext_binsearch(struct inode *inode,
ext_debug(" -> %d:%llu:%d ",
le32_to_cpu(path->p_ext->ee_block),
ext_pblock(path->p_ext),
- le16_to_cpu(path->p_ext->ee_len));
+ ext4_ext_get_actual_len(path->p_ext));
#ifdef CHECK_BINSEARCH
{
@@ -687,7 +687,7 @@ static int ext4_ext_split(handle_t *hand
ext_debug("move %d:%llu:%d in new leaf %llu\n",
le32_to_cpu(path[depth].p_ext->ee_block),
ext_pblock(path[depth].p_ext),
- le16_to_cpu(path[depth].p_ext->ee_len),
+ ext4_ext_get_actual_len(path[depth].p_ext),
newblock);
/*memmove(ex++, path[depth].p_ext++,
sizeof(struct ext4_extent));
@@ -1107,7 +1107,19 @@ static int
ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
struct ...Please always format multiline comments like this: /* * ext4_can_extents_be_merged should have checked that either * both extents are uninitialized, or both aren't. Thus we * need to check only one of them here. This description is rather thin. What is the filesystem's actual behaviour here? If the file is using extents then the implementation will do <something>. If the file is using bitmaps then we will do <something else>. So we don't implement fallocate on bitmap-based files! Well that's huge news. The changelog would be an appropriate place to communicate this, along with reasons why, or a description of the plan to fix it. Now I'm mystified. Given that we're allocating an arbitrary amount of disk space, and that this disk space will require an arbitrary amount of metadata, how can we work out how much journal space we'll be needing BUG_ON is vicious. Is it really justified here? Possibly a WARN_ON and Use buffer_new() here. A separate patch which fixes the three existing Both the lhs and the rhs here are signed. Please review for possible overflows through the sign bit and through zero. Perhaps a comment Maybe a comment describing what this does? Probably it's obvious enough. I think it could use the standard ALIGN macro. Is blkbits sufficiently parenthesised here? Even if it is, adding the inlined C functions are preferred, and I think these could be implemented that way. -
Right. I don't seem to find any suitable error from posix description. Can you please suggest an error code which might make more sense here ? Will -ENOTSUPP be ok ? Since we want to say here that we don't support You are right to say that the credits can not be fixed here. But, 'len' will not directly tell us how many extents might need to be inserted and how many block groups (if any - think about the "segment range" already being allocated case) the allocation request might touch. One solution I have thought is to check the buffer credits after a call to ext4_ext_get_blocks (in the while loop) and do a journal_extend, if the credits are falling short. Incase journal_extend fails, we call journal_restart. This will automatically take care of how much journal Ok. Will convert them to inline functions. Thanks! -- Regards, Amit Arora -
Isn't the idea that libc will interpret -ENOTTY, or whatever is returned here, and fall back to the current library code to do preallocation? This way, the caller of fallocate() will never see this return code, so it won't violate posix. -- David Kleikamp IBM Linux Technology Center -
You are right. But, we still need to "standardize" (and limit) the error codes which we should return from kernel when we want to fall back on the library implementation. The posix_fallocate() library function will have to look for a set of errors from fallocate() system call, upon receiving which it will do preallocation from user level; or else, it will return success/error-code returned by the system call to the user. I think we can make it fall back to library implementation of fallocate, whenever posix_fallocate() receives any of the following errors from fallocate() system call: 1. ENOSYS 2. EOPNOTSUPP 3. ENOTTY (?) Now the question is - should we limit the set of errors for this purpose to just 1 & 2 above ? In that case I will need to change the error being returned here to -EOPNOTSUPP (from current -ENOTTY). -- Regards, Amit Arora -
If you want my opinion, -EOPNOTSUPP is better than -ENOTTY. Shaggy -- David Kleikamp IBM Linux Technology Center -
My understanding is that glibc will handle zero-filling of files for I _think_ this is to convince glibc to do the zero-filling in userspace, Good question. The uninitialized extent can cover up to 128MB with a single entry. If @path isn't specified, then ext4_ext_calc_credits_for_insert() function returns the maximum number of extents needed to insert a leaf, including splitting all of the index blocks. That would allow up to 43GB (340 extents/block * 128MB) to be preallocated, but it still needs to take the size of the preallocation into account (adding 3 blocks per 43GB - a leaf block, a bitmap block and a group descriptor). Ouch, not very friendly error handling. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
On Mon, 7 May 2007 05:37:54 -0600
I think the use of ext4_journal_extend() (as Amit has proposed) will help
here, but it is not sufficient.
Because under some circumstances, a journal_extend() failure could mean
that we fail to allocate all the required disk space. If it is infrequent
enough, that is acceptable when the caller is using fallocate() for
performance reasons.
But it is very much not acceptable if the caller is using fallocate() for
space-reservation reasons. If you used fallocate to reserve 1GB of disk
and fallocate() "succeeded" and you later get ENOSPC then you'd have a
right to get a bit upset.
So I think the ext3/4 fallocate() implementation will need to be
implemented as a loop:
while (len) {
journal_start();
len -= do_fallocate(len, ...);
journal_stop();
}
Now the interesting question is: what do we do if we get halfway through
this loop and then run out of space? We could leave the disk all filled up
and then return failure to the caller, but that's pretty poor behaviour,
IMO.
Does the proposed implementation handle quotas correctly, btw? Has that
been tested?
Final point: it's fairly disappointing that the present implementation is
ext4-only, and extent-only. I do think we should be aiming at an ext4
bitmap-based implementation and an ext3 implementation.
-
Actually, this is a non-issue. The reason that it is handled for extent-only is that this is the only way to allocate space in the filesystem without doing the explicit zeroing. For other filesystems (including ext3 and ext4 with block-mapped files) the filesystem should return an error (e.g. -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
On Mon, 7 May 2007 15:21:04 -0700 hrm, spose so. It can be a bit suboptimal from the layout POV. The reservations code will largely save us here, but kernel support might make it a bit better. Totally blowing pagecache could be a problem. Fixable in userspace by using sync_file_range()+fadvise() or O_DIRECT, but I bet it doesn't. -
Actually, the reservations code won't matter, since glibc will fall back to its current behavior, which is it will do the preallocation by explicitly writing zeros to the file. This wlil result in the same layout as if we had done the persistent preallocation, but of course it will mean the posix_fallocate() could potentially take a long time if you're a PVR and you're reserving a gig or two for a two hour movie at high quality. That seems suboptimal, granted, and ideally the application should be warned about this before it calls posix_fallocate(). On the other hand, it's what happens today, all the time, so applications won't be too badly surprised. If we think applications programmers badly need to know in advance if posix_fallocate() will be fast or slow, probably the right thing is to define a new fpathconf() configuration option so they can query to see whether a particular file will support a fast posix_fallocate(). I'm not 100% convinced such complexity is really needed, but I'm willing to be convinced.... what do folks think? - Ted -
On Mon, 7 May 2007 19:14:42 -0400 No! Reservations code is *critical* here. Without reservations, we get disastrously-bad layout if two processes were running a large fallocate() at the same time. (This is an SMP-only problem, btw: on UP the timeslice lengths save us). My point is that even though reservations save us, we could do even-better in-kernel. But then, a smart application would bypass the glibc() fallocate() implementation and would tune the reservation window size and would use A PVR implementor would take all this over and would do it themselves, for An application could do sys_fallocate(one-byte) to work out whether it's supported in-kernel, I guess. -
In this case, since the number of blocks to preallocate (eg. N=10GB) is clear, we could improve the current reservation code, to allow callers explicitly ask for a new window that have the minimum N free blocks for the blocks-to-preallocated(rather than just have at least 1 free blocks). Before the ext4_fallocate() is called, the right reservation window size is set with the flag to indicating "please spend time if needed to find a window covers at least N free blocks". So for ex4 block mapped files, later when glibc is doing allocation and zeroing, the ext4 block-mapped allocator will knows to reserve the right amount of free blocks before allocating and zeroing 10GB space. -
Precisely /how/ do you avoid the zeroing issue, for extents? If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, otherwise the implementation is broken. Jeff -
There is a bit in the extent structure which indicates that the extent has not been initialized. When reading from a block where the extent is marked as unitialized, ext4 returns zero's, to avoid returning the uninitalized contents of the disk, which might contain someone else's love letters, p0rn, or other information which we shouldn't leak out. When writing to an extent which is uninitalized, we may potentially have to split the extent into three extents in the worst case. My understanding is that XFS uses a similar implementation; it's a pretty obvious and standard way to implement allocated-but-not-initialized extents. We thought about supporting persistent preallocation for inodes using indirect blocks, but it would require stealing a bit from each entry in the indirect block, reducing the maximum size of the filesystem by two (i.e., 2**31 blocks). It was decided it wasn't worth the complexity, given the tradeoffs. - Ted -
In ext4 (as in XFS) there is a flag stored in the extent that tells if the extent is initialized or not. Reads from uninitialized extents will return zero-filled data, and writes that don't span the whole extent will cause the uninitialized extent to be split into a regular extent and one or two uninitialized extents (depending where the write is). My comment was just that the extent doesn't have to be explicitly zero filled on the disk, by virtue of the fact that the uninitialized flag will cause reads to return zero. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I agree. There is already a loop in Amit's current's patch to call ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to I think the calculation is based on the assumption that there is only a single extent to be inserted, which is the ideal case. But in some cases we may end up allocating several chunk of blocks(extents) for this single preallocation request when fs is fragmented (or part of preallocation request is already fulfilled) I think we should move this calculation inside the loop as well,and we really do not need to grab the lock to calculate the credit if the @path is always NULL, all the function does is mathmatics. I can't think of any good way to estimate the total credits needed for this whole preallocation request. Looked at ext4_get_block(), which is used for DIO code to deal with large amount of block allocation. The credit reservation is quite weak there too. The DIO_CREDIT is only The current code handles earlier ENOSPC by three times retries. After that if we still run out of space, then it's propably right to notify the caller there isn't much space left. We could extend the block reservation window size before the while loop I think so. The ext4_ext_get_blocks() will end up calling ext4_new_blocks() to do the real block allocation, quota is being handled there, therefor is tested already. Mingming -
On Mon, 07 May 2007 17:00:24 -0700 yes, but my point is that the proposed behaviour is really quite bad. We will attempt to allocate the disk space and then we will return failure, having consumed all the disk space and having partially and uselessly populated an unknown amount of the file. Userspace could presumably repair the mess in most situations by truncating the file back again. The kernel cannot do that because there might be live data in amongst there. So we'd need to either keep track of which blocks were newly-allocated and then free them all again on the error path (doesn't work right across commit+crash+recovery) or we could later use the space-reservation scheme which delayed allocation will need to introduce. Or we could decide to live with the above IMO-crappy behaviour. -
I agree your point, that's why I mention it only helped the Not totally useless I think. If only half of the space is preallocated because run out of space, the application can decide whether it's good enough to start to use this preallocated space or wait for the fs to In fact Amit and I had raised this issue before, whether it's okay to do allow partial preallocation. At that moment the feedback is it's no much different than the current zero-out-preallocation behavior: people might preallocating half-way then later deal with ENOSPC. We could check the total number of fs free blocks account before preallocation happens, if there isn't enough space left, there is no need to bother preallocating. If there is enough free space, we could make a reservation window that have at least N free blocks and mark it not stealable by other files. So later we will not run into the ENOSPC error. The fs free blocks account is just a estimate though. Mingming -
Checking against the fs free blocks is a good idea, since it will prevent the obvious error case where someone tries to preallocate 10GB when there is only 2GB left. But it won't help if there are multiple processes trying to allocate blocks the same time. On the other hand, that case is probably relatively rare, and in that case, the filesystem was probably going to be left completely full in any case. Actually, the kernel could do it, in that could simply release all unitialized extents back to the system. The problem is distinguishing between the unitialized extents that had just been newly added, versus the ones that had there from before. (On the other hand, if the filesystem was completely full, releasing unitialized blocks wouldn't be the worse thing in the world to do, although releasing previously fallocated blocks probably does violate the princple of least surprise, even if it's what the user would have wanted.) Could you really use a single reservation window? When the filesystem is almost full, the free extents are likely going to be scattered all over the disk. The general principle of grabbing all of the extents and keeping them in an in-memory data structure, and only adding them to the extent tree would work, though; I'm just not sure we could do it using the existing reservation window code, since it only supports a single reservation window per file, yes? - Ted -
I tend to agree with this. Having fallocate() fill up the filesystem is exactly what the caller asked. Doing a write() hit ENOSPC doesn't trucate off the whole write either, nor does "dd" delete the whole file when the filesystem is full. Even checking the statfs() space before doing the fallocate() may be counter intuitive, since it will return ENOSPC but the filesystem will not actually be full. Some applications (e.g. database) may WANT to fill the filesystem and then get the actual file size back to avoid trusting statfs() because of metadata overhead (e.g. indirect blocks). One of the design goals for sys_fallocate() was to allow FA_DELALLOC to deallocate unwritten extents in a safe manner. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Think it again, this check is useful when preallocate blocks at EOF. It's not much useful is preallocating a range with holes. In that case 2GB space might be enough if the application tries to preallocate a True, the new uninitialized extents can be merged to the near old uninitialized extents, there is no way to distinguish the just added You are right. One reservation window per file and there is limit to the maximum window size). So yeah this way it's not going to prevent ENOSPC for sure:( Mingming -
It seems to handle quotas fine - the block allocation itself does not differ from the usual case, just the extents in the tree are marked as uninitialized... The only question is whether DQUOT_PREALLOC_BLOCK() shouldn't be called instead of DQUOT_ALLOC_BLOCK(). Then fallocate() won't be able to allocate anything after the softlimit has been reached which makes some sence but probably current behavior is kind-of less surprising. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs -
This patch adds write support for preallocated (using fallocate system
call) blocks/extents. The preallocated extents in ext4 are marked
"uninitialized", hence they need special handling especially while
writing to them. This patch takes care of that.
Signed-off-by: Amit Arora <aarora@in.ibm.com>
---
fs/ext4/extents.c | 228 +++++++++++++++++++++++++++++++++++-----
include/linux/ext4_fs_extents.h | 1
2 files changed, 202 insertions(+), 27 deletions(-)
Index: linux-2.6.21/fs/ext4/extents.c
===================================================================
--- linux-2.6.21.orig/fs/ext4/extents.c
+++ linux-2.6.21/fs/ext4/extents.c
@@ -1141,6 +1141,51 @@ ext4_can_extents_be_merged(struct inode
}
/*
+ * ext4_ext_try_to_merge:
+ * tries to merge the "ex" extent to the next extent in the tree.
+ * It always tries to merge towards right. If you want to merge towards
+ * left, pass "ex - 1" as argument instead of "ex".
+ * Returns 0 if the extents (ex and ex+1) were _not_ merged and returns
+ * 1 if they got merged.
+ */
+int ext4_ext_try_to_merge(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_extent *ex)
+{
+ struct ext4_extent_header *eh;
+ unsigned int depth, len;
+ int merge_done=0, uninitialized = 0;
+
+ depth = ext_depth(inode);
+ BUG_ON(path[depth].p_hdr == NULL);
+ eh = path[depth].p_hdr;
+
+ while (ex < EXT_LAST_EXTENT(eh)) {
+ if (!ext4_can_extents_be_merged(inode, ex, ex + 1))
+ break;
+ /* merge with next extent! */
+ if (ext4_ext_is_uninitialized(ex))
+ uninitialized = 1;
+ ex->ee_len = cpu_to_le16(ext4_ext_get_actual_len(ex)
+ + ext4_ext_get_actual_len(ex + 1));
+ if (uninitialized)
+ ext4_ext_mark_uninitialized(ex);
+
+ if (ex + 1 < EXT_LAST_EXTENT(eh)) {
+ len = (EXT_LAST_EXTENT(eh) - ex - 1)
+ * sizeof(struct ext4_extent);
+ memmove(ex + 1, ex + 2, len);
+ }
+ eh->eh_entries = cpu_to_le16(le16_to_cpu(eh->eh_entries)-1);
+ merge_done = 1;
+ BUG_ON(eh->eh_entries == ...space around "=", please. Many people prefer not to do the multiple-definitions-per-line, btw: int merge_done = 0; int uninitialized = 0; reasons: - If gives you some space for a nice comment - It makes patches much more readable, and it makes rejects easier to fix eek, scary BUG_ON. Do we really need to be that severe? Would it be Use The preferred style is err = ext4_ext_get_access(handle, inode, path + depth); if (err) Sigh. I hope you guys know how all this works, because the extent code is a mystery to me. Is the on-disk layout and the allocation strategy Again, I do think that sticking the identifiers in there helps readability. Although it is not as important in a boring old declaration as it is in, say, inode_operations, etc. Please try to keep the code looking nice in an 80-column display. -
Ok. Will make the required changes. Thanks again for your comments! -- Regards, Amit Arora -
Please either use proper kerneldoc format or drop "ext4_ext_try_to_merge" from the comment. -
After long discussions where at least two possible implementations were suggested that would work on _all_ architectures you chose one This is not limited to strace... Besides that the s390 backend looks ok. -
I believe the long discussion also showed that every possible implementation has drawbacks. To me this one appeared to be the best of many bad choices. Is this implementation worse than we thought? Jörn -- The grand essentials of happiness are: something to do, something to love, and something to hope for. -- Allan K. Chalmers -
If one insists to have fd at first argument, what is wrong with having u32 arguments only? It's not that this syscall comes even close to It adds userspace overhead for one architecture. Every *trace and *libc needs special handling on s390 for this syscall. I would prefer to avoid this. -
I'm not that bothered about it. I would prefer it did use clean 64-bit arguments, but given it's a non-critical syscall I'm don't think the aesthetics are worth impossing crud on s390 for. -
Ok, so now for the hard questions - what are the semantics of FA_ALLOCATE and FA_DEALLOCATE? For FA_ALLOCATE, it's supposed to change the file size if we allocate past EOF, right? What's the return value supposed to be? Zero for success, error otherwise? Does this update a/m/ctime at all? How persistent is this preallocation? Should it be there "forever" or for the lifetime of the currently open fd that it was preallocated on? For FA_DEALLOCATE, does it change the filesize at all? Or does it just punch a hole in the file? If it does change file size, what happens when you punch out preallocation beyond EOF? FWIW, we definitely need a FA_PREALLOCATE mode (FA_ALLOCATE but does not change file size) so we can preallocate beyond EOF for apps which And that's what I'm doing now, hence all the questions ;) BTW, do you have a test program for this, or will I need to write one myself? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
ia64 fallocate syscall support. Signed-Off-By: Dave Chinner <dgc@sgi.com> --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: 2.6.x-xfs-new/arch/ia64/kernel/entry.S =================================================================== --- 2.6.x-xfs-new.orig/arch/ia64/kernel/entry.S 2007-03-29 19:01:41.000000000 +1000 +++ 2.6.x-xfs-new/arch/ia64/kernel/entry.S 2007-04-27 19:12:56.829396661 +1000 @@ -1612,5 +1612,6 @@ sys_call_table: data8 sys_vmsplice data8 sys_ni_syscall // reserved for move_pages data8 sys_getcpu + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: 2.6.x-xfs-new/include/asm-ia64/unistd.h =================================================================== --- 2.6.x-xfs-new.orig/include/asm-ia64/unistd.h 2007-03-29 19:03:37.000000000 +1000 +++ 2.6.x-xfs-new/include/asm-ia64/unistd.h 2007-04-27 19:18:18.215568425 +1000 @@ -293,11 +293,12 @@ #define __NR_vmsplice 1302 /* 1303 reserved for move_pages */ #define __NR_getcpu 1304 +#define __NR_fallocate 1305 #ifdef __KERNEL__ -#define NR_syscalls 281 /* length of syscall table */ +#define NR_syscalls 282 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION -
Add XFS support for ->fallocate() vector.
Signed-Off-By: Dave Chinner <dgc@sgi.com>
---
fs/xfs/linux-2.6/xfs_iops.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-02-07 13:24:32.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.225095992 +1000
@@ -812,6 +812,53 @@ xfs_vn_removexattr(
return namesp->attr_remove(vp, attr, xflags);
}
+STATIC long
+xfs_vn_fallocate(
+ struct inode *inode,
+ int mode,
+ loff_t offset,
+ loff_t len)
+{
+ long error = -EOPNOTSUPP;
+ bhv_vnode_t *vp = vn_from_inode(inode);
+ bhv_desc_t *bdp;
+ int do_setattr = 0;
+ xfs_flock64_t bf;
+
+ bf.l_whence = 0;
+ bf.l_start = offset;
+ bf.l_len = len;
+
+ bdp = bhv_lookup_range(VN_BHV_HEAD(vp), VNODE_POSITION_XFS,
+ VNODE_POSITION_XFS);
+
+ switch (mode) {
+ case FA_ALLOCATE: /* changes file size */
+ error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+ &bf, 0, NULL, 0);
+ if (offset + len > i_size_read(inode))
+ do_setattr = offset + len;
+ break;
+ case FA_DEALLOCATE:
+ /* XXX: changes file size? this just punches a hole */
+ error = xfs_change_file_space(bdp, XFS_IOC_UNRESVSP,
+ &bf, 0, NULL, 0);
+ break;
+ default:
+ break;
+ }
+
+ /* Change file size if needed */
+ if (!error && do_setattr) {
+ bhv_vattr_t va;
+
+ va.va_mask = XFS_AT_SIZE;
+ va.va_size = do_setattr;
+ error = bhv_vop_setattr(vp, &va, 0, NULL);
+ }
+
+ return error;
+}
struct inode_operations xfs_inode_operations = {
.permission = xfs_vn_permission,
@@ -822,6 +869,7 @@ struct inode_operations xfs_inode_operat
.getxattr = xfs_vn_getxattr,
.listxattr = xfs_vn_listxattr,
.removexattr = xfs_vn_removexattr,
+ .fallocate = xfs_vn_fallocate,
};
struct inode_operations xfs_dir_inode_operations = {
-
Add new mode to ->fallocate() to allow allocation to occur
beyond the current EOF without changing the file size. Implement
in XFS ->fallocate() vector.
Signed-Off-By: Dave Chinner <dgc@sgi.com>
---
fs/xfs/linux-2.6/xfs_iops.c | 8 +++++---
include/linux/fs.h | 1 +
2 files changed, 6 insertions(+), 3 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:02:16.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_iops.c 2007-04-30 11:09:48.233375382 +1000
@@ -833,11 +833,13 @@ xfs_vn_fallocate(
VNODE_POSITION_XFS);
switch (mode) {
- case FA_ALLOCATE: /* changes file size */
- error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
- &bf, 0, NULL, 0);
+ case FA_ALLOCATE: /* changes file size */
if (offset + len > i_size_read(inode))
do_setattr = offset + len;
+ /* FALL THROUGH */
+ case FA_PREALLOCATE: /* no filesize change */
+ error = xfs_change_file_space(bdp, XFS_IOC_RESVSP,
+ &bf, 0, NULL, 0);
break;
case FA_DEALLOCATE:
/* XXX: changes file size? this just punches a hole */
Index: 2.6.x-xfs-new/include/linux/fs.h
===================================================================
--- 2.6.x-xfs-new.orig/include/linux/fs.h 2007-04-27 18:48:01.000000000 +1000
+++ 2.6.x-xfs-new/include/linux/fs.h 2007-04-30 11:08:05.790903661 +1000
@@ -269,6 +269,7 @@ extern int dir_notify_enable;
*/
#define FA_ALLOCATE 0x1
#define FA_DEALLOCATE 0x2
+#define FA_PREALLOCATE 0x3
#ifdef __KERNEL__
-
FA_ALLOCATE should be able to allocate past-EOF I would argue. -
I'm going from the ext4 implementation because the semantics have not been documented yet. IIRC, the argument for FA_ALLOCATE changing file size is that posix_fallocate() is supposed to change the file size. I think that having a mode for real preallocation and another for posix_fallocate is a valid thing to do... Note that the way XFS implements growing the file size after the That's would what I did because otherwise you'd use ftruncate64(). Without documented behaviour or an ext4 implementation, I have to ask what it's supposed to do, though ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
But it's not posix_fallocate; it's something more generic. glibc can How many *real* users are there for ext4? Why does 'what ext4 does' define 'the semantics'? Surely semantics should be decided either by precedent (if there is an existing relevant userbase) or sensible thought and some debate? -
The patch I posted for ext4 *does* change the filesize after preallocation, if required (i.e. when preallocation is after EOF). I may have to change that, if we decide on not doing this. -- Regards, Amit Arora -
I think I'd agree - it may be useful to allow preallocation beyond EOF for some kinds of applications (e.g. PVR preallocating live TV in 10 minute segments or something, but not knowing in advance how long the show will actually be recorded or the final encoded size). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I have an application (diablo dreader) where the header-info database basically consists of ~40.000 files, one for each group (it's more complicated that that, but never mind that now). If you grow those files randomly by a few hundred bytes every update, the filesystem gets hopelessly fragmented. I'm using XFS with preallocation turned on, and biosize=18 (which makes it preallocate in blocks of 256KB), and a homebrew patch that leaves the preallocated space on disk preallocated even if the file is closed .. and it helps enormously. Mike. -
XFS always has speculative preallocation turned on - this is different to explicit preallocation which we are talking about here ;) FWIW, the reason you need your homebrew patch is that specualtive allocation does not set the PREALLOC bit on the inode, and so when you close the file the speculative prealloc gets truncated away. If you use a real preallocation (XFS_IOC_RESVSP64) or the upcoming fallocate() syscall, XFS also sets the PREALLOC bit in the inode so it doesn't get truncated away on file close. If you don't want to use XFS_IOC_RESVSP64, you could just use XFS_IOC_FSSETXATTR to set the prealloc bit on the files you care about so you don't need a hack in XFS to prevent truncation of speculative allocation on file close..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
This is the new set of patches which take care of the review comments
received from the community (mainly from Andrew).
Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.
Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
Interface:
---------
The proposed system call's layout is:
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
fd: The descriptor of the open file.
mode*: This specifies the behavior of the system call. Currently the
system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime/mtime in the inode of the corresponding file, marking a
successfull allocation.
FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may ...What is the return value? I'd hope it is the number of bytes preallocated, in case of interrupted preallocation for whatever reason (interrupt, out of space, etc) like a regular write(2) call. In this case the return type needs to also be an loff_t to match @len. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
The return value in current implementation has been kept as "long" where zero is returned for success and an error on failure. This is done to keep it inline with posix_fallocate behavior. This point was brought up sometime back by Badari. At that time it was decided to keep it the way posix_fallocate is designed. Here are the posts related to this: http://lkml.org/lkml/2007/3/2/18 http://lkml.org/lkml/2007/3/2/162 http://lkml.org/lkml/2007/3/2/208 Still if you feel that we should be returning number of bytes preallocated, we can again ask for opinion here. Thanks! -- Regards, Amit Arora -
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ P L E A S E N O T E : *********************** 1. Patches have been now rebased to 2.6.22-rc1 kernel. Earlier they were based on 2.6.21. 2. An unnecessary export of symbol is removed from the ext4 preallocate patch. Details in the corresponding post (PATCH 4/5). 3. Return type now described in the interface description below. 4. Besides above points, everything is exactly same as TAKE2. -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This is the new set of patches which take care of the review comments received from the community (mainly from Andrew). Description: ----------- fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: --------- The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies ...
Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64
From: David Chinner <dgc@sgi.com>
Subject: [PATCH] ia64 fallocate syscall
Cc: "Amit K. Arora" <aarora@linux.vnet.ibm.com>,
akpm@linux-foundation.org, linux-ext4@vger.kernel.org,
suparna@in.ibm.com, cmm@us.ibm.com
ia64 fallocate syscall support.
Signed-Off-By: Dave Chinner <dgc@sgi.com>
---
arch/ia64/kernel/entry.S | 1 +
include/asm-ia64/unistd.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===================================================================
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-15 15:36:48.000000000 -0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait // 1305
data8 sys_utimensat
+ data8 sys_fallocate
.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===================================================================
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 18:45:56.000000000 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-15 15:37:51.000000000 -0700
@@ -296,6 +296,7 @@
#define __NR_getcpu 1304
#define __NR_epoll_pwait 1305
#define __NR_utimensat 1306
+#define __NR_fallocate 1307
#ifdef __KERNEL__
-
Description:
-----------
fallocate() is a new system call being proposed here which will allow
applications to preallocate space to any file(s) in a file system.
Each file system implementation that wants to use this feature will need
to support an inode operation called fallocate.
Applications can use this feature to avoid fragmentation to certain
level and thus get faster access speed. With preallocation, applications
also get a guarantee of space for particular file(s) - even if later the
the system becomes full.
Currently, glibc provides an interface called posix_fallocate() which
can be used for similar cause. Though this has the advantage of working
on all file systems, but it is quite slow (since it writes zeroes to
each block that has to be preallocated). Without a doubt, file systems
can do this more efficiently within the kernel, by implementing
the proposed fallocate() system call. It is expected that
posix_fallocate() will be modified to call this new system call first
and incase the kernel/filesystem does not implement it, it should fall
back to the current implementation of writing zeroes to the new blocks.
Interface:
---------
The proposed system call's layout is:
asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
fd: The descriptor of the open file.
mode*: This specifies the behavior of the system call. Currently the
system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
FA_ALLOCATE: Applications can use this mode to preallocate blocks to
a given file (specified by fd). This mode changes the file size if
the preallocation is done beyond the EOF. It also updates the
ctime in the inode of the corresponding file, marking a
successfull allocation.
FA_DEALLOCATE: This mode can be used by applications to deallocate the
previously preallocated blocks. This also may change the file size
and the ctime/mtime.
* New modes might get added in future. One such new mode which is
already under ...I merged the first three patches into -mm, thanks. All the system call numbers got changed due to recent additions. They may change in the future, too - nothing is stable until the code lands in mainline. I didn't merge any of the ext4 changes as they appear to be in Ted's devel tree. Although I didn't check that they are 100% the same in that tree. What's the plan to get some ext4 updates into mainline, btw? Things seem to be rather gradual. -
In case you haven't realize it, the ia64 fallocate() patch comes with Amit's takes 4 fallocate patch series (3/6) missing one line change, thus fail to compile on ia64. Here is the updated one. Patch tested on ia64. (compile and fsx) fallocate() on ia64 ia64 fallocate syscall support. Signed-Off-By: Dave Chinner <dgc@sgi.com> --- arch/ia64/kernel/entry.S | 1 + include/asm-ia64/unistd.h | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S =================================================================== --- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S 2007-05-18 16:30:16.000000000 -0700 +++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S 2007-05-18 16:32:45.000000000 -0700 @@ -1585,5 +1585,6 @@ data8 sys_getcpu data8 sys_epoll_pwait // 1305 data8 sys_utimensat + data8 sys_fallocate .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h =================================================================== --- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-18 16:30:16.000000000 -0700 +++ linux-2.6.22-rc1/include/asm-ia64/unistd.h 2007-05-18 17:34:58.000000000 -0700 @@ -296,11 +296,12 @@ #define __NR_getcpu 1304 #define __NR_epoll_pwait 1305 #define __NR_utimensat 1306 +#define __NR_fallocate 1307 #ifdef __KERNEL__ -#define NR_syscalls 283 /* length of syscall table */ +#define NR_syscalls 285 /* length of syscall table */ #define __ARCH_WANT_SYS_RT_SIGACTION Since both Amit and Ted are traveling, I will jump in... Most likely it's not the same one. What in Ted's devel tree is "takes 2" patches. I have incorporated takes 4 patches in the backing ext4 patch git tree here: http://repo.or.cz/w/ext4-patch-queue.git I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not sure if Ted got a chance to update his ext4 git tree from this patch Last time Ted ...
s390 can be changed to support seven-arg syscalls. But that would require creating an additional stackframe in *libc to save original register contents and in addition it would make our syscall hotpath slower. That is because we have to take care of an additional register that might contain user space passed contents and needs to be put on the kernel stack. If possible I'd prefer the six-32-bit-args approach. -
It does mean extra unnecessary work for 64-bit platforms, though... Paul. -
Wouldn't that work be confined to fallocate()? If I understand Heiko correctly, the alternative would slow s390 down for every syscall, including more performance-critical ones. Jörn -- tglx1 thinks that joern should get a (TM) for "Thinking Is Hard" -- Thomas Gleixner -
The alternative that Jakub suggested wouldn't slow s390 down. Paul. -
True. And it appears to be one of the least offensive options we have. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra -
It is going to need to be a COMPAT_SYS call in powerpc because 32 bit powerpc will pass the two loff_t's in pairs of registers while 64bit passes them in one register each. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
Ok. Will make that change, unless it is decided to pass each loff_t argument as two "u32"s. Thanks! -- Regards, Amit Arora -
I suggest reading the very end of arch/arm/kernel/sys_arm.c; I'd rather avoid adding more and more hacks like that to the kernel if at all possible. One solution (already mentioned elsewhere) is that we start avoiding passing 64-bit arguments and instead pass two 32-bit instead. This nicely avoids the alignment restrictions for 64-bit args in ABIs. (The issue for ARM is that with anything other than the "fd, mode, offset, len" layout we will have to deal with different ABI argument layouts, or implement our own wrapper function as done for sys_arm_sync_file_range.) I think the problem comes down to "what is the argument layout which causes the least amount of problems for the complete set of architectures." For ARM, that's the "fd, mode, offset, len" layout. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian |
