Hans Reiser [interview] described a recently posted patch as, "it revises the existing reiser4 code to do a good job for writes that are larger than 4k at a time by assiduously adhering to the principle that things that need to be done once per write should be done once per write, not once per 4k." He went on to explain, "this code empirically proves that the generic code design which passes 4k at a time to the underlying FS can be improved. Performance results show that the new code consumes 40% less CPU when doing 'dd bs=1MB .....'" Referring to generic_file_write(), he further noted that currently when writing 64MB of data, "it may go to the kernel as a 64MB write, but VFS sends it to the FS as 64MB/4k separate 4k writes." It was acknowledged that this could also be accomplished in a non-generic way, howevever earlier feedback had suggested that such improvements should be made available to all.
Andrew Morton [interview] responded to the proposed changes saying, "there's nothing which leaps out and says 'wrong' in this. But there's nothing which leaps out and says 'right', either. It seems somewhat arbitrary, that's all." He pointed out that reiser4 was currently the only filesystem to benefit from the changes, "to be able to say 'yes, we want this' I think we'd need to understand which other filesystems would benefit from exploiting it, and with what results?" In the resulting discussion, it was determined that both FUSE [story] and XFS [story] would benefit from these changes prompting Hans to ask, "Is it enough?" Andrew agreed, "Spose so. Let's see what the diff looks like?"
From: Hans Reiser [email blocked] To: Andrew Morton [email blocked] Subject: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Tue, 23 May 2006 13:14:54 -0700 ftp://ftp.namesys.com/pub/reiser4-for-2.6/2.6.17-rc4-mm1/reiser4-for-2.6.17-rc4-mm1-2.patch.gz The referenced patch replaces all reiser4 patches to mm. It revises the existing reiser4 code to do a good job for writes that are larger than 4k at a time by assiduously adhering to the principle that things that need to be done once per write should be done once per write, not once per 4k. That statement is a slight simplification: there are times when due to the limited size of RAM you want to do some things once per WRITE_GRANULARITY, where WRITE_GRANULARITY is a #define that defines some moderate number of pages to write at once. This code empirically proves that the generic code design which passes 4k at a time to the underlying FS can be improved. Performance results show that the new code consumes 40% less CPU when doing "dd bs=1MB ....." (your hardware, and whether the data is in cache, may vary this result). Note that this has only a small effect on elapsed time for most hardware. The planned future(as discussed with akpm previously): we will ship very soon (testing it now) an improved reiser4 read code that does reads in more than little 4k chunks. Then we will revise the generic code to allow an FS to receive the writes and reads in whole increments. How best to revise the generic code is still being discussed. Nate is discussing doing it in some way that improves code symmetry in the io scheduler layer as well, if there is interest by others in it maybe a thread can start on that topic, or maybe it can wait for him+zam to make a patch. Note for users: this patch also contains numerous important bug fixes.
From: Tom Vier [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Wed, 24 May 2006 13:53:12 -0400 On Tue, May 23, 2006 at 01:14:54PM -0700, Hans Reiser wrote: > underlying FS can be improved. Performance results show that the new > code consumes 40% less CPU when doing "dd bs=1MB ....." (your hardware, > and whether the data is in cache, may vary this result). Note that this > has only a small effect on elapsed time for most hardware. Write requests in linux are restricted to one page? -- Tom Vier [email blocked] DSA Key ID 0x15741ECE
From: Hans Reiser [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Wed, 24 May 2006 10:55:48 -0700 Tom Vier wrote: >On Tue, May 23, 2006 at 01:14:54PM -0700, Hans Reiser wrote: > > >>underlying FS can be improved. Performance results show that the new >>code consumes 40% less CPU when doing "dd bs=1MB ....." (your hardware, >>and whether the data is in cache, may vary this result). Note that this >>has only a small effect on elapsed time for most hardware. >> >> > >Write requests in linux are restricted to one page? > > > It may go to the kernel as a 64MB write, but VFS sends it to the FS as 64MB/4k separate 4k writes.
From: Jens Axboe [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Thu, 8 Jun 2006 13:00:45 +0200 On Wed, May 24 2006, Hans Reiser wrote: > Tom Vier wrote: > > >On Tue, May 23, 2006 at 01:14:54PM -0700, Hans Reiser wrote: > > > > > >>underlying FS can be improved. Performance results show that the new > >>code consumes 40% less CPU when doing "dd bs=1MB ....." (your hardware, > >>and whether the data is in cache, may vary this result). Note that this > >>has only a small effect on elapsed time for most hardware. > >> > >> > > > >Write requests in linux are restricted to one page? > > > > > > > It may go to the kernel as a 64MB write, but VFS sends it to the FS as > 64MB/4k separate 4k writes. Nonsense, there are ways to get > PAGE_CACHE_SIZE writes in one chunk. Other file systems have been doing it for years. -- Jens Axboe
From: Vladimir V. Saveliev [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Thu, 08 Jun 2006 15:26:40 +0400 Hello On Thu, 2006-06-08 at 13:00 +0200, Jens Axboe wrote: > On Wed, May 24 2006, Hans Reiser wrote: > > > > It may go to the kernel as a 64MB write, but VFS sends it to the FS as > > 64MB/4k separate 4k writes. > > Nonsense, Hans refers to generic_file_write which does prepare_write copy_from_user commit_write for each page. > there are ways to get > PAGE_CACHE_SIZE writes in one chunk. > Other file systems have been doing it for years. > Would you, please, say more about it.
From: Christoph Hellwig [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Thu, 8 Jun 2006 13:10:06 +0100 On Thu, Jun 08, 2006 at 03:26:40PM +0400, Vladimir V. Saveliev wrote: > > > It may go to the kernel as a 64MB write, but VFS sends it to the FS as > > > 64MB/4k separate 4k writes. > > > > Nonsense, > > Hans refers to generic_file_write which does > prepare_write > copy_from_user > commit_write > for each page. That's not really the vfs but the generic pagecache routines. For some filesystems (e.g. XFS) only reservations for delayed allocations are performed in this path so it doesn't really matter. For not so advanced filesystems batching these calls would definitly be very helpful. Patches to get there are very welcome.
From: Hans Reiser [email blocked] Subject: Re: [PATCH] updated reiser4 - reduced cpu usage for writes by writing more than 4k at a time (has implications for generic write code and eventually for the IO layer) Date: Wed, 14 Jun 2006 12:37:39 -0700 Jens Axboe wrote: >On Thu, Jun 08 2006, Vladimir V. Saveliev wrote: > > >>Hello >> >>On Thu, 2006-06-08 at 13:00 +0200, Jens Axboe wrote: >> >> >>>On Wed, May 24 2006, Hans Reiser wrote: >>> >>> >>>>Tom Vier wrote: >>>> >>>> >>>> >>>>>On Tue, May 23, 2006 at 01:14:54PM -0700, Hans Reiser wrote: >>>>> >>>>> >>>>> >>>>> >>>>>>underlying FS can be improved. Performance results show that the new >>>>>>code consumes 40% less CPU when doing "dd bs=1MB ....." (your hardware, >>>>>>and whether the data is in cache, may vary this result). Note that this >>>>>>has only a small effect on elapsed time for most hardware. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>Write requests in linux are restricted to one page? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>It may go to the kernel as a 64MB write, but VFS sends it to the FS as >>>>64MB/4k separate 4k writes. >>>> >>>> >>>Nonsense, >>> >>> >>Hans refers to generic_file_write which does >>prepare_write >>copy_from_user >>commit_write >>for each page. >> >> > >Provide your own f_op->write() ? > > In Unix VFS is an abstraction layer with a philosophical commitment to allow filesystems to do their own thing, but Linux is quite different, and what you suggest got vetoed with emphasis. In all fairness, the patch vs is sending is one I can live with that allows me to not worry about aio code and direct io code, neither of which interest me at this time. So I suppose there is some benefit to all this hassle. > > >>>there are ways to get > PAGE_CACHE_SIZE writes in one chunk. >>>Other file systems have been doing it for years. >>> >>> >>> >>Would you, please, say more about it. >> >> > >Use writepages? > > > writepages is flush time code, this is sys_write() code. sys_write first sticks things into the cache,, then memory pressure or pages reaching maximum time allowed in memory or fsync pushes them out to disk, at which time writepages might get used. This issue is about cached writes losing performance when done 4k at a time. It is very similar to why bios are better than submitting io 4k at a time, but it is at a different stage. Christoph Hellwig wrote: >That's not really the vfs but the generic pagecache routines. For some >filesystems (e.g. XFS) only reservations for delayed allocations are >performed in this path so it doesn't really matter. For not so advanced >filesystems batching these calls would definitly be very helpful. Patches >to get there are very welcome. > > > Glad we all agree. vs is sending a pseudocoded proposal.
From: Vladimir V. Saveliev [email blocked] Subject: batched write Date: Thu, 15 Jun 2006 02:08:32 +0400 Hello On Thu, 2006-06-08 at 13:10 +0100, Christoph Hellwig wrote: > On Thu, Jun 08, 2006 at 03:26:40PM +0400, Vladimir V. Saveliev wrote: > > > > It may go to the kernel as a 64MB write, but VFS sends it to the FS as > > > > 64MB/4k separate 4k writes. > > > > > > Nonsense, > > > > Hans refers to generic_file_write which does > > prepare_write > > copy_from_user > > commit_write > > for each page. > > That's not really the vfs but the generic pagecache routines. For some > filesystems (e.g. XFS) only reservations for delayed allocations are > performed in this path so it doesn't really matter. For not so advanced > filesystems batching these calls would definitly be very helpful. Patches > to get there are very welcome. > The core of generic_file_buffered_write is do { grab_cache_page(); a_ops->prepare_write(); copy_from_user(); a_ops->commit_write(); filemap_set_next_iovec(); balance_dirty_pages_ratelimited(); } while (count); Would it make sence to rework this code with adding new address_space operation - fill_pages so that looks like: do { a_ops->fill_pages(); filemap_set_next_iovec(); balance_dirty_pages_ratelimited(); } while (count); generic implementation of fill_pages would look like: generic_fill_pages() { grab_cache_page(); a_ops->prepare_write(); copy_from_user(); a_ops->commit_write(); } I believe that filesystem developers will want to exploit that operation. Any opinion on this plan is welcomed. I would try to code whatever we will have developed (I hope) in result of this discussion.
From: Andrew Morton [email blocked] Subject: Re: batched write Date: Sat, 17 Jun 2006 10:04:58 -0700 On Thu, 15 Jun 2006 02:08:32 +0400 "Vladimir V. Saveliev" [email blocked] wrote: > The core of generic_file_buffered_write is > do { > grab_cache_page(); > a_ops->prepare_write(); > copy_from_user(); > a_ops->commit_write(); > > filemap_set_next_iovec(); > balance_dirty_pages_ratelimited(); > } while (count); > > > Would it make sence to rework this code with adding new address_space > operation - fill_pages so that looks like: > > do { > a_ops->fill_pages(); > filemap_set_next_iovec(); > balance_dirty_pages_ratelimited(); > } while (count); > > generic implementation of fill_pages would look like: > > generic_fill_pages() > { > grab_cache_page(); > a_ops->prepare_write(); > copy_from_user(); > a_ops->commit_write(); > } > There's nothing which leaps out and says "wrong" in this. But there's nothing which leaps out and says "right", either. It seems somewhat arbitrary, that's all. We have one filesystem which wants such a refactoring (although I don't think you've adequately spelled out _why_ reiser4 wants this). To be able to say "yes, we want this" I think we'd need to understand which other filesystems would benefit from exploiting it, and with what results?
From: Hans Reiser [email blocked] Subject: Re: batched write Date: Sat, 17 Jun 2006 10:51:23 -0700 Andrew Morton wrote: >We have one filesystem which wants such a refactoring (although I don't >think you've adequately spelled out _why_ reiser4 wants this). > > > When calling the filesystem for writes, there is processing that must be done: 1) per word 2) per page 3) per call to the FS If the FS is called per page, then it turns out that 3) costs more than 1) and 2) for sophisticated filesystems. As we develop fancier and fancier plugins this will just get more and more true. It decreases CPU usage by 2x to use per sys_write calls into reiser4 rather than per page calls into reiser4. (Vladimir, on Monday can you find and send your benchmarks?) This is significant for cached writes. If it violates the intuition to believe this, then let me point out that there was a similar motivation for the creation of bios: calling the block layer traverses more lines of code than copying a page of bytes does. Unfortunately, all that code turns out to be useful optimizations, so one cannot just take the attitude (whether for the block layer or reiser4) that it should just be simplified. Please note that I have no real problem with leaving the generic code unchanged and having reiser4 do its own write operation. I am modifying the generic code because you suggested it was preferred. Having reviewed the code in detail, I see that you were right and it is better to just fix the generic code to call more than 4k at a time into the FS, and then be able to reuse the generic aio and direct io code (and etc.) as a result. So, to be sociable, and to get more code reuse, we make this proposal. >To be able to say "yes, we want this" I think we'd need to understand which >other filesystems would benefit from exploiting it, and with what results? > > > > Or just let us have our own sys_write implementation without being excluded for it. I have shown that it is significantly faster for reiser4 to process things more than 4k at a time.
From: Nix [email blocked] Subject: Re: batched write Date: Sun, 18 Jun 2006 12:20:00 +0100 On 17 Jun 2006, Hans Reiser prattled cheerily: > If the FS is called per page, then it turns out that 3) costs more than > 1) and 2) for sophisticated filesystems. As we develop fancier and > fancier plugins this will just get more and more true. It decreases CPU > usage by 2x to use per sys_write calls into reiser4 rather than per page > calls into reiser4. This seems to me to be something that FUSE filesystems might well like, too: I know one I'm working on would like to know the real size of the original write request (so that it can optimize layout appropriately for things frequently written in large chunks; the assumption being that if it's written in large chunks it's likely to be read in large chunks too). -- `Voting for any American political party is fundamentally incomprehensible.' --- Vadik
From: Hans Reiser [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 02:05:21 -0700 Nix wrote: >On 17 Jun 2006, Hans Reiser prattled cheerily: > > >>If the FS is called per page, then it turns out that 3) costs more than >>1) and 2) for sophisticated filesystems. As we develop fancier and >>fancier plugins this will just get more and more true. It decreases CPU >>usage by 2x to use per sys_write calls into reiser4 rather than per page >>calls into reiser4. >> >> > >This seems to me to be something that FUSE filesystems might well like, >too: I know one I'm working on would like to know the real size of the >original write request (so that it can optimize layout appropriately >for things frequently written in large chunks; the assumption being that >if it's written in large chunks it's likely to be read in large chunks >too). > > > Hi Nix, Forgive myn utter ignorance of fuse, but does it currently context switch to user space for every 4k written through VFS?
From: Miklos Szeredi [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 13:32:35 +0200 > Forgive myn utter ignorance of fuse, but does it currently context > switch to user space for every 4k written through VFS? Yes, unfortunately it does, so fuse would benefit from batched writing as well, with some constraint on the number of locked pages to avoid DoS against the page cache. Miklos
From: Hans Reiser [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 09:39:43 -0700 Miklos Szeredi wrote: >Yes, unfortunately it does, so fuse would benefit from batched writing >as well, with some constraint on the number of locked pages to avoid >DoS against the page cache. > >Miklos I would think that batched write is pretty essential then to FUSE performance. If we could then get the glibc authors to not sabotage the using of a large block size to indicate that we like large IOs (see thread on fseek implementation), reiser4 and FUSE would be all set for improved performance. Even without glibc developer cooperation, we will get a lot of benefits.
From: Andreas Dilger [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 09:27:40 -0700 On Jun 17, 2006 10:04 -0700, Andrew Morton wrote: > On Thu, 15 Jun 2006 02:08:32 +0400 > "Vladimir V. Saveliev" [email blocked] wrote: > > > The core of generic_file_buffered_write is > > do { > > grab_cache_page(); > > a_ops->prepare_write(); > > copy_from_user(); > > a_ops->commit_write(); > > > > filemap_set_next_iovec(); > > balance_dirty_pages_ratelimited(); > > } while (count); > > > > > > Would it make sence to rework this code with adding new address_space > > operation - fill_pages so that looks like: > > > > do { > > a_ops->fill_pages(); > > filemap_set_next_iovec(); > > balance_dirty_pages_ratelimited(); > > } while (count); > > > > generic implementation of fill_pages would look like: > > > > generic_fill_pages() > > { > > grab_cache_page(); > > a_ops->prepare_write(); > > copy_from_user(); > > a_ops->commit_write(); > > } > > > > There's nothing which leaps out and says "wrong" in this. But there's > nothing which leaps out and says "right", either. It seems somewhat > arbitrary, that's all. > > We have one filesystem which wants such a refactoring (although I don't > think you've adequately spelled out _why_ reiser4 wants this). > > To be able to say "yes, we want this" I think we'd need to understand which > other filesystems would benefit from exploiting it, and with what results? With the caveat that I didn't see the original patch, if this can be a step down the road toward supporting delayed allocation at the VFS level then I'm all for such changes. Lustre goes to some lengths to batch up reads and writes on the client into large (1MB+) RPCs in order to maximize performance. Similarly on the server we essentially bypass the VFS in order to allocate all of the RPC's blocks in one call and do a large bio write in a second. It just isn't possible to maximize performance if everything is split into PAGE_SIZE chunks. I believe XFS would benefit from delayed allocation, and the ext3-delalloc patches from Alex also provide a large part of the performance wins for userspace IO, when they allow large sys_write() and VM cache flush to efficiently call into the filesystem to allocate many blocks at once, and then push them out to disk in large chunks. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: Hans Reiser [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 09:51:18 -0700 Andreas Dilger wrote: > >With the caveat that I didn't see the original patch, if this can be a step >down the road toward supporting delayed allocation at the VFS level then >I'm all for such changes. > > What do you mean by supporting delayed allocation at the VFS level? Do you mean calling to the FS or maybe just not stepping on the FS's toes so much or? Delayed allocation is very fs specific in so far as I can imagine it.
From: Andreas Dilger [email blocked] Subject: Re: batched write Date: Mon, 19 Jun 2006 12:50:49 -0600 On Jun 19, 2006 09:51 -0700, Hans Reiser wrote: > Andreas Dilger wrote: > >With the caveat that I didn't see the original patch, if this can be a step > >down the road toward supporting delayed allocation at the VFS level then > >I'm all for such changes. > > What do you mean by supporting delayed allocation at the VFS level? Do > you mean calling to the FS or maybe just not stepping on the FS's toes > so much or? Delayed allocation is very fs specific in so far as I can > imagine it. Currently the VM/VFS call into the filesystem in ->prepare_write for each page to do block allocation for the filesystem. This is the filesystem's chance to return -ENOSPC, etc, because after that point the dirty pages are written asynchronously and there is no guarantee that the application will even be around when they are finally written to disk. If the VFS supported delayed allocation it would call into the filesystem on a per-sys_write basis to allow the filesystem to RESERVE space for all of the pages in the write call, and then later (under memory pressure, page aging, or even "pull" from the fs) submit a whole batch of contiguous pages to the fs efficiently (via ->fill_pages() or whatever). The fs can know at that time the final file size (if the file isn't still being dirtied), can allocate all these blocks in a contiguous chunk, can submit all of the IO in a single bio to the block layer or RPC/RDMA to net. As you well know, while it is possible to do this now by copying all of the generic_file_write() logic into the filesystem *_file_write() method, in practise it is hard to do this from a code maintenance point of view. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
From: David Chinner [email blocked] Subject: Re: batched write Date: Tue, 20 Jun 2006 10:01:33 +1000 On Mon, Jun 19, 2006 at 12:50:49PM -0600, Andreas Dilger wrote: > On Jun 19, 2006 09:51 -0700, Hans Reiser wrote: > > Andreas Dilger wrote: > > >With the caveat that I didn't see the original patch, if this can be a step > > >down the road toward supporting delayed allocation at the VFS level then > > >I'm all for such changes. > > > > What do you mean by supporting delayed allocation at the VFS level? Do > > you mean calling to the FS or maybe just not stepping on the FS's toes > > so much or? Delayed allocation is very fs specific in so far as I can > > imagine it. > > Currently the VM/VFS call into the filesystem in ->prepare_write for each > page to do block allocation for the filesystem. This is the filesystem's > chance to return -ENOSPC, etc, because after that point the dirty pages > are written asynchronously and there is no guarantee that the application > will even be around when they are finally written to disk. > > If the VFS supported delayed allocation it would call into the filesystem > on a per-sys_write basis to allow the filesystem to RESERVE space for all > of the pages in the write call, The VFS doesn't need to support delalloc as delalloc is fundamentally a filesystem property. The VFS it already provides a hook for delalloc space reservation that can return ENOSPC - it's called ->prepare_write(). Sure, a batch interface would be nice, but that's an optimisation that needs to be done regardless of whether the filesystem supports delalloc or not. The current ->prepare_write() interface shows its limits when having to do hundreds of thousands (millions, even) of ->prepare_write() calls per second. This makes for entertaining scaling problems that batching would make less of a problem. > and then later (under memory pressure, > page aging, or even "pull" from the fs) submit a whole batch of contiguous > pages to the fs efficiently (via ->fill_pages() or whatever). Can be done right now - XFS does this probe-and-pull operation already for writes. See xfs_probe_cluster(), xfs_cluster_write() and friends. Yes, it would be nice to have the VM pass us clusters of adjacent pages, but given that the file layout drives the cluster size it is more appropriate to do this from the filesystem. Also, the pages do not contain the state necessary for the VM to cluster pages in an way that results in efficient I/O patterns. Basically, the only thing really needed from the VFS/VM is a method of tagging delalloc (or unwritten) pages so that the writepage path knows how to treat the page being written. Currently we keep that state in bufferheads (e.g. see buffer_delay() usage) attached to the page...... > The fs can know at that time the final file size (if the file isn't still > being dirtied), can allocate all these blocks in a contiguous chunk, can > submit all of the IO in a single bio to the block layer or RPC/RDMA to net. You don't need to know the final file size - just what is contiguous in the page cache and in the same state as the page being flushed. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group
From: Hans Reiser [email blocked] Subject: Re: batched write Date: Tue, 20 Jun 2006 00:19:24 -0700 So far we have XFS, FUSE, and reiser4 benefiting from the potential ability to process more than 4k at a time. Is it enough?
From: Andrew Morton [email blocked] Subject: Re: batched write Date: Tue, 20 Jun 2006 00:26:59 -0700 On Tue, 20 Jun 2006 00:19:24 -0700 Hans Reiser [email blocked] wrote: > So far we have XFS, FUSE, and reiser4 benefiting from the potential > ability to process more than 4k at a time. Is it enough? Spose so. Let's see what the diff looks like?
So finally... what is the mer
So finally... what is the merging status of Reiser4 in mainline and -mm after all the noise that was produced?
reiser4 will be merged in 2.7
reiser4 will be merged in 2.7.0.
Too bad Debian Sarge is out, so I'll say "whenever Debian Etch is out".
So, I would say ... never :)
Anyway, people at Debian said that Etch would take less time than Sarge.
And, like Sarge which aimed at high quality (and archieved it), when it will be merged, it will be merged in an elegant manner, because linux dev and reiser4 devs are so good ...
Don't confuse the discussion.
Don't confuse the discussion. Debian has nothing to do with the question GP asked.
Reiser4 - Fragmentation
I used the latest Reiser4 patch to 2.6.16 for several months on an actively used partition of roughly 180GB.
Initial performance was excellent (40 - 50MB/s) however over time performance gradually dropped until even sequential block IO was unable to maintain 10MB/s and CPU usage was also considerably higher than before.
I suspected a bad IDE cable was causing a fallback to PIO modes, however after thorough investion I found no such problem.
I backed up the partition, reformatted to XFS, and resumed operations, and after months with XFS the performance is still holding up with no measurable degradation, under the same usage pattern as before with Reiser4.
Although I didn't capture hard numbers, take it from one user that Reiser4 is excellent _until_ the effects of fragmentation kick in. Welcome to the dark side of Reiser4 that no one seems to talk about.
Hey, I'll talk about it ;-)
Reiser4 needs a repacker. It needs some attention from Zam, and it will get done. He has been distracted lately by things like batch_write, and other kernel merging related tasks, but we all agree that reiser4 needs a repacker.
I also have some longstanding ideas for the block allocator that should help this that I'd like to give a try. Basically, if you put large files at one end of the disk and small files at the other end, my guess is that performance will improve.
Ends of the disk
I would tend to agree with the notion that placing large files at one end and small files at the other could be a very good idea, at least for tight packing. I've noticed behavior like that in completely different realms (register allocation, distribution of particle sizes in sediment vs depth).
I personally would put large files early and small files later, since if you assume large files are constrained by bandwidth or seeks vs. length, you want to maximize bandwidth and minimize seeks--hence low-numbered blocks near the outer, faster edge of the disk. For small files, the seek to get the file (rather than seeks while reading the file) dominates everything.
The question is, what's the impact on workloads that access a mixture of such files? And, what happens when a file starts small and ends up big? Do you move it? What ends up moving it?
botched
Anyone else read that as "botched writes"?
*raises hand*
Me too!
Summary
Hi all,
I will try to summarize what I got from your decent dicussion; please correct me if I am mistaking?
1. Though, the current communication granularity between the VFS and FS layers is PAGE_SIZE (or 4K), right?
2. and this shows up already in ReiserFS and FUSE, right?
3. However other FSs like XFS have tweaks where larger chunks can be communicated. Does Ext2 or Ext3 do?
4. Do every things apply to write operations apply also to read ones?
5. Delayed allocation, is just a way to inform the FS to reserve such amount of bytes everytime a write operation is issued?
6. I am currently employing a 2.6.17 kernel, is there a way to write large chunks (>4K) directly to the disk/FS (I mean no buffering inbetween)? If yes, how?
That's it.
Thank you all again.
answering some of your questions
1., yes, it is 4k.
4. We have a patch coming out which optimizes our reads similarly. It is easier to do, since the readahead code already works in large chunks, and the only thing we need to do is eliminate the line which turns off readahead when there is device congestion.
5. Delayed allocation allocates blocks NOT at write() time, but instead just before blocks are sent to disk. You can start to see that this is good by noting that this means that files that never last long enough to reach disk don't affect block allocation. When you get down into the grittly details of FS implementation, it is just so much easier to do a good job of block allocation if you do it all at once in a big chunk just before flushing to disk.