On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:Sure, but it's not a "filesystem block" which is what you are calling it. IMO, it's overloading a well known term with something different, and that's just confusing. Can we call it a block mapping layer or something like that? e.g. struct blkmap? Extent based block mapping is entirely independent of block size. Please don't confuse the two.... Yes. Block based is simple, but has flexibility and scalability problems. e.g the number of fsblocks that are required to map large files. It's not uncommon for use to have millions of bufferheads lying around after writing a single large file that only has a handful of extents. That's 5-6 orders of magnitude difference there in memory usage and as memory and disk sizes get larger, this will become more of a problem.... For VM operations, no, but they would continue to be locked on a per-page basis. However, we can do filesystem block operations without needing to hold page locks. e.g. space reservation and allocation...... No, that's wrong. I'm not talking about VM parallelisation, I want to be able to support multiple writers to a single file. i.e. removing the i_mutex restriction on writes. To do that you've got to have a range locking scheme integrated into the block map for the file so that concurrent lookups and allocations don't trip over each other. iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. None of what I'm talking about requires any changes to the existing page cache or VM address space. I'm proposing that we should be treat the block mapping as an address space in it's own right. i.e. perhaps the struct page should not have block mapping objects attached to it at all. By separating out the block mapping from the page cache, we make the page cache completely independent of filesystem block size, and it can just operate wholly on pages. We can implement a generic extent mapping tree instead of every filesystem having to (re)implement their own. And if the filesystem does it's job of preventing fragmentation, the amount of memory consumed by the tree will be orders of magnitude lower than any fsblock based indexing. I also like what this implies for keeping track of sub-block dirty ranges. i.e. no need for RMW cycles for if we are doing sector sized and aligned I/O - we can keep track of sub-block dirty state in the block mapping tree easily *and* we know exactly what sector on disk it maps to. That means we don't care about filesystem block size as it no longer has any influence on RMW boundaries. None of this is possible with fsblocks, so I really think that fsblocks are not the step forward we need. They are just bufferheads under another name and hence have all the same restrictions that bufferheads imply. We should be looking to eliminate bufferheads entirely rather than perpetuating them as fsblocks..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 011/196] sysfs: Fix a copy-n-paste typo in comment |
| KAMEZAWA Hiroyuki | Re: 2.6.23-rc4-mm1 |
| Aubrey | O_DIRECT question |
git: | |
| Andy Parkins | Re: [RFC] Submodules in GIT |
| Shawn Pearce | Re: Notes on Using Git with Subprojects |
| Junio C Hamano | [RFD] On deprecating "git-foo" for builtins |
| Andrew Ruder | [PATCH] Add policy on user-interface changes |
| Steve Shockley | Re: Real men don't attack straw men |
| Marcos Laufer | dmesg IBM x3650 OpenBSD 4.3 |
| Woodchuck | Re: bcw(4) is gone |
| Renaud Allard | Re: Spamd default behaviour of accepting everything |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| john ye | [PATCH: 2.6.13-15-SMP 3/3] network: concurrently run softirq network code on SMP |
| Patrick McHardy | [NET_SCHED 04/15]: act_api: use nlmsg_parse |
| David Miller | Re: 2.6.25-rc8: FTP transfer errors |
