Warning ahead: I've only briefly skipped over the pages so the comments
in the mail are very highlevel.
On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
Traditional unix buffer cache is always physical block indexed and
used for all data/metadata/blockdevice node access. There's been
a lot of variants of schemes where data or some data is in a separate
inode,logial block indexed scheme. Most modern OSes including Linux
now always do the inode,logial block index with some noop substitute
for the metadata and block device node variants of operation.
Now what you replace is a really crappy hybrid of a traditional
unix buffercache implemented ontop of the pagecache for the block
device node (for metadata) and a lot of abuse of the same data
structure as used in the buffercache for keeping metainformation
about the actual data mapping.
Actually most of the code is no older than 10 years. Just compare
fs/buffer.c in 2.2 and 2.6. buffer_head is a perfectly fine name
for one of it's uses in the traditional buffercache.
I also thing there is little to no reason to get rid of that use:
This buffercache is what most linux block-based filesystems (except
xfs and jfs most notably) are written to, and it fits them very nicely.
What I'd really like to see is to get rid of the abuse of struct buffer_head
in the data path, and the sometimes to intimate coupling of the buffer cache
with page cache internals.
That's what I mean. And from a quick glimpse at your code they're still
far too deeply coupled in fsblock. Really, we don't really want to share
anything between the buffer cache and data mapping operations - they are
so deeply different that this sharing is what creates the enormous complexity
we have to deal with.
The whole concept of delayed allocation requires page allocations at
writeout time, as do various network protocols or even storage drivers.
Not really something that is the block layers fault but rather the lazyness
of the filesystem maintainers.
See now why people like large order page cache so much :)
And this is a complete pain in the ass. XFS uses vmap in it's metadata buffer
cache due to requirements carrier over from IRIX (in fact that's why I implemented
vmap in it's current form). This works okay most of them time, but there are
a lot of scenarios where you run out of vmalloc space as you mention. What's
also nasy is that you can't call vunmap from irq context, and vunmap beeing
rather bad for system peformance due to the tlb flushing overhead.
So as the closing comment I'd say I'd rather keep buffer_heads for metadata
for now and try to decouple the data path from it. Your fsblock patches
are a very nice start for this, but I'd rather skip the intermediate step
towards the extent based API Dave has been outlining. Having deal with the
I/O path of a high performance filesystem for a while per-page or sub-page
structures are a real pain to deal with and I'd really prefer to have data
structures for as much as possible blocks with the same state.
-