V2->V3 - More restructuring - It actually works! - Add XFS support - Fix up UP support - Work out the direct I/O issues - Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert back to constants. Disabled for 32bit and HIGHMEM configurations. This also allows a gradual migration to the new page cache inline functions. LARGE_BLOCKSIZE capabilities can be added gradually and if there is a problem then we can disable a subsystem. V1->V2 - Some ext2 support - Some block layer, fs layer support etc. - Better page cache macros - Use macros to clean up code. This patchset modifies the Linux kernel so that larger block sizes than page size can be supported. Larger block sizes are handled by using compound pages of an arbitrary order for the page cache instead of single pages with order 0. Rationales: 1. We have problems supporting devices with a higher blocksize than page size. This is for example important to support CD and DVDs that can only read and write 32k or 64k blocks. We currently have a shim layer in there to deal with this situation which limits the speed of I/O. The developers are currently looking for ways to completely bypass the page cache because of this deficiency. 2. 32/64k blocksize is also used in flash devices. Same issues. 3. Future harddisks will support bigger block sizes that Linux cannot support since we are limited to PAGE_SIZE. Ok the on board cache may buffer this for us but what is the point of handling smaller page sizes than what the drive supports? 4. Reduce fsck times. Larger block sizes mean faster file system checking. 5. Performance. If we look at IA64 vs. x86_64 then it seems that the faster interrupt handling on x86_64 compensate for the speed loss due to a smaller page size (4k vs 16k on IA64). Supporting larger block sizes sizes on all allows a significant reduction in I/O overhead and increases the size of I/O that can be performed by hardware in a single r...
Hi. I have a few questions about that patchset: 1) Is it possible for block device to assume that it will alway get big requests (and aligned by big blocksize) ? 2) Does metadata reading/writing occuress also using same big blocksize ? 3 If so, How __bread/__getblk are affrected? Does returned buffer_head point to whole block ? And what do you think about mine design ? I want to link parts of compound page through buffer_heads So the head page's bh points to second page (tail page ) bh's, and from this bh it is possible to reference the page itself and so on. (This will allow a compound page be physicly fragmented) Best regards, Maxim Levitsky PS: I ask questions since this patchset does matter to me, I really like to see this <= 4K limit lifted (all software limits are bad) And finaly get good packet writing... I miss DirectCD much... Altough >4K blocksizes are really only first step. To make really fast packet writing, the UDF filesystem should be rewritten as well. -
That is one of the key problems. We hope that Mel Gorman's antifrag work Correct. -
Something I was looking for but couldn't find: suppose an application takes a pagefault against the third 4k page of an order-2 pagecache "page". We need to instantiate a pte against find_get_page(offset/4)+3. But these patches don't touch mm/memory.c at all and filemap_nopage() appears to return the zeroeth 4k page all the time in that case. So.. what am I missing, and how does that part work? Also, afaict your important requirements would be met by retaining PAGE_CACHE_SIZE=4k and simply ensuring that pagecache is populated by physically contiguous pages - so instead of allocating and adding one 4k page, we allocate an order-2 page and sprinkle all four page*'s into the radix tree in one hit. That should be fairly straightforward to do, and could be made indistinguishably fast from doing a single 16k page for some common pagecache operations (gang-insert, gang-lookup). The BIO and block layers will do-the-right-thing with that pagecache and you end up with four times more data in the SG lists, worst-case. -
Sure, that addresses the larger I/O side of things, but it doesn't address the large filesystem blocksize issues that can only be solved with some kind of page aggregation abstraction. Compound pages and high order page cache indexing solves this extremely neatly, regardless of whether the compound page is contiguous or not..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
a) That wasn't a part of Christoph's original rationale list, so forgive me for thinking it is not so important and got snuck in post-facto when things got tough. b) I don't immediately see why a filesystam cannot implement larger blocksizes via this scheme - instantiate and lock four pages and go for We cannot say anything about neatness until we've seen mmap. -
I've been pushing christoph to do something like this for more than a year purely so we can support large block sizes in XFS. He's got other reasons for wanting to do this, but that doesn't mean that the large filesystem So now how do you get block aligned writeback? Or make sure that truncate doesn't race on a partial *block* truncate? You basically have to jump through nasty, nasty hoops, to handle corner cases that are introduced because the generic code can no longer reliably lock out access to a filesystem block. Eventually you end up with something like fs/xfs/linux-2.6/xfs_buf.c and doing everything inside the filesystem because it's the only way sane way to serialise access to these aggregated structures. This is the way XFS used to work in it's data path, and we all know how long and loud people complained about that..... A filesystem specific aggregation mechanism is not a palatable solution here because it drives filesystems away from being able to use generic code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
in writeback and pageout: if (page->index & mapping->block_size_mask) I would expect we could (should) implement this in generic code by modifying the existing stuff. I'm not saying it's especially simple, nor fast. But it has the advantage that we're not forced to use larger pages with _it's_ attendant performance problems. And it will benefit all filesystems immediately. And it doesn't introduce a rather nasty hack of pretending (in some places) that pages are larger than they really are. And it has the very significant advantage that it doesn't introduce brand new concepts and some complexity into core MM. And make no mistake: the latter disadvantage is huge. Because if we do the PAGE_CACHE_SIZE hack (sorry, but it _is_), we have to do it *for ever*. Maintaining and enhancing core MM and VFS becomes harder and more costly and slower and more buggy *for ever*. The ramp for people to become competent on core MM becomes longer. Our developer pool becomes smaller, and proportionally less skilled. And hardware gets better. If Intel & AMD come out with a 16k pagesize option in a couple of years we'll look pretty dumb. If the problems which you're presently having with that controller get sorted out in the next generation of the hardware, we'll also look pretty dumb. As always, there are tradeoffs. We can see the cons, and they are very significant. We don't yet know the pros. Perhaps they will be similarly significant. But I don't believe that the larger PAGE_CACHE_SIZE hack (sorry) is the only way in which they can be realised. -
Unfortunately, this isn't a problem with hardware getting better, but a willingness to break backwards compatibility. x86_64 uses a 4k page size to avoid breaking 32-bit applications. And unfortunately, iirc, even 64-bit applications are continuing to depend on 4k page alignments for things like the text and bss segments. If the userspace ELF and other compiler/linker specifications were appropriate written so they could handle 16k pagesizes, maybe 5 years from now we could move to a 16k pagesize. But this is going to require some coordination between the userspace binutils folks and AMD/Intel in order to plan such a migration. - Ted -
The AMD64 psABI requires binaries to work with any page size up to 64k. Whether that's true in practice is another matter entirely, of course. -- Nicholas Miell <nmiell@comcast.net> -
64-bit applications are a non-issue. The ABI requires them to handle it. It's 32-bit applications on x86-64 that are the concern. ABI emulation for them is more involved when 4K 32-bit ABI's are to be emulated on kernels compiled for larger native pagesizes. In practice, so many use getpagesize() it may not be much of an issue, but consider WINE and other sorts of non-Linux-native 32-bit apps, among other issues. -- wli -
So we might do writeback on one page in N - how do we make sure none of the other pages are reclaimed while we are doing writeback on this bclok? IOWs, we have to lock every page in the block, mark them all as writeback, etc. Instead of doing something once, we have to repeat it for every block in page. This is better than a compound And the locking order? How do you enforce *kernel wide* the same locking order for all pages in the same block so that we don't get ABBA deadlocks on page locks within a block? So you're suggesting that we reintroduce a buffer-oriented filesystem So you'll take slow, inefficient and complex rather than use an non-intrusive and /optional/ interface to large pages? Words fail me...... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
By marking them all dirty when one is marked dirty. David, you're perfectly capable of working all this out yourself. But I already said it'd be a bit slower. But given that those four pageframes The optionality is useful - it at least means that we can easily remove it all if/when it becomes obsolete. -
I've looked at all this but I'm trying to work out if anyone else has looked at the impact of doing this. I have direct experience with this form of block aggregation - this is pretty much what is done in irix - and it's full of nasty, ugly corner cases. I've got several year-old Irix bugs assigned that are hit every so often where one page in the aggregated set has the wrong state, and it's simply not possible to either reproduce the problem or work out how it happened. The code has grown too complex and convoluted, and by the time the problem is noticed (either by hang, panic or bug check) the cause of it is long gone. I don't want to go back to having to deal with this sort of problem - I'd much prefer to have a design that does not make the same No, I'm speaking from years of experience working on a page/buffer/chunk cache capable of using both large pages and aggregating multiple pages. It has, at times, almost driven me insane and I don't want to go back there. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
So the practical question is. Was it a high level design problem or was it simply a choice of implementation issue. Until we code review and implementation that does page aggregation for linux we can't say how nasty it would be. Of course what gets confusing is when you mention you refer to the previous implementation as a buffer cache, because that isn't at all what Linux had for a buffer cache. The linux buffer cache was the same as the current page cache except it was index by block number and The suggestion seems to be to always aggregate pages (to handle PAGE_SIZE < block size), and not to even worry about the fact that it happens that the pages you are aggregating are physically contiguous. The memory allocator and the block layer can worry about that. It isn't something the page cache or filesystems need to pay attention to. I suspect the implementation in linux would be sufficiently different that it would not be prone to the same problems. Among other things we are already do most things on a range of page addresses, so we would seem to have most of the infrastructure already. It looks like if we extend the current batching a little more so it covers all of the interesting cases. (read) Ensure the dirty bit on all pages in the group when we set it on one page. Add re-read when we dirty the group if we don't have it all present. Round the range we operate on up so we cleanly hit the beginning and end of the group size. Only issue the mapping operations on the first page in the group. Is about what we would have to do to handle multiple pages in one block in the page cache. There are clearly more details but as a first approximation I don't see this being fundamentally more complex then what we are currently doing. Just taking into account a few more details. The whole physical continuity thing seems to come cleanly out of a speculative page allocator, and that would seem to work and provide improvements on smaller block sizes filesyste...
Both. To many things can happen asynchroonously to a page that it makes it just about impossible to predict all the potential race conditions that are involved. complexity arose from trying to fix We already have an implementation - I've pointed it out several times now: see fs/xfs/linux-2.6/xfs_buf.[ch]. perfomrance problems in using discontigous pages and needing to Filesystems don't typically do this - they work on blocks and assume Hmmm - we're not talking about using 64k block size filesystems to store lots of little files or using them on small, slow disks. We're looking at optimising for multi-petabyte filesystems with multi-terabyte sized files sustaining throughput of tens to hundreds of GB/s to/from hundreds to thousands of disk. I certinaly don't consider 64k block size filesystems as something suitable for desktop use - maybe PVRs would benefit, but this is not something you'd use for your kernel build environment on a single disk in a desktop system.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Yes, and but it isn't a generic implementation in mm/filemap.c, it is a compatibility layer. It lives with the current deficiencies Always? Ugh. I just realized looking at the xfs code that it doesn't But that is how mm/filemap.c works. The calls into the filesystem can be per multi-page group as they are current per page. The point is that the existing in kernel abstraction is already larger then a Yes. You are talking about only fixing the kernel for your giant 64K block filesystems that are only interesting on peta-byte arrays. I am pointing out that the other fixes that have been discussed. Optimistic contiguous page allocation and a larger linux scatter gather list. Are interesting on much smaller filesystem and machine sizes where small files are still interesting. Making them generally better improvements for linux. If you only improve the giant peta-byte raid cases 99% of linux users simply don't care, and so the code isn't very interesting. Eric -
Just to stick my two cents in here: The definition of what is meant by "large" filesystems has to change with the advances in disk drive technology. In the not too distant past, a "large" single filesystem was 100 GB. There are now consumer grade disks on the market with 1 TB available in a single unit. I don't know about you guys, but that scares the crap out of me, in terms of dealing with that much space on a desktop machine. Efficiently dealing with transferring that much data on a desktop (never mind server) means re-thinking the limitations of the I/O subsystems. What was once the realm of the data center is now the realm of the living room. Large data sets are becoming more commonplace (HD Movies, audio files, etc) with each passing day, and there is no end in sight in the progression. In addition, with the release of specs recently about larger sector sizes for disk drives (2048 bytes, or larger), this is going to become a pressing need for the general case, not just the extremely large servers, or HPC machines and clusters. Already there is no efficient way to back up that much space, in a reasonable time, except to have another disk of a similar or larger size to back up to. Anything we can do to make disk I/O *Faster* is a win. I recognize that there is a huge issue in dealing with sub block size files. The trade off of small files VS large blocks is now a non trivial problem. Once disk sector sizes increase, the problems will have to be dealt with in a more intelligent manner, possibly dividing sectors into smaller logical blocks for small files? Maybe filesystems that can understand multiple block sizes? Well, we do live in interesting times; we just have to make the most of it. Dan Weigert -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Eric W. Biederman Sent: Monday, May 07, 2007 2:56 AM To: David Chinner Cc: Andrew Morton; clameter@sgi.com; linux-kernel@vger.kernel.org...
And the 50% gain in the benchmarks means what? That the device manufacturers have to redesign their chips? -
We're talking about two separate things here - let us not conflate them. 1: The arguably-crippled HBA which wants bigger SG lists. 2: The late-breaking large-blocksizes-in-the-fs thing. None of this multiple-page-locking stuff we're discussing here is relevant to the HBA performance problem. It's pretty simple (I think) for us to ensure that, for the great majority of the time, contiguous pages in a file are also physically contiguous. Problem solved, HBA go nice and quick, move on. Now, we have this the second and completely unrelated requirement: supporting fs-blocksize > PAGE_SIZE. One way to address this is via the mangle-multiple-pages-into-one approach. And it's obviously the best way to do it, if mangle-multiple-pages is already available. But I don't know how important requirement 2 is. XFS already has presumably-working private code to do it, and there is simplification and perhaps modest performance gain in the block allocator to be had here. And other filesystems (ie: ext4) _might_ use it. But ext4 is extent-based, so perhaps it's not work churning the on-disk format to get a bit of a boost in the block allocator. So I _think_ what this boils down to is offering some simplifications in XFS, by adding complexications to core VFS and MM. I dunno if that's a good deal. So... tell us why you want feature 2? -
Well from other parts of the conversation there is a third issue. 3: large-sectorsize-on-disk. There are a handful of devices in the kernel that could benefit and be cleaned up a great deal if they could assume they always received data in their sg lists that were full sectors. Nothing needs to be physically contiguous to handle that case though. If we support large sector sizes for raw block devices we would still have an issue of what to do with filesystems that want I suspect we will still need Jens > 128 page linux scatter gather list Agreed. When we are doing things optimistically and absolutely require large pages this approach seems pretty sane. When we start requiring large 64k A good question. Eric -
Well, ext3 could definitely use it; there are people using 8k and 16k blocksizes on ia64 systems today. Those filesystems can't be mounted on x86 or x86_64 systems because our pagesize is 4k, though. And I imagine that ext4 might want to use a large blocksize too --- after all, XFS is extent based as well, and not _all_ of the advantages of using a larger blocksize are related to brain-damaged storage subsystems with short SG list support. Whether the advantages offset the internal fragmentation overhead or the complexity of adding fragments support is a different question, of course. So while the jury is out about how many other filesystems might use it, I suspect it's more than you might think. At the very least, there may be some IA64 users who might be trying to transition their way to x86_64, and have existing filesystems using a 8k or 16k block filesystems. :-) - Ted -
How much of a problem would it be if those blocks were not necessarily contiguous in RAM, but placed in normal 4K pages in the page cache? I expect meta data operations would have to be modified but that otherwise you would not care. Eric -
If you need to treat the block in a contiguous range, then you need to vmap() the discontiguous pages. That has substantial overhead if you have to do it regularly. We do this in xfs_buf.c for > page size blocks - the overhead that caused when operating on inode clusters resulted in us doing some pointer fiddling and directly addresing the contents of each page I think you might need to modify the copy-in and copy-out operations substantially (e.g. prepare_/commit_write()) as they assume a buffer doesn't span multple pages..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Which is why I would prefer not to do it. I think vmap is not really compatible with the design of the linux page cache. Although we can't even count on the pages being mapped into low memory right now and have to call kmap if we want to access them so things might not be that bad. Even if it was a multipage kmap But in a filesystem like ext2 except for a zeroing some unused hunks of the page all that really happens is you setup for DMA straight out of the page cache. So this is primarily an issue for meta-data. Eric -
Right - so how do we efficiently manipulate data inside a large block that spans multiple discontigous pages if we don't vmap Except when you structures span page boundaries. Then you can't directly reference the structure - it needs to be copied out elsewhere, modified and copied back. That's messy and will require significant modification I'm not sure I follow you here - copyin/copyout is to userspace and has to handle things like RMW cycles to a filesystem block. e.g. if we get a partial block over-write, we need to read in all the bits around it and that will span multiple discontiguous pages. Currently these function only handle RMW operations on something up to a single page in size - to handle a RMW cycle on a block larger than a page they are going to need substantial modification or entirely new interfaces. The high order page cache avoids the need to redesign interfaces because it doesn't change the interfaces between the filesystem and the page cache - everything still effectively operates on single pages and the filesystem block size never exceeds the size of a single page..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
You don't manipulate data except for copy_from_user, copy_to_user. That is easy comparatively to deal with, and certainly doesn't need vmap. Meta-data may be trickier, but a lot of that depends on your Potentially. This is a just a meta data problem, and possibly we solve it with something like vmap. Possibly the filesystem won't cross those kinds of boundaries and we simply never care. The fact that it is a meta-data problem suggests it isn't the fast path and we can incur a little more cost. Especially if we filesytems Bleh. It has been to many days since I have hacked that code I forgot which piece that was. Yes. prepare_to_write is called before we write to the page cache from the filesystem. We already handle multiple page writes fairly well in that context. prepare_write, commit_write may need a page cache but it may not. All that really needs to happen is that all of the pages that are part of the block get marked dirty in the page cache so one Yes, instead of having to redesign the interface between the fs and the page cache for those filesystems that handle large blocks we instead need to redesign significant parts of the VM interface. Shift the redesign work to another group of people and call it a trivial. That is hardly a gain when it looks like you can have the same effect with some moderately simple changes to mm/filemap.c and the existing interfaces. Eric -
To some extend that is true. But then there will then also be additional gain: We can likely get the VM to handle larger pages too which may get rid of hugetlb fs etc. The work is pretty straightforward: No locking changes f.e. So hardly a redesign. I think the crucial point is the antifrag/defrag issue if we want to generalize it. I have an updated patch here that relies on page reservations. Adds something called page pools. On bootup you need to specify how many pages of each size you want. The page cache will then use those pages for filesystems that need larger blocksize. The interesting thing about that one is that it actually enables support foir multiple blocksizes with a single larger pagesize. If f.e. we setup a pool of 64k pages then the block layer can segment that into 16k pieces. So one can actually use 16k 32k and 64k block size with a single larger page size. -
Sadly, a backward compatibility stub must be retained in perpetuity. It should be able to be reduced to the point it doesn't need its own dedicated source files or config options, but it'll need something to deal with the arch code. -- wli -
I wonder what happened to my pagearray patches. -- wli -
I never really got the thing working, but I had an idea for a sort of
library to do this. This is/was probably against something like 2.6.5
but I honestly have no idea. Maybe this makes it something of an API
proposal.
-- wli
Index: linux-2.6/include/linux/pagearray.h
===================================================================
--- linux-2.6.orig/include/linux/pagearray.h 2004-04-06 10:56:48.000000000 -0700
+++ linux-2.6/include/linux/pagearray.h 2005-04-22 06:06:02.677494584 -0700
@@ -0,0 +1,24 @@
+#ifndef _LINUX_PAGEARRAY_H
+#define _LINUX_PAGEARRAY_H
+
+struct scatterlist;
+struct vm_area_struct;
+struct page;
+
+struct pagearray {
+ struct page **pages;
+ int nr_pages;
+ size_t length;
+};
+
+int alloc_page_array(struct pagearray *, const int, const size_t);
+void free_page_array(struct pagearray *);
+void zero_page_array(struct pagearray *);
+struct page *nopage_page_array(const struct vm_area_struct *, unsigned long, unsigned long, int *, struct pagearray *);
+int mmap_page_array(const struct vm_area_struct *, struct pagearray *, const size_t, const size_t);
+int copy_page_array_to_user(struct pagearray *, void __user *, const size_t, const size_t);
+int copy_page_array_from_user(struct pagearray *, void __user *, const size_t, const size_t);
+struct scatterlist *pagearray_to_scatterlist(struct pagearray *, size_t, size_t, int *);
+void *vmap_pagearray(struct pagearray *);
+
+#endif /* _LINUX_PAGEARRAY_H */
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2005-04-22 06:01:29.786980248 -0700
+++ linux-2.6/mm/Makefile 2005-04-22 06:06:02.677494584 -0700
@@ -10,7 +10,7 @@
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
readahead.o slab.o swap.o truncate.o vmscan.o \
- prio_tree.o $(mmu-y)
+ prio_tree.o pagearray.o $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thras...This should probably have memcpy to/from pagearrays. Whole-hog read and write f_op implementations would be good, too, since ISTR some drivers basically do little besides that on their internal buffers. vmap_pagearray() should take flags, esp. VM_IOREMAP but perhaps also protections besides PAGE_KERNEL in case uncachedness is desirable. I'm not entirely sure what it'd be used for if discontiguity is so heavily supported. My wild guess is drivers that do things that are just too weird to support with the discontig API, since that's how I used it. It should support vmap()'ing interior sub-ranges, too. The pagearray mmap() support is schizophrenic as to whether it prefills or faults and not all that complete as far as manipulating the mmap() goes. Shooting down ptes, flipping pages, or whatever drivers actually do with the things should have helpers arranged. Coherent sets of helpers for faulting vs. mmap()'ing idioms would be good. pagearray_to_scatterlist() should probably take the scatterist as an argument instead of allocating the scatterlist itself. Something to construct bio's from pagearrays might help. s/page_array/pagearray/g should probably be done. Prefixing with pagearray_ instead of randomly positioning it within the name would be good, too. Some working API conversions on drivers sound like a good idea. I had some large number of API conversions about, now lost, but they'd be bitrotted anyway. struct pagearray is better off as an opaque type so large pagearray handling can be added in later via radix trees or some such, likewise for expansion and contraction. Keeping drivers' hands off the internals is just a good idea in general. I'm somewhat less clear on what filesystems need to do here, or if it would be useful for them to efficiently manipulate data inside a large block that spans multiple discontiguous pages. I expect some changes are needed at the very least to fill a pagearray with whatever predetermined pages are needed. Filesystems probably nee...
You would only lock a single higher order block. Truncate works on that level. If you have 4 separate pages then you need to take separate locks and you may not have contiguous memory which makes the filesystem run through all The patch is not about forcing to use large pages but about the option to The patchset would reduce complexity and making it easy to handle the page cache. Gets rid of the hacks to support larger ones right now. Its No it becomes easier. Look at the patchset. It cleans up a huge mess. What is hacky about it? It is consistently using larger pages for the page We are currently looking dumb and unable to deal with the hardware. Yes we can pressure the hardware vendors to produce hardware conforming to our It is the most consistent solution that avoid the proliferation of further hacks to address the large blocksize. -
This is completely incorrect. Of *course* they're contiguous. That's the whole point. It's not exactly hard to lock four pages which are contiguous in pagecache, No it doesn't and please stop spinning. x86 ptes map 4k pages and the core MM needs changes to continue to work with this hack in place. If x86 had larger pagesize we wouldn't be seeing any of this. It is a workaround Were any cleanups made which were not also applicable as standalone things It pretends that pages are large than they actually are, forcing the pte-management code to also play along with the pretence. That's spin as well. Please address my point: if in five years time x86 has larger or varible pagesize, this code will be a permanent millstone around our necks which we *should not have merged*. And if in five years time x86 does not have larger pagesize support then the manufacturers would have decided that 4k pages are not a performance You cannot say this. I'm sitting here *watching* you refuse to seriously consider alternatives. And you've conspicuously failed to address my point regarding the *permanent* additional maintenance cost. Anyway. Let's await those performance numbers. If they're notably good, and if we judge that this goodness will be realised on more than one arguably-crippled present-day disk adapter then we can evaluate the *various* options which we have for stuffing more data into that adapter. -
So the verdict is wait 5 years, see if x86 did anything, and so on. Has anyone else noticed our embedded arch installed base dwarfs our x86 and "enterprise" installed bases combined? What are our priorities that make designing the core around x86 meaningful again? Pipe dreams of competing with Windows on the desktop? Optimizing for kernel compiles on kernel hackers' workstations? Something should seriously be reevaluated there at some point. As x86 is now the priority regardless, maybe checking in with Intel and AMD as far as what they'd like to see happen would be enlightening. It may be that some things are deadlocked on OS use cases. Also, is there something in particular that should be done for the case of x86 acquiring a variable pagesize? -- wli -
You missed the bit about "evaluate alternatives". -
No worries. I'm used to being on the wrong side of things. I'll have no trouble picking out the alternative least likely to be accepted. ;) -- wli -
Unfortunately, it's not really practical to increase the page size very much on most systems, because you end up wasting a lot of space in the page cache. So there is a tension between wanting a small page size so your page cache uses memory efficiently, and wanting a large page size so the TLB covers more address space and your programs run faster (not to mention other benefits such as the kernel having to manage fewer pages, and I/O being done in bigger chunks). Thus there is not really any single page size that suits all workloads and machines. With distros wanting to just have a single kernel per architecture, and the fact that the page size is a compile-time constant, we currently end up having to pick one size and just put up with the fact that it will suck for some users. We currently have this situation on ppc64 now that POWER5+ and POWER6 machines have hardware support for 64k pages as well as 4k pages. So I can see a few different options: (a) Keep things more or less as they are now and just wear the fact that we will continue to show lower performance than certain proprietary OSes, or (b) Somehow manage to make the page size a variable rather than a compile-time constant, and pick a suitable page size at boot time based on how much memory the machine has, or something. I looked at implementing this at one point and recoiled in horror. :) (c) Make the page cache able to use small pages for small files and large pages for large files. AIUI this is basically what Christoph is proposing. Option (a) isn't very palatable to me (nor I expect, Christoph :) since it basically says that Linux is very much focussed on the embedded and desktop end of things and isn't really suitable as a high-performance OS for large SMP systems. I don't want to believe that. ;) Option (b) would be a bit of an ugly hack. Which leaves option (c) - unless you have a further option. So I have to say I support Christoph on this, at least as far as the general principle is conc...
We could approximate option (b) by setting a standard page size for the page cache and set the same page size in SLUB (slub is already boot time configurable in that respect). That will make large portions of the VM use the same page order. I can try to add similar controls to the page cache. If the page size is set too high for a mount then we use the buffer head functionality to split the higher order page into pieces of the appropriate size. That could limit the number of page sizes that we need to support. But I think we should first see how well Mel's antifrag work does. -
For the TLB issue, higher order pagecache doesn't help. If distros ship with a 4K page size on powerpc, and use some larger pages in the pagecache, some people are still going to get angry because they wanted to use 64K pages... But I agree 64K pages is too big for most things anyway, and 16 would be better as a default (which hopefully x86-64 will get one day). Anyway, for io performance, there are alternatives, dispite what some people seem to be saying. We can submit larger sglists to the device for larger ios, which Jens is looking at (which could help all types of workloads, not just those with sequential large file IO). After that, I'd find it amusing if HBAs worth thousands of $ have trouble looking up sglists at the relatively glacial pace that IO requires, and/or can't spare a few more K for reasonable sglist sizes, but if that is really the case, then we could use iommus and/or just attempt to put physically contiguous pages in pagecache, rather than require it. -- SUSE Labs, Novell Inc. -
Oh? Assuming your hardware is capable of supporting a variety of page sizes, and of putting a page at any address that is a multiple of its size, it should help, potentially a great deal, as far as I can see. I'm thinking in particular of machines that have software-loaded fully-associative TLBs and support a lot of page sizes, e.g. 4kB * 4^n for n = 0 up to 8 or so, like some embedded powerpc chips. It's not as simple on 64-bit powerpc with the hash table of course, because the page size is chosen at the segment (256MB) level, Even 16k is going to bloat the page cache, and some people will complain. One way that x86-64 could do 16k pages is by still indexing the PTE page in units of 4k, but then have an indicator in the PTE that this is a 16k page. Thus a 16k page would occupy 4 consecutive PTEs, but once it was loaded into the TLB, a single TLB entry would map the whole 16k. That would give the expanded TLB reach and allow 4k and 16k pages to be intermixed freely. Paul. -
I think Christoph's variable order pagecache should be perfectly fine on ppc64. We're selecting the pagesize on a per-file basis, and the page size selection would choce which segement this mmap gets into. Ben's get_unmapped_area changes are very helpfull for that. -
That's a little bit more than just the higher order pagecache patch. But I don't know if that would be impossible to do with the "attempt to allocate contiguous pagecache" approach either. Or if it would be I guess any page size bloats the pagecache relative to something smaller :) But 4K doesn't seem to be proving too much problem for x86 and I'm not talking about an actual implementation coming up, but just a size that would make sense in future (and probably last for a long time). -- SUSE Labs, Novell Inc. -
Powerpc supports multiple pagesizes. Maybe we could make mmap use those page sizes some day if we had a variable order page cache. Your stands on the issue means that powerpc will be forever crippled and not be able to Right this could help but it is not addressing the basic requirement for devices that need large contiguuos chunks of memory for I/O. -
You can already mmap using different pagesize on ppc64. A certain infinibad adapter with interesting design choices requires 4k mappings even on 64k kernels, and for some spu-related mappings on Cell we want the same. I'm no expert on the powerpc mmu, but I'd be surprised if 64k pagecache mappings wouldn't work on 4k base page size kernels. -
Linus's favourite jokes about powerpc mmu being crippled forever, aside ;) This seems like just speculation. I would not be against something which, without, would "cripple" some relevant hardware, but you are just handwaving Did you read the last paragraph? Or anything Andrew's been writing? "After that, I'd find it amusing if HBAs worth thousands of $ have trouble looking up sglists at the relatively glacial pace that IO requires, and/or can't spare a few more K for reasonable sglist sizes, but if that is really the case, then we could use iommus and/or just attempt to put physically contiguous pages in pagecache, rather than require it." -- SUSE Labs, Novell Inc. -
This crippling also applies to IA64. The crippling is a concern of multiple arches and its effects have been well documented. Talk to Peter Chubb for example. The requirement is to be able to do I/O to large contiguous sections of memory. Your proposals are ignoring the requirements. -
Different mmu. The desktop 32bit mmu Linus refered to has almost nothing Real highend HBAs don't have that problem. But for example aacraid which is very common on mid-end servers is a _lot_ faster when it gets continous memory. Some benchmark was 10 or more percent faster on windows due to this. -
Well I wasn't trying to make a point there so it isn't a big deal... but he has known to say the 64-bit hash table is insane or broken. If he's And that wasn't due to the 128 sg limit? I guess 10% isn't a small amount. Though it would be nice to have before/after numbers for Linux. And, like Andrew was saying, we could just _attempt_ to put contiguous pages in pagecache rather than _require_ it. Which is still robust under fragmentation, and benefits everyone, not just files with a large pagecache size. -- SUSE Labs, Novell Inc. -
No, that was due to aacraid really liking sg lists as small as possible where every entry covers areas as big as possible. The driver really liked physical merging once wli changed the page allocator to return I'll try to find the old thread. -
What sort of strategy do you intend to use to speculatively populate the pagecache with contiguous pages? -- wli -
Andrew outlined it. -- SUSE Labs, Novell Inc. -
I'd like to suggest a few straightforward additions to the proposal: (1) the interface to the page allocator tries to allocate N pages where (a) N is a power of 2 (b) some effort is made to get contiguity (c) some effort is made to fall back to lesser contiguity (d) some effort is made to get N pages even with no contiguity (2) a corresponding group freeing interface to the page allocator (3) Pass the pages around in a list or similar so that O(1) instead of O(pages) splice operations under the lock suffice for passing them around. Dissecting compound pages outside locks helps. -- wli -
Ahh. I think I know what you mean. The current patchset is for performance testing against mainline. Lets first cover the bases and then see where we go. It is not against mm. I will submit pieces to mm depending on the outcome of our discussions. -
Thanks. There's a ludicrous amount of MM work pending in -mm. It would probably be less work at your end to see what ends up landing in 2.6.22-rc1. <wanders off to do his mm-merge-plans email> -
I am aware of that and thats why I kept this against upstream. The need right now is for justification and explanation. I had to go through a head spinning series of VM layers to get an idea how to do this in a clean way and then had to make additional passes to do minimal modifications to get this working so that it is testable. Performance tests please... -
OK. Don't get me wrong - I do think this is neat code and is a good way of addressing the problem. (I'm surprised that the mmap protopatch didn't touch rmap.c). But I don't think it's a slam dunk and I would like you to appreciate the constraints which I believe we operate under. And I don't think we've On various HBAs, please ;) -
If you can find them.... No its a fact. The patchset really allows one to switch large page The page cache is different from pte mapping. One page struct controls them all. Look at the patches. There is no state information in the tail The page cache functions require a mapping parameter. This is available in most place and a natural thing given that allocation etc is also bound No they are 16k if the filesystem wants them to be 16k. The filesystem does not need to have the data mapped into an address space. And there is No this code will enable us to switch to this new page size in a very fast way. Because the pagecache already supports it it is easier to add the The manufacturers on x86 are already supporting 2M page sizes and cannot support intermediate sizes since they are married to the page table format for performance reasons. The patch could f.e. lead to And I am sitting here in disbelief about the series of weird alternatives running over my screen just to avoid the obvious solution. Then there is this weird idea that this would hinder us from supporting additional page sizes for mmap while the patch does exactly lead to enable support such Where? The page cache handling in the various layers is significantly One? Spin.... The majority you mean? Dave, where are we with the performance tests? -
How on earth can the *addition* of variable pagecache size simplify the existing code? What cleanups are in this patchset which cannot be made *without* the Well yes. Do note that if the numbers are good, we also need to look at how generally useful this work is. For example, if it only benefits one particular arguably-crippled present-generation adapter then that of course weakens the case. -
