On Mon, Jul 09, 2007 at 09:20:31AM +1000, David Chinner wrote:I didn't misunderstand. But the reason you can't use a larger blocksize than 4k is because the PAGE_SIZE is 4k, and CONFIG_PAGE_SHIFT raises the PAGE_SIZE to 8k or more, so you can then enlarge the filesystem blocksize too. Of course, for I/O performance the CPU cost is mostly irrelevant, especially with slow storage. Yes I'm aware of this and my patch allows it too the same way, but the fundamental difference is that it should help your I/O layout optimizations with larger blocksize, while at the same time making the _whole_ kernel faster. And it won't even waste more pagecache than a variable order page size would (both CONFIG_PAGE_SHIFT and variable order page size will waste some pagecache compared to a 4k page size). So they better be used for workloads manipulating large files. That should be possible the same way with both designs. Sorry. Totally agreed, your approach would be much better for dvd on the desktop. If only I could trust it to be reliable (I guess I'd rather stick to growisofs). But for your _own_ usage, the big box with lots of ram and where a blocksize of 4k is a blocker, my design should be much better because it'll give you many more advantages on the CPU side too (the only downside is the higher complexity in the pte manipulations). Think, even if you would end up mounting xfs with 64k blocksize on a kernel with a 16k PAGE_SIZE, that's still going to be a smaller fragmentation risk than using a 64k blocksize on a kernel with a 4k PAGE_SIZE, the risk in failing defrag because of alloc_page() = 4k is much higher than if the undefragmentable alloc_page returns a 16k page. The CPU cost of defrag itself will be diminished by a factor of 4 too. The equivalent waste will happen on disk if you raise the blocksize to 64k. The same waste will happen as well if you mounted the filesystem with the cache kernel tree using a variable order page size of 64k. I guess for maximizing cache usage during kernel development the ideal PAGE_SIZE would be smaller than 4k... You guys need to explain me how you solved the defrag issue if you can't defrag the return value of alloc_page(GFP_KERNEL) = 4k. Furthermore you never seem to account the CPU cost of defrag on big systems that may need to memcpy a lot of ram. My design doesn't need proofs, it never requires memcpy, and it'll just always run as fast as right after boot. Boosting the PAGE_SIZE is more a black and white and predictable think so I've no doubt I prefer it. BTW, I asked Hugh to look into Bill's and Hugh's old patch to see if there's some goodness we can copy to solve things like the underlying overlapping anon page after writeprotect faults over MAP_PRIVATE. Perhaps there's a better way than looking the nearby pte for a pte pointing to PG_anon or a swap entry which is my current idea. This is assuming their old patches were really using a similar design to mine (btw, back then there was no PG_anon but I guess checking page->mapping for null would have been enough to tell it was an anon page). Hugh also reminded me that at KS some year ago their old patch boosting the PAGE_SIZE was dismissed because it looked unnecessary, the major reason for wanting it back then was the mem_map_t array size, and that's not an issue anymore on 64bit archs. But back then, nobody proposed to boost the pagecache to order > 0 allocations, so this is one reason why _now_ it's different. It's really your variable order page size and the defrag efforts that don't math-proof guarantee defrag, that triggered my interest in CONFIG_PAGE_SHIFT. -
| Ian Campbell | Re: [PATCH] x86: Construct 32 bit boot time page tables in native format. |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Justin Piszcz | Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read & 195... |
| Alan | Re: [RFC] Heads up on sys_fallocate() |
| Matthias Scheler | Re: HEADS UP: timecounters (branch simonb-timecounters) merged into -current |
| David Laight | long usernames |
| Quentin Garnier | Re: Understanding foo_open, foo_read, etc. |
| Jared D. McNeill | Breaking binary compatibility for /dev/joy |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
