This patchset modifies the Linux kernel so that larger block sizes than page size can be supported. Larger block sizes are handled by using compound pages of an arbitrary order for the page cache instead of single pages with order 0. - Support is added in a way that limits the changes to existing code. As a result filesystems can support larger page I/O with minimal changes. - The page cache functions are mostly unchanged. Instead of a page struct representing a single page they take a head page struct (which looks the same as a regular page struct apart from the compound flags) and operate on those. Most page cache functions can stay as they are. - No locking protocols are added or modified. - The support is also fully transparent at the level of the OS. No specialized heuristics are added to switch to larger pages. Large page support is enabled by filesystems or device drivers when a device or volume is mounted. Larger block sizes are usually set during volume creation although the patchset supports setting these sizes per file. The formattted partition will then always be accessed with the configured blocksize. - Large blocks also do not mean that the 4k mmap semantics need to be abandoned. The included mmap support will happily map 4k chunks of large blocks so that user space sees no changes. Some of the changes are: - Replace the use of PAGE_CACHE_XXX constants to calculate offsets into pages with functions that do the the same and allow the constants to be parameterized. - Extend the capabilities of compound pages so that they can be put onto the LRU and reclaimed. - Allow setting a larger blocksize via set_blocksize() Rationales: ----------- 1. The ability to handle memory of an arbitrarily large size using a singe page struct "handle" is essential for scaling memory handling and reducing overhead in multiple kernel subsystems. This patchset is a strategic move that allows performance gains throughout the kern...
There is a limitation in the VM. Fragmentation. You keep saying this is a solved issue and just assuming you'll be able to fix any cases that come up as they happen. I still don't get the feeling you realise that there is a fundamental fragmentation issue that is unsolvable with Mel's approach. The idea that there even _is_ a bug to fail when higher order pages cannot be allocated was also brushed aside by some people at the vm/fs summit. I don't know if those people had gone through the math about this, but it goes somewhat like this: if you use a 64K page size, you can "run out of memory" with 93% of your pages free. If you use a 2MB page size, you can fail with 99.8% of your pages still free. That's 64GB of memory used on a 32TB Altix. If you don't consider that is a problem because you don't care about theoretical issues or nobody has reported it from running -mm kernels, then I simply can't argue against that on a technical basis. But I'm totally against introducing known big fundamental problems to the VM at this stage of the kernel. God knows how long it takes to ever fix them in future after they have become pervasive throughout the kernel. IMO the only thing that higher order pagecache is good for is a quick hack for filesystems to support larger block sizes. And after seeing it is fairly ugly to support mmap, I'm not even really happy for it to do that. If VM scalability is a problem, then it needs to be addressed in other areas anyway for order-0 pages, and if contiguous pages helps IO scalability or crappy hardware, then there is nothing stopping us from *attempting* to get contiguous memory in the current scheme. Basically, if you're placing your hopes for VM and IO scalability on this, then I think that's a totally broken thing to do and will end up making the kernel worse in the years to come (except maybe on some poor configurations of bad hardware). -
Well my problem first of all is that you did not read the full message. It discusses that later and provides page pools to address the issue. Secondly you keep FUDding people with lots of theoretical concerns assuming Mel's approaches must fail. If there is an issue (I guess there must be right?) then please give us a concrete case of a failure that we Allocations can currently fail and all code has the requirement to handle failure cases in one form or another. Currently we can only handle up to order 3 allocs it seems. 2M pages (and in particular pagesizes > MAX_ORDER) will have to be handled by a separate large page pool facility discussed in the earlier message. -
And BTW, before you accuse me of FUD, I'm actually talking about the fragmentation issues on which Mel I think mostly agrees with me at this point. Also have you really a rational reason why we should just up and accept all these big changes happening just because that, while there are lots of theoretical issues, the person pointing them out to you hasn't happened to give you a concrete failure case. Oh, and the actual performance benefit is actually not really even quantified yet, crappy hardware not withstanding, and neither has a proper evaluation of the alternatives. So... would you drive over a bridge if the engineer had this mindset? -
I'm half way between you two on this one. I agree with Christoph in that it's currently very difficult to trigger a failure scenario and today we don't have a way of dealing with it. I agree with Nick in that conceivably a failure scenario does exist somewhere and the careful person (or paranoid if you prefer) would deal with it pre-emptively. The fact is that no one knows what a large block workload is going to look like to the allocator so we're all hand-waving. Right now, I can't trigger the worst failure scenarious that cannot be dealt with for fragmentation but that might change with large blocks. The worst situation I can think is a process that continously dirties large amounts of data on a large block filesystem while another set of processes works with large amounts of anonymous data without any swap space configured with slub_min_order set somewhere between order-0 and the large block size. Fragmentation wise, that's just a kick in the pants and might produce the failure scenario being looked for. If it does fail, I don't think it should be used to beat Christoph with as such because it was meant to be a #2 solution. What hits it is if the mmap() Performance figures would be nice. dbench is flaky as hell but can comparison figures be generated on one filesystem with 4K blocks and one with 64K? I guess we can do it ourselves too because this should work on If I had this bus that couldn't go below 50MPH, right...... never mind. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
On the other hand, you ignore the potential failure cases, and ignore the alternatives that do not have such cases. -
While I agree with your concern, those numbers are quite silly. The chances of 99.8% of pages being free and the remaining 0.2% being perfectly spread across all 2MB large_pages are lower than those of SHA1 creating a collision. I don't see anyone abandoning git or rsync, so your extreme example clearly is the wrong one. Again, I agree with your concern, even though your example makes it look silly. Jörn -- You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is. -- Rob Pike -
You may want to consider Mel's antifrag approaches which certainly=20 decreases the chance of this occurring. Reclaim can open up the needed=20 linear memory hole in a intentional way. The memory compaction approach=20 can even move pages to open up these 2M holes. The more pages we make=20 movable (see f.e. the targeted slab reclaim patchset that makes slab=20 pages movable) the more reliable higher order allocations become.
I absolutely agree with your slab reclaim patchset. No argument here. What I'm starting to wonder about is where your approach has advantages over Andrea's. The chances of triggering something vaguely similar to Nick's worst case scenario are certainly higher for your solution. So unless there are other upsides it is just the second-best solution. Jörn -- Everything should be made as simple as possible, but not simpler. -- Albert Einstein -
Nick's worst case scenario is already at least partially addressed by the= =20 Lumpy Reclaim code in 2.6.23. His examples assume 2.6.22 or earlier=20 code. The advantages of this approach over Andreas is basically that the 4k=20 filesystems still can be used as is. 4k is useful for binaries and for=20 text processing like used for compiles. Large Page sizes are useful for=20 file systems that contain large datasets (scientific data, multimedia=20 stuff, databases) for applications that depend on high I/O throughput.
If you mean that with my approach you can't use a 4k filesystem as is, that's not correct. I even run the (admittedly premature but promising) benchmarks on my patch on a 4k blocksized filesystem... Guess what, you can even still mount a 1k fs on a 2.6 kernel. The main advantage I can see in your patch is that distributions won't need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be slower). -
Right you can use a 4k filesystem. The 4k blocks are buffers in a larger I would think that your approach would be slower since you always have to populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage of memory because even a file with one character needs an order N page? So there are less pages available for the same workload. Then you are breaking mmap assumptions of applications becaused the order N kernel will no longer be able to map 4k pages. You likely need a new binary format that has pages correctly aligned. I know that we would need one on IA64 if we go beyond the established page sizes. -
I don't have to populate them, I could just map one at time. The only reason I want to populate every possible pte that could map that page (by checking vma ranges) is to _improve_ performance by decreasing the number of page faults of an order of magnitude. Then with the 62th bit after NX giving me a 64k tlb, I could decrease the frequency of the This is a known issue. The same is true for ppc64 64k. If that really No you misunderstood the whole design. My patch will be 100% backwards compatible in all respects. If I could break backwards compatibility 70% of the complexity would go away... -
Actually it'd be pretty easy to craft an application which allocates seven pages for pagecache, then one for <something>, then seven for pagecache, then one for <something>, etc. I've had test apps which do that sort of thing accidentally. The result wasn't pretty. -
Except that the applications 7 pages are movable and the <something>
would have to be unmovable. And then they should not share the same
memory region. At least they should never be allowed to interleave in
such a pattern on a larger scale.
The only way a fragmentation catastroph can be (proovable) avoided is
by having so few unmovable objects that size + max waste << ram
size. The smaller the better. Allowing movable and unmovable objects
to mix means that max waste goes way up. In your example waste would
be 7*size. With 2MB uper order limit it would be 511*size.
I keep coming back to the fact that movable objects should be moved
out of the way for unmovable ones. Anything else just allows
fragmentation to build up.
MfG
Goswin
-It is actually really easy to force regions to never share. At the moment, there is a fallback list that determines a preference for what block to mix. The reason why this isn't enforced is the cost of moving. On x86 and x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of those pages to prevent any mixing would be bad enough. On PowerPC, it's potentially 16MB. On IA64, it's 1GB. As this was fragmentation avoidance, not guarantees, the decision was made to not strictly enforce the types of pages within a block as the cost cannot be made back unless the system was making agressive use of This is easily achieved, just really really expensive because of the amount of copying that would have to take place. It would also compel that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is a lot of free memory to keep around which is why fragmentation avoidance doesn't do it. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
I don't say the group should never be mixed. The movable objects could
be moved out on demand. If 64k get allocated then up to 64k get
moved. That would reduce the impact as the kernel does not hang while
it moves 2MB or even 1GB. It also allows objects to be freed and the
space reused in the unmovable and mixed groups. There could also be a
certain number or percentage of mixed groupd be allowed to further
increase the chance of movable objects freeing themself from mixed
groups.
But when you already have say 10% of the ram in mixed groups then it
is a sign the external fragmentation happens and some time should be
In your sample graphics you had 1152 groups. Reserving a few of those
doesnt sound too bad. And how many migrate types do we talk about. So
far we only had movable and unmovable. I would split unmovable into
short term (caches, I/O pages) and long term (task structures,
dentries). Reserving 6 groups for schort term unmovable and long term
unmovable would be 1% of ram in your situation.
Maybe instead of reserving one could say that you can have up to 6
groups of space not used by unmovable objects before aggressive moving
starts. I don't quite see why you NEED reserving as long as there is
enough space free alltogether in case something needs moving. 1 group
worth of space free might be plenty to move stuff too. Note that all
the virtual pages can be stuffed in every little free space there is
and reassembled by the MMU. There is no space lost there.
But until one tries one can't say.
MfG
Goswin
PS: How do allocations pick groups? Could one use the oldest group
dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and
highest address for movable? Something to better keep the two out of
each other way.
-This type of action makes sense in the context of Andrea's approach and large blocks. I don't think it makes sense today to do it in the general I'll play around with it on the side and see what sort of results I get. I won't be pushing anything any time soon in relation to this though. For now, I don't intend to fiddle more with grouping pages by mobility for something that may or may not be of benefit to a feature that hasn't No, which on those systems, I would suggest setting min_free_kbytes to a Movable, unmovable, reclaimable and reserve in the current incarnation Mostly done as you suggest already. Dentry are considered reclaimable, not More groups = more cost although very easy to add them. A mixed type used to exist but was removed again because it couldn't be proved to be And if the groups are 1GB in size? I tried something like this already. What you suggest sounds similar to having a type MIGRATE_MIXED where you allocate from when the preferred lists are full. It became a sizing We bias the location of unmovable and reclaimable allocations already. It's not done for movable because it wasn't necessary (as they are easily reclaimed or moved anyway). -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.
When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:
# # 0 3
# #
### 1 2
### ### 0 1 E F
# #
### ### 3 2 D C
# #
# ### # 4 7 8 B
# # # #
### ### 5 6 9 A
+-----------+-----------+
# ##### ##### # |00 03 04 05|3A 3B 3C 3F|
# # # # # # | | |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # | | |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# # # # # # | | |
# ##### ##### # |0F 0C 0B 0A|35 34 33 30|
# # +-----+-----+ |
### ####### ### |10 11|1E 1F|20 21 2E 2F|
# # # # | | | |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-----+ |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | | |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-----+-----+-----------+
I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.
But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
Which is different from reserving a full group as it does not count
Not realy. I'm saying we should actively defragment mixed groups
during allocation and always a...Here's an excellent example of an 0-255 numbered hilbert curve used to enumerate the various top-level allocations of IPv4 space: http://xkcd.com/195/ Cheers, Kyle Moffett -
I don't know how it would prevent fragmentation from building up anyway. It's commonly the case that potentially unmovable objects are allowed to fill up all of ram (dentries, inodes, etc). And of course, if you craft your exploit nicely with help from higher ordered unmovable memory (eg. mm structs or unix sockets), then you don't even need to fill all memory with unmovables before you can have them take over all groups. -
Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from ZONE_MOVABLE and thus the memory that can be allocated for them is limited. -
Why would ZONE_MOVABLE require that "movable objects should be moved out of the way for unmovable ones"? It never _has_ any unmovable objects in it. Quite obviously we were not talking about reserve zones. -
This was a response to your statement all of memory could be filled up by unmovable objects. Which cannot occur if the memory for unmovable objects is limited. Not sure what you mean by reserves? Mel's reserves? The reserves for unmovable objects established by ZONE_MOVABLE? -
As Nick points out, having to configure something makes it a #2 solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out clause to be able to control fragmentation no matter what the workload is as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some mechanism in grouping pages by mobility to force a number of blocks to be MIGRATE_MOVABLE_ONLY, the emergency option will exist, We still lack data on what sort of workloads really benefit from large blocks (assuming there are any that cannot also be solved by improving order-0). With Christophs approach + grouping pages by mobility + ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the course of the next few months while we're waiting for fsblock or software pagesize to mature. Do we really need to keep discussing this as no new point has been made ina while? Can we at least take out the non-contentious parts of Christoph's patches such as the page cache macros and do something with them? -- Mel "tired of typing" Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
No we don't. All workloads benefit from larger block sizes when you've got a btree tracking 20 million inodes and a create has to search that tree for a free inode. The tree gets much wider and hence we take fewer disk seeks to traverse the tree. Same for large directories, btree's tracking free space, etc - everything goes faster with a larger filesystem block size because we spent less time doing metadata I/O. And the other advantage is that sequential I/O speeds also tend to increase with larger block sizes. e.g. XFS on an Altix (16k pages) using 16k block size is about 20-25% faster on writes than 4k block size. See the graphs at the top of page 12: http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf The benefits are really about scalability and with terabyte sized disks on the market..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Compressing filesystems like jffs2 and logfs gain better compression ratio with larger blocks. Going from 4KiB to 64KiB gave somewhere around 10% benefit iirc. Testdata was a 128MiB qemu root filesystem. Granted, the same could be achieved by adding some extra code and a few bounce buffers to the filesystem. How suck a hack would perform I'd prefer not to find out, though. :) Jörn -- Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. -- Doug MacIlroy -
That's incidentally exactly what the slab does, no need to reinvent the wheel for that, it's an old problem and there's room for optimization in the slab partial-reuse logic too. Just boost the order 0 page size and use the slab to get the 4k chunks. The sgi/defrag design is backwards. -
How does that help? Will slabs move objects around to combine two
partially filled slabs into nearly full one? If not consider this:
- You create a slab for 4k objects based on 64k compound pages.
(first of all that wastes you a page already for the meta infos)
- Something movable allocates a 14 4k page in there making the slab
partially filled.
- Something unmovable alloactes a 4k page making the slab mixed and
full.
- Repeat until out of memory.
OR
- Userspace allocates a lot of memory in those slabs.
- Userspace frees one in every 15 4k chunks.
- Userspace forks 1000 times causing an unmovable task structure to
appear in 1000 slabs.
MfG
Goswin
-1. It helps providing a few guarantees: when you run "/usr/bin/free" you won't get a random number, but a strong _guarantee_. That ram will be available no matter what. With variable order page size you may run oom by mlocking some half free ram in pagecache backed by largepages. "free" becomes a fake number provided by a weak design. Apps and admin need to know for sure the ram that is available to be able to fine-tune the workload to avoid running into swap but while 2. yes, slab can indeed be freed to release an excessive number of 64k pages pinned by an insignificant number of small objects. I already told to Mel even at the VM summit, that the slab defrag can payoff regardless, and this is nothing new, since it will payoff even There's not just 1 4k object in the system... The whole point is to make sure all those 4k objects goes into the same 64k page. This way for you to be able to reproduce Nick's worst case scenario you have to Movable? I rather assume all slab allocations aren't movable. Then slab defrag can try to tackle on users like dcache and inodes. Keep in mind that with the exception of updatedb, those inodes/dentries will be pinned and you won't move them, which is why I prefer to consider The entire slab being full is a perfect scenario. It means zero memory If with slabs you mean slab/slub, I can't follow, there has never been a single byte of userland memory allocated there since ever the slab I guess you're confusing the config-page-shift design with the sgi design where userland memory gets mixed with slab entries in the same 64k page... Also with config-page-shift the userland pages will all be 64k. Things will get more complicated if we later decide to allow kmalloc(4k) pagecache to be mapped in userland instead of only being available for reads. But then we can restrict that to a slab and to make it relocatable by following the ptes. That will complicate things a lot. But the whole point is that you don't need all that compl...
This and other comments in your reply show me that you completly misunderstood what I was talking about. Look at http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg The red dots (pinned) are dentries, page tables, kernel stacks, whatever kernel stuff, right? The green dots (movable) are mostly userspace pages being mapped there, right? What I was refering too is that because movable objects (green dots) aren't moved out of a mixed group (the boxes) when some unmovable object needs space all the groups become mixed over time. That means the unmovable objects are spread out over all the ram and the buddy system can't recombine regions when unmovable objects free them. There will nearly always be some movable objects in the other buddy. The system of having unmovable and movable groups breaks down and becomes useless. I'm assuming here that we want the possibility of larger order pages for unmovable objects (large continiuos regions for DMA for example) than the smallest order user space gets (or any movable object). If mmap() still works on 4k page bounaries then those will fragment all regions into 4k chunks in the worst case. Obviously if userspace has a minimum order of 64k chunks then it will never break any region smaller than 64k chunks and will never cause a fragmentation catastroph. I know that is verry roughly your aproach (make order 0 bigger), and I like it, but it has some limits as to how big you can make it. I don't think my system with 1GB ram would work so well with 2MB order 0 pages. But I wasn't refering to that but to the picture. MfG Goswin -
What does the large square represent here? A "largepage"? If yes, If the largepage is the square, there can't be red pixels mixed with green pixels with the config-page-shift design, this is the whole difference... zooming in I see red pixels all over the squares mized with green pixels in the same square. This is exactly what happens with the variable order page cache and that's why it provides zero guarantees If I understood correctly, here you agree that mixing movable and unmovable objects in the same largepage is a bad thing, and that's incidentally what config-page-shift prevents. It avoids it instead of undoing the mixture later with defrag when it's far too late for With config-page-shift mmap works on 4k chunks but it's always backed by 64k or any other largesize that you choosed at compile time. And if the virtual alignment of mmap matches the physical alignment of the physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we could use the 62nd bit of the pte to use a 64k tlb (if future cpus will allow that). Nick also suggested to still set all ptes equal to Yep, exactly this is what happens, it avoids that trouble. But as far as fragmentation guarantees goes, it's really about keeping the unmovable out of our way (instead of spreading the unmovable all over the buddy randomly, or with ugly boot-time-fixed-numbers-memory-reservations) than to map largepages in userland. Infact as I said we could map kmalloced 4k entries in userland to save memory if we would really want to hurt the fast paths to make a generic kernel to use on smaller systems, but that would be very complex. Since those 4k entries would be 100% movable (not like the rest of the slab, like dentries and inodes etc..) that wouldn't make the design less reliable, it'd still be 100% reliable and performance would be ok because that memory is userland memory, we've Sure! 2M is sure way excessive for a 1G system, 64k most certainly too, of course unless you're running a db or a multim...
hmm, it was a long time ago. This one looks like 4MB large pages so Yes. I can enforce a similar situation but didn't because the evacuation costs could not be justified for hugepage allocations. Patches to do such a thing were prototyped a long time ago and abandoned based on cost. For This picture is not grouping pages by mobility so that is hardly a suprise. This picture is not running grouping pages by mobility. This is what the normal kernel looks like. Look at the videos in http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based compares to vanilla. These are from February when there was less control over mixing blocks than there is today. In the current version mixing occurs in the lower blocks as much as possible not the upper ones. So there are a number of mixed blocks but the number is kept to a minimum. The number of mixed blocks could have been enforced as 0, but I felt it was better in the general case to fragment rather than regress performance. That may be different for large blocks where you will want to take the We don't take defrag steps at the moment at all. There are memory compaction patches but I'm not pushing them until we can prove they are As I said elsewhere, you can try this easily on top of grouping pages by mobility. They are not mutually exclusive and you'll have a comparison Ok, get it implemented so and we'll try it out because we're just hand-waving here and not actually producing anything to compare. It'll be interesting to see how it works out for large blocks and hugepages (although I expect the latter to fail unless grouping pages by mobility is in place). Ideally, they'll complement each other nicely but only ever having mixing take place at the 64KB boundary. I have the testing setup necessary for checking out hugepages at least and I hope to put together something that tests large blocks as well. Minimally, running the hugepage allocation tests -- -- Mel Gorman Part-time Phd Student L...
I agree that 0 is a bad value. But so is infinity. There should be
some mixing but not a lot. You say "kept to a minimum". Is that
actively done or already happens by itself. Hopefully the later which
But would mapping a random 4K page out of a file then consume 64k?
That sounds like an awfull lot of internal fragmentation. I hope the
unaligned bits and pices get put into a slab or something as you
It is too bad that existing amd64 CPUs only allow such large physical
pages. But it kind of makes sense to cut away a full level or page
rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
ocasional mplayer.
I would mostly be concerned how rtorrents totaly random access of
mmapped files negatively impacts such a 64k page system.
MfG
Goswin
-Happens by itself due to biasing mixing blocks at lower PFNs. The exact For what it's worth, the last allocation failure that occured with grouping pages by mobility was order-1 atomic failures for a wireless network card when bittorrent was running. You're likely right in that torrents will be an interesting workload in terms of fragmentation. -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
I have been toying with the idea of having seperate caches for pinned and movable dentries. Downside of such a patch would be the number of memcpy() operations when moving dentries from one cache to the other. Upside is that a fair amount of slab cache can be made movable. memcpy() is still faster than reading an object from disk. Most likely the current reaction to such a patch would be to shoot it down due to overhead, so I didn't pursue it. All I have is an old patch to seperate never-cached from possibly-cached dentries. It will increase the odds of freeing a slab, but provide no guarantee. But the point here is: dentries/inodes can be made movable if there are clear advantages to it. Maybe they should? Jörn -- Joern's library part 2: http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html -
Totally inappropriate.
I bet 99% of all "dentry_lookup()" calls involve turning the last dentry
from having a count of zero ("movable") to having a count of 1 ("pinned").
So such an approach would fundamentally be broken. It would slow down all
normal dentry lookups, since the *common* case for leaf dentries is that
they have a zero count.
So it's much better to do it on a "directory/file" basis, on the
assumption that files are *mostly* movable (or just freeable). The fact
that they aren't always (ie while kept open etc), is likely statistically
not all that important.
Linus
-My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc. which are pinned for their entire lifetime and another for regular files/inodes. One could take a three-way approach and have always-pinned, often-pinned and rarely-pinned. We won't get never-pinned that way. Jörn -- The wise man seeks everything in himself; the ignorant man tries to get everything from somebody else. -- unknown -
That sounds pretty good. The problem, of course, is that most of the time, the actual dentry allocation itself is done before you really know which case the dentry will be in, and the natural place for actually giving the dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate" it with d_add() or d_instantiate(). But it turns out that most of the filesystems we care about already use a special case of "d_add()" that *already* replaces the dentry with another one in some cases: "d_splice_alias()". So I bet that if we just taught "d_splice_alias()" to look at the inode, and based on the inode just re-allocate the dentry to some other slab cache, we'd already handle a lot of the cases! And yes, you'd end up with the reallocation overhead quite often, but at least it would now happen only when filling in a dentry, not in the (*much* more critical) cached lookup path. Linus -
There may be another approach. We could create a never-pinned cache, without trying hard to keep it full. Instead of moving a hot dentry at dput() time, we move a cold one from the end of lru. And if the lru list is short, we just chicken out. Our definition of "short lru list" can either be based on a ratio of pinned to unpinned dentries or on a metric of cache hits vs. cache misses. I tend to dislike the cache hit metric, because updatedb would cause tons of misses and result in the same mess we have right now. With this double cache, we have a source of slabs to cheaply reap under memory pressure, but still have a performance advantage (memcpy beats disk io by orders of magnitude). Jörn -- The story so far: In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- Douglas Adams -
You would only get it for dentries that live long (or your prediction
is awfully wrong) and then the reallocation amortizes over time if you
will. :)
MfG
Goswin
-How probable is it that the dentry is needed again? If you copy it and
it is not needed then you wasted time. If you throw it out and it is
needed then you wasted time too. Depending on the probability one of
the two is cheaper overall. Idealy I would throw away dentries that
haven't been accessed recently and copy recently used ones.
How much of a systems ram is spend on dentires? How much on task
structures? Does anyone have some stats on that? If it is <10% of the
total ram combined then I don't see much point in moving them. Just
keep them out of the way of users memory so the buddy system can work
MfG
Goswin
-As usual, the answer is "it depends". I've had up to 600MB in dentry and inode slabs on a 1GiB machine after updatedb. This machine currently has 13MB in dentries, which seems to be reasonable for my purposes. Jörn -- Audacity augments courage; hesitation, fear. -- Publilius Syrus -
Except now as I've repeatadly pointed out, you have internal fragmentation problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for example to get the same sort of results and a lot of copying and moving when Nothing stops you altering the PAGE_SIZE so that large blocks work in the way you envision and keep grouping pages by mobility for huge page sizes. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
Well not sure about the 16MB number, since I'm unsure what the size of the ram was. But clearly I agree there are fragmentation issues in the slab too, there have always been, except they're much less severe, and the slab is meant to deal with that regardless of the PAGE_SIZE. That is not a new problem, you are introducing a new problem instead. We can do a lot better than slab currently does without requiring any defrag move-or-shrink at all. slab is trying to defrag memory for small objects at nearly zero cost, by not giving pages away randomly. I thought you agreed that solving the slab fragmentation was going to provide better guarantees when in another email you suggested that you could start allocating order > 0 pages in the slab to reduce the fragmentation (to achieve most of the guarantee provided by config-page-shift, but while still keeping the You ignore one other bit, when "/usr/bin/free" says 1G is free, with config-page-shift it's free no matter what, same goes for not mlocked cache. With variable order page cache, /usr/bin/free becomes mostly a lie as long as there's no 4k fallback (like fsblock). And most important you're only tackling on the pagecache and I/O performance with the inefficient I/O devices, the whole kernel has no cahnce to get a speedup, infact you're making the fast paths slower, just the opposite of config-page-shift and original Hugh's large PAGE_SIZE ;). -
% free
total used free shared buffers cached
Mem: 1398784 1372956 25828 0 225224 321504
-/+ buffers/cache: 826228 572556
Swap: 1048568 20 1048548
When has free ever given any usefull "free" number? I can perfectly
fine allocate another gigabyte of memory despide free saing 25MB. But
that is because I know that the buffer/cached are not locked in.
On the other hand 1GB can instantly vanish when I start a xen domain
and anything relying on the free value would loose.
The only sensible thing for an application concerned with swapping is
to whatch the swapping and then reduce itself. Not the amount
free. Although I wish there were some kernel interface to get a
preasure value of how valuable free pages would be right now. I would
like that for fuse so a userspace filesystem can do caching without
cripling the kernel.
MfG
Goswin
-Well, as you said you know that buffer/cached are not locked in. If /proc/meminfo would be rubbish like you seem to imply in the first line, why would we ever bother to export that information and even Repeated drop caches + free can help. -
As a user I know it because I didn't put a kernel source into /tmp. A
Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
I would kill any programm that does that to find out how much free ram
the system has.
MfG
Goswin
-Various apps requires you (admin/user) to tune the size of their The whole point is if there's not enough ram of course... this is why The numbers will change depending on what's running on your system. It's up to you to know plus I normally keep vmstat monitored in the background to see how the cache/free levels change over The admin should do that if he's unsure, not a program of course! -
