On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:1. It helps providing a few guarantees: when you run "/usr/bin/free" you won't get a random number, but a strong _guarantee_. That ram will be available no matter what. With variable order page size you may run oom by mlocking some half free ram in pagecache backed by largepages. "free" becomes a fake number provided by a weak design. Apps and admin need to know for sure the ram that is available to be able to fine-tune the workload to avoid running into swap but while using all available ram at the same time. 2. yes, slab can indeed be freed to release an excessive number of 64k pages pinned by an insignificant number of small objects. I already told to Mel even at the VM summit, that the slab defrag can payoff regardless, and this is nothing new, since it will payoff even today with 2.6.23 with regard to kmalloc(32). There's not just 1 4k object in the system... The whole point is to make sure all those 4k objects goes into the same 64k page. This way for you to be able to reproduce Nick's worst case scenario you have to allocate total_ram/4k objects large 4k... Movable? I rather assume all slab allocations aren't movable. Then slab defrag can try to tackle on users like dcache and inodes. Keep in mind that with the exception of updatedb, those inodes/dentries will be pinned and you won't move them, which is why I prefer to consider them not movable too... since there's no guarantee they are. The entire slab being full is a perfect scenario. It means zero memory waste, it's actually the ideal scenario, I can't follow your logic... for(;;) kmalloc(32); is supposed to run oom, no breaking news here... If with slabs you mean slab/slub, I can't follow, there has never been a single byte of userland memory allocated there since ever the slab existed in linux. I guess you're confusing the config-page-shift design with the sgi design where userland memory gets mixed with slab entries in the same 64k page... Also with config-page-shift the userland pages will all be 64k. Things will get more complicated if we later decide to allow kmalloc(4k) pagecache to be mapped in userland instead of only being available for reads. But then we can restrict that to a slab and to make it relocatable by following the ptes. That will complicate things a lot. But the whole point is that you don't need all that complexity, and that as long as you're ok to lose some memory, you will get a strong guarantee when "free" tells you 1G is free or available as cache. If 1000 kmem_cache_alloc(kernel_stack) in a row will keep pinned 1000 64k slab pages it means previously there have at least been 64k/8k*1000 simultaneous tasks allocated at once, not just your 1000 fork. Even if when "free" says there's 1G free, it wouldn't be a 100% strong guarantee, and even if the slab wouldn't provide strong defrag avoidance guarantees by design, splitting pages down in the core, and then merging them up outside the core, sounds less efficient than keeping the pages large in the core, and then splitting them outside the core for the few non-performance critical small users. We're not talking about laptops here, if the major load happens on tiny things and tiny objects nobody should compile a kernel with 64k page size, which is why there need to be 2 rpm to get peak performance. -
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| David Woodhouse | [GIT *] Allow request_firmware() to be satisfied from in-kernel, use it in more dr... |
| Philipp Marek | Re: sys_chroot+sys_fchdir Fix |
| Greg Kroah-Hartman | [PATCH 008/196] Chinese: add translation of volatile-considered-harmful.txt |
git: | |
| Krishna Kumar | [PATCH 9/10 REV5] [IPoIB] Implement batching |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
