This is a rewrite of a patch I had written long back to remove struct page (I shared the patches with Kamezawa, but never posted them anywhere else). I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008). I've tested the patches on an x86_64 box, I've run a simple test running under the memory control group and the same test running concurrently under two different groups (and creating pressure within their groups). I've also compiled the patch with CGROUP_MEM_RES_CTLR turned off. Advantages of the patch 1. It removes the extra pointer in struct page Disadvantages 1. It adds an additional lock structure to struct page_cgroup 2. Radix tree lookup is not an O(1) operation, once the page is known getting to the page_cgroup (pc) is a little more expensive now. This is an initial RFC for comments TODOs 1. Test the page migration changes 2. Test the performance impact of the patch/approach Comments/Reviews? Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/memcontrol.h | 23 +++++- include/linux/mm_types.h | 4 - mm/memcontrol.c | 165 +++++++++++++++++++++++++++++++++------------ 3 files changed, 144 insertions(+), 48 deletions(-) diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c --- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree 2008-08-30 22:49:28.000000000 +0530 +++ linux-2.6.27-rc5-balbir/mm/memcontrol.c 2008-08-31 23:03:06.000000000 +0530 @@ -24,7 +24,7 @@ #include <linux/smp.h> #include <linux/page-flags.h> #include <linux/backing-dev.h> -#include <linux/bit_spinlock.h> +#include <linux/radix-tree.h> #include <linux/rcupdate.h> #include <linux/slab.h> #include <linux/swap.h> @@ -40,6 +40,9 @@ struct cgroup_subsys mem_cgroup_subsys _ static struct kmem_cache *page_cgroup_cache __read_mostly; #define MEM_CGROUP_RECLAIM_RETRIES 5 +static struct radix_tree_root mem_cgroup_tree; +static spinlock_t mem_cgroup_tree_lock; + /* * Statistics ...
On Sun, 31 Aug 2008 23:17:56 +0530 It's just because I think there is no strong requirements for 64bit count/mapcount. There is no ZERO_PAGE() for ANON (by Nick Piggin. I add him to CC.) plz wait until lockless page cgroup.... And If you don't support radix-tree-delete(), pre-allocating all at boot is better. BTW, why pc->lock is necessary ? It increases size of struct page_cgroup and reduce the advantege of your patch's to half (8bytes -> 4bytes). Thanks, --
I understand the comment, but not it's context. Are you suggesting that the sizeof _count and _mapcount can be reduced? Hence the impact of having a member in struct page is not all that large? I think the patch is definitely very That depends, if we can get the lockless page cgroup done quickly, I don't mind waiting, but if it is going to take longer, I would rather push these changes in. There should not be too much overhead in porting lockless page cgroup patch on top of this (remove pc->lock and use pc->flags). I'll help out, so as to We do use radix-tree-delete() in the code, please see below. Pre-allocating has Yes, I've mentioned that as a disadvantage. Are you suggesting that with -- Balbir --
On Mon, 01 Sep 2008 08:58:32 +0530 Maybe they cannot be reduced. For 32bit systems, if the machine doesn't equip crazy amounts of memory (as 32GB) I don't think this 32bit is not very large. Let's calculate. 1GB/4096 x 4 bytes = 1 MB per 1GB. But you adds spinlock_t, then what this patch reduce is not so big. Maybe only The development of lockless-page_cgroup is not stalled. I'm just waiting for my 8cpu box comes back from maintainance... Not so clear at this stage. Thanks, -Kame --
On Mon, 1 Sep 2008 13:03:51 +0900 This is current status (result of unixbench.) result of 2core/1socket x86-64 system. == [disabled] Execl Throughput 3103.3 lps (29.7 secs, 3 samples) C Compiler Throughput 1052.0 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5915.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1142.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 586.0 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 131463.3 lpm (30.0 secs, 3 samples) [rc4mm1] Execl Throughput 3004.4 lps (29.6 secs, 3 samples) C Compiler Throughput 1017.9 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5726.3 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1124.3 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 576.0 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 125446.5 lpm (30.0 secs, 3 samples) [lockless] Execl Throughput 3041.0 lps (29.8 secs, 3 samples) C Compiler Throughput 1025.7 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 5713.6 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 1113.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 571.3 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 125417.9 lpm (30.0 secs, 3 samples) == From this, single-thread results are good. multi-process results are not good ;) So, I think the number of atomic ops are reduced but I have should-be-fixed contention or cache-bouncing problem yet. I'd like to fix this and check on 8 core system when it is back. Recently, I wonder within-3%-overhead is realistic goal. Thanks, -Kame --
It would be nice to be under 3% and lower if possible. I know it is a hard goal to achieve, but worth striving for. I'll try and extract some numbers with the radix tree changes and see if I am adding to the overhead (in terms of time) :) -- Balbir --
There are other things like sizeof(struct page) crossing cacheline boundaries and if we pass cgroup_disabled=memory, we save on the radix tree slots and I understand and I am not pushing you to completing it, but at the same time I don't want to queue up behind it for long. I suspect the cost of porting lockless page cache on top of my patches should not be high, but I'll never know -- Balbir --
On Mon, 01 Sep 2008 11:39:26 +0530 My point is, your patch adds big lock. Then, I don't have to do meaningless effort to reduce lock. Thanks, -Kame --
My patch does not add a big lock, it moves the lock from struct page->page_cgroup to struct page_cgroup. The other locking added is the locking overhead associated with inserting entries into the radix tree, true. I ran oprofile along with lockdep and lockstats enabled on my patches. I don't see the radix_tree or page_cgroup->lock showing up, I see __slab_free and __slab_alloc showing up. I'll poke a little further. Please don't let my patch stop you, we'll integrate the best of both worlds and what is good for memcg. -- Balbir --
On Mon, 01 Sep 2008 11:55:39 +0530 Hmm, one concern I have now is I don't see any contention on res_counter->lock in recent lock_stat test....which was usually on the top of list in past. Thank you. To be honest, I wonder control via page_cgroup may be too rich for 32bit archs ;(. Regards, -Kame --
I think it would be nice to reduce the impact when it is not configured anyway. Normally I would not mind so much, but this is something that many distros will want to enable but fewer users will make use of it. I think it is always a very good idea to try to reduce struct page size. When looking at the performance impact though, just be careful with the alignment of struct page... I actually think it is going to be a If you do that, it might even be an idea to allocate flat arrays with bootmem. It would just be slightly more tricky more tricky to fit this in with the memory model. But that's not a requirement, just an idea for a small optimisation. Thanks, Nick --
On Mon, 1 Sep 2008 16:56:44 +1000 On 32bit, sizeof(struct page) = 32bytes + 4bytes(page_cgroup) On 64bit, sizeof(struct page) = 56bytes + 8bytes(page_cgroup) If we make mem_res_controller available only under SPARSEMEM, I think we can do in very straightfoward way. Thanks, -Kame --
Right. Well, either one is a problem because we always prefer to have That could be a reasonable solution. Balbir has other concerns about this... so I think it is OK to try the radix tree approach first. --
Thanks, Nick! Kamezawa-San, I would like to integrate the radix tree patches after review and some more testing then integrate your patchset on top of it. Do you have any objections/concerns with the suggested approach? -- Balbir --
On Tue, 02 Sep 2008 14:54:17 +0530 please show performance number first. Thanks, --
Yes, that is why said some more testing. I am running lmbench and kernbench on it and some other tests, I'll get back with numbers. -- Balbir --
On Tue, 02 Sep 2008 15:28:34 +0530 A test which is not suffer much from I/O is better. And please don't worry about my patches. I'll reschedule if yours goes first. Thanks, -Kame --
Thanks, I'll try and find the right set of tests. -- Balbir --
On Tue, 02 Sep 2008 15:42:43 +0530 Maybe it's good time to share my concerns. IMHO, the memory resource controller is for dividing memory into groups. We have following choices to divide memory into groups, now. - cpuset(+ fake NUMA) - VM (kvm, Xen, etc...) - memory resource controller. (memcg) Considering 3 aspects peformance, flexibility, isolation(security). My expectaion is peroformance : cpuset > memcg >> VMs flexibility : memcg > VMs >> cpuset. isolation : VMs >> cpuset >= memcg The word 'flexibility' sounds sweet *but* it's just one of aspects. If the peformance is cpuset > VMs > memcg, I'll advise users to use VMs. I think VMs are getting faster and faster. memcg will be slower when we add new 'fancy' feature more. (I think we need some more features.) So, I want to keep memcg fast as much as possible at this stage. But yes, memory usage overhead of page->page_cgroup, struct page_cgroup is big on 32bit arch. I'll say users to use VMs, maybe ;) Thanks, -Kame --
I understand your concern and I am not trying to reduce memcg's performance - or add a fancy feature. I am trying to make memcg more friendly for distros. I see your point about the overhead. I just got back my results - I see a 4% overhead with the patches. Let me see if I can rework them for better performance. -- Balbir --
On Tue, 02 Sep 2008 18:07:18 +0530 Just an idea, by using atomic_ops page_cgroup patch, you can encode page_cgroup->lock to page_cgroup->flags and use bit_spinlock(), I think. (my new patch set use bit_spinlock on page_cgroup->flags for avoiding some race.) This will save extra 4 bytes. Thanks, -Kame --
Exactly the next step I was thinking about (since we already use it, in the current form). Thanks for the suggestion! -- Balbir --
Sorry for the delay in sending out the new patch, I am traveling and thus a little less responsive. Here is the update patch v3...v2 1. Convert flags to unsigned long 2. Move page_cgroup->lock to a bit spin lock in flags v2...v1 1. Fix a small bug, don't call radix_tree_preload_end(), if preload fails This is a rewrite of a patch I had written long back to remove struct page (I shared the patches with Kamezawa, but never posted them anywhere else). I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008). I've tested the patches on an x86_64 box, I've run a simple test running under the memory control group and the same test running concurrently under two different groups (and creating pressure within their groups). I've also compiled the patch with CGROUP_MEM_RES_CTLR turned off. Advantages of the patch 1. It removes the extra pointer in struct page Disadvantages 1. Radix tree lookup is not an O(1) operation, once the page is known getting to the page_cgroup (pc) is a little more expensive now. This is an initial RFC for comments TODOs 1. Test the page migration changes Performance In a unixbench run, these patches had a performance impact of 2% (slowdown). Comments/Reviews? Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/memcontrol.h | 23 +++++ include/linux/mm_types.h | 4 mm/memcontrol.c | 187 +++++++++++++++++++++++++++++---------------- 3 files changed, 144 insertions(+), 70 deletions(-) diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c --- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree 2008-09-04 15:45:54.000000000 +0530 +++ linux-2.6.27-rc5-balbir/mm/memcontrol.c 2008-09-08 20:15:30.000000000 +0530 @@ -24,6 +24,7 @@ #include <linux/smp.h> #include <linux/page-flags.h> #include <linux/backing-dev.h> +#include <linux/radix-tree.h> #include <linux/bit_spinlock.h> #include <linux/rcupdate.h> #include <linux/slab.h> @@ -40,6 ...
On Mon, 8 Sep 2008 20:58:10 +0530
Hmm.. I've considered this approach for a while and my answer is that
this is not what you really want.
Because you just moves the placement of pointer from memmap to
radix_tree both in GFP_KERNEL, total kernel memory usage is not changed.
So, at least, you have to add some address calculation (as I did in March)
to getting address of page_cgroup. But page_cgroup itself consumes 32bytes
per page. Then.....
My proposal to 32bit system is following
- remove page_cgroup completely.
- As a result, there is no per-cgroup lru. But it will not be bad
bacause the number of cgroups and pages are not big.
just a trade-off between kernel-memory-space v.s. speed.
- Removing page_cgroup and just remember address of mem_cgroup per page.
How do you think ?
Thanks,
--
Just keep in mind that an important point is to make it more attractive to configure cgroup into the kernel, but have it disabled or unused at runtime. --
On Tue, 9 Sep 2008 13:58:27 +1000
Hmm..kicking out 4bytes per 4096bytes if disabled ?
maybe a routine like SPARSEMEM is a choice.
Following is pointer pre-allocation. (just pointer, not page_cgroup itself)
==
#define PCG_SECTION_SHIFT (10)
#define PCG_SECTION_SIZE (1 << PCG_SECTION_SHIFT)
struct pcg_section {
struct page_cgroup **map[PCG_SECTION_SHIFT]; //array of pointer.
};
struct page_cgroup *get_page_cgroup(unsigned long pfn)
{
struct pcg_section *sec;
sec = pcg_section[(pfn >> PCG_SECTION_SHIFT)];
return *sec->page_cgroup[(pfn & ((1 << PCG_SECTTION_SHIFT) - 1];
}
==
If we go extreme, we can use kmap_atomic() for pointer array.
Overhead of pointer-walk is not so bad, maybe.
For 64bit systems, we can find a way like SPARSEMEM_VMEMMAP.
Thanks,
-Kame
Thanks,
-Kame
--
Yeah of course. 4 or 8 bytes. Everything adds up. There is nothing special about cgroups that says it is allowed to use fields in struct page where others cannot. Put it in perspective: we try very hard not to allocate new Yes I too think that would be the ideal way to go to get the best of performance in the enabled case. However Balbir I believe is interested in memory savings if not all pages have cgroups... I don't know, I don't care so much about the "enabled" case, so I'll leave you two to fight it out :) --
On Tue, 9 Sep 2008 15:00:10 +1000 I'll add a new patch on my set. Balbir, are you ok to CONFIG_CGROUP_MEM_RES_CTLR depends on CONFIG_SPARSEMEM ? I thinks SPARSEMEM(SPARSEMEM_VMEMMAP) is widely used in various archs now. Thanks, -Kame --
Can't we make it more generic. I was thinking of allocating memory for each node for page_cgroups (of the size of spanned_pages) at initialization time. I've not yet prototyped the idea. BTW, even with your approach I fail to see why we need to add a dependency on CONFIG_SPARSEMEM (but again it is 4:30 in the morning and I might be missing the obvious) -- Balbir --
I guess it would be just a matter of coding up the implementation for each model you want to support. In some cases, you might lose memory (eg in the case of flatmem where you have holes in ram), but in those cases you lose memory from the struct pages anyway so I wouldn't worry too much. I think it would be reasonable to provide an implementation for flatmem as well (which AFAIK is the other non-deprecated memory model). It should not be hard to code AFAIKS. --
Doesn't have big issue without CONFIG_SPARSEMEM, maybe. Sorry for my confusion. Thanks, -Kame --
No problem, I am glad to know that we are not limited to a particular model. -- Balbir --
OK, here is approach #2, it works for me and gives me really good
performance (surpassing even the current memory controller). I am
seeing almost a 7% increase
Caveats
1. Uses more memory (since it allocates memory for each node based on
spanned_pages. Ignores holes, so might not be the most efficient,
but it is a tradeoff of complexity versus space. I propose refining it
as we go along.
2. Does not currently handle alloc_bootmem failure
3. Needs lots of testing/tuning and polishing
I've tested it on an x86_64 box with 4G of memory
Again, this is an early RFC patch, please review test.
Comments/Reviews?
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
include/linux/memcontrol.h | 32 ++++++
include/linux/mm_types.h | 4
mm/memcontrol.c | 212 +++++++++++++++++++++++++++------------------
mm/page_alloc.c | 10 --
4 files changed, 162 insertions(+), 96 deletions(-)
diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c
--- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree 2008-09-04 03:15:54.000000000 -0700
+++ linux-2.6.27-rc5-balbir/mm/memcontrol.c 2008-09-09 17:56:54.000000000 -0700
@@ -18,6 +18,7 @@
*/
#include <linux/res_counter.h>
+#include <linux/bootmem.h>
#include <linux/memcontrol.h>
#include <linux/cgroup.h>
#include <linux/mm.h>
@@ -37,9 +38,10 @@
#include <asm/uaccess.h>
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
+static struct page_cgroup *pcg_map[MAX_NUMNODES];
+
/*
* Statistics for memory cgroup.
*/
@@ -137,20 +139,6 @@ struct mem_cgroup {
static struct mem_cgroup init_mem_cgroup;
/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock. We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin). But since
- * bit_spin_lock doesn't actually set ...On Tue, 9 Sep 2008 18:20:48 -0700 This number is from pre-allcation, maybe. ^^^^^ 1. This is nonsense...do you know the memory map of IBM's (maybe ppc) machine ? Node's memory are splitted into several pieces and not ordered by node number. example) Node 0 | Node 1 | Node 2 | Node 1 | Node 2 | This seems special but when I helped SPARSEMEM and MEMORY_HOTPLUG, I saw mannnny kinds of memory map. As you wrote, this should be re-designed. 2. If pre-allocating all is ok, I stop my work. Mine is of-no-use. But you have to know that by pre-allocationg, we can't use avoid-lru-lock by batch like page_vec technique. We can't delay uncharge because a page Can this happen ? Our direction should be Is this lock/unlock_page_cgroup is for what kind of race ? Thanks, -Kame --
Thanks, so that means that we cannot before hand predict the size of pcg_map[n], One of the goals of this patch is refinement, it is a starting piece, something I shared very early. I am not asking you to stop your work. While I think pre-allocating is not the best way to do this, the trade off is the sparseness of the machine. I don't mind doing it in other ways, but we'll still need to do Care to elaborate on this? Why not? If the page is reused, we act on the batch We use page_cgroup_zoneinfo(pc), we want to make sure pc does not disappear or Yes, it can.. several people trying to map the same page at once. Can't we race for setting pc->flags and for setting pc->page and pc->mem_cgroup. -- Balbir --
On Tue, 09 Sep 2008 19:11:09 -0700 Or use some "allocate a chunk of page_cgroup for a chunk of continuous pages". Hmm, maybe clarifying trade-off and comapring them is the first step. And touch vec on other cpu ? The reason "vec" is fast is because it's per-cpu. If we want to use "delaying", we'll have to make page_cgroup unused and not-on-lru Hmm...there is a confustion, maybe. The page_cgroup is now 1:1 to struct page. Then, we can guarantee that - There is no race between charge v.s. uncharge. Only problem is force_empty. (But it's difficult..) This means pc->mem_cgroup is safe here. And pc->flags should be atomic flags, anyway. I believe we have to record "Dirty bit" at el, later. Thanks, -Kame --
It seems really nice to me -- we get the best of both worlds, less overhead for those who don't enable the memory controller, and even better performance for those who do. Are you expecting many users to want to turn this on and off at runtime? I wouldn't expect so, but I don't know enough about them. --
On Thu, 11 Sep 2008 06:44:37 +1000 No trobles for me for allocating-all-at-boot policy. My small concern is - wasting page_cgroup for hugepage area. There is no runtime switch. only at boot. Thanks, -Kame --
In those cases you still waste the struct page area too. I realise that isn't a good way to justify even more wastage. But I guess it is relatively low. At least, I would think the users would be more happy to Fine --
On Thu, 11 Sep 2008 07:02:44 +1000 I guess the increase mostly because we can completely avoid kmalloc/kfree slow path. Balbir, how about fix our way to allocate-all-at-boot-policy ? If you say yes, I think I can help you and I'll find usable part from my garbage. Following is lockless+remove-page-cgroup-pointer-from-page-struct patch's result. rc5-mm1 == Execl Throughput 3006.5 lps (29.8 secs, 3 samples) C Compiler Throughput 1006.7 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 4863.7 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 943.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 482.7 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 124804.9 lpm (30.0 secs, 3 samples) lockless == Execl Throughput 3035.5 lps (29.6 secs, 3 samples) C Compiler Throughput 1010.3 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 4881.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 947.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 485.0 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 125437.9 lpm (30.0 secs, 3 samples) lockless + remove page cgroup pointer (my version). == Execl Throughput 3021.1 lps (29.5 secs, 3 samples) C Compiler Throughput 980.3 lpm (60.0 secs, 3 samples) Shell Scripts (1 concurrent) 4600.0 lpm (60.0 secs, 3 samples) Shell Scripts (8 concurrent) 915.7 lpm (60.0 secs, 3 samples) Shell Scripts (16 concurrent) 468.3 lpm (60.0 secs, 3 samples) Dc: sqrt(2) to 99 decimal places 124909.1 lpm (30.0 secs, 3 samples) Oh,yes. siginificant slow down. I'm glad to kick this patch out to trash box. Thanks, -Kame --
I am perfectly fine with it, I'll need your expertise to get the -- Thanks, Balbir --
This will really suck for sparse memory machines. Imagine a machine with 1GB of memory at 0x0 and another 1GB of memory at 1TB up in the address space. You also need to consider how it works with memory hotplug and how you're going to grow it at runtime. Oh, and doesn't alloc_bootmem() panic() if it fails internally anyway? I need to look at your other approach. :) -- Dave --
From: Dave Hansen <dave@linux.vnet.ibm.com> You just described the workstation I am typing this from :-) --
I would hate to re-implement the entire sparsemem code :( Kame did suggest making the memory controller depend on sparsemem (to hook in Yes, true. This is not the final version, a very very early version that I We'll need some slab_is_available() sort of checks that sparse.c uses and also deal with memory hotplug add and remove. -- Balbir --
Yeah, you could just make another mem_section member. Or, you could work to abstract the sparsemem code so that other people can use it, or maybe make it more dynamic so we can have multiple pfn->object lookups in parallel. Adding the struct member is obviously easier. -- Dave --
On Wed, 10 Sep 2008 15:56:48 -0700 Don't worry. I'll care sparse memory map and hotplug. But whether making this depends on SPARSEMEM or not is not fixed yet. I'll try generic one, at first. If it's dirty, start discussion about SPARSEMEM. (Honestly, I love sparsemem than others ;) Thanks, -Kame --
My concern is that if we depend on sparsemem, then we force distros to turn on sparsemem (which might be the default, but not on all architectures), we might end up losing those architectures (w.r.t. turning on the memory controller) where sparsemem is not the default on the distro. -- Balbir --
On Wed, 10 Sep 2008 18:47:25 -0700 Yes. I share your concern. Then, I'll try not-on-sparsemem version, at first. Thanks, -Kame --
Before trying the sparsemem approach, I tried a radix tree per node, per zone and I seemed to actually get some performance improvement.(1.5% (noise maybe)) But please do see and review (tested on my x86_64 box with unixbench and some other simple tests) v4..v3 1. Use a radix tree per node, per zone v3...v2 1. Convert flags to unsigned long 2. Move page_cgroup->lock to a bit spin lock in flags v2...v1 1. Fix a small bug, don't call radix_tree_preload_end(), if preload fails This is a rewrite of a patch I had written long back to remove struct page (I shared the patches with Kamezawa, but never posted them anywhere else). I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008). I've tested the patches on an x86_64 box, I've run a simple test running under the memory control group and the same test running concurrently under two different groups (and creating pressure within their groups). Advantages of the patch 1. It removes the extra pointer in struct page Disadvantages 1. Radix tree lookup is not an O(1) operation, once the page is known getting to the page_cgroup (pc) is a little more expensive now. This is an initial RFC (version 3) for comments TODOs 1. Test the page migration changes Performance In a unixbench run, these patches had a performance impact of 2% (slowdown). Comments/Reviews? Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/memcontrol.h | 23 +++ include/linux/mm_types.h | 4 mm/memcontrol.c | 264 ++++++++++++++++++++++++++++++++++----------- 3 files changed, 221 insertions(+), 70 deletions(-) diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c --- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree 2008-09-16 17:10:15.000000000 -0700 +++ linux-2.6.27-rc5-balbir/mm/memcontrol.c 2008-09-17 16:15:01.000000000 -0700 @@ -24,6 +24,7 @@ #include <linux/smp.h> #include <linux/page-flags.h> #include ...
Why are we doing this? I can guess, but I'd rather not have to. a) It's slower. b) It uses even more memory worst-case. c) It uses less memory best-case. someone somewhere decided that (Aa + Bb) / Cc < 1.0. What are the values of A, B and C and where did they come from? ;) (IOW, your changelog is in the category "sucky", along with 90% of the others) --
Yes, I agree, to be honest we discussed the reasons on the mailing list and those should go to the changelog. I'll do that in the next version of the patches. These are early RFC patches, but the changelog does suck. -- Balbir --
On Wed, 17 Sep 2008 20:57:54 -0700 IIRC, (Aa + Bb) / Cc < 1.0 discussion was following. Because we have to maintain pointer to page_cgroup in radix-tree (ZONE_NORMAL) 1. memory usage will increase when memory cgroup is enabled. The amount memory usage increase just depends on the height of radix-tree. 2. memory usage will decrease when memory cgroup is disabled. This saves 4bytes per 4096bytes. (on x86-32) Thanks, -Kame --
I think this design is just temporary and the goal is to pre-allocate all page_cgroups at boot time if it isn't disabled. But I think each memory model type should have its own way of managing its page_cgroup arrays as doing for its struct page arrays. --
On Thu, 18 Sep 2008 13:26:13 +0900 (JST) My patch adds an interface. Then... FLATMEM support will be very easy. I'll ignore DISCONTIGMEM and SPARSEMEM (they will use my 'hash') SPARSEMEM_VMEMMAP support will took some amount of time. It will need per-arch patches. Thanks, -Kame --
What part of this do you think is the problem to implement it in the straight way for this model? I think it won't be difficult to implement it since each pgdat can have its page_cgroup array, which can care about holes in the node as well as Yes, each of ia64, powerpc and x86_64 use this memory model. --
On Wed, 17 Sep 2008 18:40:08 -0700 Balbir, don't you like pre-allocate-page-cgroup-at-boot at all ? I don't like radix-tree for objects which can spread to very vast/sparse area ;) BTW, I already have lazy-lru-by-pagevec protocol on my patch(hash version) and seems to work well. I'm now testing it and will post today if I'm enough lucky. Thanks, -Kame --
I tried one version, but before trying a pre-allocation version, I wanted to spread out the radix-tree and try and the results seemed quite impressive. We can still do pre-allocation, but it gets more complicated as we start supporting all memory models. I do have a design on paper, but it is much more complex than cool! Please do post what numbers you see as well. I would appreciate if you can try this version and see what sort of performance issues you see. -- Balbir --
On Wed, 17 Sep 2008 21:58:08 -0700 I'll see what happens. your patch is agasint rc6-mmtom ? BTW, my 8cpu/2socket/Xeon 3.16GHz host is back now. Maybe good for seeing performance. Thanks, -Kame --
On Wed, 17 Sep 2008 21:58:08 -0700 This is the result on 8cpu box. I think I have to reduce footprint of fastpath of my patch ;) Test result of your patch is (2). == Xeon 8cpu/2socket/1-node equips 48GB of memory. run shell/exec benchmark 3 times just after boot. lps ... loops per sec. lpm ... loops per min. (*) Shell tests somtimes fail because of division by zero, etc... (1). rc6-mm1(2008/9/13 version) == Run == 1st == == 2nd == ==3rd== Execl Throughput 2425.2 2534.5 2465.8 (lps) C Compiler Throughput 1438.3 1476.3 1459.1 (lpm) Shell Scripts (1 concurrent) 9360.3 9368.3 9360.0 (lpm) Shell Scripts (8 concurrent) 3868.0 3870.0 3868.0 (lpm) Shell Scripts (16 concurrent) 2207.0 2204.0 2201.0 (lpm) Dc: sqrt(2) to 99 decimal places 101644.3 102184.5 102118.5 (lpm) (2). (1) +remove-page-cgroup-pointer-v3 (radix-tree + dynamic allocation) == Run == 1st == == 2nd == == 3rd == Execl Throughput 2514.1 2548.9 2648.7 (lps) C Compiler Throughput 1353.9 1324.6 1324.7 (lpm) Shell Scripts (1 concurrent) 8866.7 8871.0 8856.0 (lpm) Shell Scripts (8 concurrent) 3674.3 3680.0 3677.7 (lpm) Shell Scripts (16 concurrent) failed. failed 2094.3 (lpm) Dc: sqrt(2) to 99 decimal places 98837.0 98206.9 98250.6 (lpm) (3). (1) + pre-allocation by "vmalloc" + hash + misc(atomic flags etc..) == Run == 1st == == 2nd == == 3rd == Execl Throughput 2385.4 2579.2 2361.5 (lps) C Compiler Throughput 1424.3 1436.3 1430.6 (lpm) Shell Scripts (1 concurrent) 9222.0 9234.0 9246.7 (lpm) Shell Scripts (8 concurrent) ...
Thanks, Kame! I'll look at the lazy lru patches and see if I can find anything. Do you have a unified patch anywhere, I seem to get confused with the patches, I see 10/9, 11/9 and 12/9. I'll do some analysis when I find some free time, I am currently at plumbers conference. -- Balbir --
I think it should try to hook into the physical memory model code. I thought it was going to do that but didn't look at the details... --
Agreed, but we do reduce the sizeof(struct page) without adding on to What address calculation do we need, sorry I don't recollect it. 32 bit systems with PAE can support quite a lot of memory, so I am not sure I I don't like the approach. -- Balbir --
On Mon, 08 Sep 2008 21:18:37 -0700 base_address = base_addrees_of_page_group_chunk_of_pfn. base_address + offset_of_pfn_from_base_pfn * sizeof (struct page_cgroup). Thanks, -Kame --
On Mon, 08 Sep 2008 21:18:37 -0700 Because it's depends on CONFIG. But ok, I'll recall my patches from grave. Maybe it can be a hint for you. Thanks, -Kame --
I agree with the last point, but then when I see http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg41546.html I am tempted to reduce the size. I did ask Andi about on which x86_64 machines are we exceeding the cache size, but heard nothing back. The questions to answer are 1. On 64 bit systems, would be OK making the size of struct page 64 bytes Yes, I thought about it, but dropped it for two reasons 1. We don't want a static overhead (irrespective of the memory used and user pages) 2. Like you said, it means fitting into every memory model to get it right. -- Balbir --
On Sun, 31 Aug 2008 23:17:56 +0530 BTW, how deep this radix-tree on 4GB/32GB/64GB/256GB machine ? Thanks, -Kame --
Good Question, My ball-park estimates are number of pfns = RADIX_TREE_TAG_LONGS/(RADIX_TREE_TAG_LONGS - 1) * (RADIX_TREE_LONGS^n - 1) and "n" is the number we are looking for. For a 64 bit system with 256 GB and 4KB page size, I've calculated it to be 9 levels deep. -- Balbir --
And besides, we also have a global lock, that protects even lookup --
Sorry, not sure I understand. The lookup is done under RCU. Updates are done using the global lock. It should not be hard to make the radix tree per node later (as an iterative refinement). -- Balbir --
OOPS, indeed. Sorry for the confusion, then. :) --
Or you could have a look at the concurrent radix tree, esp for dense trees it can save a lot on lock bouncing. Latest code available at: http://programming.kicks-ass.net/kernel-patches/concurrent-pagecache/27-rc3/ --
I'll try them later. I see a lot of slab allocations/frees (expected). I am trying to experiment with alternatives to reduce them. -- Balbir --
