Re: [RFC][PATCH] Remove cgroup member from struct page (v3)

Previous thread: re : ACPI PnP on Intel MU440EX by matthieu castet on Sunday, August 31, 2008 - 10:39 am. (5 messages)

Next thread: [BUG] x86 kenel won't boot under Virtual PC by David Sanders on Sunday, August 31, 2008 - 11:22 am. (21 messages)
From: Balbir Singh
Date: Sunday, August 31, 2008 - 10:47 am

This is a rewrite of a patch I had written long back to remove struct page
(I shared the patches with Kamezawa, but never posted them anywhere else).
I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008).

I've tested the patches on an x86_64 box, I've run a simple test running
under the memory control group and the same test running concurrently under
two different groups (and creating pressure within their groups). I've also
compiled the patch with CGROUP_MEM_RES_CTLR turned off.

Advantages of the patch

1. It removes the extra pointer in struct page

Disadvantages

1. It adds an additional lock structure to struct page_cgroup
2. Radix tree lookup is not an O(1) operation, once the page is known
   getting to the page_cgroup (pc) is a little more expensive now.

This is an initial RFC for comments

TODOs

1. Test the page migration changes
2. Test the performance impact of the patch/approach

Comments/Reviews?

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |   23 +++++-
 include/linux/mm_types.h   |    4 -
 mm/memcontrol.c            |  165 +++++++++++++++++++++++++++++++++------------
 3 files changed, 144 insertions(+), 48 deletions(-)

diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c
--- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree	2008-08-30 22:49:28.000000000 +0530
+++ linux-2.6.27-rc5-balbir/mm/memcontrol.c	2008-08-31 23:03:06.000000000 +0530
@@ -24,7 +24,7 @@
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include <linux/backing-dev.h>
-#include <linux/bit_spinlock.h>
+#include <linux/radix-tree.h>
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
@@ -40,6 +40,9 @@ struct cgroup_subsys mem_cgroup_subsys _
 static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+static struct radix_tree_root mem_cgroup_tree;
+static spinlock_t mem_cgroup_tree_lock;
+
 /*
  * Statistics ...
From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 5:01 pm

On Sun, 31 Aug 2008 23:17:56 +0530
It's just because I think there is no strong requirements for 64bit count/mapcount.
There is no ZERO_PAGE() for ANON (by Nick Piggin. I add him to CC.)
plz wait until lockless page cgroup....

And If you don't support radix-tree-delete(), pre-allocating all at boot is better.

BTW, why pc->lock is necessary ? It increases size of struct page_cgroup and reduce 
the advantege of your patch's to half (8bytes -> 4bytes).

Thanks,

--

From: Balbir Singh
Date: Sunday, August 31, 2008 - 8:28 pm

I understand the comment, but not it's context. Are you suggesting that the
sizeof _count and _mapcount can be reduced? Hence the impact of having a member
in struct page is not all that large? I think the patch is definitely very

That depends, if we can get the lockless page cgroup done quickly, I don't mind
waiting, but if it is going to take longer, I would rather push these changes
in. There should not be too much overhead in porting lockless page cgroup patch
on top of this (remove pc->lock and use pc->flags). I'll help out, so as to

We do use radix-tree-delete() in the code, please see below. Pre-allocating has

Yes, I've mentioned that as a disadvantage. Are you suggesting that with

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 9:03 pm

On Mon, 01 Sep 2008 08:58:32 +0530
Maybe they cannot be reduced. For 32bit systems, if the machine doesn't equip
crazy amounts of memory (as 32GB) I don't think this 32bit is not very large.

Let's calculate. 1GB/4096 x 4 bytes = 1 MB per 1GB. 
But you adds spinlock_t, then what this patch reduce is not so big. Maybe only
The development of lockless-page_cgroup is not stalled. I'm just waiting for
my 8cpu box comes back from maintainance...
Not so clear at this stage. 

Thanks,
-Kame



--

From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 10:17 pm

On Mon, 1 Sep 2008 13:03:51 +0900
This is current status (result of unixbench.)
result of 2core/1socket x86-64 system.

==
[disabled]
Execl Throughput                           3103.3 lps   (29.7 secs, 3 samples)
C Compiler Throughput                      1052.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               5915.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)               1142.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               586.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         131463.3 lpm   (30.0 secs, 3 samples)

[rc4mm1]
Execl Throughput                           3004.4 lps   (29.6 secs, 3 samples)
C Compiler Throughput                      1017.9 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               5726.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)               1124.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               576.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         125446.5 lpm   (30.0 secs, 3 samples)

[lockless]
Execl Throughput                           3041.0 lps   (29.8 secs, 3 samples)
C Compiler Throughput                      1025.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               5713.6 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)               1113.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               571.3 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         125417.9 lpm   (30.0 secs, 3 samples)
==

From this, single-thread results are good. multi-process results are not good ;)
So, I think the number of atomic ops are reduced but I have should-be-fixed
contention or cache-bouncing problem yet. I'd like to fix this and check on 8 core
system when it is back.
Recently, I wonder within-3%-overhead is realistic goal.

Thanks,
-Kame




--

From: Balbir Singh
Date: Sunday, August 31, 2008 - 11:16 pm

It would be nice to be under 3% and lower if possible. I know it is a hard goal
to achieve, but worth striving for. I'll try and extract some numbers with the
radix tree changes and see if I am adding to the overhead (in terms of time) :)

-- 
	Balbir
--

From: Balbir Singh
Date: Sunday, August 31, 2008 - 11:09 pm

There are other things like sizeof(struct page) crossing cacheline boundaries
and if we pass cgroup_disabled=memory, we save on the radix tree slots and

I understand and I am not pushing you to completing it, but at the same time I
don't want to queue up behind it for long. I suspect the cost of porting
lockless page cache on top of my patches should not be high, but I'll never know



-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 11:24 pm

On Mon, 01 Sep 2008 11:39:26 +0530
My point is, your patch adds big lock. Then, I don't have to do meaningless effort
to reduce lock.

Thanks,
-Kame

--

From: Balbir Singh
Date: Sunday, August 31, 2008 - 11:25 pm

My patch does not add a big lock, it moves the lock from struct
page->page_cgroup to struct page_cgroup. The other locking added is the locking
overhead associated with inserting entries into the radix tree, true. I ran
oprofile along with lockdep and lockstats enabled on my patches. I don't see the
radix_tree or page_cgroup->lock showing up, I see __slab_free and __slab_alloc
showing up. I'll poke a little further.

Please don't let my patch stop you, we'll integrate the best of both worlds and
what is good for memcg.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 11:59 pm

On Mon, 01 Sep 2008 11:55:39 +0530
Hmm, one concern I have now is I don't see any contention on  res_counter->lock in
recent lock_stat test....which was usually on the top of list in past.
Thank you.

To be honest, I wonder control via page_cgroup may be too rich for 32bit archs ;(.

Regards,
-Kame

--

From: Nick Piggin
Date: Sunday, August 31, 2008 - 11:56 pm

I think it would be nice to reduce the impact when it is not configured
anyway. Normally I would not mind so much, but this is something that
many distros will want to enable but fewer users will make use of it.

I think it is always a very good idea to try to reduce struct page size.
When looking at the performance impact though, just be careful with the
alignment of struct page... I actually think it is going to be a

If you do that, it might even be an idea to allocate flat arrays with
bootmem. It would just be slightly more tricky more tricky to fit this
in with the memory model. But that's not a requirement, just an idea
for a small optimisation.

Thanks,
Nick
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 1, 2008 - 12:19 am

On Mon, 1 Sep 2008 16:56:44 +1000
On 32bit, sizeof(struct page) = 32bytes + 4bytes(page_cgroup)
On 64bit, sizeof(struct page) = 56bytes + 8bytes(page_cgroup)
If we make mem_res_controller available only under SPARSEMEM, I think we can
do in very straightfoward way.

Thanks,
-Kame

--

From: Nick Piggin
Date: Monday, September 1, 2008 - 12:43 am

Right. Well, either one is a problem because we always prefer to have

That could be a reasonable solution.  Balbir has other concerns about
this... so I think it is OK to try the radix tree approach first.
--

From: Balbir Singh
Date: Tuesday, September 2, 2008 - 2:24 am

Thanks, Nick!

Kamezawa-San, I would like to integrate the radix tree patches after review and
some more testing then integrate your patchset on top of it. Do you have any
objections/concerns with the suggested approach?

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 2, 2008 - 3:02 am

On Tue, 02 Sep 2008 14:54:17 +0530
please show performance number first.

Thanks,

--

From: Balbir Singh
Date: Tuesday, September 2, 2008 - 2:58 am

Yes, that is why said some more testing. I am running lmbench and kernbench on
it and some other tests, I'll get back with numbers.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 2, 2008 - 3:07 am

On Tue, 02 Sep 2008 15:28:34 +0530
A test which is not suffer much from I/O is better.
And please don't worry about my patches. I'll reschedule if yours goes first.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 2, 2008 - 3:12 am

Thanks, I'll try and find the right set of tests.
-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 2, 2008 - 3:57 am

On Tue, 02 Sep 2008 15:42:43 +0530

Maybe it's good time to share my concerns.

IMHO, the memory resource controller is for dividing memory into groups.

We have following choices to divide memory into groups, now.
  - cpuset(+ fake NUMA)
  - VM (kvm, Xen, etc...)
  - memory resource controller. (memcg)

Considering 3 aspects peformance, flexibility, isolation(security).
My expectaion is

peroformance   : cpuset > memcg >> VMs
flexibility    : memcg  > VMs >> cpuset.
isolation      : VMs >> cpuset >= memcg

The word 'flexibility' sounds sweet *but* it's just one of aspects.
If the peformance is cpuset > VMs > memcg, I'll advise users to use VMs.

I think VMs are getting faster and faster. memcg will be slower when we add new
'fancy' feature more. (I think we need some more features.)
So, I want to keep memcg fast as much as possible at this stage.

But yes, memory usage overhead of page->page_cgroup, struct page_cgroup is big
on 32bit arch. I'll say users to use VMs, maybe ;)

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 2, 2008 - 5:37 am

I understand your concern and I am not trying to reduce memcg's performance - or
add a fancy feature. I am trying to make memcg more friendly for distros. I see
your point about the overhead. I just got back my results - I see a 4% overhead
with the patches. Let me see if I can rework them for better performance.


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 2, 2008 - 8:33 pm

On Tue, 02 Sep 2008 18:07:18 +0530
Just an idea, by using atomic_ops page_cgroup patch, you can encode page_cgroup->lock
to page_cgroup->flags and use bit_spinlock(), I think.
(my new patch set use bit_spinlock on page_cgroup->flags for avoiding some race.)

This will save extra 4 bytes.

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, September 3, 2008 - 12:31 am

Exactly the next step I was thinking about (since we already use it, in the
current form). Thanks for the suggestion!

-- 
	Balbir
--

From: Balbir Singh
Date: Monday, September 8, 2008 - 8:28 am

Sorry for the delay in sending out the new patch, I am traveling and
thus a little less responsive. Here is the update patch


v3...v2
1. Convert flags to unsigned long
2. Move page_cgroup->lock to a bit spin lock in flags

v2...v1

1. Fix a small bug, don't call radix_tree_preload_end(), if preload fails

This is a rewrite of a patch I had written long back to remove struct page
(I shared the patches with Kamezawa, but never posted them anywhere else).
I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008).

I've tested the patches on an x86_64 box, I've run a simple test running
under the memory control group and the same test running concurrently under
two different groups (and creating pressure within their groups). I've also
compiled the patch with CGROUP_MEM_RES_CTLR turned off.

Advantages of the patch

1. It removes the extra pointer in struct page

Disadvantages

1. Radix tree lookup is not an O(1) operation, once the page is known
   getting to the page_cgroup (pc) is a little more expensive now.

This is an initial RFC for comments

TODOs

1. Test the page migration changes

Performance

In a unixbench run, these patches had a performance impact of 2% (slowdown).

Comments/Reviews?

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |   23 +++++
 include/linux/mm_types.h   |    4 
 mm/memcontrol.c            |  187 +++++++++++++++++++++++++++++----------------
 3 files changed, 144 insertions(+), 70 deletions(-)

diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c
--- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree	2008-09-04 15:45:54.000000000 +0530
+++ linux-2.6.27-rc5-balbir/mm/memcontrol.c	2008-09-08 20:15:30.000000000 +0530
@@ -24,6 +24,7 @@
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include <linux/backing-dev.h>
+#include <linux/radix-tree.h>
 #include <linux/bit_spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
@@ -40,6 ...
From: KAMEZAWA Hiroyuki
Date: Monday, September 8, 2008 - 8:57 pm

On Mon, 8 Sep 2008 20:58:10 +0530
Hmm.. I've considered this approach for a while and my answer is that
this is not what you really want.

Because you just moves the placement of pointer from memmap to
radix_tree both in GFP_KERNEL, total kernel memory usage is not changed.
So, at least, you have to add some address calculation (as I did in March)
to getting address of page_cgroup. But page_cgroup itself consumes 32bytes
per page. Then.....

My proposal to 32bit system is following 
 - remove page_cgroup completely.
   - As a result, there is no per-cgroup lru. But it will not be bad
     bacause the number of cgroups and pages are not big.
     just a trade-off between kernel-memory-space v.s. speed.
   - Removing page_cgroup and just remember address of mem_cgroup per page.

How do you think ?

Thanks,

--

From: Nick Piggin
Date: Monday, September 8, 2008 - 8:58 pm

Just keep in mind that an important point is to make it more attractive
to configure cgroup into the kernel, but have it disabled or unused at
runtime.
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 8, 2008 - 9:53 pm

On Tue, 9 Sep 2008 13:58:27 +1000

Hmm..kicking out 4bytes per 4096bytes if disabled ?

maybe a routine like SPARSEMEM is a choice.

Following is pointer pre-allocation. (just pointer, not page_cgroup itself)
==
#define PCG_SECTION_SHIFT	(10)
#define PCG_SECTION_SIZE	(1 << PCG_SECTION_SHIFT)

struct pcg_section {
	struct page_cgroup **map[PCG_SECTION_SHIFT]; //array of pointer.
};

struct page_cgroup *get_page_cgroup(unsigned long pfn)
{
	struct pcg_section *sec;
	sec = pcg_section[(pfn >> PCG_SECTION_SHIFT)];
	return *sec->page_cgroup[(pfn & ((1 << PCG_SECTTION_SHIFT) - 1];
}
==
If we go extreme, we can use kmap_atomic() for pointer array.

Overhead of pointer-walk is not so bad, maybe.

For 64bit systems, we can find a way like SPARSEMEM_VMEMMAP.

Thanks,
-Kame




Thanks,
-Kame

--

From: Nick Piggin
Date: Monday, September 8, 2008 - 10:00 pm

Yeah of course. 4 or 8 bytes. Everything adds up. There is nothing special
about cgroups that says it is allowed to use fields in struct page where
others cannot. Put it in perspective: we try very hard not to allocate new

Yes I too think that would be the ideal way to go to get the best of
performance in the enabled case. However Balbir I believe is interested
in memory savings if not all pages have cgroups... I don't know, I don't
care so much about the "enabled" case, so I'll leave you two to fight it
out :)
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 8, 2008 - 10:12 pm

On Tue, 9 Sep 2008 15:00:10 +1000
I'll add a new patch on my set.

Balbir, are you ok to CONFIG_CGROUP_MEM_RES_CTLR depends on CONFIG_SPARSEMEM ?
I thinks SPARSEMEM(SPARSEMEM_VMEMMAP) is widely used in various archs now.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 9, 2008 - 5:24 am

Can't we make it more generic. I was thinking of allocating memory for each node
for page_cgroups (of the size of spanned_pages) at initialization time. I've not
yet prototyped the idea. BTW, even with your approach I fail to see why we need
to add a dependency on CONFIG_SPARSEMEM (but again it is 4:30 in the morning and
I might be missing the obvious)

-- 
	Balbir
--

From: Nick Piggin
Date: Tuesday, September 9, 2008 - 5:28 am

I guess it would be just a matter of coding up the implementation for
each model you want to support. In some cases, you might lose memory
(eg in the case of flatmem where you have holes in ram), but in those
cases you lose memory from the struct pages anyway so I wouldn't worry
too much.

I think it would be reasonable to provide an implementation for flatmem
as well (which AFAIK is the other non-deprecated memory model). It
should not be hard to code AFAIKS.
--

From: kamezawa.hiroyu
Date: Tuesday, September 9, 2008 - 5:30 am

Doesn't have big issue without CONFIG_SPARSEMEM, maybe.
Sorry for my confusion.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 9, 2008 - 5:34 am

No problem, I am glad to know that we are not limited to a particular model.

-- 
	Balbir
--

From: Balbir Singh
Date: Tuesday, September 9, 2008 - 6:20 pm

OK, here is approach #2, it works for me and gives me really good
performance (surpassing even the current memory controller). I am
seeing almost a 7% increase

Caveats

1. Uses more memory (since it allocates memory for each node based on
   spanned_pages. Ignores holes, so might not be the most efficient,
   but it is a tradeoff of complexity versus space. I propose refining it
   as we go along.
2. Does not currently handle alloc_bootmem failure
3. Needs lots of testing/tuning and polishing

I've tested it on an x86_64 box with 4G of memory

Again, this is an early RFC patch, please review test. 

Comments/Reviews?

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |   32 ++++++
 include/linux/mm_types.h   |    4 
 mm/memcontrol.c            |  212 +++++++++++++++++++++++++++------------------
 mm/page_alloc.c            |   10 --
 4 files changed, 162 insertions(+), 96 deletions(-)

diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c
--- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree	2008-09-04 03:15:54.000000000 -0700
+++ linux-2.6.27-rc5-balbir/mm/memcontrol.c	2008-09-09 17:56:54.000000000 -0700
@@ -18,6 +18,7 @@
  */
 
 #include <linux/res_counter.h>
+#include <linux/bootmem.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
 #include <linux/mm.h>
@@ -37,9 +38,10 @@
 #include <asm/uaccess.h>
 
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
+static struct page_cgroup *pcg_map[MAX_NUMNODES];
+
 /*
  * Statistics for memory cgroup.
  */
@@ -137,20 +139,6 @@ struct mem_cgroup {
 static struct mem_cgroup init_mem_cgroup;
 
 /*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, September 9, 2008 - 6:49 pm

On Tue, 9 Sep 2008 18:20:48 -0700
This number is from pre-allcation, maybe.
^^^^^

1. This is nonsense...do you know the memory map of IBM's (maybe ppc) machine ?
Node's memory are splitted into several pieces and not ordered by node number.
example)
   Node 0 | Node 1 | Node 2 | Node 1 | Node 2 | 

This seems special but when I helped SPARSEMEM and MEMORY_HOTPLUG,
I saw mannnny kinds of memory map. As you wrote, this should be re-designed.

2. If pre-allocating all is ok, I stop my work. Mine is of-no-use.
But you have to know that by pre-allocationg, we can't use avoid-lru-lock
by batch like page_vec technique. We can't delay uncharge because a page

Can this happen ? Our direction should be

Is this lock/unlock_page_cgroup is for what kind of race ?

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 9, 2008 - 7:11 pm

Thanks, so that means that we cannot before hand predict the size of pcg_map[n],

One of the goals of this patch is refinement, it is a starting piece, something
I shared very early. I am not asking you to stop your work. While I think
pre-allocating is not the best way to do this, the trade off is the sparseness
of the machine. I don't mind doing it in other ways, but we'll still need to do

Care to elaborate on this? Why not? If the page is reused, we act on the batch


We use page_cgroup_zoneinfo(pc), we want to make sure pc does not disappear or

Yes, it can.. several people trying to map the same page at once. Can't we race

for setting pc->flags and for setting pc->page and pc->mem_cgroup.


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 9, 2008 - 7:35 pm

On Tue, 09 Sep 2008 19:11:09 -0700
Or use some "allocate a chunk of page_cgroup for a chunk of continuous pages".
Hmm, maybe clarifying trade-off and comapring them is the first step.
And touch vec on other cpu ? The reason "vec" is fast is because it's per-cpu.
If we want to use "delaying", we'll have to make page_cgroup unused and not-on-lru

Hmm...there is a confustion, maybe.

The page_cgroup is now 1:1 to struct page. Then, we can guarantee that

- There is no race between charge v.s. uncharge.

Only problem is force_empty. (But it's difficult..)

This means pc->mem_cgroup is safe here. 
And pc->flags should be atomic flags, anyway. I believe we have to record
"Dirty bit" at el, later.

Thanks,
-Kame

--

From: Nick Piggin
Date: Wednesday, September 10, 2008 - 1:44 pm

It seems really nice to me -- we get the best of both worlds, less overhead
for those who don't enable the memory controller, and even better
performance for those who do.

Are you expecting many users to want to turn this on and off at runtime?
I wouldn't expect so, but I don't know enough about them.
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 10, 2008 - 4:03 am

On Thu, 11 Sep 2008 06:44:37 +1000

No trobles for me for allocating-all-at-boot policy.
My small concern is
  - wasting page_cgroup for hugepage area.
There is no runtime switch. only at boot.

Thanks,
-Kame

--

From: Nick Piggin
Date: Wednesday, September 10, 2008 - 2:02 pm

In those cases you still waste the struct page area too. I realise that
isn't a good way to justify even more wastage. But I guess it is
relatively low. At least, I would think the users would be more happy to

Fine

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 10, 2008 - 4:27 am

On Thu, 11 Sep 2008 07:02:44 +1000
I guess the increase mostly because we can completely avoid kmalloc/kfree slow path.

Balbir, how about fix our way to allocate-all-at-boot-policy ?
If you say yes, I think I can help you and I'll find usable part from my garbage.

Following is lockless+remove-page-cgroup-pointer-from-page-struct patch's result.

rc5-mm1
==
Execl Throughput                           3006.5 lps   (29.8 secs, 3 samples)
C Compiler Throughput                      1006.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4863.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                943.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               482.7 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         124804.9 lpm   (30.0 secs, 3 samples)

lockless
==
Execl Throughput                           3035.5 lps   (29.6 secs, 3 samples)
C Compiler Throughput                      1010.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4881.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                947.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               485.0 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         125437.9 lpm   (30.0 secs, 3 samples)

lockless + remove page cgroup pointer (my version).
==
Execl Throughput                           3021.1 lps   (29.5 secs, 3 samples)
C Compiler Throughput                       980.3 lpm   (60.0 secs, 3 samples)
Shell Scripts (1 concurrent)               4600.0 lpm   (60.0 secs, 3 samples)
Shell Scripts (8 concurrent)                915.7 lpm   (60.0 secs, 3 samples)
Shell Scripts (16 concurrent)               468.3 lpm   (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places         124909.1 lpm   (30.0 secs, 3 samples)

Oh,yes. siginificant slow down. I'm glad to kick this patch out to trash box.

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, September 10, 2008 - 7:34 am

I am perfectly fine with it, I'll need your expertise to get the


-- 
	Thanks,
	Balbir
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 3:21 pm

This will really suck for sparse memory machines.  Imagine a machine
with 1GB of memory at 0x0 and another 1GB of memory at 1TB up in the
address space.

You also need to consider how it works with memory hotplug and how
you're going to grow it at runtime.

Oh, and doesn't alloc_bootmem() panic() if it fails internally anyway?

I need to look at your other approach. :)

-- Dave

--

From: David Miller
Date: Wednesday, September 10, 2008 - 3:31 pm

From: Dave Hansen <dave@linux.vnet.ibm.com>

You just described the workstation I am typing this from :-)
--

From: Balbir Singh
Date: Wednesday, September 10, 2008 - 3:36 pm

I would hate to re-implement the entire sparsemem code :(
Kame did suggest making the memory controller depend on sparsemem (to hook in

Yes, true. This is not the final version, a very very early version that I

We'll need some slab_is_available() sort of checks that sparse.c uses and also
deal with memory hotplug add and remove.

-- 
	Balbir
--

From: Dave Hansen
Date: Wednesday, September 10, 2008 - 3:56 pm

Yeah, you could just make another mem_section member.  Or, you could
work to abstract the sparsemem code so that other people can use it, or
maybe make it more dynamic so we can have multiple pfn->object lookups
in parallel.  Adding the struct member is obviously easier.

-- Dave

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 10, 2008 - 6:35 pm

On Wed, 10 Sep 2008 15:56:48 -0700
Don't worry. I'll care sparse memory map and hotplug.
But whether making this depends on SPARSEMEM or not is not fixed yet.
I'll try generic one, at first. If it's dirty, start discussion about SPARSEMEM.

(Honestly, I love sparsemem than others ;)

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, September 10, 2008 - 6:47 pm

My concern is that if we depend on sparsemem, then we force distros to turn on
sparsemem (which might be the default, but not on all architectures), we might
end up losing those architectures (w.r.t. turning on the memory controller)
where sparsemem is not the default on the distro.


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 10, 2008 - 6:56 pm

On Wed, 10 Sep 2008 18:47:25 -0700
Yes. I share your concern. Then, I'll try not-on-sparsemem version, at first.

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, September 17, 2008 - 4:28 pm

Before trying the sparsemem approach, I tried a radix tree per node,
per zone and I seemed to actually get some performance
improvement.(1.5% (noise maybe))

But please do see and review (tested on my x86_64 box with unixbench
and some other simple tests)

v4..v3
1. Use a radix tree per node, per zone

v3...v2
1. Convert flags to unsigned long
2. Move page_cgroup->lock to a bit spin lock in flags

v2...v1

1. Fix a small bug, don't call radix_tree_preload_end(), if preload fails

This is a rewrite of a patch I had written long back to remove struct page
(I shared the patches with Kamezawa, but never posted them anywhere else).
I spent the weekend, cleaning them up for 2.6.27-rc5-mmotm (29 Aug 2008).

I've tested the patches on an x86_64 box, I've run a simple test running
under the memory control group and the same test running concurrently under
two different groups (and creating pressure within their groups).

Advantages of the patch

1. It removes the extra pointer in struct page

Disadvantages

1. Radix tree lookup is not an O(1) operation, once the page is known
   getting to the page_cgroup (pc) is a little more expensive now.

This is an initial RFC (version 3) for comments

TODOs

1. Test the page migration changes

Performance

In a unixbench run, these patches had a performance impact of 2% (slowdown).

Comments/Reviews?

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/memcontrol.h |   23 +++
 include/linux/mm_types.h   |    4 
 mm/memcontrol.c            |  264 ++++++++++++++++++++++++++++++++++-----------
 3 files changed, 221 insertions(+), 70 deletions(-)

diff -puN mm/memcontrol.c~memcg_move_to_radix_tree mm/memcontrol.c
--- linux-2.6.27-rc5/mm/memcontrol.c~memcg_move_to_radix_tree	2008-09-16 17:10:15.000000000 -0700
+++ linux-2.6.27-rc5-balbir/mm/memcontrol.c	2008-09-17 16:15:01.000000000 -0700
@@ -24,6 +24,7 @@
 #include <linux/smp.h>
 #include <linux/page-flags.h>
 #include ...
From: Andrew Morton
Date: Wednesday, September 17, 2008 - 6:40 pm

Why are we doing this?  I can guess, but I'd rather not have to.

a) It's slower.

b) It uses even more memory worst-case.

c) It uses less memory best-case.

someone somewhere decided that (Aa + Bb) / Cc < 1.0.  What are the values
of A, B and C and where did they come from? ;)

(IOW, your changelog is in the category "sucky", along with 90% of the others)
--

From: Balbir Singh
Date: Wednesday, September 17, 2008 - 8:57 pm

Yes, I agree, to be honest we discussed the reasons on the mailing list and
those should go to the changelog. I'll do that in the next version of the
patches. These are early RFC patches, but the changelog does suck.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 17, 2008 - 10:00 pm

On Wed, 17 Sep 2008 20:57:54 -0700
IIRC, (Aa + Bb) / Cc < 1.0 discussion was following.

Because we have to maintain pointer to page_cgroup in radix-tree (ZONE_NORMAL)

1. memory usage will increase when memory cgroup is enabled.
   The amount memory usage increase just depends on the height of radix-tree.

2. memory usage will decrease when memory cgroup is disabled.
   This saves 4bytes per 4096bytes. (on x86-32)

Thanks,
-Kame  

--

From: Hirokazu Takahashi
Date: Wednesday, September 17, 2008 - 9:26 pm

I think this design is just temporary and the goal is to pre-allocate
all page_cgroups at boot time if it isn't disabled.

But I think each memory model type should have its own way of managing
its page_cgroup arrays as doing for its struct page arrays.
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 17, 2008 - 9:50 pm

On Thu, 18 Sep 2008 13:26:13 +0900 (JST)
My patch adds an interface. Then...
FLATMEM support will be very easy.
I'll ignore DISCONTIGMEM and SPARSEMEM (they will use my 'hash')
SPARSEMEM_VMEMMAP support will took some amount of time. It will need
per-arch patches.

Thanks,
-Kame

--

From: Hirokazu Takahashi
Date: Wednesday, September 17, 2008 - 11:13 pm

What part of this do you think is the problem to implement it in the
straight way for this model?
I think it won't be difficult to implement it since each pgdat can have
its page_cgroup array, which can care about holes in the node as well as

Yes, each of ia64, powerpc and x86_64 use this memory model.


--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 17, 2008 - 9:43 pm

On Wed, 17 Sep 2008 18:40:08 -0700

Balbir, don't you like pre-allocate-page-cgroup-at-boot at all ?
I don't like radix-tree for objects which can spread to very vast/sparse area ;)

BTW, I already have lazy-lru-by-pagevec protocol on my patch(hash version) and
seems to work well. I'm now testing it and will post today if I'm enough lucky.

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, September 17, 2008 - 9:58 pm

I tried one version, but before trying a pre-allocation version, I wanted to
spread out the radix-tree and try and the results seemed quite impressive. We
can still do pre-allocation, but it gets more complicated as we start supporting
all memory models. I do have a design on paper, but it is much more complex than

cool! Please do post what numbers you see as well. I would appreciate if you can
try this version and see what sort of performance issues you see.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 17, 2008 - 10:15 pm

On Wed, 17 Sep 2008 21:58:08 -0700
I'll see what happens. your patch is agasint rc6-mmtom ?

BTW, my 8cpu/2socket/Xeon 3.16GHz host is back now. Maybe good for seeing performance.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 18, 2008 - 4:01 am

On Wed, 17 Sep 2008 21:58:08 -0700

This is the result on 8cpu box. I think I have to reduce footprint of fastpath of
my patch ;)

Test result of your patch is (2).
==
Xeon 8cpu/2socket/1-node equips 48GB of memory.
run shell/exec benchmark 3 times just after boot.

lps ... loops per sec.
lpm ... loops per min.
(*) Shell tests somtimes fail because of division by zero, etc...

(1). rc6-mm1(2008/9/13 version)
==
Run                                       == 1st ==  == 2nd ==  ==3rd==
Execl Throughput                           2425.2     2534.5     2465.8  (lps)
C Compiler Throughput                      1438.3     1476.3     1459.1  (lpm)
Shell Scripts (1 concurrent)               9360.3     9368.3     9360.0  (lpm)
Shell Scripts (8 concurrent)               3868.0     3870.0     3868.0  (lpm)
Shell Scripts (16 concurrent)              2207.0     2204.0     2201.0  (lpm)
Dc: sqrt(2) to 99 decimal places         101644.3   102184.5   102118.5  (lpm)

(2). (1) +remove-page-cgroup-pointer-v3 (radix-tree + dynamic allocation)
==
Run                                       == 1st ==  == 2nd ==  == 3rd ==
Execl Throughput                           2514.1      2548.9    2648.7  (lps)
C Compiler Throughput                      1353.9      1324.6    1324.7  (lpm)
Shell Scripts (1 concurrent)               8866.7      8871.0    8856.0  (lpm)
Shell Scripts (8 concurrent)               3674.3      3680.0    3677.7  (lpm)
Shell Scripts (16 concurrent)              failed.     failed    2094.3  (lpm)
Dc: sqrt(2) to 99 decimal places          98837.0     98206.9   98250.6  (lpm)

(3). (1) + pre-allocation by "vmalloc" + hash + misc(atomic flags etc..)
==
Run                                       == 1st ==  == 2nd ==  == 3rd ==
Execl Throughput                           2385.4      2579.2    2361.5  (lps)
C Compiler Throughput                      1424.3      1436.3    1430.6  (lpm)
Shell Scripts (1 concurrent)               9222.0      9234.0    9246.7  (lpm)
Shell Scripts (8 concurrent)  ...
From: Balbir Singh
Date: Thursday, September 18, 2008 - 4:56 pm

Thanks, Kame! I'll look at the lazy lru patches and see if I can find anything.
Do you have a unified patch anywhere, I seem to get confused with the patches, I
see 10/9, 11/9 and 12/9. I'll do some analysis when I find some free time, I am
currently at plumbers conference.

-- 
	Balbir
--

From: Nick Piggin
Date: Wednesday, September 10, 2008 - 3:38 pm

I think it should try to hook into the physical memory model
code. I thought it was going to do that but didn't look at the
details...
--

From: Balbir Singh
Date: Monday, September 8, 2008 - 9:18 pm

Agreed, but we do reduce the sizeof(struct page) without adding on to

What address calculation do we need, sorry I don't recollect it.


32 bit systems with PAE can support quite a lot of memory, so I am not sure I


I don't like the approach.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 8, 2008 - 9:55 pm

On Mon, 08 Sep 2008 21:18:37 -0700
   base_address = base_addrees_of_page_group_chunk_of_pfn.
   base_address + offset_of_pfn_from_base_pfn * sizeof (struct page_cgroup).

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 9, 2008 - 12:37 am

On Mon, 08 Sep 2008 21:18:37 -0700
Because it's depends on CONFIG.

But ok, I'll recall my patches from grave. Maybe it can be a hint for you.

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, September 1, 2008 - 12:17 am

I agree with the last point, but then when I see
http://www.mail-archive.com/git-commits-head@vger.kernel.org/msg41546.html
I am tempted to reduce the size. I did ask Andi about on which x86_64 machines
are we exceeding the cache size, but heard nothing back.

The questions to answer are

1. On 64 bit systems, would be OK making the size of struct page 64 bytes

Yes, I thought about it, but dropped it for two reasons

1. We don't want a static overhead (irrespective of the memory used and user pages)
2. Like you said, it means fitting into every memory model to get it right.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, August 31, 2008 - 7:39 pm

On Sun, 31 Aug 2008 23:17:56 +0530
BTW, how deep this radix-tree on 4GB/32GB/64GB/256GB machine ?

Thanks,
-Kame

--

From: Balbir Singh
Date: Sunday, August 31, 2008 - 8:42 pm

Good Question,

My ball-park estimates are

number of pfns = RADIX_TREE_TAG_LONGS/(RADIX_TREE_TAG_LONGS - 1) *
(RADIX_TREE_LONGS^n - 1)

and "n" is the number we are looking for.

For a 64 bit system with 256 GB and 4KB page size, I've calculated it to be 9
levels deep.

-- 
	Balbir
--

From: Pavel Emelyanov
Date: Monday, September 1, 2008 - 2:03 am

And besides, we also have a global lock, that protects even lookup
--

From: Balbir Singh
Date: Monday, September 1, 2008 - 2:17 am

Sorry, not sure I understand. The lookup is done under RCU. Updates are done
using the global lock. It should not be hard to make the radix tree per node
later (as an iterative refinement).

-- 
	Balbir
--

From: Pavel Emelyanov
Date: Monday, September 1, 2008 - 2:43 am

OOPS, indeed. Sorry for the confusion, then. :)
--

From: Peter Zijlstra
Date: Monday, September 1, 2008 - 6:19 am

Or you could have a look at the concurrent radix tree, esp for dense
trees it can save a lot on lock bouncing.

Latest code available at:

http://programming.kicks-ass.net/kernel-patches/concurrent-pagecache/27-rc3/

--

From: Balbir Singh
Date: Tuesday, September 2, 2008 - 12:35 am

I'll try them later. I see a lot of slab allocations/frees (expected). I am
trying to experiment with alternatives to reduce them.

-- 
	Balbir
--

Previous thread: re : ACPI PnP on Intel MU440EX by matthieu castet on Sunday, August 31, 2008 - 10:39 am. (5 messages)

Next thread: [BUG] x86 kenel won't boot under Virtual PC by David Sanders on Sunday, August 31, 2008 - 11:22 am. (21 messages)