Re: [RFC][BUGFIX][PATCH 2/2] memcg: fix file mapped underflow at migration (v3)

Previous thread: [PATCH] Kconfig: Make config Filter access to /dev/mem default y by wzt.wzt on Monday, April 12, 2010 - 7:52 pm. (13 messages)

Next thread: Compat-wireless for 2.6.34 - 802.11 and bluetooth backported by Luis R. Rodriguez on Monday, April 12, 2010 - 9:51 pm. (2 messages)
From: Daisuke Nishimura
Date: Monday, April 12, 2010 - 9:42 pm

Hi.

When I was testing page migration, I found underflow problem of "mapped_file" field
in memory.stat. This is a fix for the problem.

This patch is based on mmotm-2010-04-05-16-09, and IIUC it conflicts with Mel's
compaction patches, so I send it as RFC for now. After next mmotm, which will
include those patches, I'll update and resend this patch.

===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

page_add_file_rmap(), which can be called from remove_migration_ptes(), is
assumed to increment memcg's stat of mapped file. But on success of page
migration, the newpage(mapped file) has not been charged yet, so the stat will
not be incremented. This behavior leads to underflow of memcg's stat because
when the newpage is unmapped afterwards, page_remove_rmap() decrements the stat.
This problem doesn't happen on failure path of page migration, because the old
page(mapped file) hasn't been uncharge at the point of remove_migration_ptes().
This patch fixes this problem by calling commit_charge(mem_cgroup_end_migration)
before remove_migration_ptes().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 mm/migrate.c |   19 ++++++++++++++-----
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5938db5..915c35e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,7 +485,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						struct mem_cgroup *mem)
 {
 	struct address_space *mapping;
 	int rc;
@@ -520,9 +521,16 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
+	if (!rc) {
+		/*
+		 * On success of page migration, the newpage has not been
+		 * charged yet, so we must call end_migration() ...
From: KAMEZAWA Hiroyuki
Date: Monday, April 12, 2010 - 11:14 pm

On Tue, 13 Apr 2010 13:42:07 +0900

Nice catch. but...I want to make all kind of complicated things under
prepare/end migration. (And I want to avoid changes in migrate.c...)

Considering some racy condistions, I wonder memcg_update_file_mapped() itself
still need fixes..

So, how about this ? We already added FILE_MAPPED flags, then, make use of it.
==


At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst new page, before migration. But at this point
    no changes to page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. memcg commits the charge for newpage.

Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.

This patch fixes it by accounting file_mapped information at
commiting charge.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -1435,11 +1435,13 @@ void mem_cgroup_update_file_mapped(struc
 
 	/*
 	 * Preemption is already disabled. We can use __this_cpu_xxx
+	 * We have no lock per page at inc/dec mapcount of pages. We have to do
+	 * check by ourselves under lock_page_cgroup().
 	 */
-	if (val > 0) {
+	if (val > 0 && !PageCgroupFileMapped(pc)) {
 		__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		SetPageCgroupFileMapped(pc);
-	} else {
+	} else if (PageCgroupFileMapped(pc)) {
 		__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		ClearPageCgroupFileMapped(pc);
 	}
@@ -2563,6 +2565,15 @@ void mem_cgroup_end_migration(struct mem
 	 ...
From: Daisuke Nishimura
Date: Tuesday, April 13, 2010 - 5:54 pm

hmm, I want to call mem_cgroup_update_file_mapped() only where we update
Adding likely() is better ? IIUC, these conditions are usually met except for
We cannot rely on page lock, because when we succeeded in page migration,
"target" = "newpage" has already unlocked in move_to_new_page(). So the "target"
can be removed from the radix-tree theoretically(it's not related to this
underflow problem, though).
Shouldn't we call lock_page(target) and check "if (!target->mapping)" to handle
this case(maybe in another patch) ?

Thanks,
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 6:03 pm

On Wed, 14 Apr 2010 09:54:08 +0900

Sounds reasonable. I think about that.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 6:40 pm

On Wed, 14 Apr 2010 10:03:08 +0900

Ah, PageCgroupUsed() is already checked under lock_page_cgroup(). It's enough.

Thanks,
-Kame


--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 6:56 pm

On Wed, 14 Apr 2010 10:40:10 +0900

Thinking again....new page is unlocked here. It means the new page may be
removed from radix-tree before commit_charge().

Haha, it seems totally wrong. please wait..

Thanks,
-Kame






--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 8:06 pm

On Wed, 14 Apr 2010 10:56:08 +0900

This is my answer. How do you think ?
I think this logic is simple. (But yes, we should check corner cases.)

==

At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
And FILE_MAPPED of migrated file cache is not properly updated, now.

At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst new page, before migration. But at this point
    no changes to page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. memcg commits the charge for newpage.

Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.

For fixing all, this changes parepare/end migration.
New migration sequence with memcg is:

 1. allocate a new page.
 2. charge against a new page onto old page's memcg. (here, the new page
    is marked as PageCgroupUsed.)
 3. page migration replaces radix-tree, page table, etc...
 4. At remapping, FILE_MAPPED will be properly updated. (because newpage is marked as
    USED, already)
 5. If anonymous page is freed before remap, check it and fixup accounting.
 

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 +-
 mm/memcontrol.c            |   95 ++++++++++++++++++++++++---------------------
 mm/migrate.c               |    2 
 3 files changed, 56 insertions(+), 47 deletions(-)

Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -2501,10 +2501,12 @@ static inline int mem_cgroup_move_swap_a
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page ...
From: Daisuke Nishimura
Date: Tuesday, April 13, 2010 - 10:31 pm

Great! I like this change very much.

it should be:

	*ptr = mem;
	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
	css_put(&mem->css);

hmm... using mem_cgroup_uncharge_page() would be enough, but I think it doesn't
show what we want: we must uncharge "unused" by all means in PageCgroupUsed case,
and I feel it strange a bit to uncharge "unused" by mem_cgroup_uncharge_page(),
if it *was* a cache page.
So I think __mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE)
would be better, otherwise we need more comments to explain why
And if newpage was removed from radix-tree after unlock_page(),
the context which removed it from radix-tree uncharges it properly, because
it is charged at prepare_migration.

mem_cgroup_uncharge_page() does the same check :)


Thanks,
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 10:40 pm

On Wed, 14 Apr 2010 14:31:32 +0900

Ok. I'll fix.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Wednesday, April 14, 2010 - 7:22 pm

Considering more, we'd better to check PageAnon() at least not to call
mem_cgroup_uncharge_page() for cache page.


Thanks,
Daisuke Nishimura.
--

From: Balbir Singh
Date: Monday, April 12, 2010 - 11:45 pm

Why do we need to pass mem, won't try_get_mem_cgroup_from_page() help?


-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 8:05 pm

I'd like to wait until next mmotm comes out. (So, [RFC]) I'll rebase
This patch is based on
 mmotm-2010/04/05
 +
 mm-migration-take-a-reference-to-the-anon_vma-before-migrating.patch
 mm-migration-do-not-try-to-migrate-unmapped-anonymous-pages.patch
 mm-share-the-anon_vma-ref-counts-between-ksm-and-page-migration.patch
 mm-migration-allow-the-migration-of-pageswapcache-pages.patch
 memcg-fix-prepare-migration.patch

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is an additonal fix to memcg-fix-prepare-migration.patch

Now, try_charge can bypass charge if TIF_MEMDIE at el are marked on the caller.
In this case, the charge is bypassed. This makes accounts corrupted.
(PageCgroup will be marked as PCG_USED even if bypassed, and css->refcnt
 can leak.)

This patch clears passed "*memcg" in bypass route.

Because usual page allocater passes memcg=NULL, this patch only affects
some special case as
  - move account
  - migration
  - swapin.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -1606,8 +1606,12 @@ static int __mem_cgroup_try_charge(struc
 	 * MEMDIE process.
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE)
-		     || fatal_signal_pending(current)))
+		     || fatal_signal_pending(current))) {
+		/* Showing we skipped charge */
+		if (memcg)
+			*memcg = NULL;
 		goto bypass;
+	}
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -2523,7 +2527,6 @@ int mem_cgroup_prepare_migration(struct 
 		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
 		css_put(&mem->css);
 	}
-	*ptr = mem;
 	return ret;
 }
 


  

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 8:06 pm

At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
And FILE_MAPPED of migrated file cache is not properly updated, now.

At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst new page, before migration. But at this point
    no changes to page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. memcg commits the charge for newpage.

Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.

For fixing all, this changes parepare/end migration.
New migration sequence with memcg is:

 1. allocate a new page.
 2. charge against a new page onto old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
 3. page migration replaces radix-tree, page table, etc...
 4. At remapping, FILE_MAPPED will be properly updated.

Changelog: 2010/04/14
 - updated onto the latest mmotm + page-compaction, etc...
 - fixed __try_charge() bypass case.
 - use CHARGE_TYPE_FORCE for uncharging an unused page.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 +-
 mm/memcontrol.c            |   96 ++++++++++++++++++++++++---------------------
 mm/migrate.c               |    2 
 3 files changed, 58 insertions(+), 46 deletions(-)

Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -2505,10 +2505,12 @@ static inline int mem_cgroup_move_swap_a
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
-int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
+int ...
From: KAMEZAWA Hiroyuki
Date: Friday, April 16, 2010 - 3:31 am

This is still RFC. 
I hope this will fix memcg v.s. page-migraion war finally ;(
But we have to be careful...
==

At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
(But it wasn't big problem..the real issue is mapcount handling.)

This has small race and we need to fix this race. By this,
FILE_MAPPED of migrated file cache is not properly updated, now.

This patch is for fixing the race by changing algorithm.

At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. Here, the new page is unlocked.
 7. memcg commits the charge for newpage, Mark page cgroup as USED.

Because "commit" happens after page-remap, we cannot count FILE_MAPPED
at "5", we can lose file_mapped accounting information within this
small race. FILE_MAPPED is updated only when mapcount changes 0->1.
So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED
at 1->0 event.

We may be able to avoid underflow by some small technique or new hook but
we should catch mapcount 0->1 event fundamentaly. To catch this, we have to
make page_cgroup of new page as "USED".

BTW, historically, above implemntation comes from migration-failure
of anonymous page. When we charge both of old page and new page
with mapcount=0 before migration we can't catch
  - the page is really freed before remap.
  - migration fails but it's freed before remap
.....corner cases.

For fixing all, this changes parepare/end migration with MIGRATION flag on
page_cgroup.

New migration sequence with memcg is:

 1. allocate a new page.
 2. mark PageCgroupMigration to the old page.
 3. charge against a new page onto the ...
From: Daisuke Nishimura
Date: Sunday, April 18, 2010 - 8:42 pm

Hmm, before going further, will you explain why we need a new PCG_MIGRATION flag ?
What's the problem of v2 ?

Thanks,
Daisuke Nishimura.

--

From: KAMEZAWA Hiroyuki
Date: Sunday, April 18, 2010 - 9:18 pm

On Mon, 19 Apr 2010 12:42:25 +0900

v2 can't handle migration-failure case of freed swapcache and the used page
was swapped-out case. I think.

All "page" in following is ANON.


     mem_cgroup_prepare_migration()
	     charge against new page.
     
     try_to_unmap()
        -> mapcount goes down to 0.
             -> an old page is unchaged
     
     move_to_new_page()
        -> may fail. (in some case.)   ----(*1)

     remap the old page to pte.

     mem_cgroup_end_migration()
		(at success *1)
		check charge for newpage is valid or not (*2)

		(at fail *1)
		uncharge new page.
		What we should do for an old page. ---(*3)

At (*2). (*3), there are several cases.

(*2) migration was succeeded.
    1. The new page was successfully remapped.
	-> Nothing to do.
    2. The new page was remapped but finally unmapped before (*3)
	-> page_remove_rmap() will catch the event.
    3. The new page was not remapped.
	-> page_remove_rmap() can't catch the event. end_migraion() has to
	uncharge it.

(*3) migration was failed.
    1. The old page was successfully remapped.
	-> We have to recharge against the old page. (But it may hit OOM.)
    2. The old page wasn't remapped.
        -> mapcount is 0. No new charge will happen.
    3. The old page wasn't remapped but SwapCache.
        -> mapcount is 0. We have to recharge against the old page (But it may hit OOM)

Maybe other seqence I couldn't write will exist......IMHO, "we have to recharge it because
it's uncharged.." is bad idea. It seems hard to maintainace..


When we use MIGRATION flag.
After migaration.

    1. Agaisnt new page, we remove MIGRATION flag and try to uncharge() it again.

    2. Agaisnt old page, we remove MIGRATION flag and try to uncharge it again.

NOTE:  I noticed my v3 patch is buggy when the page-is-swapped-out case. It seems
       mem_cgroup_uncharge_swapcache() has to wait for migration ends or some
       other case handling. (Anyway, this race exists only after ...
From: Daisuke Nishimura
Date: Monday, April 19, 2010 - 1:07 am

Thank you for explaining in detail.

Anyway, I agree that current implementation is complicated and there might be
some cases we are missing. MIGRATION flag can make it simpler.

I have one concern for now. Reading the patch, the flag have influence on
only anonymous pages, so we'd better to note it and I feel it strange to
set(and clear) the flag of "old page" always(iow, even when !PageAnon)
in prepare_migration.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, April 19, 2010 - 1:26 am

On Mon, 19 Apr 2010 17:07:01 +0900

v2 doesn't charge. That was the bug.
"may hit OOM" is an explanation for why current implementation is used.

Hmm...Checking "Only Anon" is simpler ? It will have no meanings for migrating
file caches, but it may have some meanings for easy debugging. 
I think "mark it always but it's used only for anonymous page" is reasonable
(if it causes no bug.)


Thanks,
-Kame




--

From: Daisuke Nishimura
Date: Monday, April 19, 2010 - 9:20 pm

I just thought it was inconsistent that we always set/clear the bit about "old page",
Anyway, I don't have any strong objection.
It's all right for me as long as it is well documented or commented.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, April 19, 2010 - 9:26 pm

On Tue, 20 Apr 2010 13:20:50 +0900
Maybe I can post v4, today.

Thanks,
-Kame


--

From: KAMEZAWA Hiroyuki
Date: Tuesday, April 20, 2010 - 2:19 am

On Tue, 20 Apr 2010 13:20:50 +0900
Okay, before posting as v4, here is draft version.

I'll check again.
==

At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
(it worked but...)
This has small race and we need to fix this race. By this,
FILE_MAPPED of migrated file cache is not properly updated, now.

This patch is for fixing the race by changing algorithm.

At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. Here, the new page is unlocked.
 7. memcg commits the charge for newpage, Mark page cgroup as USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", we can lose file_mapped accounting information within this
small race. FILE_MAPPED is updated only when mapcount changes 0->1.
So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED
at 1->0 event.

We may be able to avoid underflow by some small technique but
we should catch mapcount 0->1 event. To catch this, we have to
make page_cgroup of new page as "USED".

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
  - the page is really freed before remap.
  - migration fails but it's freed before remap
or .....corner cases.

For fixing all, this changes parepare/end migration.

New migration sequence with memcg is:

 1. allocate a new page.
 2. mark PageCgroupMigration to the old page.
 3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
 4. mark PageCgroupMigration to the old page.

If ...
From: Daisuke Nishimura
Date: Friday, April 23, 2010 - 1:08 am

I'm sorry for my late reply.

Thank you for adding good comments about what it does and why we need it.
I like the direction that we set MIGRATION flags only on the old page.
And this patch looks good to me, except that checkpatch warns some problems
about indent :)

The reason why "This flag itself is not racy" is that we update the flag only
while the page is isolated ?
Then, we doesn't need page_cgroup lock, do we ? PCG_USED bit will avoid
double-uncharge.

Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Friday, April 23, 2010 - 1:23 am

On Fri, 23 Apr 2010 17:08:46 +0900
(--;

I'm sorry that this patch is delayed. I have to fix migration itself
yes and no.
It's not racy because a page is only under a migration thread, not under a few of
no. there is a chance to update FILE_MAPPED etc..and any other races. I guess.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Wednesday, April 14, 2010 - 11:43 pm

I'm sorry, I can't understand what this part fixes.
We set *memcg to NULL at "bypass" part already:

   1740 bypass:
   1741         *memcg = NULL;
   1742         return 0;

I sent a patch to Andrew to fix this part yesterday :)


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 11:56 pm

On Thu, 15 Apr 2010 15:43:24 +0900


Ok, ignore this patch.

-Kame

--

From: Andrea Arcangeli
Date: Thursday, April 15, 2010 - 1:17 am

Ok so I'll stick to my original patch on aa.git:

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=f0a05fea58501298ab7b...

I already also merged the move from /proc to debugfs from Mel of two
files. So now I've to:

1) finish the generic doc in Documentation/ (mostly taken from
   transparent hugepage core changeset comments here:
   http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=b901f7e1ab412241d42...
   )

2) add alloc_pages_vma for numa awareness in the huge page faults

3) have the kernel stack 2m aligned and growsdown the vm_start in 2m
   chunks when enabled=always. I doubt it makes sense to decouple this
   feature from enabled=always and to add a special sysfs control for
   it, plus I don't like adding too many apis and it can always
   decoupled later.

4) I think I will not add a prctl to achieve Ingo's per-process enable
   for now. I'm quite convinced in real life madvise is enough and
   enabled=always|madvise|never is more than enough for the testing
   without having to add a prctl. This is identical issue to KSM after
   all, in the end also KSM is missing a prctl to enabled merging on a
   per process basis and that's fine. prctl really looks very much
   like libhugetlbfs to me so I'm not very attracted to it as I doubt
   its usefulness strongly and if I add it, it becomes a
   forever-existing API (actually even worse than the sysfs layout
   from the kernel API point of view) so there has to be a strong
   reason for it. And I don't think there's any point to add a
   madvise(MADV_NO_HUGEPAGE) or a prctl to selectively _disable_
   hugepages on mappings or processes when enabled=always. It makes no
   sense to use enabled=always and then to disable hugepages in a few
   apps. The opposite makes sense to save memory of course! I don't
   want to add kernel APIs in prctl useful only for testing and
   benchmarking. It can always be added later anyway...

5) Ulrich sent me a ...
From: Christoph Lameter
Date: Friday, April 16, 2010 - 9:13 am

How do interleave policies work with alloc_pages_vma? So far the semantics
is to spread 4k pages over different nodes. With 2M pages this can no
longer work the way is was.

--


static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
       	      	   				     unsigned nid)

See the order parameter, so I hope it's already solved. I assume the
idea would be to interleave 2M pages to avoid the CPU the memory
overhead of the pte layer and to decrease the tlb misses, but still
maxing out the bandwidth of the system when multiple threads accesses
memory that is stored in different nodes with random access. It should
be ideal for hugetlbfs too for the large shared memory pools of the
DB. Surely it'll be better than having all hugepages from the same
node despite MPOL_INTERLEAVE is set.

Said that, it'd also be possible to disable hugepages if the vma has
MPOL_INTERLEAVE set, but I doubt we want to do that by default. Maybe
we can add a sysfs control later for that which can be further tweaked
at boot time by per-arch quirks, dunno... It's really up to you, you
know numa better, but I've no doubt that MPOL_INTERLEAVE also can make
sense with hugepages (both hugetlbfs and transparent hugepage
support).

Thanks,
Andrea
--

Previous thread: [PATCH] Kconfig: Make config Filter access to /dev/mem default y by wzt.wzt on Monday, April 12, 2010 - 7:52 pm. (13 messages)

Next thread: Compat-wireless for 2.6.34 - 802.11 and bluetooth backported by Luis R. Rodriguez on Monday, April 12, 2010 - 9:51 pm. (2 messages)