Hi.
When I was testing page migration, I found underflow problem of "mapped_file" field
in memory.stat. This is a fix for the problem.
This patch is based on mmotm-2010-04-05-16-09, and IIUC it conflicts with Mel's
compaction patches, so I send it as RFC for now. After next mmotm, which will
include those patches, I'll update and resend this patch.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
page_add_file_rmap(), which can be called from remove_migration_ptes(), is
assumed to increment memcg's stat of mapped file. But on success of page
migration, the newpage(mapped file) has not been charged yet, so the stat will
not be incremented. This behavior leads to underflow of memcg's stat because
when the newpage is unmapped afterwards, page_remove_rmap() decrements the stat.
This problem doesn't happen on failure path of page migration, because the old
page(mapped file) hasn't been uncharge at the point of remove_migration_ptes().
This patch fixes this problem by calling commit_charge(mem_cgroup_end_migration)
before remove_migration_ptes().
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/migrate.c | 19 ++++++++++++++-----
1 files changed, 14 insertions(+), 5 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5938db5..915c35e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,7 +485,8 @@ static int fallback_migrate_page(struct address_space *mapping,
* < 0 - error code
* == 0 - success
*/
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+ struct mem_cgroup *mem)
{
struct address_space *mapping;
int rc;
@@ -520,9 +521,16 @@ static int move_to_new_page(struct page *newpage, struct page *page)
else
rc = fallback_migrate_page(mapping, newpage, page);
- if (!rc)
+ if (!rc) {
+ /*
+ * On success of page migration, the newpage has not been
+ * charged yet, so we must call end_migration() ...On Tue, 13 Apr 2010 13:42:07 +0900
Nice catch. but...I want to make all kind of complicated things under
prepare/end migration. (And I want to avoid changes in migrate.c...)
Considering some racy condistions, I wonder memcg_update_file_mapped() itself
still need fixes..
So, how about this ? We already added FILE_MAPPED flags, then, make use of it.
==
At migrating mapped file, events happens in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst new page, before migration. But at this point
no changes to page_cgroup, no commit-charge.
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. memcg commits the charge for newpage.
Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.
This patch fixes it by accounting file_mapped information at
commiting charge.
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -1435,11 +1435,13 @@ void mem_cgroup_update_file_mapped(struc
/*
* Preemption is already disabled. We can use __this_cpu_xxx
+ * We have no lock per page at inc/dec mapcount of pages. We have to do
+ * check by ourselves under lock_page_cgroup().
*/
- if (val > 0) {
+ if (val > 0 && !PageCgroupFileMapped(pc)) {
__this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
SetPageCgroupFileMapped(pc);
- } else {
+ } else if (PageCgroupFileMapped(pc)) {
__this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
ClearPageCgroupFileMapped(pc);
}
@@ -2563,6 +2565,15 @@ void mem_cgroup_end_migration(struct mem
...hmm, I want to call mem_cgroup_update_file_mapped() only where we update Adding likely() is better ? IIUC, these conditions are usually met except for We cannot rely on page lock, because when we succeeded in page migration, "target" = "newpage" has already unlocked in move_to_new_page(). So the "target" can be removed from the radix-tree theoretically(it's not related to this underflow problem, though). Shouldn't we call lock_page(target) and check "if (!target->mapping)" to handle this case(maybe in another patch) ? Thanks, --
On Wed, 14 Apr 2010 09:54:08 +0900 Sounds reasonable. I think about that. Thanks, -Kame --
On Wed, 14 Apr 2010 10:03:08 +0900 Ah, PageCgroupUsed() is already checked under lock_page_cgroup(). It's enough. Thanks, -Kame --
On Wed, 14 Apr 2010 10:40:10 +0900 Thinking again....new page is unlocked here. It means the new page may be removed from radix-tree before commit_charge(). Haha, it seems totally wrong. please wait.. Thanks, -Kame --
On Wed, 14 Apr 2010 10:56:08 +0900
This is my answer. How do you think ?
I think this logic is simple. (But yes, we should check corner cases.)
==
At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
And FILE_MAPPED of migrated file cache is not properly updated, now.
At migrating mapped file, events happens in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst new page, before migration. But at this point
no changes to page_cgroup, no commit-charge.
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. memcg commits the charge for newpage.
Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.
For fixing all, this changes parepare/end migration.
New migration sequence with memcg is:
1. allocate a new page.
2. charge against a new page onto old page's memcg. (here, the new page
is marked as PageCgroupUsed.)
3. page migration replaces radix-tree, page table, etc...
4. At remapping, FILE_MAPPED will be properly updated. (because newpage is marked as
USED, already)
5. If anonymous page is freed before remap, check it and fixup accounting.
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 6 +-
mm/memcontrol.c | 95 ++++++++++++++++++++++++---------------------
mm/migrate.c | 2
3 files changed, 56 insertions(+), 47 deletions(-)
Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -2501,10 +2501,12 @@ static inline int mem_cgroup_move_swap_a
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page ...Great! I like this change very much. it should be: *ptr = mem; ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false); css_put(&mem->css); hmm... using mem_cgroup_uncharge_page() would be enough, but I think it doesn't show what we want: we must uncharge "unused" by all means in PageCgroupUsed case, and I feel it strange a bit to uncharge "unused" by mem_cgroup_uncharge_page(), if it *was* a cache page. So I think __mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE) would be better, otherwise we need more comments to explain why And if newpage was removed from radix-tree after unlock_page(), the context which removed it from radix-tree uncharges it properly, because it is charged at prepare_migration. mem_cgroup_uncharge_page() does the same check :) Thanks, --
On Wed, 14 Apr 2010 14:31:32 +0900 Ok. I'll fix. Thanks, -Kame --
Considering more, we'd better to check PageAnon() at least not to call mem_cgroup_uncharge_page() for cache page. Thanks, Daisuke Nishimura. --
Why do we need to pass mem, won't try_get_mem_cgroup_from_page() help? -- Three Cheers, Balbir --
I'd like to wait until next mmotm comes out. (So, [RFC]) I'll rebase
This patch is based on
mmotm-2010/04/05
+
mm-migration-take-a-reference-to-the-anon_vma-before-migrating.patch
mm-migration-do-not-try-to-migrate-unmapped-anonymous-pages.patch
mm-share-the-anon_vma-ref-counts-between-ksm-and-page-migration.patch
mm-migration-allow-the-migration-of-pageswapcache-pages.patch
memcg-fix-prepare-migration.patch
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
This is an additonal fix to memcg-fix-prepare-migration.patch
Now, try_charge can bypass charge if TIF_MEMDIE at el are marked on the caller.
In this case, the charge is bypassed. This makes accounts corrupted.
(PageCgroup will be marked as PCG_USED even if bypassed, and css->refcnt
can leak.)
This patch clears passed "*memcg" in bypass route.
Because usual page allocater passes memcg=NULL, this patch only affects
some special case as
- move account
- migration
- swapin.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -1606,8 +1606,12 @@ static int __mem_cgroup_try_charge(struc
* MEMDIE process.
*/
if (unlikely(test_thread_flag(TIF_MEMDIE)
- || fatal_signal_pending(current)))
+ || fatal_signal_pending(current))) {
+ /* Showing we skipped charge */
+ if (memcg)
+ *memcg = NULL;
goto bypass;
+ }
/*
* We always charge the cgroup the mm_struct belongs to.
@@ -2523,7 +2527,6 @@ int mem_cgroup_prepare_migration(struct
ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
css_put(&mem->css);
}
- *ptr = mem;
return ret;
}
--
At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
And FILE_MAPPED of migrated file cache is not properly updated, now.
At migrating mapped file, events happens in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst new page, before migration. But at this point
no changes to page_cgroup, no commit-charge.
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. memcg commits the charge for newpage.
Because "commit" happens after page-remap, we lose file_mapped
accounting information at migration.
For fixing all, this changes parepare/end migration.
New migration sequence with memcg is:
1. allocate a new page.
2. charge against a new page onto old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
3. page migration replaces radix-tree, page table, etc...
4. At remapping, FILE_MAPPED will be properly updated.
Changelog: 2010/04/14
- updated onto the latest mmotm + page-compaction, etc...
- fixed __try_charge() bypass case.
- use CHARGE_TYPE_FORCE for uncharging an unused page.
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 6 +-
mm/memcontrol.c | 96 ++++++++++++++++++++++++---------------------
mm/migrate.c | 2
3 files changed, 58 insertions(+), 46 deletions(-)
Index: mmotm-temp/mm/memcontrol.c
===================================================================
--- mmotm-temp.orig/mm/memcontrol.c
+++ mmotm-temp/mm/memcontrol.c
@@ -2505,10 +2505,12 @@ static inline int mem_cgroup_move_swap_a
* Before starting migration, account PAGE_SIZE to mem_cgroup that the old
* page belongs to.
*/
-int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
+int ...This is still RFC.
I hope this will fix memcg v.s. page-migraion war finally ;(
But we have to be careful...
==
At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
(But it wasn't big problem..the real issue is mapcount handling.)
This has small race and we need to fix this race. By this,
FILE_MAPPED of migrated file cache is not properly updated, now.
This patch is for fixing the race by changing algorithm.
At migrating mapped file, events happens in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit-charge.
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark page cgroup as USED.
Because "commit" happens after page-remap, we cannot count FILE_MAPPED
at "5", we can lose file_mapped accounting information within this
small race. FILE_MAPPED is updated only when mapcount changes 0->1.
So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED
at 1->0 event.
We may be able to avoid underflow by some small technique or new hook but
we should catch mapcount 0->1 event fundamentaly. To catch this, we have to
make page_cgroup of new page as "USED".
BTW, historically, above implemntation comes from migration-failure
of anonymous page. When we charge both of old page and new page
with mapcount=0 before migration we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
.....corner cases.
For fixing all, this changes parepare/end migration with MIGRATION flag on
page_cgroup.
New migration sequence with memcg is:
1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the ...Hmm, before going further, will you explain why we need a new PCG_MIGRATION flag ? What's the problem of v2 ? Thanks, Daisuke Nishimura. --
On Mon, 19 Apr 2010 12:42:25 +0900
v2 can't handle migration-failure case of freed swapcache and the used page
was swapped-out case. I think.
All "page" in following is ANON.
mem_cgroup_prepare_migration()
charge against new page.
try_to_unmap()
-> mapcount goes down to 0.
-> an old page is unchaged
move_to_new_page()
-> may fail. (in some case.) ----(*1)
remap the old page to pte.
mem_cgroup_end_migration()
(at success *1)
check charge for newpage is valid or not (*2)
(at fail *1)
uncharge new page.
What we should do for an old page. ---(*3)
At (*2). (*3), there are several cases.
(*2) migration was succeeded.
1. The new page was successfully remapped.
-> Nothing to do.
2. The new page was remapped but finally unmapped before (*3)
-> page_remove_rmap() will catch the event.
3. The new page was not remapped.
-> page_remove_rmap() can't catch the event. end_migraion() has to
uncharge it.
(*3) migration was failed.
1. The old page was successfully remapped.
-> We have to recharge against the old page. (But it may hit OOM.)
2. The old page wasn't remapped.
-> mapcount is 0. No new charge will happen.
3. The old page wasn't remapped but SwapCache.
-> mapcount is 0. We have to recharge against the old page (But it may hit OOM)
Maybe other seqence I couldn't write will exist......IMHO, "we have to recharge it because
it's uncharged.." is bad idea. It seems hard to maintainace..
When we use MIGRATION flag.
After migaration.
1. Agaisnt new page, we remove MIGRATION flag and try to uncharge() it again.
2. Agaisnt old page, we remove MIGRATION flag and try to uncharge it again.
NOTE: I noticed my v3 patch is buggy when the page-is-swapped-out case. It seems
mem_cgroup_uncharge_swapcache() has to wait for migration ends or some
other case handling. (Anyway, this race exists only after ...Thank you for explaining in detail. Anyway, I agree that current implementation is complicated and there might be some cases we are missing. MIGRATION flag can make it simpler. I have one concern for now. Reading the patch, the flag have influence on only anonymous pages, so we'd better to note it and I feel it strange to set(and clear) the flag of "old page" always(iow, even when !PageAnon) in prepare_migration. Thanks, Daisuke Nishimura. --
On Mon, 19 Apr 2010 17:07:01 +0900 v2 doesn't charge. That was the bug. "may hit OOM" is an explanation for why current implementation is used. Hmm...Checking "Only Anon" is simpler ? It will have no meanings for migrating file caches, but it may have some meanings for easy debugging. I think "mark it always but it's used only for anonymous page" is reasonable (if it causes no bug.) Thanks, -Kame --
I just thought it was inconsistent that we always set/clear the bit about "old page", Anyway, I don't have any strong objection. It's all right for me as long as it is well documented or commented. Thanks, Daisuke Nishimura. --
On Tue, 20 Apr 2010 13:20:50 +0900 Maybe I can post v4, today. Thanks, -Kame --
On Tue, 20 Apr 2010 13:20:50 +0900
Okay, before posting as v4, here is draft version.
I'll check again.
==
At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
(it worked but...)
This has small race and we need to fix this race. By this,
FILE_MAPPED of migrated file cache is not properly updated, now.
This patch is for fixing the race by changing algorithm.
At migrating mapped file, events happens in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit-charge.
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark page cgroup as USED.
Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", we can lose file_mapped accounting information within this
small race. FILE_MAPPED is updated only when mapcount changes 0->1.
So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED
at 1->0 event.
We may be able to avoid underflow by some small technique but
we should catch mapcount 0->1 event. To catch this, we have to
make page_cgroup of new page as "USED".
BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.
For fixing all, this changes parepare/end migration.
New migration sequence with memcg is:
1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. mark PageCgroupMigration to the old page.
If ...I'm sorry for my late reply. Thank you for adding good comments about what it does and why we need it. I like the direction that we set MIGRATION flags only on the old page. And this patch looks good to me, except that checkpatch warns some problems about indent :) The reason why "This flag itself is not racy" is that we update the flag only while the page is isolated ? Then, we doesn't need page_cgroup lock, do we ? PCG_USED bit will avoid double-uncharge. Thanks, Daisuke Nishimura. --
On Fri, 23 Apr 2010 17:08:46 +0900 (--; I'm sorry that this patch is delayed. I have to fix migration itself yes and no. It's not racy because a page is only under a migration thread, not under a few of no. there is a chance to update FILE_MAPPED etc..and any other races. I guess. Thanks, -Kame --
I'm sorry, I can't understand what this part fixes. We set *memcg to NULL at "bypass" part already: 1740 bypass: 1741 *memcg = NULL; 1742 return 0; I sent a patch to Andrew to fix this part yesterday :) Thanks, Daisuke Nishimura. --
On Thu, 15 Apr 2010 15:43:24 +0900 Ok, ignore this patch. -Kame --
Ok so I'll stick to my original patch on aa.git: http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=patch;h=f0a05fea58501298ab7b... I already also merged the move from /proc to debugfs from Mel of two files. So now I've to: 1) finish the generic doc in Documentation/ (mostly taken from transparent hugepage core changeset comments here: http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=commit;h=b901f7e1ab412241d42... ) 2) add alloc_pages_vma for numa awareness in the huge page faults 3) have the kernel stack 2m aligned and growsdown the vm_start in 2m chunks when enabled=always. I doubt it makes sense to decouple this feature from enabled=always and to add a special sysfs control for it, plus I don't like adding too many apis and it can always decoupled later. 4) I think I will not add a prctl to achieve Ingo's per-process enable for now. I'm quite convinced in real life madvise is enough and enabled=always|madvise|never is more than enough for the testing without having to add a prctl. This is identical issue to KSM after all, in the end also KSM is missing a prctl to enabled merging on a per process basis and that's fine. prctl really looks very much like libhugetlbfs to me so I'm not very attracted to it as I doubt its usefulness strongly and if I add it, it becomes a forever-existing API (actually even worse than the sysfs layout from the kernel API point of view) so there has to be a strong reason for it. And I don't think there's any point to add a madvise(MADV_NO_HUGEPAGE) or a prctl to selectively _disable_ hugepages on mappings or processes when enabled=always. It makes no sense to use enabled=always and then to disable hugepages in a few apps. The opposite makes sense to save memory of course! I don't want to add kernel APIs in prctl useful only for testing and benchmarking. It can always be added later anyway... 5) Ulrich sent me a ...
How do interleave policies work with alloc_pages_vma? So far the semantics is to spread 4k pages over different nodes. With 2M pages this can no longer work the way is was. --
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
unsigned nid)
See the order parameter, so I hope it's already solved. I assume the
idea would be to interleave 2M pages to avoid the CPU the memory
overhead of the pte layer and to decrease the tlb misses, but still
maxing out the bandwidth of the system when multiple threads accesses
memory that is stored in different nodes with random access. It should
be ideal for hugetlbfs too for the large shared memory pools of the
DB. Surely it'll be better than having all hugepages from the same
node despite MPOL_INTERLEAVE is set.
Said that, it'd also be possible to disable hugepages if the vma has
MPOL_INTERLEAVE set, but I doubt we want to do that by default. Maybe
we can add a sysfs control later for that which can be further tweaked
at boot time by per-arch quirks, dunno... It's really up to you, you
know numa better, but I've no doubt that MPOL_INTERLEAVE also can make
sense with hugepages (both hugetlbfs and transparent hugepage
support).
Thanks,
Andrea
--
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
| Linux Kernel Mailing List | md: move allocation of ->queue from mddev_find to md_probe |
| Linux Kernel Mailing List |
