Hi, I updated the stack and reflected comments.
Against the latest mmotm. (rc7-mm1)
Major changes from previous one is
- page_cgroup allocation/lookup manner is changed.
all FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
- force_empty is totally rewritten. and a problem that "force_empty takes long time"
in previous version is fixed (I think...)
- reordered patches.
- first half are easy ones.
- second half are big ones.
I'm still testing with full debug option. No problem found yet.
(I'm afraid of race condition which have not been caught yet.)
[1/12] avoid accounting special mappings not on LRU. (fix)
[2/12] move charege() call to swapped-in page under lock_page() (clean up)
[3/12] make root cgroup to be unlimited. (change semantics.)
[4/12] make page->mapping NULL before calling uncharge (clean up)
[5/12] make page->flags to use atomic ops. (changes in infrastructure)
[6/12] optimize stat. (clean up)
[7/12] add support function for moving account. (new function)
[8/12] rewrite force_empty to use move_account. (change semantics.)
[9/12] allocate all page_cgroup at boot. (changes in infrastructure)
[10/12] free page_cgroup from LRU in lazy way (optimize)
[11/12] add page_cgroup to LRU in lazy way (optimize)
[12/12] fix race at charging swap (fix by new logic.)
*Any* comment is welcome.
Thanks,
-Kame
--
There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We'd like to make use of usual objrmap/radix-tree's protcol
and don't want to account out-of-vm's control pages.
When special_mapping_fault() is called, page->mapping is tend to be NULL
and it's charged as Anonymous page.
insert_page() also handles some special pages from drivers.
This patch is for avoiding to account special pages.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 18 ++++++------------
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 14 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
pte_t *pte;
spinlock_t *ptl;
- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
- if (retval)
- goto out;
-
retval = -EINVAL;
if (PageAnon(page))
- goto out_uncharge;
+ goto out;
retval = -ENOMEM;
flush_dcache_page(page);
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
- goto out_uncharge;
+ goto out;
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
return retval;
out_unlock:
pte_unmap_unlock(pte, ptl);
-out_uncharge:
- mem_cgroup_uncharge_page(page);
out:
return retval;
}
@@ -2542,7 +2536,7 @@ static int __do_fault(struct mm_struct *
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (anon && mem_cgroup_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2584,10 +2578,10 @@ static int __do_fault(struct mm_struct *
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
} else {
- mem_cgroup_uncharge_page(page);
- if (anon)
+ if (anon) ...Hmm... I am a little concerned that with these changes actual usage will much more than what we report in memory.usage_in_bytes. Why not move them to non-reclaimable LRU list as unevictable pages (once those patches go in, we can push this as well). I suspect the size of special pages is too short to affect anything or are you seeing something very different? -- Balbir --
On Fri, 26 Sep 2008 13:55:54 +0530 I don't want put pages never goes to LRU onto memcgroup's LRU. Thanks, -Kame --
I understand.. Thanks for clarifying.. my concern is w.r.t accounting, we account it in RSS (file RSS). -- Balbir --
On Fri, 26 Sep 2008 15:02:13 +0530 Yes. Because page->mapping is NULL, there are accounted as RSS. AFAIK, what bad about !PageLRU(page) is.. 1. we skips !PageLRU(page) in force_empty. and we cannot reduce account. 2. we don't know we can trust that page is treated as usual LRU page. I'll update description. Thanks, -Kame --
While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)
This patch moves do_swap_page()'s charge() call under lock. This helps
us to avoid to charge already mapped one, unnecessary calls.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -2320,15 +2320,14 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}
+ lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
goto out;
}
-
mark_page_accessed(page);
- lock_page(page);
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
/*
* Back out if somebody else already faulted in this pte.
--
Seems reasonable to me Just one quick comment though, as a result of this change, mark_page_accessed is now called with PageLock held, I suspect you would want to move that call prior to lock_page(). -- Balbir --
On Fri, 26 Sep 2008 14:06:02 +0530 Ok. I'll check it Thanks, --
Make root cgroup of memory resource controller to have no limit.
By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.
For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Documentation/controllers/memory.txt | 4 ++++
mm/memcontrol.c | 7 +++++++
2 files changed, 11 insertions(+)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;
+#define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
+
+
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
@@ -945,6 +948,10 @@ static int mem_cgroup_write(struct cgrou
switch (cft->private) {
case RES_LIMIT:
+ if (is_root_cgroup(memcg)) {
+ ret = -EINVAL;
+ break;
+ }
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
if (!ret)
Index: mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc7+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
a page from Page Cache is used to decrement the accounting counters of the
cgroup.
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
2.3 Shared Page Accounting
Shared pages are accounted on the basis of the first touch approach. The
@@ ...This is an ABI change (although not too many people might be using it, I wonder if we should add memory.features (a set of flags and let users enable them and provide good defaults), like sched features. -- Balbir --
On Fri, 26 Sep 2008 14:11:00 +0530 I think "feature" flag is complicated, at this stage. We'll add more features and not settled yet. Hmm, if you don't like this, calling try_to_free_page() at force_empty() instead of move_account() ? Thanks, -Kame --
I know.. but breaking ABI is a bad bad thing. We'll have to keep the feature flags extensible (add new things). If we all feel we don't have enough users Not sure I understand this. -- Balbir --
On Fri, 26 Sep 2008 14:59:48 +0530 In following series, force_empty() uses move_account() rather than forget all. By this, accounted file caches are kept as accounted and the whole accounting will be sane. Another choice is calling try_to_free_pages() at force_empty rather than forget all and makes memory usage to be Zero. This will also makes the whole accounting sange. Thanks, -Kame --
This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.
"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/filemap.c | 2 +-
mm/memcontrol.c | 1 +
mm/migrate.c | 12 +++++++++---
3 files changed, 11 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/filemap.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/filemap.c
+++ mmotm-2.6.27-rc7+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;
- mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+ mem_cgroup_uncharge_cache_page(page);
/*
* Some filesystems seem to re-dirty the page even after
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -737,6 +737,7 @@ void mem_cgroup_uncharge_page(struct pag
void mem_cgroup_uncharge_cache_page(struct page *page)
{
VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
Index: mmotm-2.6.27-rc7+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc7+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
__inc_zone_page_state(newpage, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
- if (!PageSwapCache(newpage))
- mem_cgroup_uncharge_cache_page(page);
return 0;
}
@@ -378,7 +376,15 @@ static ...Isn't it better and correct coding style to do /* * Uncharge obsolete file cache */ if (!PageAnon(page)) mem_cgroup_uncharge_cache_page(page); /* else - uncharged at try_to_unmap() */ page->mapping = NULL; -- Balbir --
On Fri, 26 Sep 2008 15:17:48 +0530
yea, maybe.
I always wonder what I should do when I want to add comment to if-then-else...
But ok, will remove {}.
Thanks,
-Kame
--
This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.
This patch itself makes memcg slow but this patch's final purpose is
to remove lock_page_cgroup() and allowing fast access to page_cgroup.
(And total performance will increase after all patches applied.)
Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
So we use atomic version here.
Changelog: (v4) -> (v5)
- removed unsued operations.
- adjusted to new ctype MEM_CGROUP_CHARGE_TYPE_SHMEM
Changelog: (v3) -> (v4)
- no changes.
Changelog: (v2) -> (v3)
- renamed macros and flags to be longer name.
- added comments.
- added "default bit set" for File, Shmem, Anon.
Changelog: (preview) -> (v1):
- patch ordering is changed.
- Added macro for defining functions for Test/Set/Clear bit.
- made the names of flags shorter.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 122 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 82 insertions(+), 40 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -161,12 +161,46 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
- int flags;
+ unsigned long flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+enum {
+ /* flags for mem_cgroup */
+ PCG_CACHE, /* charged as cache */
+ /* flags for LRU placement ...Looks good to me Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> provided we push in the lockless ones too :) -- Balbir --
Some obvious optimization to memcg.
I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.
And res_counter_charge() is 'likely' to success.
Changelog v3->v4:
- merged with an other leaf patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -66,11 +66,10 @@ struct mem_cgroup_stat {
/*
* For accounting under irq disable, no need for increment preempt count.
*/
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
enum mem_cgroup_stat_index idx, int val)
{
- int cpu = smp_processor_id();
- stat->cpustat[cpu].count[idx] += val;
+ stat->count[idx] += val;
}
static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -237,18 +236,21 @@ static void mem_cgroup_charge_statistics
{
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
+ struct mem_cgroup_stat_cpu *cpustat;
VM_BUG_ON(!irqs_disabled());
+
+ cpustat = &stat->cpustat[smp_processor_id()];
if (PageCgroupCache(pc))
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
if (charge)
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
else
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
}
@@ ...Looks good to me Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Balbir --
This patch provides a function to move account information of a page between
mem_cgroups.
This moving of page_cgroup is done under
- lru_lock of source/destination mem_cgroup is held.
- lock_page_cgroup() is held.
Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.
(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irq(&mz->lru_lock);
Or better way is
lock_page_cgroup(pc);
....
unlock_page_cgroup(pc);
But you should confirm the nest of lock and avoid deadlock.
(trylock is better if it's ok.)
If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
Changelog: (v4) -> (v5)
- check for lock_page() is removed.
- rewrote description.
Changelog: (v2) -> (v4)
- added lock_page_cgroup().
- splitted out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 81 insertions(+), 3 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -426,6 +426,7 @@ int task_in_mem_cgroup(struct task_struc
void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
+ struct mem_cgroup *mem;
struct mem_cgroup_per_zone *mz;
unsigned long flags;
@@ -444,9 +445,14 @@ void mem_cgroup_move_lists(struct page *
pc = page_get_page_cgroup(page);
if (pc) {
+ mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
...I'm sorry, but I've not been convinced yet why these checks are needed here. (Those checks are removed by [9/12] anyway.) IIUC, pc->mem_cgroup is moved to another group only by mem_cgroup_move_account under lock_page_cgroup( and mz->lru_lock). And those two above(mem_cgroup_move_lists and __mem_cgroup_uncharge_common) sets mem = pc->mem_cgroup under lock_page_cgroup, so I don't think those checks (mem != pc->mem_cgroup) is needed. Thanks, --
On Fri, 26 Sep 2008 16:30:50 +0900 you're right. Thanks, --
What is the interface for moving the accounting? Is it an explicit call like force_empty? The other concern I have is about merging two LRU lists, when we move LRU pages from one mem_cgroup to another, where do we add them? To the head or tail? I think we need to think about it and document it well. The other thing is that once we have mult-hierarchy support (which we really Please BUG_ON() if the charging fails, we can be sure we catch assumptions that Coding style above is broken. Can this race really occur? Why do we get mem -- Balbir --
Hmm, good point. considering force_empty, head is better. (But I don't think about this seriously because I assumed root cgroup is unlimited) I don't think this kind of internal behavior should not be documented as UI yes. of course. please do as you like if this goes. adding logic to do so needs more patch, so please wait. I'll remove this charge() and change protocol to be yes, I think so. please check SwapCache behavior. It's costly to call page_cgroup_zoneinfo() again. so just saves mem and compare pc->mem_cgroup. Thanks, --
Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is never accounted at all and create an in-use page which
has no page_cgroup.
This patch tries to move account to "root" cgroup. By this patch, force_empty
doesn't leak an account but move account to "root" cgroup. Maybe someone can
think of other enhancements as moving account to its parent.
(But moving to the parent means we have to handle "limit" of pages.
Need more complicated work to do that.)
For now, just moves account to root cgroup.
Note: all lock other than old mem_cgroup's lru_lock
in this path is try_lock().
Changelog (v4) -> (5)
- removed yield()
- remove lock_page().
- use list_for_each_entry_safe() instead of list_empty() loop.
- check list is empty or not rather than see usage.
- added lru_add_drain_all() at the start of loops.
Changelog (v2) -> (v4)
- splitted out mem_cgroup_move_account().
- replaced get_page() with get_page_unless_zero().
(This is necessary for avoiding confliction with migration)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Documentation/controllers/memory.txt | 7 ++-
mm/memcontrol.c | 68 ++++++++++++++++++++---------------
2 files changed, 43 insertions(+), 32 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -29,10 +29,12 @@
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/spinlock.h>
+#include <linux/pagemap.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>
@@ -979,45 +981,34 @@ int mem_cgroup_resize_limit(struct mem_c
/*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just ...Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
meta data usage of this is no problem in FLATMEM/DISCONTIGMEM.
On SPARSEMEM, this makes mem_section[] size twice.
On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
(and uses xx kbytes for mem_section.)
I think this reduction makes sense.
By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.
I added printk message as
"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"
maybe enough informative for users.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 11 -
include/linux/mm_types.h | 4
include/linux/mmzone.h | 9 +
include/linux/page_cgroup.h | 90 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 258 ++++++++++++--------------------------------
mm/page_alloc.c | 12 --
mm/page_cgroup.c | 253 +++++++++++++++++++++++++++++++++++++++++++
8 files changed, 431 insertions(+), 208 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/mm/page_cgroup.c
@@ -0,0 +1,253 @@
+#include ...I thought the use of this variable was under the +#ifdef CONFIG_FLAT_NODE_MEM_MAP options. Otherwise, we unconditionally bloat mem_section, right? -- Dave --
On Thu, 25 Sep 2008 11:40:47 -0700 Hmmm......Oh, yes ! nice catch. Thanks, I'll fix. -Kame --
On Fri, 26 Sep 2008 10:17:54 +0900 But in reality, this is under CONFIG_SPARSEMEM and if CONFIG_SPARSEMEM, FLAT_NODE_MEM_MAP is not true (I think). Hmm..Maybe I shouldn't use checking CONFIG_FLAT_NODE_MEM_MAP and should just use CONFIG_SPARSEMEM check. I'll rewrite. Thanks, -Kame --
Hmm, who uses this function? Just for clarification, in what sequence will the page be mapped here? Thanks, Daisuke Nishimura. --
On Fri, 26 Sep 2008 10:00:22 +0900
Uh, ok. unnecessary. (In my first version, this allocation error
just shows Warning. Now, it panics.)
Please think about folloing situation.
There is a SwapCache which is referred from 2 process, A, B.
A maps it.
B doesn't maps it.
And now, process A exits.
CPU0(process A) CPU1 (process B)
zap_pte_range()
=> page remove from rmap => charge() (do_swap_page)
=> set page->mapcount->0
=> uncharge() => set page->mapcount=1
This race is what patch 12/12 is fixed.
Thank you for review.
Regards,
-Kame
--
On Fri, 26 Sep 2008 10:43:36 +0900 Sorry, my brain seems to be sleeping.. above page_mapped() check doesn't help this situation. Maybe this page_mapped() check is not necessary because it's of no use. I think this kind of problem will not be fixed until we handle SwapCache. Thanks, -Kame --
I've not fully understood yet what [12/12] does, but if we handle swapcache properly, [12/12] would become unnecessary? If so, how about handling swapcache instead of adding new interface? I think it can be done independent of mem+swap. Thanks, Daisuke Nishimura. --
On Fri, 26 Sep 2008 14:54:22 +0900 Hmm, worth to be considered. But I'll reuse the interface itself for othres (shmem, migrate, move_account etc) But, in previous trial of SwapCache handling, we saw many troubles. Then, I'd like to go carefully step by step to handle that. Thanks, -Kame --
On Fri, 26 Sep 2008 14:54:22 +0900
Try to illustrate what is trouble more precisely.
in do_swap_page(), page is charged when SwapCache lookup ends.
Here,
- charged when page is not mapped.
- not charged when page is mapped.
set_pte() etc...are done under appropriate lock.
On the other side, when a task exits, zap_pte_range() is called.
It calls page_remove_rmap().
Case A) Following is race.
Thread A Thread B
do_swap_page() zap_pte_range()
(1)try charge (mapcount=1)
(2) page_remove_rmap()
(3) uncharge page.
(4) map it
Then,
at (1), mapcount=1 and this page is not charged.
at (2), page_remove_rmap() is called and mapcount goes down to Zero.
uncharge(3) is called.
at (4), at the end of do_swap_page(), page->mapcount=1 but not charged.
Case B) In another scenario.
Thread A Thread B
do_swap_page() zap_pte_range()
(1)try charge (mapcount=1)
(2) page_remove_rmap()
(3) map it
(4) uncharge is called.
In (4), uncharge is capped but mapcount can go up to 1.
protocol 12/12 is for case (A).
After 12/12, double-check page_mapped() under lock_page_cgroup() will be fix to
case (B).
Huu, I don't like swap-cache ;)
Anyway, we'll have to handle swap cache later.
Thanks,
-Kame
--
On Fri, 26 Sep 2008 10:43:36 +0900 I think I saw page_mapped() case (in your shmem/page test.) I'll check what cause this if I have time. Thanks, -Kame --
Because of terrible compile error in FLATMEM/DISCONTIGMEM, I post
fixed version (and avoid patch bomb again.)
I post replacements for 9(fix), and 10,11,12 (adjustment for HUNK).
==
Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Remove page_cgroup pointer reduces the amount of memory by
- 4 bytes per PAGE_SIZE.
- 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.
By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
This means
- we're not necessary to be afraid of kmalloc faiulre.
(this can happen because of gfp_mask type.)
- we can avoid calling kmalloc/kfree.
- we can avoid allocating tons of small objects which can be fragmented.
- we can know what amount of memory will be used for this extra-lru handling.
I added printk message as
"allocated %ld bytes of page_cgroup"
"please try cgroup_disable=memory option if you don't want"
maybe enough informative for users.
TODO:
- small race window still remains in do_swap_page(). will be fixed by
later patch in series.
Changelog: v5 -> v6.
* removed "ctype" from uncharge.
* improved comment to show FLAT_NODE_MEM_MAP == !SPARSEMEM
* fixed errors in !SPARSEMEM codes
* removed unused function in !SPARSEMEM codes.
(start from v5 because of series..)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 11 -
include/linux/mm_types.h | 4
include/linux/mmzone.h | 14 ++
include/linux/page_cgroup.h | 90 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 264 ...adjustment for changes in 9/12(fixed)
==
Free page_cgroup from its LRU in batched manner.
When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 152 insertions(+), 12 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -36,6 +36,7 @@
#include <linux/mm_inline.h>
#include <linux/writeback.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>
#include <asm/uaccess.h>
@@ -531,6 +532,116 @@ out:
return ret;
}
+
+#define MEMCG_PCPVEC_SIZE (14) /* size of pagevec */
+struct memcg_percpu_vec {
+ int nr;
+ int limit;
+ struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ VM_BUG_ON(PageCgroupUsed(pc));
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ pc->mem_cgroup = NULL;
+ }
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = ...Fixed HUNK with 9/12(fixed)
==
Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.
If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)
For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.
(I don't want to call lock_page_cgroup() under mz->lru_lock when
add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
TestSet is an easy way to avoid unneccesary nest of locks.)
Changelog: v3 -> v5.
- removed css_get/put per page_cgroup struct.
Now, *new* force_empty checks there is page_cgroup on the memcg.
We don't need to be afraid of leak.
Changelog: v2 -> v3
- added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
- renamed function name from use_page_cgroup to set_page_cgroup_lru().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/page_cgroup.h | 10 +++
mm/memcontrol.c | 121 +++++++++++++++++++++++++++++++-------------
2 files changed, 96 insertions(+), 35 deletions(-)
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -24,6 +24,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_LRU, /* this is on LRU */
/* flags for LRU placement */
PCG_ACTIVE, /* page is active in this cgroup */
PCG_FILE, /* page is file system backed */
@@ -42,11 +43,20 @@ static inline void SetPageCgroup##uname(
static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
{ clear_bit(PCG_##lname, &pc->flags); }
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int ...I like the changelog very much, it tells exactly what we set out to do. Thanks Should we check for slab_is_available() before calling kmalloc_node? Otherwise, I like this code very much, even though I've not tested it. I think it combines all we've been thinking and discussing very well (from my initial flatmem based Yay! I think we should add some documentation saying, if you think about extending struct page, think again, look at how we did it for the memory I don't understand the preempt_disable() is it for atomic operations? I would like to split the removal of struct page from atomic operations for performance. Except for the the performance changes for atomic operations, I like this patch very much. For the page removal portions Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> as a co-maintainer -- Balbir --
On Wed, 01 Oct 2008 09:33:53 +0530 That check is not necessary. Because we use kmalloc()/vmalloc() in mem_cgroup_create(), after this. (We assume usual page allocateor is available here. For FLATMEM, size of ???? it's usually not accepted. I've seen such case very much. Memory hotremoval is not related to this. This preempt_disable() is avoidance for deadlock. (I thought) lock_page_cgroup(); -> spin_lock_irqsave(&mz->lru_lock); -> spin_unlock_irqrestore(&mz->lru_lock); unlock_page_cgroup(); Did you see performance regression by atomic ops ? I'll rewrite the whole stack from now for your request. Thanks, --
On Wed, 1 Oct 2008 14:07:48 +0900 BTW, do you have good idea to modify flag bit without affecting LOCK bit on page_cgroup->flags ? At least, we'll have to set ACTIVE/INACTIVE/UNEVICTABLE flags dynamically. take lock_page_cgroup() always ? __mem_cgroup_move_lists() will have some amount of changes. And we should check dead lock again. Thanks, -Kame --
In the past patches, I've used lock_page_cgroup(). In some cases like initialization at boot time, I've ignored taking the lock, since I know no-one __mem_cgroup_move_lists() is called from mem_cgroup_isolate_pages() and mem_cgroup_move_lists(). In mem_cgroup_move_lists(), we have the page_cgroup lock. I think the current code works on the assumption (although not documented anywhere I've seen), that PAGE_CGROUP_FLAG_INACTIVE/ACTIVE/UNEVICTABLE bits are protected by lru_lock. Please look at __mem_cgroup_remove_list __mem_cgroup_add_list __mem_cgroup_move_lists __mem_cgroup_charge_common (sets this flag, before the pc is associated with the page). Am I missing something (today is holiday here, so I am in a bit of a lazy/sleepy mood :) ) -- Balbir --
On Wed, 01 Oct 2008 11:29:04 +0530 But my point is lru_lock doesn't means page_cgroup is not locked by someone and we must take always lock_page_cgroup() when we modify flags. Then, mem_cgroup_isolate_page() should have to take lock. But this means we have to care preemption for avoiding deadlock. Maybe need some time to test. Thanks, -Kame --
I don't think you'll need to do a major rewrite? Will you? My concern is that this patch does too much to be a single patch. Consider someone trying to do a git-bisect to identify a problem? It is hard to review as well and I think the patch that just removes struct page member can go in faster. It will be easier to test/debug as well, we'll know if the problem is because of new page_cgroup being outside struct page rather then guessing if it was the atomic ops that caused the problem. -- Balbir --
On Wed, 01 Oct 2008 11:00:42 +0530 yes. LOCK bit is on page_cgroup->flags bit. Then, I'll check all places which modify page_cgroup->flags again and take lock always. atomic_ops patch just rewrite exisiting behavior. Thanks, -Kame --
On Wed, 1 Oct 2008 14:41:50 +0900 please forgive me to post v6 today, which passed 24h+ tests. v5 is a week old. Discussion about patch order is welcome. But please give me hint. Thanks, -Kame --
That sounds impressive. I'll test and review v6. -- Balbir --
Free page_cgroup from its LRU in batched manner.
When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 152 insertions(+), 12 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -36,6 +36,7 @@
#include <linux/mm_inline.h>
#include <linux/writeback.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>
#include <asm/uaccess.h>
@@ -533,6 +534,116 @@ out:
return ret;
}
+
+#define MEMCG_PCPVEC_SIZE (14) /* size of pagevec */
+struct memcg_percpu_vec {
+ int nr;
+ int limit;
+ struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ VM_BUG_ON(PageCgroupUsed(pc));
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ pc->mem_cgroup = NULL;
+ }
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_free_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr ...Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.
If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)
For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.
(I don't want to call lock_page_cgroup() under mz->lru_lock when
add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
This TestSet is an easy way to avoid unneccesary nest of locks.)
Changelog: v3 -> v5.
- removed css_get/put per page_cgroup struct.
Now, new force_empty checks there is page_cgroup on the memcg.
We don't need to be afraid of leak.
Changelog: v2 -> v3
- added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
- renamed function name from use_page_cgroup to set_page_cgroup_lru().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/page_cgroup.h | 10 +++
mm/memcontrol.c | 121 +++++++++++++++++++++++++++++++-------------
2 files changed, 96 insertions(+), 35 deletions(-)
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -24,6 +24,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_LRU, /* this is on LRU */
/* flags for LRU placement */
PCG_ACTIVE, /* page is active in this cgroup */
PCG_FILE, /* page is file system backed */
@@ -42,11 +43,20 @@ static inline void SetPageCgroup##uname(
static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
{ clear_bit(PCG_##lname, &pc->flags); }
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int TestSetPageCgroup##uname(struct page_cgroup ...There is a small race in do_swap_page(). When the page swapped-in is charged, the mapcount can be greater than 0. But, at the same time some process (shares it ) call unmap and make mapcount 1->0 and the page is uncharged. For fixing this, I added a new interface. - precharge account to res_counter by PAGE_SIZE and try to free pages if necessary. - commit register page_cgroup and add to LRU if necessary. - cancel uncharge PAGE_SIZE because of do_swap_page failure. This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting. Usual mem_cgroup_charge_common() does precharge -> commit at a time. These precharge/commit/cancel is useful and can be used for other places, - shmem, (and other places need precharge.) - migration - move_account(force_empty) etc... etc..we'll revisit later. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> include/linux/memcontrol.h | 21 +++++++ mm/memcontrol.c | 135 +++++++++++++++++++++++++++++++-------------- mm/memory.c | 6 +- 3 files changed, 120 insertions(+), 42 deletions(-) Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h =================================================================== --- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h +++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h @@ -31,6 +31,13 @@ struct mm_struct; extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +/* for swap handling */ +extern int mem_cgroup_precharge(struct mm_struct *mm, + gfp_t gfp_mask, struct mem_cgroup **ptr); +extern void mem_cgroup_commit_charge_swap(struct page *page, + struct mem_cgroup *ptr); +extern void mem_cgroup_cancel_charge_swap(struct mem_cgroup *ptr); + extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru); @@ -85,6 +92,20 @@ static inline int mem_cgroup_cache_charg return ...
I got general protection fault. (log from dump) general protection fault: 0000 [1] SMP last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map CPU 0 Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc microcode dm_mirror dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg e1000 ide_cd_mod cdrom button acpi_memhotp lug serio_raw rtc_cmos parport_pc rtc_core parport rtc_lib i2c_i801 i2c_core pcspkr shpchp ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci _hcd Pid: 8001, comm: shmem_test_02 Tainted: G W 2.6.27-rc7-mm1-7eacf5c9 #1 RIP: 0010:[<ffffffff802a0ebb>] [<ffffffff802a0ebb>] __mem_cgroup_move_lists+0x8b/0xa2 RSP: 0018:ffff8800bb4ad888 EFLAGS: 00010046 RAX: ffff88010b253080 RBX: ffff88010c67d618 RCX: dead000000100100 RDX: dead000000200200 RSI: ffff88010b253088 RDI: ffff88010c67d630 RBP: 0000000000000000 R08: ffff88010fc020a3 R09: 000000000000000f R10: ffffffff802a204a R11: 00000000fffffffa R12: ffff88010b253080 R13: 0000000000000000 R14: ffff8800bb4ad9c8 R15: 0000000000000000 FS: 00007f4600faa6f0(0000) GS:ffffffff80638900(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000033af86c027 CR3: 00000000c1549000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process shmem_test_02 (pid: 8001, threadinfo ffff8800bb4ac000, task ffff880107d21470) Stack: ffffe200028ef8b0 0000000000000082 ffff88010c67d618 ffffffff802a1cb9 ffff880000016f80 0000000000000000 ffff880000016f80 ffffe200028ef888 ffff8800bb4adb38 ffffffff8027dd09 ffffc20001859000 0000000000000000 Call Trace: [<ffffffff802a1cb9>] mem_cgroup_move_lists+0x50/0x74 [<ffffffff8027dd09>] shrink_list+0x443/0x4ff [<ffffffff8027e04e>] shrink_zone+0x289/0x315 [<ffffffff802805d2>] congestion_wait+0x74/0x80 ...
On Fri, 26 Sep 2008 11:32:28 +0900
Thank you.
How about following ?
-Kame
==
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -597,8 +597,8 @@ __set_page_cgroup_lru(struct memcg_percp
spin_lock(&mz->lru_lock);
}
if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
- SetPageCgroupLRU(pc);
__mem_cgroup_add_list(mz, pc);
+ SetPageCgroupLRU(pc);
}
}
--
On Fri, 26 Sep 2008 11:58:10 +0900
Of course, remove side should be..
-Kame
==
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -564,8 +564,8 @@ __release_page_cgroup(struct memcg_percp
spin_lock(&mz->lru_lock);
}
if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
- __mem_cgroup_remove_list(mz, pc);
ClearPageCgroupLRU(pc);
+ __mem_cgroup_remove_list(mz, pc);
}
}
if (prev_mz)
@@ -597,8 +597,8 @@ __set_page_cgroup_lru(struct memcg_percp
spin_lock(&mz->lru_lock);
}
if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
- SetPageCgroupLRU(pc);
__mem_cgroup_add_list(mz, pc);
+ SetPageCgroupLRU(pc);
}
}
--
I'll test it with updated version of 9-11 and report you back. Thanks, Daisuke Nishimura. --
On Fri, 26 Sep 2008 12:00:19 +0900
Thank you. below is the new one...(Sorry!)
-Kame
==
Check LRU bit under lru_lock.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -340,11 +340,12 @@ void mem_cgroup_move_lists(struct page *
if (!trylock_page_cgroup(pc))
return;
- if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+ if (PageCgroupUsed(pc)) {
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
+ if (PageCgroupLRU(pc))
+ __mem_cgroup_move_lists(pc, lru);
spin_unlock_irqrestore(&mz->lru_lock, flags);
}
unlock_page_cgroup(pc);
@@ -564,8 +565,8 @@ __release_page_cgroup(struct memcg_percp
spin_lock(&mz->lru_lock);
}
if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
- __mem_cgroup_remove_list(mz, pc);
ClearPageCgroupLRU(pc);
+ __mem_cgroup_remove_list(mz, pc);
}
}
if (prev_mz)
@@ -597,8 +598,8 @@ __set_page_cgroup_lru(struct memcg_percp
spin_lock(&mz->lru_lock);
}
if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
- SetPageCgroupLRU(pc);
__mem_cgroup_add_list(mz, pc);
+ SetPageCgroupLRU(pc);
}
}
--
Unfortunately, there remains some bugs yet... ------------[ cut here ]------------ WARNING: at lib/list_debug.c:51 list_del+0x5c/0x87() list_del corruption. next->prev should be ffff88010ca291e8, but was dead000000200200 Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc microcode dm_mirror dm_log dm_multipath dm_mod rfkill input_polldev sbs sbshc battery ac lp sg e1000 ide_cd_mod cdrom button acpi_memhotp lug parport_pc rtc_cmos rtc_core parport serio_raw rtc_lib i2c_i801 i2c_core shpchp pcspkr ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci _hcd Pid: 3940, comm: bash Tainted: G W 2.6.27-rc7-mm1-dd8bf0fe #1 Call Trace: [<ffffffff8023ac52>] warn_slowpath+0xb4/0xd2 [<ffffffff8024c0ec>] prepare_to_wait_exclusive+0x38/0x5a [<ffffffff8024c089>] finish_wait+0x32/0x5d [<ffffffff8049ec80>] __wait_on_bit_lock+0x5b/0x66 [<ffffffff80272ff4>] __lock_page+0x5e/0x64 [<ffffffff8022c4f9>] target_load+0x2a/0x58 [<ffffffff8022cf59>] place_entity+0x85/0xb3 [<ffffffff8022f6db>] enqueue_entity+0x16e/0x18f [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d [<ffffffff802788d3>] get_page_from_freelist+0x455/0x5bf [<ffffffff803402e8>] list_del+0x5c/0x87 [<ffffffff802a1530>] mem_cgroup_commit_charge+0x6f/0xdd [<ffffffff802a16ed>] mem_cgroup_charge_common+0x4c/0x62 [<ffffffff802835de>] handle_mm_fault+0x222/0x791 [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d [<ffffffff8028235a>] follow_page+0x2d/0x2c2 [<ffffffff80283e42>] __get_user_pages+0x2f5/0x3f3 [<ffffffff802a74ed>] get_arg_page+0x46/0xa5 [<ffffffff802a7724>] copy_strings+0xfc/0x1de [<ffffffff802a7827>] copy_strings_kernel+0x21/0x33 [<ffffffff802a8aff>] do_execve+0x140/0x256 [<ffffffff8020a495>] sys_execve+0x35/0x4c [<ffffffff8020c1ea>] stub_execve+0x6a/0xc0 ---[ end trace 4eaa2a86a8e2da22 ]--- ------------[ cut here ]------------ WARNING: at ...
On Fri, 26 Sep 2008 14:24:55 +0900
I confirmed I can reproduce this.
I found one chance to cause this. (and confirmed this happens by printk)
set_lru()..
TestSetPageCgroup(pc);
.... if (PageCgroupUsed(pc) && !PageCgroupLRU(pc))
pc->mem_cgroup = mem;
SetPageCgroupLRU();
__mem_cgroup_add_list();
Then, page_cgroup will be added to wrong LRU whic doesn't match pc->mem_cgroup.
But there is still more...still digging.
Thanks,
--
On Fri, 26 Sep 2008 14:24:55 +0900
Thank you for your patient good test!
I'm now testing following (and will do over-night test.)
In this an hour, this seems to work good.
(under your test which usually panics in 10-20min on my box.)
==
page_cgroup is not valid until pc->mem_cgroup is set to appropriate value.
There is a small race between Set-Used-Bit and Set-Proper-pc->mem_cgroup.
This patch tries to fix that by adding PCG_VALID bit
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/page_cgroup.h | 3 +++
mm/memcontrol.c | 22 ++++++++++++++--------
2 files changed, 17 insertions(+), 8 deletions(-)
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -21,6 +21,7 @@ struct page_cgroup *lookup_page_cgroup(s
enum {
/* flags for mem_cgroup */
+ PCG_VALID, /* you can access this page cgroup */
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
@@ -50,6 +51,10 @@ static inline int TestSetPageCgroup##una
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
+TESTPCGFLAG(Valid, VALID)
+SETPCGFLAG(Valid, VALID)
+CLEARPCGFLAG(Valid, VALID)
+
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)
TESTSETPCGFLAG(Used, USED)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -340,7 +340,7 @@ void mem_cgroup_move_lists(struct page *
if (!trylock_page_cgroup(pc))
return;
- if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+ if (PageCgroupValid(pc) && PageCgroupLRU(pc)) {
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
@@ -434,7 +434,7 @@ unsigned long ...On Fri, 26 Sep 2008 19:43:09 +0900 This helps some, but causes other troubles. (But I know what it is.) I'll add lock_page_cgroup() to charge side and see what happens. Thanks, --
Kame, I'm beginning to review test the patches now. It would be really nice to split the development patches from the maintenance ones. I think the full patchset has too many things and is confusing to look at. -- Balbir --
On Fri, 26 Sep 2008 13:48:58 +0530 I hope I can do....but maybe difficult. If you give me ack, 1,2,4,6, can be pushed at early stage. Thanks, -Kame --
I think (1) might be OK, except for the accounting issues pointed out (change in behaviour visible to end user again, sigh! :( ). Is (1) a serious issue? (2) seems OK, except for the locking change for mark_page_accessed. I am looking at (4) and (6) currently. -- Balbir --
On Fri, 26 Sep 2008 15:01:46 +0530 Thanks, --
On Fri, 26 Sep 2008 19:36:02 +0900
I'll do in following way in the next Monday.
Divide patches into 2 set
in early fix/optimize set.
- push (2)
- push (4)
- push (6)
- push (1)
drops (3).
I don't want to remove all? pages-never-on-LRU before fixing force_empty.
in updates
- introduce atomic flags. (5)
- add move_account() function (7)
- add memory.attribute to each memcg dir. (NEW)
- enhance force_empty (was (8))
- remove "forget all" logic. and add attribute to select following 2 behavior
- call try_to_free_page() until the usage goes down to 0.
This allows faiulre (if page is mlocked, we can't do.). (NEW)
- call move_account() to move all charges to its parent (as much as possible) (NEW)
In future, I'd liket to add trash-box cgroup for force_empty somewhere.
- allocate all page cgroup at boot (9)
- lazy lru free/add (10,11) with fixes.
- fix race at charging swap. (12)
After (9), all page and page_cgroup has one-to-one releationship and we want to
assume that "if page is alive and on LRU, it's accounted and has page_cgroup."
(other team, bio cgroup want to use page_cgroup and I want to make it easy.)
For this, fix to behavior of force_empty..."forget all" is necessary.
SwapCache handling is also necessary but I'd like to postpone until next set
because it's complicated.
After above all.
- handle swap cache
- Mem+Swap controller.
- add trashbox feature ?
- add memory.shrink_usage_to file.
It's long way to what I really want to do....
Thanks,
-Kame
-
--
Yes a long way to go, I want to add 1) Multi-hierarchy support 2) Support for soft-limits 3) get swappiness working (there are patches posted for it by Yamamoto-San, but -- Balbir --
On Mon, 29 Sep 2008 08:32:07 +0530 I'll add -EBUSY behavior to force_empty. I'm now adding(merging) code to move_account.patch for supporing force_empty. Thanks. no major changes in my current stack from already posted one. Thanks, --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer |
