Control the maximum amount of dirty pages a cgroup can have at any given time. Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, they will not be able to consume more than their designated share of dirty pages and will be forced to perform write-out if they cross that limit. The overall design is the following: - account dirty pages per cgroup - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes and memory.dirty_background_ratio / memory.dirty_background_bytes in cgroupfs - start to write-out (background or actively) when the cgroup limits are exceeded This feature is supposed to be strictly connected to any underlying IO controller implementation, so we can stop increasing dirty pages in VM layer and enforce a write-out before any cgroup will consume the global amount of dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. Changelog (v6 -> v7) ~~~~~~~~~~~~~~~~~~~~~~ * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() is never called under tree_lock (no strict accounting, but better overall performance) * do not account file cache statistics for the root cgroup (zero overhead for the root cgroup) * fix: evaluate cgroup free pages as at the minimum free pages of all its parents Results ~~~~~~~ The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ 1.2GHz: <before> - root cgroup: 11m51.983s - child cgroup: 11m56.596s <after> - root cgroup: 11m51.742s - child cgroup: 12m5.016s In the previous version of this patchset, using the "complex" locking scheme with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. With this version there's no overhead for the root cgroup (the ...
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi <arighi@develer.com> --- Documentation/cgroups/memory.txt | 36 ++++++++++++++++++++++++++++++++++++ 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache - # of bytes of page cache memory. rss - # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). +filedirty - # of pages that are waiting to get written back to the disk. +writeback - # of pages that are actively being written back to the disk. +writeback_tmp - # of pages used by FUSE for temporary writeback buffers. +nfs - # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon - # of bytes of anonymous and swap cache memory on active lru list. inactive_anon - # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the ...
It would be better to note that what those files of root cgroup mean. We cannot write any value to them, IOW, we cannot control dirty limit about root cgroup. And they show the same value as the global one(strictly speaking, it's not true because global values can change. We need a hook in mem_cgroup_dirty_read()?). Thanks, --
On Mon, Mar 15, 2010 at 11:41 PM, Daisuke Nishimura Should these new memory.stat counters (filedirty, etc) report byte counts rather than page counts? I am thinking that byte counters would make reporting more obvious depending on how heterogeneous page sizes are used. Byte counters would also agree with /proc/meminfo. Within the kernel we could still maintain page counts. The only change would be to the reporting routine, mem_cgroup_get_local_stat(), which would scale the page counts by PAGE_SIZE as it does for for cache,rss,etc. -- Greg --
I agree, byte counts would be better than page counts. pgpin and pgpout are special cases where the pages matter, the size does not due to the nature of the operation. -- Three Cheers, Balbir --
OK, we can just return system-wide value if mem_cgroup_is_root() in mem_cgroup_dirty_read(). Will change this in the next version. Thanks, -Andrea --
Introduce page_cgroup flags to keep track of file cache pages.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
include/linux/page_cgroup.h | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index bf9a913..65247e4 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,7 +40,11 @@ enum {
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
/* for cache-status accounting */
- PCG_FILE_MAPPED,
+ PCG_FILE_MAPPED, /* page is accounted as file rss*/
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is being written back to disk */
+ PCG_FILE_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+ PCG_FILE_UNSTABLE_NFS, /* NFS page not yet committed to the server */
};
#define TESTPCGFLAG(uname, lname) \
@@ -83,6 +87,22 @@ TESTPCGFLAG(FileMapped, FILE_MAPPED)
SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+TESTPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+SETPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+CLEARPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
static inline int page_cgroup_nid(struct page_cgroup *pc)
{
return page_to_nid(pc->page);
--
1.6.3.3
--
Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:
- Direct write-out: memory.dirty_ratio, memory.dirty_bytes
- Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
Signed-off-by: Andrea Righi <arighi@develer.com>
---
include/linux/memcontrol.h | 92 ++++++++-
mm/memcontrol.c | 484 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 540 insertions(+), 36 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 88d3f9e..0602ec9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,55 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
#include <linux/cgroup.h>
+
struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+ MEMCG_NR_DIRTYABLE_PAGES,
+ MEMCG_NR_RECLAIM_PAGES,
+ MEMCG_NR_WRITEBACK,
+ MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* File cache pages accounting */
+enum mem_cgroup_write_page_stat_item {
+ MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_WRITEBACK_TEMP, /* # of pages under writeback using
+ temporary buffers */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+
+ MEMCG_NR_FILE_NSTAT,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = ...On Mon, 15 Mar 2010 00:26:41 +0100 Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
hmm, mem_cgroup_page_stat() can return negative value, but you place BUG_ON()
in [5/5] to check it returns negative value. What happens if the current is moved
to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ?
How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, and
I don't have a strong objection, but I prefer showing them in bytes.
And can you add to mem_cgroup_stat_show() something like:
for (i = 0; i < NR_MCS_STAT; i++) {
if (i == MCS_SWAP && !do_swap_account)
continue;
+ if (i >= MCS_FILE_STAT_STAR && i <= MCS_FILE_STAT_END &&
+ mem_cgroup_is_root(mem_cont))
+ continue;
cb->fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
}
not to show file stat in root cgroup ? It's meaningless value anyway.
Of course, you'd better mention it in [2/5] too.
Thanks,
--
On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote: Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one shall have to use rcu_read_lock() and that will look ugly. Why don't we simply look at the return value and if it is negative, we fall back to using global stats and get rid of BUG_ON()? Or, modify mem_cgroup_page_stat() to return global stats if it can't determine per cgroup stat for some reason. (mem=NULL or root cgroup etc). Vivek --
On Tue, 16 Mar 2010 10:11:50 -0400 I don't have any objection as long as we don't hit BUG_ON. Thanks, --
I vote for this one. IMHO the caller of mem_cgroup_page_stat() should fallback to the equivalent global stats. This allows to keep the things Thanks, -Andrea --
On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote: Agreed. I would be definitely more clear. Balbir, KAME-san, what do you OK. Thanks, -Andrea --
I have a proposed change to "[PATCH -mmotm 4/5] memcg: dirty pages accounting
and limiting infrastructure" v6. The change is small and I am presenting it
as a git patch (below) to be applied after 4/5 v6 has been applied.
The change is fairly simple. An alternative would be to reject my
patch (below) and enhance get_vm_dirty_param() to loop for consistenty in all
cases.
---patch snip here, rest of email is git patch of 4/5 v6 ---
Removed unneeded looping from get_vm_dirty_param(). The only caller of
get_vm_dirty_param() gracefully handles inconsistent values, so there is no
need for get_vm_dirty_param() to loop to ensure consistency. The previous
looping was inconsistent because it did not loop for the case where memory
cgroups were disabled.
Signed-off-by: Greg Thelen <gthelen@google.com>
---
mm/memcontrol.c | 28 ++++++++++++----------------
1 files changed, 12 insertions(+), 16 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d00c0f..990a907 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1081,6 +1081,10 @@ static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
* The function fills @param with the current memcg dirty memory settings. If
* memory cgroup is disabled or in case of error the structure is filled with
* the global dirty memory settings.
+ *
+ * Because global and memcg vm_dirty_param are not protected, inconsistent
+ * values may be returned. If consistent values are required, then the caller
+ * should call this routine until dirty_param_is_valid() returns true.
*/
void get_vm_dirty_param(struct vm_dirty_param *param)
{
@@ -1090,28 +1094,20 @@ void get_vm_dirty_param(struct vm_dirty_param *param)
get_global_vm_dirty_param(param);
return;
}
+
/*
* It's possible that "current" may be moved to other cgroup while we
* access cgroup. But precise check is meaningless because the task can
* be moved after our access and writeback tends to take long time.
* At least, "memcg" ...Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.
[ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2
bounce pages. This depends on charging also bounce pages per cgroup. ]
As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.
Signed-off-by: Andrea Righi <arighi@develer.com>
---
fs/nfs/write.c | 4 +
include/linux/writeback.h | 2 -
mm/filemap.c | 1 +
mm/page-writeback.c | 215 ++++++++++++++++++++++++++++-----------------
mm/rmap.c | 4 +-
mm/truncate.c | 1 +
6 files changed, 141 insertions(+), 86 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..3e8b9f8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;
if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
req = nfs_list_entry(head->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff ...On Mon, 15 Mar 2010 00:26:42 +0100 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "dirty" seems not to be necessary. if (get_dirty_writeback_pages() < dirty_thresh) ? Thanks, -Kame --
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, file-mapped is maintaiend. But more generic update function
will be needed for dirty page accounting.
For accountig page status, we have to guarantee lock_page_cgroup()
will be never called under tree_lock held.
To guarantee that, we use trylock at updating status.
By this, we do fuzzy accounting, but in almost all case, it's correct.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/memcontrol.h | 7 +++-
include/linux/page_cgroup.h | 15 +++++++
mm/memcontrol.c | 94 +++++++++++++++++++++++++++++++++----------
mm/rmap.c | 4 +-
4 files changed, 95 insertions(+), 25 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44301c6..88d3f9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,12 @@ static inline bool mem_cgroup_disabled(void)
return false;
}
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+enum mem_cgroup_page_stat_item {
+ MEMCG_NR_FILE_MAPPED,
+ MEMCG_NR_FILE_NSTAT,
+};
+
+void mem_cgroup_update_stat(struct page *page, int idx, bool charge);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..bf9a913 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,8 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
+ /* for cache-status accounting */
+ PCG_FILE_MAPPED,
};
#define TESTPCGFLAG(uname, lname) \
@@ -57,6 +59,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
+/* Page/File stat ...On Mon, 15 Mar 2010 00:26:38 +0100 Bad patch title...."use trylock for safe accounting in some contexts" ? Thanks, --
OK, sounds better. I just copy & paste the email subject, but the title was probably related to the old lock_page_cgroup()+irq_disable patch. Thanks, -Andrea --
What happens if the trylock fails, do we lose the stat? -- Three Cheers, Balbir --
I don't like this at all, but in almost all cases is not acceptable for statistics, since decisions will be made on them and having them incorrect is really bad. Could we do a form of deferred statistics and fix this. -- Three Cheers, Balbir --
On Wed, 17 Mar 2010 17:28:55 +0530 plz show your implementation which has no performance regresssion. For me, I don't neee file_mapped accounting, at all. If we can remove that, we can add simple migration lock. file_mapped is a feattue you added. please improve it. Thanks, -Kame --
On Thu, 18 Mar 2010 08:54:11 +0900 BTW, I should explain how acculate this accounting is in this patch itself. Now, lock_page_cgroup/unlock_page_cgroup happens when - charge/uncharge/migrate/move accounting Then, the lock contention (trylock failure) seems to occur in conflict with - charge, uncharge, migarate. move accounting About dirty accounting, charge/uncharge/migarate are operation in synchronous manner with radix-tree (holding treelock etc). Then no account leak. move accounting is only source for inacculacy...but I don't think this move-task is ciritial....moreover, we don't move any file pages at task-move, now. (But Nishimura-san has a plan to do so.) So, contention will happen only at confliction with force_empty. About FILE_MAPPED accounting, it's not synchronous with radix-tree operaton. Then, accounting-miss seems to happen when charge/uncharge/migrate/account move. But charge .... we don't account a page as FILE_MAPPED before it's charged. uncharge .. usual operation in turncation is unmap->remove-from-radix-tree. Then, it's sequential in almost all case. The race exists when... Assume there are 2 threads A and B. A truncate a file, B map/unmap that. This is very unusal confliction. migrate.... we do try_to_unmap before migrating pages. Then, FILE_MAPPED is properly handled. move account .... we don't have move-account-mapped-file, yet. Then, this trylock contention happens at contention with force_empty and truncate. Then, main issue for contention is force_empty. But it's called for removing memcg, accounting for such memcg is not important. Then, I say "this accounting is Okay." To do more accurate, we may need another "migration lock". But to get better performance for root cgroup, we have to call mem_cgroup_is_root() before taking lock and there will be another complicated race. Bye, -Kame --
FILE_MAPPED is updated under pte lock. OTOH, move account is also done under pte lock. page cgroup lock is held under pte lock in both cases, so move account is not so problem as for FILE_MAPPED. Thanks, --
On Thu, 18 Mar 2010 11:16:53 +0900 HmmHmm, thank you. then, only racy cases are truncate and force_empty. Thanks, -Kame --
Agreed, we need to find a simpler way of doing this without affecting the accuracy of accounting - may be two accounting routines for two code paths. I have not thought through this yet. -- Three Cheers, Balbir --
That doesn't matter, if you need it, I think the larger user base matters. Unmapped and mapped page cache is critical and I use it I will, but please don't break it silently -- Three Cheers, Balbir --
On Thu, 18 Mar 2010 09:49:44 +0530 I never do silently. All e-mails are open. Thanks, -Kame --
No, I meant it from the Documentation update perspective. The end user will be confused when looking at stats without proper documentation. -- Three Cheers, Balbir --
On Thu, 18 Mar 2010 09:49:44 +0530 Andrea, could you go in following way ? - don't touch FILE_MAPPED stuff. - add new functions for other dirty accounting stuff as in this series. (using trylock is ok.) Then, no probelm. It's ok to add mem_cgroup_udpate_stat() indpendent from mem_cgroup_update_file_mapped(). The look may be messy but it's not your fault. But please write "why add new function" to patch description. I'm sorry for wasting your time. Thanks, -Kame --
Do we need to go down this route? We could check the stat and do the correct thing. In case of FILE_MAPPED, always grab page_cgroup_lock and for others potentially look at trylock. It is OK for different stats to be protected via different locks. /me takes a look at the code again. -- Three Cheers, Balbir --
On Thu, 18 Mar 2010 21:58:55 +0530 I _don't_ want to see a mixture of spinlock and trylock in a function. Thanks, -Kame --
A well documented well written function can help. The other thing is to of-course solve this correctly by introducing different locking around the statistics. Are you suggesting the later? -- Three Cheers, Balbir --
On Fri, 19 Mar 2010 08:10:39 +0530 No. As I wrote. - don't modify codes around FILE_MAPPED in this series. - add a new functions for new statistics Then, - think about clean up later, after we confirm all things work as expected. Thanks, -Kame --
I have ported Andrea Righi's memcg dirty page accounting patches to latest
mmtom-2010-04-05-16-09. In doing so I have to address this locking issue. Does
the following look good? I will (of course) submit the entire patch for review,
but I wanted make sure I was aiming in the right direction.
void mem_cgroup_update_page_stat(struct page *page,
enum mem_cgroup_write_page_stat_item idx, bool charge)
{
static int seq;
struct page_cgroup *pc;
if (mem_cgroup_disabled())
return;
pc = lookup_page_cgroup(page);
if (!pc || mem_cgroup_is_root(pc->mem_cgroup))
return;
/*
* This routine does not disable irq when updating stats. So it is
* possible that a stat update from within interrupt routine, could
* deadlock. Use trylock_page_cgroup() to avoid such deadlock. This
* makes the memcg counters fuzzy. More complicated, or lower
* performing locking solutions avoid this fuzziness, but are not
* currently needed.
*/
if (irqs_disabled()) {
if (! trylock_page_cgroup(pc))
return;
} else
lock_page_cgroup(pc);
__mem_cgroup_update_page_stat(pc, idx, charge);
unlock_page_cgroup(pc);
}
__mem_cgroup_update_page_stat() has a switch statement that updates all of the
MEMCG_NR_FILE_{MAPPED,DIRTY,WRITEBACK,WRITEBACK_TEMP,UNSTABLE_NFS} counters
using the following form:
switch (idx) {
case MEMCG_NR_FILE_MAPPED:
if (charge) {
if (!PageCgroupFileMapped(pc))
SetPageCgroupFileMapped(pc);
else
val = 0;
} else {
if (PageCgroupFileMapped(pc))
ClearPageCgroupFileMapped(pc);
else
val = 0;
}
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
...
}
/*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
if (val > 0) {
__this_cpu_inc(mem->stat->count[idx]);
} else if (val < 0) {
__this_cpu_dec(mem->stat->count[idx]);
}
In my current tree, irq is never saved/restored by cgroup locking code. To
protect against interrupt reentrancy, trylock_page_cgroup() is used. As ...On Tue, 13 Apr 2010 23:55:12 -0700
I prefer trylock_page_cgroup() always.
I have another idea fixing this up _later_. (But I want to start from simple one.)
My rough idea is following. Similar to your idea which you gave me before.
==
DEFINE_PERCPU(account_move_ongoing);
DEFINE_MUTEX(move_account_mutex):
void memcg_start_account_move(void)
{
mutex_lock(&move_account_mutex);
for_each_online_cpu(cpu)
per_cpu(cpu, account_move_ongoing) += 1;
mutex_unlock(&move_account_mutex);
/* Wait until there are no lockless update */
synchronize_rcu();
return;
}
void memcg_end_account_move(void)
{
mutex_lock(&move_account_mutex);
for_each_online_cpu(cpu)
per_cpu(cpu, account_move_ongoing) -= 1;
mutex_unlock(&move_account_mutex);
}
/* return 1 when we took lock, return 0 if lockess OPs is guarantedd to be safe */
int memcg_start_filecache_accounting(struct page_cgroup *pc)
{
rcu_read_lock();
smp_rmb();
if (!this_cpu_read(move_account_ongoing))
return 0; /* no move account is ongoing */
lock_page_cgroup(pc);
return 1;
}
void memcg_end_filecache_accounting(struct page_cgroup *pc, int unlock)
{
if (unlock)
unlock_page_cgroup(pc);
rcu_read_unlock();
}
and call memcg_start_account_move()/end_account_move() in the start/end of
migrainting chunk of pages.
Bye.
-Kame
--
Hi Kame-san, May be I am missing something but how does it solve the issue of making sure lock_page_cgroup() is not held in interrupt context? IIUC, above code will make sure that for file cache accouting, lock_page_cgroup() is taken only if task migration is on. But say task migration is on, and then some IO completes and we update WRITEBACK stat (i think this is the one which can be called from interrupt context), then we will still take the lock_page_cgroup() and again run into the issue of deadlocks? Thanks Vivek --
I agree. I think the lock/unlock_page_cgrpoup() calls suggested by Kame-san should also include local_irq_save/restore() calls to prevent the interrupt context deadlock Vivek describes. These new local_irq_save/restore() calls would only be used if move_account_ongoing is set. They behave just like the optional calls to lock/unlock_page_cgroup(). -- Greg --
On Wed, 14 Apr 2010 10:04:30 -0400 Yes and No. At "Set", IIRC, almost all updates against DIRTY and WRITBACK accountings can be done under mapping->tree_lock, which disables IRQ always. (ex. I don't mention TestSetPageWriteback but account_page_diritied().) Then, save/restore irq flags is just a cost and no benefit, in such cases. Of course, there are cases irqs doesn't enabled. About FILE_MAPPED, it's not updated under mapping->tree_lock. So, we'll have race with charge/uncharge. We have to take lock_page_cgroup(), always. So, I think we'll have 2 or 3 interfaces, finally. mem_cgroup_update_stat_fast() // the caller must disable IRQ and lock_page() and mem_cgroup_update_stat_locked() // the caller has lock_page(). and mem_cgroup_update_stat_safe() // the caller don't have to do anything. Why update_stat_fast() is for avoiding _unnecessary_ costs. When we lock a page and disables IRQ, we don't have to do anything. There are no races. But yes, it's complicated. I'd like to see what we can do. Thanks, -Kame --
On Wed, Apr 14, 2010 at 2:29 AM, KAMEZAWA Hiroyuki What is your reason for preferring trylock_page_cgroup()? I assume it's for code simplicity, but I wanted to check. I had though about using trylock_page_cgroup() always, but I think that would make file_mapped accounting even more fuzzy that it already it is. I was trying to retain the current accuracy of file_mapped and only make new counters, like writeback/dirty/etc (those obtained in Hi Kame-san, I like the general approach. The code I previously gave you appears to work and is faster than non-root memcgs using mmotm due to mostly What's the reason for having a per-cpu account_move_ongoing flag? Would a single system-wide global be sufficient? I assume the majority of the time this value will not be changing because accounting moves are rare. Perhaps all of the per-cpu variables are packed within a per-cpu cacheline making accessing it more likely to be local, but I'm not sure if this is true. -- Greg --
On Wed, 14 Apr 2010 09:22:41 -0700 file_mapped should have different interface as mem_cgroup_update_stat_verrrry_safe(). or some. I don't think accuracy is important (if it's doesn't go minus) but if people want, Yes. this value is rarely updated but update is not enough rare to put this value to read_mostly section. We see cacheline ping-pong by random placement of global variables. This is performance critical. Recent updates for percpu variables accessor makes access to percpu very efficient. I'd like to make use of it. Thanks, -Kame --
Good catch. I will replace irqs_disabled() with in_interrupt(). Thank you. -- Greg --
I think you should check both. __remove_from_page_cache(), which will update DIRTY, is called with irq disabled(iow, under mapping->tree_lock) but not in interrupt context. Anyway, I tend to agree with KAMEZAWA-san: use trylock always(except for FILE_MAPPED), or add some new interfaces(e.g. mem_cgroup_update_stat_locked/safe...). Thanks, Daisuke Nishimura. --
On Wed, Apr 14, 2010 at 7:40 PM, Daisuke Nishimura The only reason to use trylock in this case is to prevent deadlock when running in a context that may have preempted or interrupted a routine that already holds the bit locked. In the __remove_from_page_cache() irqs are disabled, but that does not imply that a routine holding the spinlock has been preempted. When the bit is locked, preemption is disabled. The only way to interrupt a holder of the bit for an interrupt to occur (I/O, timer, etc). So I think Thank you for the input. I'm thinking more on this. -- Greg --
IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire the same page cgroup lock. But there is still some possibility where 2 CPUs can cause dead lock each other(please see the commit e767e056). IOW, my point is "don't call lock_page_cgroup() under mapping->tree_lock". Thanks, Daisuke Nishimura. --
On Wed, Apr 14, 2010 at 11:21 PM, Daisuke Nishimura I see your point. Thank you for explaining. -- Greg --
On Thu, 15 Apr 2010 15:21:04 +0900 Hmm, maybe worth to try. We may be able to set/clear all DIRTY/WRITBACK bit on page_cgroup without mapping->tree_lock. In such case, of course, the page itself should be locked by lock_page(). But.Hmm..for example. account_page_dirtied() is the best place to mark page_cgroup dirty. But it's called under mapping->tree_lock. Another thinking: I wonder we may have to change our approach for dirty page acccounting. Please see task_dirty_inc(). It's for per task dirty limiting. And you'll notice soon that there is no task_dirty_dec(). Making use of lib/proportions.c's proportion calculation as task_dirty limit or per-bdi dirty limit does is worth to be considered. This is very simple and can be implemented without problems we have now. (Need to think about algorithm itself, but it's used and works well.) We'll never see complicated race condtions. I know some guys wants "accurate" accounting, but I myself don't want too much. Using propotions.c can offer us unified approach with per-task dirty accounting. or per-bid dirty accouting. If we do so, memcg will have interface like per-bdi dirty ratio (see below) [kamezawa@bluextal kvm2]$ ls /sys/block/dm-0/bdi/ max_ratio min_ratio power read_ahead_kb subsystem uevent Maybe memory.min_ratio memory.max_ratio And use this instead of task_dirty_limit(current, pbdi_dirty); As if (mem_cgroup_dirty_ratio_support(current)) // return 0 if root cgroup memcg_dirty_limit(current, pbdi_dirty, xxxx?); else task_dirty_limit(current, pbdi_diryt) To be honest, I don't want to increase caller of lock_page_cgroup() and don't want to see complicated race conditions. Thanks, -Kame --
Hello Kame-san, This is an interesting idea. If this applies to memcg dirty accounting, then would it also apply to system-wide dirty accounting? I don't think so, but I wanted to float the idea. It looks like this proportions.c code is good is at comparing the rates of events (for example: per-task dirty page events). However, in the case of system-wide dirty accounting we also want to consider the amount of dirty memory, not just the rate at which it is being dirtied. Cgroup dirty page accounting imposes the following additional accounting complexities: * hierarchical accounting * page migration between cgroups For per-memcg dirty accounting, are you thinking that each mem_cgroup would have a struct prop_local_single to represent a memcg's dirty memory usage relative to a system wide prop_descriptor? My concern is that we will still need an efficient way to determine the mem_cgroup associated with a page under a variety of conditions (page fault path for new mappings, softirq for dirty page writeback). Currently -rc4 and -mmotm use a non-irq safe lock_page_cgroup() to protect a page's cgroup membership. I think this will cause problems as we add more per-cgroup stats (dirty page counts, etc) that are adjusted in irq handlers. Proposed approaches include: 1. use try-style locking. this can lead to fuzzy counters, which some do not like. Over time these fuzzy counter may drift. 2. mask irq when calling lock_page_cgroup(). This has some performance cost, though it may be small (see below). 3. because a page's cgroup membership rarely changes, use RCU locking. This is fast, but may be more complex than we want. The performance of simple irqsave locking or more advanced RCU locking is similar to current locking (non-irqsave/non-rcu) for several workloads (kernel build, dd). Using a micro-benchmark some differences are seen: * irqsave is 1% slower than mmotm non-irqsave/non-rcu locking. * RCU locking is 4% faster than mmotm non-irqsave/non-rcu ...
Insta-bug here, there is nothing guaranteeing we're not preemptible --
Good catch. A call to smp_wmb() belongs in
mem_cgroup_begin_page_cgroup_reassignment() like so:
static void mem_cgroup_begin_page_cgroup_reassignment(void)
{
VM_BUG_ON(mem_cgroup_account_move_ongoing);
mem_cgroup_account_move_ongoing = true;
smp_wmb();
synchronize_rcu();
}
The irq-disable is not needed for current code, only for upcoming
per-memcg dirty page accounting which will be refactoring
mem_cgroup_update_file_mapped() into a generic memcg stat update
routine. I assume these locking changes should be bundled with the
dependant memcg dirty page accounting changes which need the ability to
update counters from irq routines. I'm sorry I didn't make that more
My addition of VM_BUG_ON() was to programmatic assert what the comment
was asserting. All callers of mem_cgroup_update_file_mapped() hold the
pte spinlock, which disables preemption. So I don't think this
VM_BUG_ON() will cause panic. A function level comment for
mem_cgroup_update_file_mapped() declaring that "callers must have
--
Greg
--
btw, you know synchronize_rcu() is _really_ slow? --
On Fri, 23 Apr 2010 22:57:06 +0200 IIUC, this is called once per an event when task is moved and we have to move accouting information...and once per an event when we call rmdir() to destroy cgroup. So, this is not frequenctly called. (hooks to migration in this patch is removable.) Thanks, -Kame --
Correct, the whole proportion thing is purely about comparing rates of Depending on what architecture you care about local_t might also be an option, it uses per-cpu irq/nmi safe instructions (and falls back to local_irq_save/restore for architectures lacking this support). --
On Fri, 23 Apr 2010 13:17:38 -0700 I think the direction itself is good. But, about FILE_MAPPED, it's out of control of radix-tree. So, we need lock_page_cgroup/unlock_page_cgroup always for avoiding race with charge/uncharge. And you don't need to call "begin/end reassignment" at page migration. (I'll post a patch for modifing codes around page-migration. It has BUG now.) But you have to call begin/end reassignment at force_empty. It does account move. Thanks, --
Since this is just stats can we used deferred updates?
else
Do charging + any deferred charges pending in
--
Three Cheers,
Balbir
--
On Mon, 15 Mar 2010 00:26:37 +0100 Okay, thanks. This seems good result. Optimization for children can be done under I hope, too. Thanks, -Kame --
OK, I'll wait a bit to see if someone has other fixes or issues and post a new version soon including these small changes. Thanks, -Andrea --
For me even with this version I see that group with 100M limit is getting much more BW. root cgroup ========== #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s real 0m56.209s test1 cgroup with memory limit of 100M ====================================== # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s real 0m21.096s Note, these two jobs are not running in parallel. These are running one after the other. --
Ok, here is the strange part. I am seeing similar behavior even without your patches applied. root cgroup ========== #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 56.098 s, 76.6 MB/s real 0m56.614s test1 cgroup with memory limit 100M =================================== # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 19.8097 s, 217 MB/s real 0m19.992s --
This is strange, did you flish the cache between the two runs? NOTE: Since the files are same, we reuse page cache from the other cgroup. -- Three Cheers, Balbir --
Files are different. Note suffix "1". Vivek --
Thanks, I'll get the perf output and see what I get. -- Three Cheers, Balbir --
One more thing I noticed and that is, it happens only if we limit the memory of cgroup to 100M. If same cgroup test1 is unlimited memory thing, then it did not happen. I also did not notice this happening on other system where I have 4G of memory. So it also seems to be related with only bigger configurations. Thanks --
The data is not always repeatable at my end. Are you able to modify the order and get repeatable results? In fact, I saw for cgroup != root ------------------ 4294967296 bytes (4.3 GB) copied, 120.359 s, 35.7 MB/s for cgroup = root ----------------- 4294967296 bytes (4.3 GB) copied, 84.504 s, 50.8 MB/s This is without the patches applied. -- Three Cheers, Balbir --
I lost the access to the big configuration machine but on that machine I could reproduce it all the time. But on smaller machine (4core, 4G), I could not. I will do some more tests later. Vivek --
I'll test this, I have a small machine to test on at the moment, I'll revert back with data. -- Three Cheers, Balbir --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
