Perform writeback and dirty throttling with awareness of cpuset mem_allowed. The theory of operation has two primary elements: 1. Add a nodemask per mapping which indicates the nodes which have set PageDirty on any page of the mappings. 2. Add a nodemask argument to wakeup_pdflush() which is propagated down to sync_sb_inodes. This leaves sync_sb_inodes() with two nodemasks. One is passed to it and specifies the nodes the caller is interested in syncing, and will either be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the caller's context. The second nodemask is attached to the inode's mapping and shows who has modified data in the inode. sync_sb_inodes() will then skip syncing of inodes if the nodemask argument does not intersect with the mapping nodemask. cpuset_current_mems_allowed will be passed in to pdflush background_writeout by try_to_free_pages and balance_dirty_pages. balance_dirty_pages also passes the nodemask in to writeback_inodes directly when doing active reclaim. Other callers do not limit inode writeback, passing in a NULL nodemask pointer. A final change is to get_dirty_limits. It takes a nodemask argument, and when it is null there is no change in behavior. If the nodemask is set, page statistics are accumulated only for specified nodes, and the background and throttle dirty ratios will be read from a new per-cpuset ratio feature. These patches are mostly unchanged from Chris Lameter's original changelist posted previously to linux-mm. -
Perform writeback and dirty throttling with awareness of cpuset mem_allowed. The theory of operation has two primary elements: 1. Add a nodemask per mapping which indicates the nodes which have set PageDirty on any page of the mappings. 2. Add a nodemask argument to wakeup_pdflush() which is propagated down to sync_sb_inodes. This leaves sync_sb_inodes() with two nodemasks. One is passed to it and specifies the nodes the caller is interested in syncing, and will either be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the caller's context. The second nodemask is attached to the inode's mapping and shows who has modified data in the inode. sync_sb_inodes() will then skip syncing of inodes if the nodemask argument does not intersect with the mapping nodemask. cpuset_current_mems_allowed will be passed in to pdflush background_writeout by try_to_free_pages and balance_dirty_pages. balance_dirty_pages also passes the nodemask in to writeback_inodes directly when doing active reclaim. Other callers do not limit inode writeback, passing in a NULL nodemask pointer. A final change is to get_dirty_limits. It takes a nodemask argument, and when it is null there is no change in behavior. If the nodemask is set, page statistics are accumulated only for specified nodes, and the background and throttle dirty ratios will be read from a new per-cpuset ratio feature. For testing I did a variety of basic tests, verifying individual features of the test. To verify that it fixes the core problem, I created a stress test which involved using cpusets and mems_allowed to split memory so that all daemons had memory set aside for them, and my memory stress test had a separate set of memory. The stress test was mmaping 7GB of a very large file on disk. It then scans the entire 7GB of memory reading and modifying each byte. 7GB is more than the amount of physical memory made available to the stress test. Using iostat I can see the initial period of reading from dis...
Per cpuset dirty ratios
This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:
background_dirty_ratio Percentage at which background writeback starts
throttle_dirty_ratio Percentage at which the application is throttled
and we start synchrononous writeout.
Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 5/include/linux/cpuset.h 2007-09-11 14:50:48.000000000 -0700
+++ 7/include/linux/cpuset.h 2007-09-11 14:51:12.000000000 -0700
@@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+extern void cpuset_get_current_ratios(int *background, int *ratio);
/*
* We need macros since struct address_space is not defined yet
*/
diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
--- 5/kernel/cpuset.c 2007-09-11 14:50:49.000000000 -0700
+++ 7/kernel/cpuset.c 2007-09-11 14:56:18.000000000 -0700
@@ -51,6 +51,7 @@
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>
#include <asm/atomic.h>
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
struct fmeter fmeter; /* memory_pressure filter */
+
+ int background_dirty_ratio;
+ int throttle_dirty_ratio;
};
/* Retrieve the cpuset for a container */
@@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+ .background_dirty_ratio = -1,
+ .throttle_dirty_ratio = -1,
};
...On Tue, 11 Sep 2007 18:42:16 -0700 ditto? -
Locking is wrong here. The lock needs to be taken before the cs pointer It is required to take the task lock while dereferencing the tasks cpuset pointer. -
I think we can just remove the callback_mutex lock. Since the change is coming from an update to a cpuset filesystem file, the cpuset is not going anywhere since the inode is open. And I don't see that any code Agreed. -- Ethan -
Throttle VM writeout in a cpuset aware way
This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.
kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not from kswapd.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-09-11 14:49:47.000000000 -0700
+++ 5/include/linux/writeback.h 2007-09-11 14:50:52.000000000 -0700
@@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc
int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c 2007-09-11 14:49:47.000000000 -0700
+++ 5/mm/page-writeback.c 2007-09-11 14:50:52.000000000 -0700
@@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
{
struct dirty_limits dl;
@@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
for ( ; ; ) {
- get_dirty_limits(&dl, NULL, &node_online_map);
+ get_dirty_limits(&dl, NULL, nodes);
/*
* Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c 2007-09-11 14:50:41.000000000 -0700
+++ 5/mm/vmscan.c 2007-09-11 14:50:52.000000000 -0700
@@ -1185,7 +1185,...Direct reclaim: cpuset aware writeout
During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either
A. No writeout occurs because global dirty limits have not been reached
or
B. Writeout starts randomly for some dirty inode in the system. Pdflush
may just write out data for nodes in another cpuset and miss doing
proper dirty handling for the current cpuset.
In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.
Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c 2007-09-11 14:41:56.000000000 -0700
+++ 4/mm/vmscan.c 2007-09-11 14:50:41.000000000 -0700
@@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc
*/
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+ &cpuset_current_mems_allowed);
sc->may_writepage = 1;
}
-Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Ethan Solomita <solo@google.com> --- Patch against 2.6.23-rc4-mm1 diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-09-11 14:39:22.000000000 -0700 +++ 3/mm/page-writeback...
pdflush: Allow the passing of a nodemask parameter
If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.
So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.23-rc4-mm1
diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c 2007-09-11 14:36:24.000000000 -0700
+++ 2/fs/buffer.c 2007-09-11 14:39:22.000000000 -0700
@@ -372,7 +372,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
- wakeup_pdflush(1024);
+ wakeup_pdflush(1024, NULL);
yield();
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c 2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/super.c 2007-09-11 14:39:22.000000000 -0700
@@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb
return 0;
}
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
{
struct super_block *sb;
@@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne
void emergency_remount(void)
{
- pdflush_operation(do_emergency_remount, 0);
+ pdflush_operation(do_emergency_remount, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/sync.c 2007-09-11 14:39:22.000000000 -0700
@@ -21,9 +21,9 @@
* sync everything. Start out by waking pdflush, because that writes back
* all queues in parallel.
*/
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
{
- wakeup_pdflush(0);
+ wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers(); /* Write...On Tue, 11 Sep 2007 18:36:34 -0700 I get a tremendous number of rejects trying to wedge this stuff on top of Peter's mm-dirty-balancing-for-tasks changes. More rejects than I am prepared to partially-fix so that I can usefully look at these changes in It'd be nice to see some discussion regarding the memory consumption of The patch is sprinkled full of this conditional. I don't understand why this is being done. afaict it isn't described in a code comment (it should be) nor even in the changelogs? Given its overall complexity and its likelihood to change in the future, I'd suggest that this conditional be centralised in a single place. Something like /* * nice comment goes here */ #if MAX_NUMNODES <= BITS_PER_LONG #define CPUSET_DIRTY_LIMITS 1 #else #define CPUSET_DIRTY_LIMITS 0 #endif Then use #if CPUSET_DIRTY_LIMITS everywhere else. (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a warning if someone typos '#if CPUSET_DITRY_LIMITS') Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig, but I suspect you'll need to jump through unfeasible hoops to do that afacit there is no code comment and no changelog text which explains the above design decision? There should be, please. There is talk of making cpusets available with CONFIG_SMP=n. Will this new That comment is a bit terse. It's always good to be lavish when commenting hmm, OK, there's a hint as to wghat's going on. It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and we might want to tweak that in the future. Yet another argument for Does it have to be atomic? atomic is weak and can fail. If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we should at least pass the gfp_t into this function so it can do the stronger Can this race with cpuset_update_dirty_nodes()? And with itself? If not, -
This isn't surprising. We're both changing the calculation of dirty limits. If his code is already into your workspace, then I'll have to do I can add something like this. Probably something like: I'm not sure how useful it would be in that scenario, but for consistency we should still be able to specify varying dirty ratios (from patch 6/6). The above code wouldn't mean anything SMP=n since there's only the one node. We'd just be indicating whether the inode has I'll add a comment to make it more obvious. The point is to only add one long to the data structure, no matter sizeof(nodemask_t), adding I was going to say that sanity would be improved by just allocing the nodemask at inode alloc time. A failure here could be a problem because below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask pointer means that there are no dirty nodes, thus preventing dirty pages from getting written to disk. i.e. This must never fail. Given that we allocate it always at the beginning, I'm leaning towards just allocating it within mapping no matter its size. It will make the code much much simpler, and save me writing all the comments we've been discussing. 8-) How disastrous would this be? Is the need to support a 1024 node system with 1,000,000 open mostly-read-only files thus needing to spend 120MB I'll add a comment. Such a race should not be possible. It is called only from clear_inode() which is used when the inode is being freed "with extreme prejudice" (from its comments). I can add a check that -
Hmmm. It should assume that there is no tracking thus any node can be Consider that a 1024 node system has more than 4TB of memory. If that system is running as a fileserver then you get into some issues. But then 120MB are not that big of a deal. Its more the cache footprint issue I would think. Having a NULL there avoids touching a 128 byte nodemask. I There is already a comment saying that it cannot happen. -
None of this is very nice. Yes, it would be good to save all that memory and yes, I_DIRTY_PAGES inodes are very much the uncommon case. But if a failed GFP_ATOMIC allocation results in data loss then that's a showstopper. How hard would it be to handle the allocation failure in a more friendly manner? Say, if the allocation failed then point mapping->dirty_nodes at some global all-ones nodemask, and then special-case that nodemask in the Sounds sane. -
Ack. However, the situation dirty_nodes == NULL && inode dirty then means that unknown nodes are dirty. If we are later are successful with the alloc and we know that the pages are dirty in the mapping then the initial dirty_nodes must be all ones. If this is the first page to be dirtied then we can start with a dirty_nodes mask of all zeros like now. -
On small NUMA system < 64 nodes you basically have an additional word added to the address_space structure. On large NUMA we may have to allocate very large per node arrays of up to 1024 nodes which may result in 128 bytes used. However, the very large systems are rare thus we want Hmmmm... There was an extensive discussion on that one early in the year and I had it in the overview of the earlier releases. Ethan: Maybe get Yes. Cpuset with CONFIG_SMP=n has been run for years on IA64 with Tony So A pointer has the size of BITS_PER_LONG. If we have less nodes then we This is called from __set_page_dirty_nobuffers. There is no allocation context available at that point. If the allocation fails then we do not You just quoted the comment above that explains the locking requirements. Here it is again: /* + * Special functions for NUMA systems with a large number of nodes. + * The nodemask is pointed to from the address space structures. + * The attachment of the dirty_node mask is protected by the + * tree_lock. The nodemask is freed only when the inode is cleared + * (and therefore unused, thus no locking necessary). + */ -
Looks like just an optimization to me ... Ethan wants to economize and not bloat struct address_space too much. So, if sizeof(nodemask_t) == sizeof(long), i.e. when: MAX_NUMNODES <= BITS_PER_LONG, then we'll be adding only sizeof(long) extra bytes to the struct (by plonking the object itself into it). But even when MAX_NUMNODES > BITS_PER_LONG, because we're storing a pointer, and because sizeof(void *) == sizeof(long), so again the maximum bloat addition to struct address_space would only be sizeof(long) bytes. I didn't see the original mail, but if the #ifdeffery for this conditional is too much as a result of this optimization, Ethan should probably just do away with all of it entirely, and simply put a full nodemask_t object (irrespective of MAX_NUMNODES) into the struct. After all, struct task_struct does the same unconditionally ... but admittedly, there are several times more address_space struct's resident in memory at any given time than there are task_struct's, so this optimization does -
On Sat, 15 Sep 2007 05:17:48 +0530 yup. Note that "It's unobvious" != "It's unobvious to me". I review code for I think the optimisation is (probably) desirable, but it would be best to describe the tradeoff in the changelog and to add some suitable code-commentary for those who read the code in a year's time and to avoid sprinkling the logic all over the tree. -
True, the other option could be to put the /pointer/ in there unconditionally, but that would slow down the MAX_NUMNODES <= BITS_PER_LONG case, which (after grepping through defconfigs) appears to be the common case on all archs other than ia64. So I think your idea of making that conditional centralized in the code with an accompanying comment is the way to go here ... Satyam -
Satyam Sharma wrote: It won't be long before arch's other than ia64 will have MAX_NUMNODES > BITS_PER_LONG. While it won't be the norm, we should account for it now. Thanks, Mike -
On Tue, 17 Jul 2007 14:23:14 -0700 Thanks for keeping these patches up to date. Add you signoff if you did modifications to a patch. Also include the description of the tests in the introduction to the patchset. -
So switch from an Ack to a signed-off? OK, and I'll add descriptions of testing. Everyone other than you has been silent on these patches. Does silence equal consent? -- Ethan -
Sometimes. Howerver, the audience for NUMA and cpusets is rather limited. -
Per cpuset dirty ratios
This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:
background_dirty_ratio Percentage at which background writeback starts
throttle_dirty_ratio Percentage at which the application is throttled
and we start synchrononous writeout.
Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 6/include/linux/cpuset.h 2007-07-11 21:17:08.000000000 -0700
+++ 7/include/linux/cpuset.h 2007-07-11 21:17:41.000000000 -0700
@@ -76,6 +76,7 @@ extern void cpuset_track_online_nodes(vo
extern int current_cpuset_is_being_rebound(void);
+extern void cpuset_get_current_ratios(int *background, int *ratio);
/*
* We need macros since struct address_space is not defined yet
*/
diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c
--- 6/kernel/cpuset.c 2007-07-12 12:15:20.000000000 -0700
+++ 7/kernel/cpuset.c 2007-07-12 12:15:34.000000000 -0700
@@ -51,6 +51,7 @@
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
+#include <linux/writeback.h>
#include <asm/uaccess.h>
#include <asm/atomic.h>
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
struct fmeter fmeter; /* memory_pressure filter */
+
+ int background_dirty_ratio;
+ int throttle_dirty_ratio;
};
/* Update the cpuset for a container */
@@ -175,6 +179,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+ .background_dirty_ratio = -1,
+ .throttle_dirty_ratio = -1,
};
/*...Throttle VM writeout in a cpuset aware way
This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.
kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not from kswapd.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-07-11 21:16:25.000000000 -0700
+++ 5/include/linux/writeback.h 2007-07-11 21:16:50.000000000 -0700
@@ -95,7 +95,7 @@ static inline void inode_sync_wait(struc
int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c 2007-07-16 18:31:13.000000000 -0700
+++ 5/mm/page-writeback.c 2007-07-16 18:32:08.000000000 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
{
struct dirty_limits dl;
@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
for ( ; ; ) {
- get_dirty_limits(&dl, NULL, &node_online_map);
+ get_dirty_limits(&dl, NULL, nodes);
/*
* Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c 2007-07-11 21:16:26.000000000 -0700
+++ 5/mm/vmscan.c 2007-07-11 21:16:50.000000000 -0700
@@ -1064,7 +1064,...Direct reclaim: cpuset aware writeout
During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either
A. No writeout occurs because global dirty limits have not been reached
or
B. Writeout starts randomly for some dirty inode in the system. Pdflush
may just write out data for nodes in another cpuset and miss doing
proper dirty handling for the current cpuset.
In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.
Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c 2007-07-11 21:16:14.000000000 -0700
+++ 4/mm/vmscan.c 2007-07-11 21:16:26.000000000 -0700
@@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z
*/
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
- wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+ wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+ &cpuset_current_mems_allowed);
sc.may_writepage = 1;
}
-Make page writeback obey cpuset constraints Currently dirty throttling does not work properly in a cpuset. If f.e a cpuset contains only 1/10th of available memory then all of the memory of a cpuset can be dirtied without any writes being triggered. If all of the cpusets memory is dirty then only 10% of total memory is dirty. The background writeback threshold is usually set at 10% and the synchrononous threshold at 40%. So we are still below the global limits while the dirty ratio in the cpuset is 100%! Writeback throttling and background writeout do not work at all in such scenarios. This patch makes dirty writeout cpuset aware. When determining the dirty limits in get_dirty_limits() we calculate values based on the nodes that are reachable from the current process (that has been dirtying the page). Then we can trigger writeout based on the dirty ratio of the memory in the cpuset. We trigger writeout in a a cpuset specific way. We go through the dirty inodes and search for inodes that have dirty pages on the nodes of the active cpuset. If an inode fulfills that requirement then we begin writeout of the dirty pages of that inode. Adding up all the counters for each node in a cpuset may seem to be quite an expensive operation (in particular for large cpusets with hundreds of nodes) compared to just accessing the global counters if we do not have a cpuset. However, please remember that the global counters were only introduced recently. Before 2.6.18 we did add up per processor counters for each processor on each invocation of get_dirty_limits(). We now add per node information which I think is equal or less effort since there are less nodes than processors. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Ethan Solomita <solo@google.com> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c --- 2/mm/page-writeback.c 2007-07-11 21:15:47.000000000 -0700 +++ 3/mm/page-writeback.c 20...
pdflush: Allow the passing of a nodemask parameter
If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.
So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Ethan Solomita <solo@google.com>
---
Patch against 2.6.22-rc6-mm1
diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c 2007-07-11 21:08:04.000000000 -0700
+++ 2/fs/buffer.c 2007-07-11 21:15:47.000000000 -0700
@@ -359,7 +359,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
- wakeup_pdflush(1024);
+ wakeup_pdflush(1024, NULL);
yield();
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c 2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/super.c 2007-07-11 21:15:47.000000000 -0700
@@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb
return 0;
}
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
{
struct super_block *sb;
@@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne
void emergency_remount(void)
{
- pdflush_operation(do_emergency_remount, 0);
+ pdflush_operation(do_emergency_remount, 0, NULL);
}
/*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/sync.c 2007-07-11 21:15:47.000000000 -0700
@@ -21,9 +21,9 @@
* sync everything. Start out by waking pdflush, because that writes back
* all queues in parallel.
*/
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
{
- wakeup_pdflush(0);
+ wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers(); /* Write...Add a dirty map to struct address_space In a NUMA system it is helpful to know where the dirty pages of a mapping are located. That way we will be able to implement writeout for applications that are constrained to a portion of the memory of the system as required by cpusets. This patch implements the management of dirty node maps for an address space through the following functions: cpuset_clear_dirty_nodes(mapping) Clear the map of dirty nodes cpuset_update_nodes(mapping, page) Record a node in the dirty nodes map cpuset_init_dirty_nodes(mapping) First time init of the map The dirty map may be stored either directly in the mapping (for NUMA systems with less then BITS_PER_LONG nodes) or separately allocated for systems with a large number of nodes (f.e. IA64 with 1024 nodes). Updating the dirty map may involve allocating it first for large configurations. Therefore we protect the allocation and setting of a node in the map through the tree_lock. The tree_lock is already taken when a page is dirtied so there is no additional locking overhead if we insert the updating of the nodemask there. The dirty map is only cleared (or freed) when the inode is cleared. At that point no pages are attached to the inode anymore and therefore it can be done without any locking. The dirty map therefore records all nodes that have been used for dirty pages by that inode until the inode is no longer used. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Ethan Solomita <solo@google.com> --- Patch against 2.6.22-rc6-mm1 diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c --- 0/fs/buffer.c 2007-07-11 20:30:55.000000000 -0700 +++ 1/fs/buffer.c 2007-07-11 21:08:04.000000000 -0700 @@ -41,6 +41,7 @@ #include <linux/bitops.h> #include <linux/mpage.h> #include <linux/bit_spinlock.h> +#include <linux/cpuset.h> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); @@ -710,6 +711,7 @@ static...
| Linus Torvalds | Linux 2.6.26-rc4 |
| Antonino Ingargiola | [SOLVED] Serial buffer corruption [was Re: FTDI usb-serial possible bug] |
| Jan Engelhardt | Re: LSM conversion to static interface |
| Peter Teoh | Re: Slow DOWN, please!!! |
| Jacob Yocom-Piatt | Re: Real men don't attack straw men |
| Florin Andrei | firewall is very slow, something's wrong |
| Karel Kulhavy | OpenBSD kernel janitors |
| Joerg Zinke | openbsd router hardware |
git: | |
| Theodore Ts'o | How do get a specific version of a particular file? |
| Linus Torvalds | Re: git versus CVS (versus bk) |
| Ken Pratt | pack operation is thrashing my server |
| Giuseppe Bilotta | git-svn tags and branches |
| David Miller | [GIT]: Networking |
| Wang Jian | drivers/net/phy/marvell.c: 88e1111 can't get out sleep mode |
| Wei Yongjun | Re: [PATCH] DCCP: Fix to reset the connection with Reset Code 5 Option Error while... |
| Auke Kok | [PATCH] e1000e: test MSI interrupts |
