Patch #1 sets up some helper functions for account_page_dirty Patch #2 sets up some helper functions for account_page_writeback Patch #3 adds writeback visibility in /proc/vmstat To help developers and applications gain visibility into writeback behaviour this patch adds two counters to /proc/vmstat. # grep nr_dirtied /proc/vmstat nr_dirtied 3747 # grep nr_cleaned /proc/vmstat nr_cleaned 3618 These entries allow user apps to understand writeback behaviour over time and learn how it is impacting their performance. Currently there is no way to inspect dirty and writeback speed over time. It's not possible for nr_dirty/nr_writeback. These entries are necessary to give visibility into writeback behaviour. We have /proc/diskstats which lets us understand the io in the block layer. We have blktrace for more in depth understanding. We have e2fsprogs and debugsfs to give insight into the file systems behaviour, but we don't offer our users the ability understand what writeback is doing. There is no way to know how active it is over the whole system, if it's falling behind or to quantify it's efforts. With these values exported users can easily see how much data applications are sending through writeback and also at what rates writeback is processing this data. Comparing the rates of change between the two allow developers to see when writeback is not able to keep up with incoming traffic and the rate of dirty memory being sent to the IO back end. This allows folks to understand their io workloads and track kernel issues. Non kernel engineers at Google often use these counters to solve puzzling performance problems. Patch #4 add writeback thresholds to /proc/vmstat # grep threshold /proc/vmstat nr_pages_dirty_threshold 409111 nr_pages_dirty_background_threshold 818223 The files that report the dirty thresholds belong in /proc/vmstat. They are meant for application writers so should not be in debugfs. But since they are more related to internals of ...
This allows code outside of the mm core to safely manipulate page writeback state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. Modified nilfs2 to use interface. Signed-off-by: Michael Rubin <mrubin@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- fs/nilfs2/segment.c | 2 +- include/linux/mm.h | 1 + mm/page-writeback.c | 13 ++++++++++++- 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 9fd051a..5617f16 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) kunmap_atomic(kaddr, KM_USER0); if (!TestSetPageWriteback(clone_page)) - inc_zone_page_state(clone_page, NR_WRITEBACK); + account_page_writeback(clone_page); unlock_page(clone_page); return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 709f672..4b2f38b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -856,6 +856,7 @@ int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); void account_page_dirtied(struct page *page, struct address_space *mapping); +void account_page_writeback(struct page *page); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); int clear_page_dirty_for_io(struct page *page); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 9d07a8d..ae5f5d5 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1134,6 +1134,17 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) EXPORT_SYMBOL(account_page_dirtied); /* + * Helper function for set_page_writeback family. + * NOTE: Unlike account_page_dirtied this does not rely on being atomic + * wrt interrupts. + */ +void account_page_writeback(struct page ...
The kernel already exposes the user desired thresholds in /proc/sys/vm
with dirty_background_ratio and background_ratio. But the kernel may
alter the number requested without giving the user any indication that
is the case.
Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.
$ grep threshold /proc/vmstat
nr_dirty_threshold 409111
nr_dirty_background_threshold 818223
Signed-off-by: Michael Rubin <mrubin@google.com>
---
include/linux/mmzone.h | 2 ++
mm/vmstat.c | 5 +++++
2 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d42f179..ad48963 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -106,6 +106,8 @@ enum zone_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_FILE_PAGES_DIRTIED, /* number of times pages get dirtied */
NR_PAGES_CLEANED, /* number of times pages enter writeback */
+ NR_DIRTY_THRESHOLD, /* writeback threshold */
+ NR_DIRTY_BG_THRESHOLD, /* bg writeback threshold */
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8521475..2342010 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -17,6 +17,7 @@
#include <linux/vmstat.h>
#include <linux/sched.h>
#include <linux/math64.h>
+#include <linux/writeback.h>
#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -734,6 +735,8 @@ static const char * const vmstat_text[] = {
"nr_shmem",
"nr_dirtied",
"nr_cleaned",
+ "nr_dirty_threshold",
+ "nr_dirty_background_threshold",
#ifdef CONFIG_NUMA
"numa_hit",
@@ -917,6 +920,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
return ERR_PTR(-ENOMEM);
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
v[i] = ...? afaict, you and wu agreed /debug/bdi/default/stats is enough good. Don't need this even though we add this two fields into /proc/sys/vm. --
On Sun, Aug 29, 2010 at 5:28 PM, KOSAKI Motohiro I commented on this in the 0/4 email of the bug. I think these belong in /proc/vmstat but I saw they exist in /debug/bdi/default/stats. I figure they will probably not be accepted but I thought it was worth attaching for consideration of upgrading from debugfs to /proc. mrubin --
For reviewers view, we are reviewing your patch to merge immediately if all issue are fixed. Then, I'm unhappy if you don't drop merge blocker item even though you merely want asking. At least, you can make separate thread, no? Of cource, wen other user also want to expose via /proc interface, we are resume this discusstion gradly. --
Michael asked promoting the dirty thresholds from debugfs to /proc. As a developer I'd interpret the question as: will there be enough applications/admins using it? If not, we'd better keep it as debugfs. Otherwise it benefits to do the interface promotion now, because it will hurt to accumulate many end user dependencies on debugfs over time.. Thanks, Fengguang --
To help developers and applications gain visibility into writeback
behaviour adding two entries to /proc/vmstat.
# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_cleaned /proc/vmstat
nr_cleaned 3618
In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items. Per memory node stats have been added also. So we can
see per node granularity:
# cat /sys/devices/system/node/node20/vmstat
Node 20 pages_cleaned: 0 times
Node 20 pages_dirtied: 0 times
Signed-off-by: Michael Rubin <mrubin@google.com>
---
drivers/base/node.c | 14 ++++++++++++++
include/linux/mmzone.h | 2 ++
mm/page-writeback.c | 2 ++
mm/vmstat.c | 3 +++
4 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2872e86..facd920 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
}
static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
+static ssize_t node_read_vmstat(struct sys_device *dev,
+ struct sysdev_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ return sprintf(buf,
+ "Node %d pages_cleaned: %lu times\n"
+ "Node %d pages_dirtied: %lu times\n",
+ nid, node_page_state(nid, NR_PAGES_CLEANED),
+ nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
+
static ssize_t node_read_distance(struct sys_device * dev,
struct sysdev_attribute *attr, char * buf)
{
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
sysdev_create_file(&node->sysdev, &attr_meminfo);
sysdev_create_file(&node->sysdev, &attr_numastat);
sysdev_create_file(&node->sysdev, &attr_distance);
+ sysdev_create_file(&node->sysdev, &attr_vmstat);
scan_unevictable_register_node(node);
@@ -267,6 +280,7 @@ void unregister_node(struct node *node)
...It's silly to have the different names nr_dirtied and pages_cleaned The output format is quite different from /proc/vmstat. Do we really need to "Node X", ":" and "times" decorations? And the "_PAGES" in NR_FILE_PAGES_DIRTIED looks redundant to the "_page" in node_page_state(). It's a bit long to be a pleasant How about the comments /* accumulated number of pages ... */? Note that NR_CLEANED won't match NR_FILE_DIRTIED in long term because it also accounts for anon pages, and does not account for dirty pages that are truncated before they go writeback. Thanks, Fengguang --
OK. May not get patch out today but will incorporate these fixes. Thank you again. mrubin --
Thanks. In the same directory you can find a different style example
/sys/devices/system/node/node0/numastat :) If ever the file was named
Thanks. This is kind of nitpick, however here is another name by
Jan Kara: BDI_WRITTEN. BDI_WRITTEN may not be a lot better than
BDI_CLEANED, but here is a patch based on Jan's code. I'm cooking
more patches that make use of this per-bdi counter to estimate the
bdi's write bandwidth, and to further decide the optimal (large)
writeback chunk size as well as to do IO-less balance_dirty_pages().
Basically BDI_WRITTEN and NR_CLEANED are accounting for the same
thing in different dimensions. So it would be good if we can use
the same naming scheme to avoid confusing users: either to use
BDI_WRITTEN and NR_WRITTEN, or use BDI_CLEANED and NR_CLEANED.
What's your opinion?
Thanks,
Fengguang
---
writeback: account per-bdi accumulated written pages
This steals code from Jan's balance_dirty_pages() patch.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/backing-dev.h | 1 +
mm/backing-dev.c | 6 ++++--
mm/page-writeback.c | 1 +
3 files changed, 6 insertions(+), 2 deletions(-)
--- linux-next.orig/include/linux/backing-dev.h 2010-08-29 23:32:46.000000000 +0800
+++ linux-next/include/linux/backing-dev.h 2010-08-29 23:52:50.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
+ BDI_WRITTEN,
NR_BDI_STAT_ITEMS
};
--- linux-next.orig/mm/backing-dev.c 2010-08-29 23:32:46.000000000 +0800
+++ linux-next/mm/backing-dev.c 2010-08-29 23:52:50.000000000 +0800
@@ -91,6 +91,7 @@ static int bdi_debug_stats_show(struct s
"BdiDirtyThresh: %8lu kB\n"
"DirtyThresh: %8lu kB\n"
"BackgroundThresh: %8lu kB\n"
+ "BdiWritten: %8lu kB\n"
"b_dirty: %8lu\n"
"b_io: %8lu\n"
"b_more_io: %8lu\n"
@@ -98,8 ...With wider use of NUMA, I'm expecting more interests to put /proc/vmstat items into /sys/devices/system/node/node0/. What shall we do then? There are several possible options: - just put the /proc/vmstat items into nodeX/numastat - create nodeX/vmstat and make numastat a symlink to vmstat - create nodeX/vmstat and remove numastat in future I tend to prefer *_WRITTEN now. - *_WRITTEN reminds the users about IO, *_CLEANED is less so obvious. - *_CLEANED seems to be paired with NR_DIRTIED, this could be misleading to the users. The fact is, dirty pages may either be written to disk, or dropped (by truncate). Thanks, Fengguang --
Yeah 4th option: keep numastat while introducing the above interfaces. It's only about naming :) Michael want to do per-zone accounting and I also need to do per-bdi accounting, for basically the same event. So I'm proposing to name it consistently. Thanks, Fengguang --
