[PATCH 2/4] mm: account_page_writeback added

Previous thread: [PATCH 1/4] mm: exporting account_page_dirty by Michael Rubin on Friday, August 27, 2010 - 7:40 pm. (2 messages)

Next thread: [PATCH] usb: allow drivers to use allocated bandwidth until unbound by Thadeu Lima de Souza Cascardo on Friday, August 27, 2010 - 11:06 pm. (1 message)
From: Michael Rubin
Date: Friday, August 27, 2010 - 7:40 pm

Patch #1 sets up some helper functions for account_page_dirty

Patch #2 sets up some helper functions for account_page_writeback

Patch #3 adds writeback visibility in /proc/vmstat

To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

   # grep nr_dirtied /proc/vmstat
   nr_dirtied 3747
   # grep nr_cleaned /proc/vmstat
   nr_cleaned 3618

These entries allow user apps to understand writeback behaviour over
time and learn how it is impacting their performance. Currently there
is no way to inspect dirty and writeback speed over time. It's not
possible for nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback
behaviour. We have /proc/diskstats which lets us understand the io in
the block layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system,
if it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers
to see when writeback is not able to keep up with incoming traffic and
the rate of dirty memory being sent to the IO back end. This allows
folks to understand their io workloads and track kernel issues. Non
kernel engineers at Google often use these counters to solve puzzling
performance problems.

Patch #4 add writeback thresholds to /proc/vmstat

 # grep threshold /proc/vmstat
 nr_pages_dirty_threshold 409111
 nr_pages_dirty_background_threshold 818223

The files that report the dirty thresholds belong in /proc/vmstat. They
are meant for application writers so should not be in debugfs. But since
they are more related to internals of ...
From: Michael Rubin
Date: Friday, August 27, 2010 - 7:40 pm

This allows code outside of the mm core to safely manipulate page
writeback state and not worry about the other accounting. Not using
these routines means that some code will lose track of the accounting
and we get bugs.

Modified nilfs2 to use interface.

Signed-off-by: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 fs/nilfs2/segment.c |    2 +-
 include/linux/mm.h  |    1 +
 mm/page-writeback.c |   13 ++++++++++++-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 9fd051a..5617f16 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1599,7 +1599,7 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	kunmap_atomic(kaddr, KM_USER0);
 
 	if (!TestSetPageWriteback(clone_page))
-		inc_zone_page_state(clone_page, NR_WRITEBACK);
+		account_page_writeback(clone_page);
 	unlock_page(clone_page);
 
 	return 0;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 709f672..4b2f38b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -856,6 +856,7 @@ int __set_page_dirty_no_writeback(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
 				struct page *page);
 void account_page_dirtied(struct page *page, struct address_space *mapping);
+void account_page_writeback(struct page *page);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9d07a8d..ae5f5d5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1134,6 +1134,17 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 EXPORT_SYMBOL(account_page_dirtied);
 
 /*
+ * Helper function for set_page_writeback family.
+ * NOTE: Unlike account_page_dirtied this does not rely on being atomic
+ * wrt interrupts.
+ */
+void account_page_writeback(struct page ...
From: Michael Rubin
Date: Friday, August 27, 2010 - 7:40 pm

The kernel already exposes the user desired thresholds in /proc/sys/vm
with dirty_background_ratio and background_ratio. But the kernel may
alter the number requested without giving the user any indication that
is the case.

Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.

        $ grep threshold /proc/vmstat
        nr_dirty_threshold 409111
        nr_dirty_background_threshold 818223

Signed-off-by: Michael Rubin <mrubin@google.com>
---
 include/linux/mmzone.h |    2 ++
 mm/vmstat.c            |    5 +++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d42f179..ad48963 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -106,6 +106,8 @@ enum zone_stat_item {
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_FILE_PAGES_DIRTIED,	/* number of times pages get dirtied */
 	NR_PAGES_CLEANED,	/* number of times pages enter writeback */
+	NR_DIRTY_THRESHOLD,	/* writeback threshold */
+	NR_DIRTY_BG_THRESHOLD,	/* bg writeback threshold */
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8521475..2342010 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -17,6 +17,7 @@
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
+#include <linux/writeback.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -734,6 +735,8 @@ static const char * const vmstat_text[] = {
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_cleaned",
+	"nr_dirty_threshold",
+	"nr_dirty_background_threshold",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",
@@ -917,6 +920,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		v[i] = ...
From: KOSAKI Motohiro
Date: Sunday, August 29, 2010 - 5:28 pm

?
afaict, you and wu agreed /debug/bdi/default/stats is enough good.

Don't need this even though we add this two fields into /proc/sys/vm.



--

From: Michael Rubin
Date: Monday, August 30, 2010 - 9:25 am

On Sun, Aug 29, 2010 at 5:28 PM, KOSAKI Motohiro

I commented on this in the 0/4 email of the bug. I think these belong
in /proc/vmstat but I saw they exist in /debug/bdi/default/stats. I
figure they will probably not be accepted but I thought it was worth
attaching for consideration of upgrading from debugfs to /proc.

mrubin
--

From: KOSAKI Motohiro
Date: Monday, August 30, 2010 - 6:07 pm

For reviewers view, we are reviewing your patch to merge immediately if all issue are fixed.
Then, I'm unhappy if you don't drop merge blocker item even though you merely want asking.
At least, you can make separate thread, no?

Of cource, wen other user also want to expose via /proc interface, we are resume
this discusstion gradly.



--

From: Wu Fengguang
Date: Monday, August 30, 2010 - 6:32 pm

Michael asked promoting the dirty thresholds from debugfs to /proc.
As a developer I'd interpret the question as: will there be enough
applications/admins using it? If not, we'd better keep it as debugfs.
Otherwise it benefits to do the interface promotion now, because it
will hurt to accumulate many end user dependencies on debugfs over
time..

Thanks,
Fengguang
--

From: Michael Rubin
Date: Friday, August 27, 2010 - 7:40 pm

To help developers and applications gain visibility into writeback
behaviour adding two entries to /proc/vmstat.

   # grep nr_dirtied /proc/vmstat
   nr_dirtied 3747
   # grep nr_cleaned /proc/vmstat
   nr_cleaned 3618

In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items. Per memory node stats have been added also. So we can
see per node granularity:

   # cat /sys/devices/system/node/node20/vmstat
   Node 20 pages_cleaned: 0 times
   Node 20 pages_dirtied: 0 times

Signed-off-by: Michael Rubin <mrubin@google.com>
---
 drivers/base/node.c    |   14 ++++++++++++++
 include/linux/mmzone.h |    2 ++
 mm/page-writeback.c    |    2 ++
 mm/vmstat.c            |    3 +++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2872e86..facd920 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
 }
 static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
+static ssize_t node_read_vmstat(struct sys_device *dev,
+				struct sysdev_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	return sprintf(buf,
+		"Node %d pages_cleaned: %lu times\n"
+		"Node %d pages_dirtied: %lu times\n",
+		nid, node_page_state(nid, NR_PAGES_CLEANED),
+		nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(vmstat, S_IRUGO, node_read_vmstat, NULL);
+
 static ssize_t node_read_distance(struct sys_device * dev,
 			struct sysdev_attribute *attr, char * buf)
 {
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+		sysdev_create_file(&node->sysdev, &attr_vmstat);
 
 		scan_unevictable_register_node(node);
 
@@ -267,6 +280,7 @@ void unregister_node(struct node *node)
 ...
From: Wu Fengguang
Date: Saturday, August 28, 2010 - 4:50 pm

It's silly to have the different names nr_dirtied and pages_cleaned

The output format is quite different from /proc/vmstat.
Do we really need to "Node X", ":" and "times" decorations?

And the "_PAGES" in NR_FILE_PAGES_DIRTIED looks redundant to
the "_page" in node_page_state(). It's a bit long to be a pleasant

How about the comments /* accumulated number of pages ... */?

Note that NR_CLEANED won't match NR_FILE_DIRTIED in long term because
it also accounts for anon pages, and does not account for dirty pages
that are truncated before they go writeback.

Thanks,
Fengguang
--

From: Michael Rubin
Date: Monday, August 30, 2010 - 11:09 pm

OK.
May not get patch out today but will incorporate these fixes. Thank you again.

mrubin
--

From: Wu Fengguang
Date: Tuesday, August 31, 2010 - 12:48 am

Thanks. In the same directory you can find a different style example
/sys/devices/system/node/node0/numastat :) If ever the file was named

Thanks. This is kind of nitpick, however here is another name by
Jan Kara: BDI_WRITTEN. BDI_WRITTEN may not be a lot better than
BDI_CLEANED, but here is a patch based on Jan's code. I'm cooking
more patches that make use of this per-bdi counter to estimate the
bdi's write bandwidth, and to further decide the optimal (large)
writeback chunk size as well as to do IO-less balance_dirty_pages().

Basically BDI_WRITTEN and NR_CLEANED are accounting for the same
thing in different dimensions. So it would be good if we can use
the same naming scheme to avoid confusing users: either to use
BDI_WRITTEN and NR_WRITTEN, or use BDI_CLEANED and NR_CLEANED.
What's your opinion?

Thanks,
Fengguang

---
writeback: account per-bdi accumulated written pages

This steals code from Jan's balance_dirty_pages() patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-08-29 23:32:46.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-08-29 23:52:50.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-08-29 23:32:46.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-08-29 23:52:50.000000000 +0800
@@ -91,6 +91,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -98,8 ...
From: Wu Fengguang
Date: Sunday, September 5, 2010 - 7:17 am

With wider use of NUMA, I'm expecting more interests to put
/proc/vmstat items into /sys/devices/system/node/node0/.

What shall we do then? There are several possible options:
- just put the /proc/vmstat items into nodeX/numastat
- create nodeX/vmstat and make numastat a symlink to vmstat
- create nodeX/vmstat and remove numastat in future


I tend to prefer *_WRITTEN now.
- *_WRITTEN reminds the users about IO, *_CLEANED is less so obvious.
- *_CLEANED seems to be paired with NR_DIRTIED, this could be
  misleading to the users. The fact is, dirty pages may either be
  written to disk, or dropped (by truncate).

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Sunday, September 5, 2010 - 6:34 pm

Yeah 4th option: keep numastat while introducing the above interfaces.

It's only about naming :) Michael want to do per-zone accounting and I
also need to do per-bdi accounting, for basically the same event. So
I'm proposing to name it consistently.

Thanks,
Fengguang
--

Previous thread: [PATCH 1/4] mm: exporting account_page_dirty by Michael Rubin on Friday, August 27, 2010 - 7:40 pm. (2 messages)

Next thread: [PATCH] usb: allow drivers to use allocated bandwidth until unbound by Thadeu Lima de Souza Cascardo on Friday, August 27, 2010 - 11:06 pm. (1 message)