Re: [PATCH 4/4] writeback: Reporting dirty thresholds in /proc/vmstat

Previous thread: [PATCH 2/4] mm: account_page_writeback added by Michael Rubin on Friday, August 20, 2010 - 2:31 am. (2 messages)

Next thread: [Patch] makefile: not need to regenerate kernel.release file when make kernelrelease by Amerigo Wang on Friday, August 20, 2010 - 2:36 am. (2 messages)
From: Michael Rubin
Date: Friday, August 20, 2010 - 2:31 am

Patch #1 sets up some helper functions for account_page_dirty

Patch #2 sets up some helper functions for account_page_writeback

Patch #3 adds writeback visibility in /proc/vmstat

Patch #4 add writeback thresholds to /proc/vmstat

To help developers and applications gain visibility into writeback
behaviour this patch adds two counters to /proc/vmstat.

   # grep nr_dirtied /proc/vmstat
   nr_dirtied 3747
   # grep nr_entered_writeback /proc/vmstat
   nr_entered_writeback 3618

These entries allow user apps to understand writeback behaviour over
time and learn how it is impacting their performance. Currently there
is no way to inspect dirty and writeback speed over time. It's not
possible for nr_dirty/nr_writeback.

These entries are necessary to give visibility into writeback
behaviour. We have /proc/diskstats which lets us understand the io in
the block layer. We have blktrace for more in depth understanding. We have
e2fsprogs and debugsfs to give insight into the file systems behaviour,
but we don't offer our users the ability understand what writeback is
doing. There is no way to know how active it is over the whole system,
if it's falling behind or to quantify it's efforts. With these values
exported users can easily see how much data applications are sending
through writeback and also at what rates writeback is processing this
data. Comparing the rates of change between the two allow developers
to see when writeback is not able to keep up with incoming traffic and
the rate of dirty memory being sent to the IO back end. This allows
folks to understand their io workloads and track kernel issues. Non
kernel engineers at Google often use these counters to solve puzzling
performance problems.

Patch #3 adds dirty thresholds to /proc/vmstat.

 # grep threshold /proc/vmstat
 nr_pages_dirty_threshold 409111
 nr_pages_dirty_background_threshold 818223

The files that report the dirty thresholds belong in /proc/vmstat. They
are meant for application writers so should ...
From: Michael Rubin
Date: Friday, August 20, 2010 - 2:31 am

To help developers and applications gain visibility into writeback
behaviour adding two entries to /proc/vmstat.

   # grep nr_dirtied /proc/vmstat
   nr_dirtied 3747
   # grep nr_entered_writeback /proc/vmstat
   nr_entered_writeback 3618

In order to track the "cleaned" and "dirtied" counts we added two
vm_stat_items.  Per memory node stats have been added also. So we can
see per node granularity:

   # cat /sys/devices/system/node/node20/writebackstat
   Node 20 pages_writeback: 0 times
   Node 20 pages_dirtied: 0 times

Signed-off-by: Michael Rubin <mrubin@google.com>
---
 drivers/base/node.c    |   14 ++++++++++++++
 include/linux/mmzone.h |    2 ++
 mm/page-writeback.c    |    2 ++
 mm/vmstat.c            |    3 +++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 2872e86..2d05421 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -160,6 +160,18 @@ static ssize_t node_read_numastat(struct sys_device * dev,
 }
 static SYSDEV_ATTR(numastat, S_IRUGO, node_read_numastat, NULL);
 
+static ssize_t node_read_writebackstat(struct sys_device *dev,
+				struct sysdev_attribute *attr, char *buf)
+{
+	int nid = dev->id;
+	return sprintf(buf,
+		"Node %d pages_writeback: %lu times\n"
+		"Node %d pages_dirtied: %lu times\n",
+		nid, node_page_state(nid, NR_PAGES_ENTERED_WRITEBACK),
+		nid, node_page_state(nid, NR_FILE_PAGES_DIRTIED));
+}
+static SYSDEV_ATTR(writebackstat, S_IRUGO, node_read_writebackstat, NULL);
+
 static ssize_t node_read_distance(struct sys_device * dev,
 			struct sysdev_attribute *attr, char * buf)
 {
@@ -243,6 +255,7 @@ int register_node(struct node *node, int num, struct node *parent)
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
 		sysdev_create_file(&node->sysdev, &attr_numastat);
 		sysdev_create_file(&node->sysdev, &attr_distance);
+		sysdev_create_file(&node->sysdev, &attr_writebackstat);
 
 		scan_unevictable_register_node(node);
 
@@ -267,6 +280,7 @@ ...
From: KOSAKI Motohiro
Date: Friday, August 20, 2010 - 3:05 am

'nr_entered_writeback' seems ok. but nr_dirtied seems a bit easy confusable
with 'nr_dirty'. Can you please choice more clear meaningful name?

Otherwise looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--

From: Wu Fengguang
Date: Friday, August 20, 2010 - 3:08 am

How about the names nr_dirty_accumulated and nr_writeback_accumulated?
It seems more consistent, for both the interface and code (see below).

I'd prefer the name "vmstat" over "writebackstat", and propose to
migrate items from /proc/zoneinfo over time. zoneinfo is a terrible
interface for scripting.

Also, are there meaningful usage of per-node writeback stats?
The numbers are naturally per-bdi ones instead. But if we plan to
expose them for each bdi, this patch will need to be implemented

                nid, node_page_state(nid, NR_WRITEBACK_ACCUMULATED),


ditto
  
ditto

   	NR_FILE_DIRTY_ACCUMULATED, /* number of times pages get dirtied */



   	"nr_dirty_accumulated",
--

From: Michael Rubin
Date: Friday, August 20, 2010 - 4:51 pm

Those names don't seem to right to me.
I admit I like "nr_dirtied" and "nr_cleaned" that seems most
understood. These numbers also get very big pretty fast so I don't


For us yes. We use fake numa nodes to implement cgroup memory isolation.

Currently I have no plans to do that.

mrubin
--

From: Wu Fengguang
Date: Friday, August 20, 2010 - 5:48 pm

That's sure convenient for you, for now. But it's special use case.

I wonder if you'll still stick to the fake NUMA scenario two years
later -- when memcg grows powerful enough. What do we do then? "Hey
let's rip these counters, their major consumer has dumped them.."

For per-job nr_dirtied, I suspect the per-process write_bytes and
cancelled_write_bytes in /proc/self/io will serve you well.

For per-job nr_cleaned, I suspect the per-zone nr_writeback will be

Peter? :)

Thanks,
Fengguang
--

From: Michael Rubin
Date: Monday, August 23, 2010 - 10:45 am

I think the counters will still be useful for NUMA also. Is there a
performance hit here I am missing to having the per node counters?
Just want to make sure we are only wondering about whether or not we
are polluting the interface? Also since we plan to change the name to
vmstat instead doesn't that make it more generic in the future?

mrubin
--

From: Michael Rubin
Date: Monday, August 23, 2010 - 8:02 pm

Is this as an alternative to the vmstat counters or the per node
vmstat counters?

In either case I am not sure /proc/pid will be sufficient. What if 20
processes are created, write a lot of data, then quit? How do we know
about those events later? What about jobs that wake up, write data
then quit quickly?

If we are running many many tasks I would think the amount of time it
can take to cycle through all of them might mean we might not capture
all the data.

The goal is to have a complete view of the dirtying and cleaning of
the pages over time. Using /proc/pid feels like it won't achieve that.

mrubin
--

From: Wu Fengguang
Date: Monday, August 23, 2010 - 8:25 pm

According to Documentation/accounting/taskstats.txt, it's possible to
register a task for collecting its stat numbers when it quits. However
I'm not sure if there are reliable ways to do this if tasks come and
go very quickly. It may help to somehow automatically add the numbers


Looks so.

Thanks,
Fengguang
--

From: Michael Rubin
Date: Friday, August 20, 2010 - 2:31 am

The kernel already exposes the user desired thresholds in /proc/sys/vm
with dirty_background_ratio and background_ratio. But the kernel may
alter the number requested without giving the user any indication that
is the case.

Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.

	$ grep threshold /proc/vmstat
	nr_dirty_threshold 409111
	nr_dirty_background_threshold 818223

Signed-off-by: Michael Rubin <mrubin@google.com>
---
 include/linux/mmzone.h |    3 +++
 mm/vmstat.c            |    5 +++++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fe4e6dd..c2243d0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -106,6 +106,9 @@ enum zone_stat_item {
 	NR_SHMEM,		/* shmem pages (included tmpfs/GEM pages) */
 	NR_FILE_PAGES_DIRTIED,	/* number of times pages get dirtied */
 	NR_PAGES_ENTERED_WRITEBACK, /* number of times pages enter writeback */
+	NR_DIRTY_THRESHOLD,	/* writeback threshold */
+	NR_DIRTY_BG_THRESHOLD,	/* bg writeback threshold */
+
 #ifdef CONFIG_NUMA
 	NUMA_HIT,		/* allocated in intended node */
 	NUMA_MISS,		/* allocated in non intended node */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 073a496..c755d71 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -17,6 +17,7 @@
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
+#include <linux/writeback.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -734,6 +735,8 @@ static const char * const vmstat_text[] = {
 	"nr_shmem",
 	"nr_dirtied",
 	"nr_entered_writeback",
+	"nr_dirty_threshold",
+	"nr_dirty_background_threshold",
 
 #ifdef CONFIG_NUMA
 	"numa_hit",
@@ -917,6 +920,8 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
 		v[i] = ...
From: KOSAKI Motohiro
Date: Friday, August 20, 2010 - 3:06 am

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

From: KOSAKI Motohiro
Date: Sunday, August 22, 2010 - 3:27 am

sorry, this is mistake. Wu pointed out this patch is unnecessary.





--

From: Wu Fengguang
Date: Friday, August 20, 2010 - 3:12 am

This may cost cacheline.

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Friday, August 20, 2010 - 10:48 pm

I realized that the dirty thresholds has already been exported here:

$ grep Thresh  /debug/bdi/8:0/stats
BdiDirtyThresh:     381000 kB
DirtyThresh:       1719076 kB
BackgroundThresh:   859536 kB

So why not use that interface directly?

Thanks,
Fengguang
--

From: Michael Rubin
Date: Monday, August 23, 2010 - 10:52 am

LOL. I know about these counters. This goes back and forth a lot.
The reason we don't want to use this interface is several fold.

1) It's exporting the implementation of writeback. We are doing bdi
today but one day we may not.
2) We need a non debugfs version since there are many situations where
debugfs requires root to mount and non root users may want this data.
Mounting debugfs all the time is not always an option.
3) Full system counters are easier to handle the juggling of removable
storage where these numbers will appear and disappear due to being
dynamic.

The goal is to get a full view of the system writeback behaviour not a
"kinda got it-oops maybe not" view.

mrubin
--

From: KOSAKI Motohiro
Date: Monday, August 23, 2010 - 6:20 pm

Please don't use LOL if you want to get good discuttion. afaict, Wu have
deep knowledge in this area. However all kernel-developer don't know all

In nowadays, many distro mount debugfs at boot time. so, can you please

I bet nobody oppose this point :)


--

From: Michael Rubin
Date: Monday, August 23, 2010 - 6:41 pm

On Mon, Aug 23, 2010 at 6:20 PM, KOSAKI Motohiro

Apologies. No offense was intended. I was laughing at the situation
and how I too once thought the per bdi counters were enough. Feng has
been very helpful and patient. The discussion has done nothing but

Right now we don't mount all of debugfs at boot time. We have not done
the work to verify its safe in our environment. It's mostly a nit.

Also I was under the impression that debugfs was intended more for
kernel devs while /proc and /sys was intended for application

This is the biggie to me. The idea is to get a complete view of the
system's writeback behaviour over time. With systems with hot plug

Yup.

mrubin
--

From: Wu Fengguang
Date: Monday, August 23, 2010 - 7:11 pm

You work discreetly, that's a good thing. Note that most

I guess the keyword here is "debugging/diagnosing". Think about

Sorry for giving a wrong example. Hope this one is better:

$ cat /debug/bdi/default/stats
[...]
DirtyThresh:       1838904 kB
BackgroundThresh:   919452 kB
[...]

It's a trick to avoid messing with real devices :)

Thanks,
Fengguang
--

From: Michael Rubin
Date: Monday, August 23, 2010 - 7:42 pm

That's cool. And it's the exact code path :-)

mrubin
--

From: Wu Fengguang
Date: Monday, August 23, 2010 - 7:01 pm

> +	global_dirty_limits(v + NR_DIRTY_THRESHOLD, v + NR_DIRTY_BG_THRESHOLD);

Sorry I messed it up. The parameters should be swapped.
--

Previous thread: [PATCH 2/4] mm: account_page_writeback added by Michael Rubin on Friday, August 20, 2010 - 2:31 am. (2 messages)

Next thread: [Patch] makefile: not need to regenerate kernel.release file when make kernelrelease by Amerigo Wang on Friday, August 20, 2010 - 2:36 am. (2 messages)