Re: [PATCH 4/4] per-task-delay-accounting: /proc export for memory reclaim delay

Previous thread: [PATCH 1/2] cpusets: restructure the function update_cpumask() and update_nodemask() by Miao Xie on Tuesday, June 3, 2008 - 10:03 pm. (1 message)

Next thread: per_cpu_counter_sum lockdep warning by Balbir Singh on Tuesday, June 3, 2008 - 11:27 pm. (4 messages)
To: <linux-kernel@...>
Cc: <akpm@...>
Date: Tuesday, June 3, 2008 - 10:38 pm

Sometimes, application responses become bad under heavy memory load.
Applications take a bit time to reclaim memory.
The statistics, how long memory reclaim takes, will be useful to
measure memory usage.

This patch adds accounting memory reclaim to per-task-delay-accounting
for accounting the time of try_to_free_pages().

<i.e>

- When System is under low memory load,
memory reclaim may not occur.

$ free
total used free shared buffers cached
Mem: 8197800 1577300 6620500 0 4808 1516724
-/+ buffers/cache: 55768 8142032
Swap: 16386292 0 16386292

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 5069748 10612 3014060 0 0 0 0 3 26 0 0 100 0
0 0 0 5069748 10612 3014060 0 0 0 0 4 22 0 0 100 0
0 0 0 5069748 10612 3014060 0 0 0 0 3 18 0 0 100 0

Measure the time of tar command.

$ ls -s test.dat
1501472 test.dat

$ time tar cvf test.tar test.dat
real 0m13.388s
user 0m0.116s
sys 0m5.304s

$ ./delayget -d -p <pid>
CPU count real total virtual total delay total
428 5528345500 5477116080 62749891
IO count delay total
338 8078977189
SWAP count delay total
0 0
RECLAIM count delay total
0 0

- When system is under heavy memory load
memory reclaim may occur.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 7159032 49724 1812 3012 0 0 0 0 3 24 0 0 100 0
0 0 7159032 49724 1812 3012 0 0 0 0 4 24 0 0 10...

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 2:00 am

On Tue, 3 Jun 2008 19:38:25 -0700
Could you move this to vmscan.c do_try_to_free_pages().
Then, memory reclaim from memory resource controller can be catched as well.

Thanks,
-Kame

--

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 11:15 pm

Why don't you cc'ed delayacct developer nor linux-mm?

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 11:40 pm

On Wed, 04 Jun 2008 12:15:48 +0900

Oops. I forgot to add delayacct developer to CC.
I'll do it.

But are you sure about linux-mm?
--

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 11:00 pm

Shouldn't we consider memcgroup reclaim?

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Keika Kobayashi <kobayashi.kk@...>, <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 1:51 pm

thanks for pointing this.
Unfortunately, we're not so familiar to memcgroup.
Will look into this.
Could you tell us pointers to memcgroup?

Thanks,
Hiroshi Shimamoto
--

To: Hiroshi Shimamoto <h-shimamoto@...>
Cc: <kosaki.motohiro@...>, Keika Kobayashi <kobayashi.kk@...>, <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 8:50 pm

Documentation/controllers/memory.txt is good guide :)

<abstract>
try_to_free_pages(): global reclaim entry point
try_to_free_mem_cgroup_pages(): memcgroup reclaim entry point
do_try_to_free_pages(): common layer

Unfortunately, I don't know your requirement and
memcgroup reclaim shold be mesured.
I hope you explain your purpose and benefit more.

Thanks.

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Keika Kobayashi <kobayashi.kk@...>, <linux-kernel@...>, <akpm@...>
Date: Thursday, June 5, 2008 - 2:56 pm

thanks, and playing with this is good too.

I see the point that, on memcgroup enabled system, there are
memcgroup memory reclaim points and try_to_free_mem_cgroup_pages()
is called when the page charge reach the limit.
So if we want to account delay for memory reclaim, we should
account at both of try_to_free_pages() and try_to_free_mem_cgroup_pages().

Ah, sure.
We have an issue that a process which accesses a DB sometime slows down.
We want to know what causes it, without the reproducer.
We have only the report of the problem, memory usage, vmstat...
And, I don't know the memory reclaim really causes the problem.

If we can see what kind of delay is the matter in the test bed, it will
help us to take a next action.

Thanks,
Hiroshi Shimamoto

--

To: <linux-kernel@...>
Cc: <akpm@...>
Date: Tuesday, June 3, 2008 - 10:46 pm

Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
---
fs/proc/array.c | 3 ++-
include/linux/delayacct.h | 10 ++++++++++
kernel/delayacct.c | 11 +++++++++++
3 files changed, 23 insertions(+), 1 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 9e3b8c3..a3e6e86 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -498,7 +498,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,

seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %llu %lu %ld\n",
pid_nr_ns(pid, ns),
tcomm,
state,
@@ -544,6 +544,7 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
task->rt_priority,
task->policy,
(unsigned long long)delayacct_blkio_ticks(task),
+ (unsigned long long)delayacct_freepages_ticks(task),
cputime_to_clock_t(gtime),
cputime_to_clock_t(cgtime));
if (mm)
diff --git a/include/linux/delayacct.h b/include/linux/delayacct.h
index f352f06..226b754 100644
--- a/include/linux/delayacct.h
+++ b/include/linux/delayacct.h
@@ -41,6 +41,7 @@ extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
extern __u64 __delayacct_blkio_ticks(struct task_struct *);
extern void __delayacct_freepages_start(void);
extern void __delayacct_freepages_end(void);
+extern __u64 __delayacct_freepages_ticks(struct task_struct *);

static inline int delayacct_is_task_waiting_on_io(struct task_struct *p)
{
@@ -121,6 +122,13 @@ static inline void delayacct_freepages_end(void)
__delayacct_freepages_end();
}

+static inline __u64 delayacct_freepages_ticks(struct task_struct *tsk)
+{
+ if (tsk->delays)
+ return __delayacct_freepages_ticks(tsk);
+ return 0;
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -147,6 +155,8 @@ static inline void ...

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 11:07 pm

userland program oftern below access.
thus, this patch break userland compatibility.

----------------------------------------
#!/usr/bin/perl

$stat = `cat /proc/$pid/stat`;
split
@stat_list = split(/ / , $stat);
print stat_list[$index];
^
|
use array index number

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <kobayashi.kk@...>, <kosaki.motohiro@...>, <linux-kernel@...>
Date: Wednesday, June 4, 2008 - 6:41 pm

On Wed, 04 Jun 2008 12:07:29 +0900

Any application which breaks if we add more things to /proc/pid/stat
just needs to be fixed.

But we can't add new fields in the middle of the line! We can only add
new items to the end.

But I think it would be good if we just weren't to make this change.
Do we really want to keep reporting the same information in two
separate ways?

If so, a new /proc/pid/delays would be a better way, but it might be a
bit late for doing that now, given that we already put
delayacct_blkio_ticks into /proc/pid/stat.

--

To: Andrew Morton <akpm@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, <kobayashi.kk@...>, <linux-kernel@...>
Date: Wednesday, June 4, 2008 - 10:30 pm

On Wed, 4 Jun 2008 15:41:24 -0700

I agree. There is no strong reason with two method.
I'll drop this one from patch series next time.

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Keika Kobayashi <kobayashi.kk@...>, <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 1:47 pm

It's my fault. I suggested him to group delayacct ticks.
I know this breakage, but was not sure what it affects.
I thought that if it'll be a problem we get a claim such you did.

Thanks,
Hiroshi Shimamoto
--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 12:11 am

On Wed, 04 Jun 2008 12:07:29 +0900

Certainly, it makes sense.
I thought this measurement value should be put
beside other delay account.
--

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <akpm@...>
Date: Wednesday, June 4, 2008 - 12:26 am

Yeah, I'm looking forward to your v2 :)

Thanks!

--

To: <linux-kernel@...>
Cc: <akpm@...>
Date: Tuesday, June 3, 2008 - 10:44 pm

Update document and
make getdelays.c show delay accounting of memory reclaim.

For making a distinction between "swapping in pages" and "memory reclaim"
in getdelays.c, MEM is changed to SWAP.

Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
---
Documentation/accounting/delay-accounting.txt | 11 ++++++++---
Documentation/accounting/getdelays.c | 8 ++++++--
2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/Documentation/accounting/delay-accounting.txt b/Documentation/accounting/delay-accounting.txt
index 1443cd7..8a12f07 100644
--- a/Documentation/accounting/delay-accounting.txt
+++ b/Documentation/accounting/delay-accounting.txt
@@ -11,6 +11,7 @@ the delays experienced by a task while
a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages
+d) memory reclaim

and makes these statistics available to userspace through
the taskstats interface.
@@ -41,7 +42,7 @@ this structure. See
include/linux/taskstats.h
for a description of the fields pertaining to delay accounting.
It will generally be in the form of counters returning the cumulative
-delay seen for cpu, sync block I/O, swapin etc.
+delay seen for cpu, sync block I/O, swapin, memory reclaim etc.

Taking the difference of two successive readings of a given
counter (say cpu_delay_total) for a task will give the delay
@@ -94,7 +95,9 @@ CPU count real total virtual total delay total
7876 92005750 100000000 24001500
IO count delay total
0 0
-MEM count delay total
+SWAP count delay total
+ 0 0
+RECLAIM count delay total
0 0

Get delays seen in executing a given simple command
@@ -108,5 +111,7 @@ CPU count real total virtual total delay total
6 4000250 4000000 0
IO count delay total
0 0
-MEM count delay total
+SWAP count delay total
+ 0 0
+RECLAIM count delay total
0 0
diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c
index 4012...

To: <linux-kernel@...>
Cc: <akpm@...>
Date: Tuesday, June 3, 2008 - 10:42 pm

Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
---
include/linux/taskstats.h | 4 ++++
kernel/delayacct.c | 3 +++
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/taskstats.h b/include/linux/taskstats.h
index 5d69c07..87aae21 100644
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -81,6 +81,10 @@ struct taskstats {
__u64 swapin_count;
__u64 swapin_delay_total;

+ /* Delay waiting for memory reclaim */
+ __u64 freepages_count;
+ __u64 freepages_delay_total;
+
/* cpu "wall-clock" running time
* On some architectures, value will adjust for cpu time stolen
* from the kernel in involuntary waits due to virtualization.
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 84b6782..b3179da 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -145,8 +145,11 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp;
tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
+ tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
+ d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
+ d->freepages_count += tsk->delays->freepages_count;
spin_unlock_irqrestore(&tsk->delays->lock, flags);

done:
--
1.5.0.6
--

To: Keika Kobayashi <kobayashi.kk@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 10:59 pm

Please write patch desctiption :)

--

To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <linux-kernel@...>, <akpm@...>
Date: Tuesday, June 3, 2008 - 11:20 pm

Add members for memory reclaim delay to taskstats,
and accumulate them in __delayacct_add_tsk() .

Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
---
On Wed, 04 Jun 2008 11:59:04 +0900

Thank you for your comment.
Here is update version.

include/linux/taskstats.h | 4 ++++
kernel/delayacct.c | 3 +++
2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/taskstats.h b/include/linux/taskstats.h
index 5d69c07..87aae21 100644
--- a/include/linux/taskstats.h
+++ b/include/linux/taskstats.h
@@ -81,6 +81,10 @@ struct taskstats {
__u64 swapin_count;
__u64 swapin_delay_total;

+ /* Delay waiting for memory reclaim */
+ __u64 freepages_count;
+ __u64 freepages_delay_total;
+
/* cpu "wall-clock" running time
* On some architectures, value will adjust for cpu time stolen
* from the kernel in involuntary waits due to virtualization.
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 84b6782..b3179da 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -145,8 +145,11 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp;
tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
+ tmp = d->freepages_delay_total + tsk->delays->freepages_delay;
+ d->freepages_delay_total = (tmp < d->freepages_delay_total) ? 0 : tmp;
d->blkio_count += tsk->delays->blkio_count;
d->swapin_count += tsk->delays->swapin_count;
+ d->freepages_count += tsk->delays->freepages_count;
spin_unlock_irqrestore(&tsk->delays->lock, flags);

done:
--
1.5.0.6
--

Previous thread: [PATCH 1/2] cpusets: restructure the function update_cpumask() and update_nodemask() by Miao Xie on Tuesday, June 3, 2008 - 10:03 pm. (1 message)

Next thread: per_cpu_counter_sum lockdep warning by Balbir Singh on Tuesday, June 3, 2008 - 11:27 pm. (4 messages)