Re: [PATCH] Add all thread stats for TASKSTATS_CMD_ATTR_TGID

Previous thread: [PATCH x86-64] Calgary - Fix mis-handled PCI topology by Muli Ben-Yehuda on Thursday, August 2, 2007 - 6:39 am. (2 messages)

Next thread: Re: [PATCH 2.6.23-rc1] eCryptfs: fix error handling in ecryptfs_init by Michael Halcrow on Thursday, August 2, 2007 - 6:56 am. (1 message)
From: Guillaume Chazarain
Date: Thursday, August 2, 2007 - 6:53 am

[Resent with different recipients as nagar@watson.ibm.com bounced with
a User unknown]

Hi,

This patch adds all thread accounting stats for the global tgid stats.
As a shameless plug, this fixes iotop -P (http://guichaz.free.fr/misc/iotop.py).

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
---

diff -r 22708012ca6e kernel/taskstats.c
--- a/kernel/taskstats.c	Tue Jul 31 21:12:07 2007 -0700
+++ b/kernel/taskstats.c	Wed Aug 01 17:43:54 2007 +0200
@@ -233,6 +233,7 @@ static int fill_tgid(pid_t tgid, struct
 		memset(stats, 0, sizeof(*stats));

 	tsk = first;
+	bacct_add_tsk(stats, first);
 	do {
 		if (tsk->exit_state)
 			continue;
@@ -246,6 +247,7 @@ static int fill_tgid(pid_t tgid, struct

 		stats->nvcsw += tsk->nvcsw;
 		stats->nivcsw += tsk->nivcsw;
+		xacct_add_tsk(stats, tsk);
 	} while_each_thread(first, tsk);

 	unlock_task_sighand(first, &flags);
diff -r 22708012ca6e kernel/tsacct.c
--- a/kernel/tsacct.c	Tue Jul 31 21:12:07 2007 -0700
+++ b/kernel/tsacct.c	Wed Aug 01 17:41:40 2007 +0200
@@ -81,8 +81,8 @@ void xacct_add_tsk(struct taskstats *sta
 	struct mm_struct *mm;

 	/* convert pages-jiffies to Mbyte-usec */
-	stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
-	stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
+	stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB;
+	stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB;
 	mm = get_task_mm(p);
 	if (mm) {
 		/* adjust to KB unit */
@@ -90,18 +90,14 @@ void xacct_add_tsk(struct taskstats *sta
 		stats->hiwater_vm    = mm->hiwater_vm * PAGE_SIZE / KB;
 		mmput(mm);
 	}
-	stats->read_char	= p->rchar;
-	stats->write_char	= p->wchar;
-	stats->read_syscalls	= p->syscr;
-	stats->write_syscalls	= p->syscw;
+	stats->read_char	+= p->rchar;
+	stats->write_char	+= p->wchar;
+	stats->read_syscalls	+= p->syscr;
+	stats->write_syscalls	+= p->syscw;
 #ifdef CONFIG_TASK_IO_ACCOUNTING
-	stats->read_bytes	= ...
From: Andrew Morton
Date: Thursday, August 2, 2007 - 12:04 pm

On Thu, 2 Aug 2007 15:53:13 +0200

uhm, that's a fairly skimpy changelog for a fairly significant functional


Suitable cc's added.  Guys, please have a think about the implications of
this change?  In particular, why didn't we do this on day one?  It's so
obvious that I have a feeling we must have had a reason?

-

From: Guillaume Chazarain
Date: Sunday, August 19, 2007 - 12:34 pm

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
not the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID returns also the min, max or sum (depending on
the semantic of the field) of the accounting info for all threads of a
thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads (hiwater_rss and hiwater_vm).

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Jonathan Lim <jlim@sgi.com>
---

 kernel/taskstats.c |   15 ++++++++++-----
 kernel/tsacct.c    |   43 ++++++++++++++++++++++---------------------
 2 files changed, 32 insertions(+), 26 deletions(-)

diff -r 73eff559debc kernel/taskstats.c
--- a/kernel/taskstats.c	Sat Aug 18 17:15:17 2007 -0700
+++ b/kernel/taskstats.c	Sun Aug 19 17:20:15 2007 +0200
@@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct 
 
 		stats->nvcsw += tsk->nvcsw;
 		stats->nivcsw += tsk->nivcsw;
+		bacct_add_tsk(stats, tsk);
+		xacct_add_tsk(stats, tsk);
 	} while_each_thread(first, tsk);
 
 	unlock_task_sighand(first, &flags);
@@ -265,21 +267,24 @@ static void fill_tgid_exit(struct task_s
 static void fill_tgid_exit(struct task_struct *tsk)
 {
 	unsigned long flags;
+	struct taskstats *tg_stats;
 
 	spin_lock_irqsave(&tsk->sighand->siglock, flags);
-	if (!tsk->signal->stats)
+	tg_stats = tsk->signal->stats;
+	if (!tg_stats)
 		goto ret;
 
 	/*
 	 * Each accounting subsystem calls its functions here to
 	 * accumalate its per-task stats for tsk, into ...
From: Balbir Singh
Date: Monday, August 20, 2007 - 10:01 am

I'm afraid this is still not good enough. bacct_add_tsk() will assign
values and do nothing in the loop (HINT: no summation). Ideally,
we should split up bacct_add_tsk(), so that the initialization


I am not sure why we get the max value, ideally we should
be summing up utime, stime into the stats structure.


Could you please write a simple test case and ensure that the
summed up stats are indeed correct for TGID's?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Guillaume Chazarain
Date: Saturday, August 25, 2007 - 8:10 am

Le Mon, 20 Aug 2007 22:31:08 +0530,

Hi Balbir, thank you for your review. I agree with everything you said
and am on my way to do it as time permits, but I have some trouble
understanding this part. You stated that bacct_add_tsk() would overwrite
the stats of each thread in the tgid stats, but the other part of the
patch is the (actually wrong) combination of stats in xxx_add_tsk()
using min/max/sum.

Also, I don't understand why the code to update btime:

        /* calculate task elapsed time in timespec */
        do_posix_clock_monotonic_gettime(&uptime);
        ts = timespec_sub(uptime, tsk->start_time);
	...
        stats->ac_btime = get_seconds() - ts.tv_sec;

does not simply use tsk->start_time or tsk->real_start_time without
comparing it to the current time.

Thanks.

-- 
Guillaume
-

From: Balbir Singh
Date: Saturday, August 25, 2007 - 9:58 pm

Hi, Guillaume,

The CSA code was written by Jay Lan, but I'll see if I can answer your
questions. In the current implementation, CSA does not use TGID exit
notifications. In fill_pid(), both bacct_add_tsk() and xacct_add_tsk()
are called (which is correct), they complement each other.

The code needs refactoring to ensure that they can work together
are taken from the realtime clock. The accounting in CSA seems
to be very similar to the accounting done in do_acct_process()
(kernel/acct.c).

Jay/Jonathan any comments?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Guillaume Chazarain
Date: Sunday, August 26, 2007 - 2:44 am

Le Sun, 26 Aug 2007 10:28:44 +0530,

Thanks, I see it is used to have a wall time version of btime.
Here is my current version of the refactoring for information.
The testcase is coming.

diff -r 27877ed82082 include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h	Sat Aug 25 22:17:19 2007 +0200
+++ b/include/linux/tsacct_kern.h	Sat Aug 25 21:48:51 2007 +0200
@@ -10,17 +10,23 @@
 #include <linux/taskstats.h>
 
 #ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
 #else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
 {}
 #endif /* CONFIG_TASKSTATS */
 
 #ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
 extern void acct_update_integrals(struct task_struct *tsk);
 extern void acct_clear_integrals(struct task_struct *tsk);
 #else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
 static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {}
 static inline void acct_update_integrals(struct task_struct *tsk)
diff -r 27877ed82082 kernel/taskstats.c
--- a/kernel/taskstats.c	Sat Aug 25 22:17:19 2007 +0200
+++ b/kernel/taskstats.c	Sat Aug 25 22:11:23 2007 +0200
@@ -168,6 +168,55 @@ static void send_cpu_listeners(struct sk
 	up_write(&listeners->sem);
 }
 
+/*
+ * Common stats for each thread in the thread group, @task is not necessarily
+ * a thread group leader.
+ */
+static void ...
From: Jonathan Lim
Date: Thursday, August 30, 2007 - 8:02 pm

In CSA 3.0 ...

csa_acct_eop(int exitcode, struct task_struct *p)

        csa->ac_btime = boottime +
                        ((p->start_time.tv_nsec < NSEC_PER_SEC/2) ?
                         p->start_time.tv_sec :
                         p->start_time.tv_sec +1);

where 

        do_posix_clock_monotonic_gettime(&uptime);
        boottime = xtime.tv_sec - uptime.tv_sec;

In an upcoming version of CSA ...

csa_acct_eop(struct taskstats *p)

        csa->ac_btime = p->ac_btime;

where

        do_posix_clock_monotonic_gettime(&uptime);
        ts = uptime - tsk->start_time;
        p->ac_btime = get_seconds() - ts.tv_sec;
                    = xtime.tv_sec - (uptime - tsk->start_time);
                    = (xtime.tv_sec - uptime) + tsk->start_time;

So they're basically equivalent.

Jonathan
-

From: Balbir Singh
Date: Friday, August 31, 2007 - 12:24 am

Excellent, so can Guillaume change ac_btime to be just tsk->start_time?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Guillaume Chazarain
Date: Friday, August 31, 2007 - 5:35 am

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats,
not the basic and extended accounting. With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a
thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Here is the output of the testcase before the patch:

Name          |      System|        User|  Cached I/O|        Self|       Group|
--------------+------------+------------+------------+------------+------------+
version       |           5|           5|           5|           5|           5|
ac_exitcode   |           0|           0|           0|           0|           0|
ac_flag       |           0|           0|           0|           0|           0|
ac_nice       |           5|          10|          15|           0|           0|
cpu_count     |        2020|        1378|         360|       70645|       74428|
cpu_delay_tota|        2327|    14281152|    14225866|       30475|    44505841|
blkio_delay_to|           0|           0|           0|           0|  1146901308|
swapin_count  |           0|           0|           0|           0|           0|
swapin_delay_t|           0|           0|           0|           0|           0|
cpu_run_real_t|  4794271160|  1269806960|   331949536|    44993160|  6562002424|
cpu_run_virtua|  4792937602|  1247020977|   330330065|    43082258|  6530204192|
ac_comm       |      cached|      cached|      cached|      cached|            |
ac_uid        |         500|         500|         500|         500|           0|
ac_gid        |         500|         500|         500|         500|           0|
ac_pid        |        2513|        2514|        2515|        2512|           0|
ac_ppid       |        2445|        2445|        2445|        2445|           0|
ac_btime      |  1188326011|  1188326012|  1188326013|  1188326011|           0|
ac_etime      |     6619382|     5615437|  ...
From: Andrew Morton
Date: Wednesday, September 12, 2007 - 5:18 pm

On Fri, 31 Aug 2007 14:35:35 +0200

Another C file in the Documentation directory.  Sigh.  At kernel summit I
suggested that we should be putting these things in a place from where we
can actually build and install them.  People said there was no need to do
that because kernel developers can now easily get new stuff into

newline here.  Just because it's userspace doesn't mean that it needs to
look crappy, despite all the code out there which disproves this ;)



Are the get_task_struct/put_task_struct here actually needed?


-

From: Guillaume Chazarain
Date: Saturday, September 15, 2007 - 11:42 am

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
the basic and extended accounting.  With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
of a thread group.  This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V3 (http://lkml.org/lkml/2007/8/31/121):
- Removed userspace example, either it gets accepted in util-linux-ng or I'll
maintain it elsewhere.
- Added kerneldoc for fill_threadgroup() and add_tsk().
- Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton
and Oleg Nesterov.
- Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock)
for consistency with the locking of task->signal->stats in fill_tgid().
- Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg.
- Documented double accounting race seen by Oleg.
- Rephrased the fill_tgid_exit() comment as per Oleg's recommendation.
- Documented the special case for the AFORK ac_flag.
- Use the exit status (code >> 8) instead of the exit code as documented in
Documentation/accounting/taskstats-struct.txt.
- Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as
suggested by Oleg.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() as
suggested by Balbir Singh.
- Very light cleanup of fill_tgid_exit() by the way.
- bacct fields are also combined for all threads.
- Instead of assuming memory stats are identical for all threads, we
take the max of all threads.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Cc: Balbir Singh ...
From: Guillaume Chazarain
Date: Monday, September 17, 2007 - 3:23 pm

TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not
the basic and extended accounting.  With this patch,
TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads
of a thread group.  This makes TASKSTATS_CMD_ATTR_TGID usable in a similar
fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P
(http://guichaz.free.fr/misc/iotop.py).

Changelog since V4 (http://lkml.org/lkml/2007/9/15/171):
- Revert gratuitous user interface change (returning exit_code >> 8 instead of
exit_code). Thanks Oleg Nesterov.
- Revert useless heavyweight locking (lock_task_sighand() in fill_tgid_exit).
Thanks Oleg.
- Correctly fill the TGID in taskstats_exit(). Thanks Oleg.

Changelog since V3 (http://lkml.org/lkml/2007/8/31/121):
- Removed userspace example, either it gets accepted in util-linux-ng or I'll
maintain it elsewhere.
- Added kerneldoc for fill_threadgroup() and add_tsk().
- Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton
and Oleg Nesterov.
- Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock)
for consistency with the locking of task->signal->stats in fill_tgid().
- Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg.
- Documented double accounting race seen by Oleg.
- Rephrased the fill_tgid_exit() comment as per Oleg's recommendation.
- Documented the special case for the AFORK ac_flag.
- Use the exit status (code >> 8) instead of the exit code as documented in
Documentation/accounting/taskstats-struct.txt.
- Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as
suggested by Oleg.

Changelog since V2 (http://lkml.org/lkml/2007/8/19/96):
- Added a testcase
- Added an indirection between the stats producer and consumer:
add_task() & fill_threadgroup()
- TGID stats are either summed from all the threads or taken from the leader

Changelog since V1 (http://lkml.org/lkml/2007/8/2/185):
- Update combined stats of exited threads in fill_tgid_exit() ...
From: Oleg Nesterov
Date: Tuesday, September 18, 2007 - 8:29 am

No, no, this is wrong.

tsk->signal->stats contains the accumulated info about the already exited
threads, we shouldn't throw it out.

Also, fill_tgid() doesn't make sense here, current is the last live sub-thread.


Hmm. We have another race here. There is no guarantee that tsk->signal->stats
covers all sub-threads, as it is supposed to be. It is quite possible that
another sub-thread decremented ->signal->live before us, but didn't complete
taskstats_exit()->fill_tgid_exit() yet.

Oleg.

-

From: Andrew Morton
Date: Wednesday, September 19, 2007 - 11:20 pm

This patch conflicts somewhat with
add-scaled-time-to-taskstats-based-process-accounting.patch

I fixed it up like this:

void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{
	if (task->flags & PF_SUPERPRIV)
		stats->ac_flag |= ASU;
	if (task->flags & PF_DUMPCORE)
		stats->ac_flag |= ACORE;
	if (task->flags & PF_SIGNALED)
		stats->ac_flag |= AXSIG;
	if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
		/*
		 * Threads are created by do_fork() and don't exec but not in
		 * the AFORK sense, as the latter involves fork(2).
		 */
		stats->ac_flag |= AFORK;

	stats->ac_utimescaled +=
		cputime_to_msecs(task->utimescaled) * USEC_PER_MSEC;
	stats->ac_stimescaled +=
		cputime_to_msecs(task->stimescaled) * USEC_PER_MSEC;
	stats->ac_utime	 += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
	stats->ac_stime	 += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
	stats->ac_minflt += task->min_flt;
	stats->ac_majflt += task->maj_flt;
}

(note the s/=/+=/ in there) but it all needs reviewing and checking and
testing please.
-

From: Balbir Singh
Date: Thursday, September 20, 2007 - 1:54 am

Andrew,

Thanks for reviewing the patchset, this patch is on my review and test
queue (which has gotten rather long of late). I'll test it further and
get back.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Oleg Nesterov
Date: Thursday, September 20, 2007 - 5:16 am

I still think this version is very wrong. It makes the ->signal->stats
absolutely meaningless. Quoting myself:

	On 09/18, Guillaume Chazarain wrote:
	>
	> @@ -508,7 +543,7 @@ void taskstats_exit(struct task_struct *
	>    if (!stats)
	>            goto err;
	>
	> -  memcpy(stats, tsk->signal->stats, sizeof(*stats));
	> +  fill_tgid(tsk->pid, tsk, stats);

	No, no, this is wrong.

	tsk->signal->stats contains the accumulated info about the already exited
	threads, we shouldn't throw it out.

	Also, fill_tgid() doesn't make sense here, current is the last live sub-thread.

Oleg.

-

From: Balbir Singh
Date: Thursday, September 20, 2007 - 5:17 am

Hi, Oleg,

Yes, I see, removing the memcpy is definitely wrong. Thanks for catching
it. I did not get a chance to review the patch, it's on my review queue,
but since you've reviewed it, I am very glad that you have and
identified potential issues.

Big Thanks!

-- 
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Jonathan Lim
Date: Friday, September 7, 2007 - 4:37 pm

I don't think so.  Current time (xtime) is relative to the epoch; uptime and
tsk->start_time (jiffies) are both relative to some boot time.  So you need to
subtract uptime from xtime to get the boot time relative to the epoch, then add
tsk->start_time.  The result is what ac_btime should be set to.

I think his recent changes are as follows:

--- a/kernel/tsacct.c   Fri Aug 31 01:42:23 2007 -0700
+++ b/kernel/tsacct.c   Tue Aug 28 20:35:27 2007 +0200
...
-void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static void fill_wall_times(struct taskstats *stats, struct task_struct *task)
...
-       ts = timespec_sub(uptime, tsk->start_time);
+       ts = timespec_sub(uptime, task->start_time);
...
-       stats->ac_btime = get_seconds() - ts.tv_sec;
...
+       stats->ac_btime = get_seconds() - ts.tv_sec;

So really no different from before, which is correct.

Jonathan
-

From: Guillaume Chazarain
Date: Monday, September 10, 2007 - 6:03 am

Le Fri, 7 Sep 2007 16:37:57 -0700 (PDT),

Yes, I just tried to make it clearer that the computations were needed
to get a wall time. Otherwise, as CSA seems to start using taskstats,
do my changes make sense for your usage?

Thanks.

-- 
Guillaume
-

From: Balbir Singh
Date: Wednesday, August 15, 2007 - 12:15 am

Since this changes CSA behaviour, it would be nice to get Jay Lan

using bacct and xacct might not work well in this case. You'll end
up with basic accounting data for the thread group leader (utime,
stime. For delay accounting, we accumulate data of tasks that have
exited the thread group in fill_tgid_exit(), that way when we
sum up data, we continue storing data of all tasks that exited.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

Previous thread: [PATCH x86-64] Calgary - Fix mis-handled PCI topology by Muli Ben-Yehuda on Thursday, August 2, 2007 - 6:39 am. (2 messages)

Next thread: Re: [PATCH 2.6.23-rc1] eCryptfs: fix error handling in ecryptfs_init by Michael Halcrow on Thursday, August 2, 2007 - 6:56 am. (1 message)