[Resent with different recipients as nagar@watson.ibm.com bounced with a User unknown] Hi, This patch adds all thread accounting stats for the global tgid stats. As a shameless plug, this fixes iotop -P (http://guichaz.free.fr/misc/iotop.py). Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> --- diff -r 22708012ca6e kernel/taskstats.c --- a/kernel/taskstats.c Tue Jul 31 21:12:07 2007 -0700 +++ b/kernel/taskstats.c Wed Aug 01 17:43:54 2007 +0200 @@ -233,6 +233,7 @@ static int fill_tgid(pid_t tgid, struct memset(stats, 0, sizeof(*stats)); tsk = first; + bacct_add_tsk(stats, first); do { if (tsk->exit_state) continue; @@ -246,6 +247,7 @@ static int fill_tgid(pid_t tgid, struct stats->nvcsw += tsk->nvcsw; stats->nivcsw += tsk->nivcsw; + xacct_add_tsk(stats, tsk); } while_each_thread(first, tsk); unlock_task_sighand(first, &flags); diff -r 22708012ca6e kernel/tsacct.c --- a/kernel/tsacct.c Tue Jul 31 21:12:07 2007 -0700 +++ b/kernel/tsacct.c Wed Aug 01 17:41:40 2007 +0200 @@ -81,8 +81,8 @@ void xacct_add_tsk(struct taskstats *sta struct mm_struct *mm; /* convert pages-jiffies to Mbyte-usec */ - stats->coremem = jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB; - stats->virtmem = jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB; + stats->coremem += jiffies_to_usecs(p->acct_rss_mem1) * PAGE_SIZE / MB; + stats->virtmem += jiffies_to_usecs(p->acct_vm_mem1) * PAGE_SIZE / MB; mm = get_task_mm(p); if (mm) { /* adjust to KB unit */ @@ -90,18 +90,14 @@ void xacct_add_tsk(struct taskstats *sta stats->hiwater_vm = mm->hiwater_vm * PAGE_SIZE / KB; mmput(mm); } - stats->read_char = p->rchar; - stats->write_char = p->wchar; - stats->read_syscalls = p->syscr; - stats->write_syscalls = p->syscw; + stats->read_char += p->rchar; + stats->write_char += p->wchar; + stats->read_syscalls += p->syscr; + stats->write_syscalls += p->syscw; #ifdef CONFIG_TASK_IO_ACCOUNTING - stats->read_bytes = ...
On Thu, 2 Aug 2007 15:53:13 +0200 uhm, that's a fairly skimpy changelog for a fairly significant functional Suitable cc's added. Guys, please have a think about the implications of this change? In particular, why didn't we do this on day one? It's so obvious that I have a feeling we must have had a reason? -
TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not the basic and extended accounting. With this patch, TASKSTATS_CMD_ATTR_TGID returns also the min, max or sum (depending on the semantic of the field) of the accounting info for all threads of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P (http://guichaz.free.fr/misc/iotop.py). Changelog since V1 (http://lkml.org/lkml/2007/8/2/185): - Update combined stats of exited threads in fill_tgid_exit() as suggested by Balbir Singh. - Very light cleanup of fill_tgid_exit() by the way. - bacct fields are also combined for all threads. - Instead of assuming memory stats are identical for all threads, we take the max of all threads (hiwater_rss and hiwater_vm). Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> --- kernel/taskstats.c | 15 ++++++++++----- kernel/tsacct.c | 43 ++++++++++++++++++++++--------------------- 2 files changed, 32 insertions(+), 26 deletions(-) diff -r 73eff559debc kernel/taskstats.c --- a/kernel/taskstats.c Sat Aug 18 17:15:17 2007 -0700 +++ b/kernel/taskstats.c Sun Aug 19 17:20:15 2007 +0200 @@ -246,6 +246,8 @@ static int fill_tgid(pid_t tgid, struct stats->nvcsw += tsk->nvcsw; stats->nivcsw += tsk->nivcsw; + bacct_add_tsk(stats, tsk); + xacct_add_tsk(stats, tsk); } while_each_thread(first, tsk); unlock_task_sighand(first, &flags); @@ -265,21 +267,24 @@ static void fill_tgid_exit(struct task_s static void fill_tgid_exit(struct task_struct *tsk) { unsigned long flags; + struct taskstats *tg_stats; spin_lock_irqsave(&tsk->sighand->siglock, flags); - if (!tsk->signal->stats) + tg_stats = tsk->signal->stats; + if (!tg_stats) goto ret; /* * Each accounting subsystem calls its functions here to * accumalate its per-task stats for tsk, into ...
I'm afraid this is still not good enough. bacct_add_tsk() will assign values and do nothing in the loop (HINT: no summation). Ideally, we should split up bacct_add_tsk(), so that the initialization I am not sure why we get the max value, ideally we should be summing up utime, stime into the stats structure. Could you please write a simple test case and ensure that the summed up stats are indeed correct for TGID's? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Le Mon, 20 Aug 2007 22:31:08 +0530,
Hi Balbir, thank you for your review. I agree with everything you said
and am on my way to do it as time permits, but I have some trouble
understanding this part. You stated that bacct_add_tsk() would overwrite
the stats of each thread in the tgid stats, but the other part of the
patch is the (actually wrong) combination of stats in xxx_add_tsk()
using min/max/sum.
Also, I don't understand why the code to update btime:
/* calculate task elapsed time in timespec */
do_posix_clock_monotonic_gettime(&uptime);
ts = timespec_sub(uptime, tsk->start_time);
...
stats->ac_btime = get_seconds() - ts.tv_sec;
does not simply use tsk->start_time or tsk->real_start_time without
comparing it to the current time.
Thanks.
--
Guillaume
-
Hi, Guillaume, The CSA code was written by Jay Lan, but I'll see if I can answer your questions. In the current implementation, CSA does not use TGID exit notifications. In fill_pid(), both bacct_add_tsk() and xacct_add_tsk() are called (which is correct), they complement each other. The code needs refactoring to ensure that they can work together are taken from the realtime clock. The accounting in CSA seems to be very similar to the accounting done in do_acct_process() (kernel/acct.c). Jay/Jonathan any comments? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Le Sun, 26 Aug 2007 10:28:44 +0530,
Thanks, I see it is used to have a wall time version of btime.
Here is my current version of the refactoring for information.
The testcase is coming.
diff -r 27877ed82082 include/linux/tsacct_kern.h
--- a/include/linux/tsacct_kern.h Sat Aug 25 22:17:19 2007 +0200
+++ b/include/linux/tsacct_kern.h Sat Aug 25 21:48:51 2007 +0200
@@ -10,17 +10,23 @@
#include <linux/taskstats.h>
#ifdef CONFIG_TASKSTATS
-extern void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk);
+void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void bacct_add_tsk(struct taskstats *stats, struct task_struct *task);
#else
-static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk)
+static inline void bacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
+static inline void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{}
#endif /* CONFIG_TASKSTATS */
#ifdef CONFIG_TASK_XACCT
-extern void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
+void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task);
+void xacct_add_tsk(struct taskstats *stats, struct task_struct *p);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
+static inline void xacct_fill_threadgroup(struct taskstats *stats, struct task_struct *task)
+{}
static inline void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
{}
static inline void acct_update_integrals(struct task_struct *tsk)
diff -r 27877ed82082 kernel/taskstats.c
--- a/kernel/taskstats.c Sat Aug 25 22:17:19 2007 +0200
+++ b/kernel/taskstats.c Sat Aug 25 22:11:23 2007 +0200
@@ -168,6 +168,55 @@ static void send_cpu_listeners(struct sk
up_write(&listeners->sem);
}
+/*
+ * Common stats for each thread in the thread group, @task is not necessarily
+ * a thread group leader.
+ */
+static void ...In CSA 3.0 ...
csa_acct_eop(int exitcode, struct task_struct *p)
csa->ac_btime = boottime +
((p->start_time.tv_nsec < NSEC_PER_SEC/2) ?
p->start_time.tv_sec :
p->start_time.tv_sec +1);
where
do_posix_clock_monotonic_gettime(&uptime);
boottime = xtime.tv_sec - uptime.tv_sec;
In an upcoming version of CSA ...
csa_acct_eop(struct taskstats *p)
csa->ac_btime = p->ac_btime;
where
do_posix_clock_monotonic_gettime(&uptime);
ts = uptime - tsk->start_time;
p->ac_btime = get_seconds() - ts.tv_sec;
= xtime.tv_sec - (uptime - tsk->start_time);
= (xtime.tv_sec - uptime) + tsk->start_time;
So they're basically equivalent.
Jonathan
-
Excellent, so can Guillaume change ac_btime to be just tsk->start_time? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not the basic and extended accounting. With this patch, TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P (http://guichaz.free.fr/misc/iotop.py). Here is the output of the testcase before the patch: Name | System| User| Cached I/O| Self| Group| --------------+------------+------------+------------+------------+------------+ version | 5| 5| 5| 5| 5| ac_exitcode | 0| 0| 0| 0| 0| ac_flag | 0| 0| 0| 0| 0| ac_nice | 5| 10| 15| 0| 0| cpu_count | 2020| 1378| 360| 70645| 74428| cpu_delay_tota| 2327| 14281152| 14225866| 30475| 44505841| blkio_delay_to| 0| 0| 0| 0| 1146901308| swapin_count | 0| 0| 0| 0| 0| swapin_delay_t| 0| 0| 0| 0| 0| cpu_run_real_t| 4794271160| 1269806960| 331949536| 44993160| 6562002424| cpu_run_virtua| 4792937602| 1247020977| 330330065| 43082258| 6530204192| ac_comm | cached| cached| cached| cached| | ac_uid | 500| 500| 500| 500| 0| ac_gid | 500| 500| 500| 500| 0| ac_pid | 2513| 2514| 2515| 2512| 0| ac_ppid | 2445| 2445| 2445| 2445| 0| ac_btime | 1188326011| 1188326012| 1188326013| 1188326011| 0| ac_etime | 6619382| 5615437| ...
On Fri, 31 Aug 2007 14:35:35 +0200 Another C file in the Documentation directory. Sigh. At kernel summit I suggested that we should be putting these things in a place from where we can actually build and install them. People said there was no need to do that because kernel developers can now easily get new stuff into newline here. Just because it's userspace doesn't mean that it needs to look crappy, despite all the code out there which disproves this ;) Are the get_task_struct/put_task_struct here actually needed? -
TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not the basic and extended accounting. With this patch, TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P (http://guichaz.free.fr/misc/iotop.py). Changelog since V3 (http://lkml.org/lkml/2007/8/31/121): - Removed userspace example, either it gets accepted in util-linux-ng or I'll maintain it elsewhere. - Added kerneldoc for fill_threadgroup() and add_tsk(). - Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton and Oleg Nesterov. - Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock) for consistency with the locking of task->signal->stats in fill_tgid(). - Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg. - Documented double accounting race seen by Oleg. - Rephrased the fill_tgid_exit() comment as per Oleg's recommendation. - Documented the special case for the AFORK ac_flag. - Use the exit status (code >> 8) instead of the exit code as documented in Documentation/accounting/taskstats-struct.txt. - Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as suggested by Oleg. Changelog since V2 (http://lkml.org/lkml/2007/8/19/96): - Added a testcase - Added an indirection between the stats producer and consumer: add_task() & fill_threadgroup() - TGID stats are either summed from all the threads or taken from the leader Changelog since V1 (http://lkml.org/lkml/2007/8/2/185): - Update combined stats of exited threads in fill_tgid_exit() as suggested by Balbir Singh. - Very light cleanup of fill_tgid_exit() by the way. - bacct fields are also combined for all threads. - Instead of assuming memory stats are identical for all threads, we take the max of all threads. Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr> Cc: Balbir Singh ...
TASKSTATS_CMD_ATTR_TGID used to return only the delay accounting stats, not the basic and extended accounting. With this patch, TASKSTATS_CMD_ATTR_TGID also aggregates the accounting info for all threads of a thread group. This makes TASKSTATS_CMD_ATTR_TGID usable in a similar fashion to TASKSTATS_CMD_ATTR_PID, for commands like iotop -P (http://guichaz.free.fr/misc/iotop.py). Changelog since V4 (http://lkml.org/lkml/2007/9/15/171): - Revert gratuitous user interface change (returning exit_code >> 8 instead of exit_code). Thanks Oleg Nesterov. - Revert useless heavyweight locking (lock_task_sighand() in fill_tgid_exit). Thanks Oleg. - Correctly fill the TGID in taskstats_exit(). Thanks Oleg. Changelog since V3 (http://lkml.org/lkml/2007/8/31/121): - Removed userspace example, either it gets accepted in util-linux-ng or I'll maintain it elsewhere. - Added kerneldoc for fill_threadgroup() and add_tsk(). - Removed useless {get,put}_task_struct(leader) as spotted by Andrew Morton and Oleg Nesterov. - Use lock_task_sighand() instead of spin_lock_irqsave(&tsk->sighand->siglock) for consistency with the locking of task->signal->stats in fill_tgid(). - Removed useless check for a NULL taskstats in fill_tgid_exit(). Thanks Oleg. - Documented double accounting race seen by Oleg. - Rephrased the fill_tgid_exit() comment as per Oleg's recommendation. - Documented the special case for the AFORK ac_flag. - Use the exit status (code >> 8) instead of the exit code as documented in Documentation/accounting/taskstats-struct.txt. - Use signal->group_exit_code if set for stats->ac_exitcode on a TGID as suggested by Oleg. Changelog since V2 (http://lkml.org/lkml/2007/8/19/96): - Added a testcase - Added an indirection between the stats producer and consumer: add_task() & fill_threadgroup() - TGID stats are either summed from all the threads or taken from the leader Changelog since V1 (http://lkml.org/lkml/2007/8/2/185): - Update combined stats of exited threads in fill_tgid_exit() ...
No, no, this is wrong. tsk->signal->stats contains the accumulated info about the already exited threads, we shouldn't throw it out. Also, fill_tgid() doesn't make sense here, current is the last live sub-thread. Hmm. We have another race here. There is no guarantee that tsk->signal->stats covers all sub-threads, as it is supposed to be. It is quite possible that another sub-thread decremented ->signal->live before us, but didn't complete taskstats_exit()->fill_tgid_exit() yet. Oleg. -
This patch conflicts somewhat with
add-scaled-time-to-taskstats-based-process-accounting.patch
I fixed it up like this:
void bacct_add_tsk(struct taskstats *stats, struct task_struct *task)
{
if (task->flags & PF_SUPERPRIV)
stats->ac_flag |= ASU;
if (task->flags & PF_DUMPCORE)
stats->ac_flag |= ACORE;
if (task->flags & PF_SIGNALED)
stats->ac_flag |= AXSIG;
if (thread_group_leader(task) && (task->flags & PF_FORKNOEXEC))
/*
* Threads are created by do_fork() and don't exec but not in
* the AFORK sense, as the latter involves fork(2).
*/
stats->ac_flag |= AFORK;
stats->ac_utimescaled +=
cputime_to_msecs(task->utimescaled) * USEC_PER_MSEC;
stats->ac_stimescaled +=
cputime_to_msecs(task->stimescaled) * USEC_PER_MSEC;
stats->ac_utime += cputime_to_msecs(task->utime) * USEC_PER_MSEC;
stats->ac_stime += cputime_to_msecs(task->stime) * USEC_PER_MSEC;
stats->ac_minflt += task->min_flt;
stats->ac_majflt += task->maj_flt;
}
(note the s/=/+=/ in there) but it all needs reviewing and checking and
testing please.
-
Andrew, Thanks for reviewing the patchset, this patch is on my review and test queue (which has gotten rather long of late). I'll test it further and get back. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
I still think this version is very wrong. It makes the ->signal->stats absolutely meaningless. Quoting myself: On 09/18, Guillaume Chazarain wrote: > > @@ -508,7 +543,7 @@ void taskstats_exit(struct task_struct * > if (!stats) > goto err; > > - memcpy(stats, tsk->signal->stats, sizeof(*stats)); > + fill_tgid(tsk->pid, tsk, stats); No, no, this is wrong. tsk->signal->stats contains the accumulated info about the already exited threads, we shouldn't throw it out. Also, fill_tgid() doesn't make sense here, current is the last live sub-thread. Oleg. -
Hi, Oleg, Yes, I see, removing the memcpy is definitely wrong. Thanks for catching it. I did not get a chance to review the patch, it's on my review queue, but since you've reviewed it, I am very glad that you have and identified potential issues. Big Thanks! -- Balbir Singh Linux Technology Center IBM, ISTL -
I don't think so. Current time (xtime) is relative to the epoch; uptime and tsk->start_time (jiffies) are both relative to some boot time. So you need to subtract uptime from xtime to get the boot time relative to the epoch, then add tsk->start_time. The result is what ac_btime should be set to. I think his recent changes are as follows: --- a/kernel/tsacct.c Fri Aug 31 01:42:23 2007 -0700 +++ b/kernel/tsacct.c Tue Aug 28 20:35:27 2007 +0200 ... -void bacct_add_tsk(struct taskstats *stats, struct task_struct *tsk) +static void fill_wall_times(struct taskstats *stats, struct task_struct *task) ... - ts = timespec_sub(uptime, tsk->start_time); + ts = timespec_sub(uptime, task->start_time); ... - stats->ac_btime = get_seconds() - ts.tv_sec; ... + stats->ac_btime = get_seconds() - ts.tv_sec; So really no different from before, which is correct. Jonathan -
Le Fri, 7 Sep 2007 16:37:57 -0700 (PDT), Yes, I just tried to make it clearer that the computations were needed to get a wall time. Otherwise, as CSA seems to start using taskstats, do my changes make sense for your usage? Thanks. -- Guillaume -
Since this changes CSA behaviour, it would be nice to get Jay Lan using bacct and xacct might not work well in this case. You'll end up with basic accounting data for the thread group leader (utime, stime. For delay accounting, we accumulate data of tasks that have exited the thread group in fill_tgid_exit(), that way when we sum up data, we continue storing data of all tasks that exited. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
