Re: [RFC][PATCH v2 5/7] taskstats: Improve cumulative CPU time accounting

Previous thread: [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec by Michael Holzheu on Thursday, November 11, 2010 - 10:03 am. (1 message)

Next thread: [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) by Michael Holzheu on Thursday, November 11, 2010 - 10:03 am. (2 messages)
From: Michael Holzheu
Date: Thursday, November 11, 2010 - 10:03 am

From: Michael Holzheu <holzheu@linux.vnet.ibm.com>

CHANGE HISTORY OF THIS PATCH
----------------------------
Version 2
---------
* Add cumulative dead thread CPU times to taskstats
* Introduce parallel accounting process hierarchy (based on discussions
  with Oleg Nesterov and Roland McGrath)

DESCRIPTION
-----------
Currently the cumulative time accounting in Linux has two major drawbacks
for tools like the top command:

* Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted
  to the cumulative time of the parents, if the parents ignore SIGCHLD
  or have set SA_NOCLDWAIT. This behaviour has the major drawback that
  it is not possible to calculate all consumed CPU time of a system by
  looking at the current tasks. CPU time can be lost.

* When a parent process dies, its children get the init process as
  new parent. For accounting this is suboptimal, because then init
  gets the CPU time of the tasks. For accounting it would be better,
  if the CPU time is passed along the relationship tree using the
  cumulative time counters as would have happened if the child had died
  before the parent. The main benefit of that approach is that between
  two tasks snapshots it is always clear which parent got the CPU time
  of dead tasks. Otherwise there are situations, where we can't
  determine if either init or an older relative of a dead task has
  inherited the CPU time.

  Another benefit would be that then e.g. it would be possible to look
  at the login shell process cumulative times to get all CPU time that has
  been consumed by it's children, grandchildren, etc. This would allow
  accounting without the need of exit events for all dead processes.
  Note that this has a small fault because processes can use
  daemonize to detach themselfes from their parents.

This patch adds a new set of cumulative time counters. We then have two
cumulative counter sets:

* cdata_wait: Traditional cumulative time used e.g. by getrusage.
* cdata_acct: Cumulative ...
From: Oleg Nesterov
Date: Saturday, November 13, 2010 - 11:38 am

First of all, let me repeat, I am not going to discuss whether we need
these changes or not, I do not know even if I understand your motivation.

My only concern is correctness, but


Michael, I really think this patch does too many different things.
I forgot the details of our previous discussion, and I got lost
trying to understand the new version.

I already asked you to split these changes, perhaps you can do this?
Say, bacct_add_tsk() looks overcomplicated, the change in copy_process()
shouldn't introduce the new CLONE_THREAD check, not sure I understand
why release_task() was chosen for reparenting, other nits...

But it is not easy to discuss these completely different things
looking at the single patch.

Imho, it would be much better if you make a separate patch which
adds acct_parent/etc and implements the parallel hierarchy. This
also makes sense because it touches the core kernel.

Personally I think that even "struct cdata" + __account_ctime() helper
needs a separate patch, and perhaps this change makes sense by itself
as cleanup. And this way the "trivial" changes (like the changes in
k_getrusage) won't distract from the functional changes.

The final change should introduce cdata_acct and actually implement
the complete cumulative accounting.

At least, please distill parallel accounting process hierarchy.
We do not want bugs in this area ;)

What do you think?

Oleg.

--

From: Martin Schwidefsky
Date: Monday, November 15, 2010 - 8:55 am

On Sat, 13 Nov 2010 19:38:10 +0100

Hmm, is the motivation of whole patch series unclear or just the details
in this one? What we want is a low-overhead tool that precisely shows
where the cpu spent its time (or didn't because of steal time). The
granularity target is tenths of microseconds, something that should be
possible with decent hardware. 
What makes life hard is the corner case of a task that has been reparented

You review effort to improve correctness is very much appreciated. We'll
try to split the patches to make review easier.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Peter Zijlstra
Date: Monday, November 15, 2010 - 9:03 am

To what purpose? 

--

From: Martin Schwidefsky
Date: Monday, November 15, 2010 - 10:49 am

On Mon, 15 Nov 2010 17:03:56 +0100
 
Is that a trick question? Why do we have tools like "top"? Or process
accounting? The point is that the quality of the numbers we get right
now is rather bad, the overhead of scanning /proc is horrendous and
the 10ms granularity is rather coarse.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Peter Zijlstra
Date: Monday, November 15, 2010 - 10:51 am

But you're not just replacing top, you're adding all kinds of new
accounting crap all over the place.
--

From: Martin Schwidefsky
Date: Monday, November 15, 2010 - 11:00 am

On Mon, 15 Nov 2010 18:51:57 +0100

We DO replace top. Patch #7 of 7.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Peter Zijlstra
Date: Monday, November 15, 2010 - 11:10 am

You _also_ replace top, but its not by far the only thing you do. If you
simply replaced top you wouldn't need to add tons of extra accounting,
would you?
--

From: Martin Schwidefsky
Date: Tuesday, November 16, 2010 - 1:54 am

On Mon, 15 Nov 2010 19:10:06 +0100

There are basically two things we want to accomplish:
1) Make top faster by replacing the /proc based data gathering with taskstats.
   To really make it go fast with taskstats we filter with last_depart to
   read only tasks that have changed since the last snapshot.
2) Make top more precise. That is where all of the extra accounting comes
   into play.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Michael Holzheu
Date: Tuesday, November 16, 2010 - 9:57 am

Hello Oleg,


Sorry, if I was not clear enough with my descriptions. Let me try to
describe my motivation again:

The cumulative accounting patch implements an infrastructure that makes
it possible to collect all CPU time usage between two task snapshots
without using task exit events. The main idea is to show cumulative CPU
time fields for each task in the top command that contain the CPU time
of children that died in the last interval.

Example 1 (simple case):
A "make" process forked several gcc processes after snapshot 1 and all
of them exited before snapshot 2. We subtract the cumulative CPU times
of "make" of snapshot 2 from the cumulative times of "make" of snapshot
1. The result will be the consumed CPU time of the dead gcc processes in
the last interval. The value is that we can see in top that "make" is
"responsible" for this CPU time.

Example 2:
We have the gcc processes in snapshot 1 but not in snapshot 2. Then the
top command has to find the nearest relative (e.g. the parent process)
that is still alive in snapshot 2, create the delta of the cumulative
time for this process and subtract the CPU time of the gcc processes of
snapshot 1. This gives you the CPU time that was consumed by the gcc
processes between snapshot 1 and 2.

With your help we identified two problems that make this approach
impossible or at least not exact with the current Linux implementation:
1. Cumulative CPU time counters are not complete (SA_NOCLDWAIT)
2. Because of reparent to init, there are situations where it is
   not clear to which tasks the CPU time of dead tasks between
   two snapshots has been accounted. This is a problem for example 2.

The patch tries to solve the problem by adding a second set of
cumulative data that contains all CPU time of dead children and adds the
parallel hierarchy to make it unambiguous which parent got the CPU time
of dead tasks (needed for example 2).

I hope that you understand now the value and the motivation of fixing
the two problems.

I ...
From: Oleg Nesterov
Date: Thursday, November 18, 2010 - 10:10 am

Sorry for delay.



Yes, I see.

But I must admit, _personally_ I am not sure this problem worth the
trouble. And, I already mentioned daemonize() case, IOW I am not sure
it is _always_ right to choose exiting_parent->parent for accounting.
To me, this can be equally confusing. A user sees the running deamon
with ppid = 1, then this daemon exits and top reports the "unrelated"
process as cpu consumer.


But once again. I am _not_ against this patch. I never know when it
comes to new features. Certainly you know better if this suits top.

What I actually meant is: dear CC list, please look at this change
and comment ;)

Oleg.

--

From: Michael Holzheu
Date: Friday, November 19, 2010 - 12:46 pm

Hello Oleg,


I think you are right...

For that function, the introduced overhead, additional code and
especially the possible confusion with two process hierarchies is not
worth the effort. Maybe we have a chance to solve the reparent problem
by introducing task exit event filters (e.g. give me all exit events for
processes that have init as parent).

So I will send no more patches with the parallel hierarchy:
Good news for you :-)

But for the second problem with the forgotten CPU time, I would like
send you a new patch set, separated as you have requested. Although I
personally think that also there we probably have no good chances to get
them accepted upstream, because the signal_struct size is increased and
some cycles are added to the task exit processing.

It would be nice, if you find some time to look at the patches,
but no hurry!

Michael





--

From: Michael Holzheu
Date: Tuesday, November 16, 2010 - 10:34 am

Hello Oleg,


I want to establish the new hierarchy when a new process is forked and
not for new threads, therefore the check for CLONE_THREAD in
copy_process(). I do the reparenting with reparent_acct() when a process

So you want to have the following three patches:
* Introduce "struct cdata" + __account_ctime() (no functional change)
* Add cdata_acct accounting + parallel accounting hierarchy
* Add taskstats interface to export the data to userspace

Correct?

Michael

--

From: Oleg Nesterov
Date: Tuesday, November 16, 2010 - 10:50 am

Hi Michael,

I can't read this discussion today, will try to do tomorrow and reply.
Just one note,


Something like this (if possible).

But my main concern is parallel hierarchy itself. So, if you ask me,
I'd prefer to see a separate patch which only adds acct_parent/etc
and implements this reparenting, even without cdata_acct accounting.

Again, if you think this is possible. I understand that it is not
easy to maintain this series,

Oleg.

--

From: Oleg Nesterov
Date: Thursday, November 18, 2010 - 9:34 am

Yes, but copy_process() already checks CLONE_THREAD many times. No

And it is not clear to me why release_task() is better than
exit_notify().


That said, perhaps I'll understand this reading the next version.
That is why I asked to split.

Oleg.

--

Previous thread: [RFC][PATCH v2 6/7] taskstats: Fix accounting for non-leader thread exec by Michael Holzheu on Thursday, November 11, 2010 - 10:03 am. (1 message)

Next thread: [RFC][PATCH v2 0/7] taskstats: Enhancements for precise process accounting (version 2) by Michael Holzheu on Thursday, November 11, 2010 - 10:03 am. (2 messages)