From: Michael Holzheu <holzheu@linux.vnet.ibm.com> CHANGE HISTORY OF THIS PATCH ---------------------------- Version 2 --------- * Add cumulative dead thread CPU times to taskstats * Introduce parallel accounting process hierarchy (based on discussions with Oleg Nesterov and Roland McGrath) DESCRIPTION ----------- Currently the cumulative time accounting in Linux has two major drawbacks for tools like the top command: * Due to POSIX POSIX.1-2001, the CPU time of processes is not accounted to the cumulative time of the parents, if the parents ignore SIGCHLD or have set SA_NOCLDWAIT. This behaviour has the major drawback that it is not possible to calculate all consumed CPU time of a system by looking at the current tasks. CPU time can be lost. * When a parent process dies, its children get the init process as new parent. For accounting this is suboptimal, because then init gets the CPU time of the tasks. For accounting it would be better, if the CPU time is passed along the relationship tree using the cumulative time counters as would have happened if the child had died before the parent. The main benefit of that approach is that between two tasks snapshots it is always clear which parent got the CPU time of dead tasks. Otherwise there are situations, where we can't determine if either init or an older relative of a dead task has inherited the CPU time. Another benefit would be that then e.g. it would be possible to look at the login shell process cumulative times to get all CPU time that has been consumed by it's children, grandchildren, etc. This would allow accounting without the need of exit events for all dead processes. Note that this has a small fault because processes can use daemonize to detach themselfes from their parents. This patch adds a new set of cumulative time counters. We then have two cumulative counter sets: * cdata_wait: Traditional cumulative time used e.g. by getrusage. * cdata_acct: Cumulative ...
First of all, let me repeat, I am not going to discuss whether we need these changes or not, I do not know even if I understand your motivation. My only concern is correctness, but Michael, I really think this patch does too many different things. I forgot the details of our previous discussion, and I got lost trying to understand the new version. I already asked you to split these changes, perhaps you can do this? Say, bacct_add_tsk() looks overcomplicated, the change in copy_process() shouldn't introduce the new CLONE_THREAD check, not sure I understand why release_task() was chosen for reparenting, other nits... But it is not easy to discuss these completely different things looking at the single patch. Imho, it would be much better if you make a separate patch which adds acct_parent/etc and implements the parallel hierarchy. This also makes sense because it touches the core kernel. Personally I think that even "struct cdata" + __account_ctime() helper needs a separate patch, and perhaps this change makes sense by itself as cleanup. And this way the "trivial" changes (like the changes in k_getrusage) won't distract from the functional changes. The final change should introduce cdata_acct and actually implement the complete cumulative accounting. At least, please distill parallel accounting process hierarchy. We do not want bugs in this area ;) What do you think? Oleg. --
On Sat, 13 Nov 2010 19:38:10 +0100 Hmm, is the motivation of whole patch series unclear or just the details in this one? What we want is a low-overhead tool that precisely shows where the cpu spent its time (or didn't because of steal time). The granularity target is tenths of microseconds, something that should be possible with decent hardware. What makes life hard is the corner case of a task that has been reparented You review effort to improve correctness is very much appreciated. We'll try to split the patches to make review easier. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
On Mon, 15 Nov 2010 17:03:56 +0100 Is that a trick question? Why do we have tools like "top"? Or process accounting? The point is that the quality of the numbers we get right now is rather bad, the overhead of scanning /proc is horrendous and the 10ms granularity is rather coarse. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
But you're not just replacing top, you're adding all kinds of new accounting crap all over the place. --
On Mon, 15 Nov 2010 18:51:57 +0100 We DO replace top. Patch #7 of 7. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
You _also_ replace top, but its not by far the only thing you do. If you simply replaced top you wouldn't need to add tons of extra accounting, would you? --
On Mon, 15 Nov 2010 19:10:06 +0100 There are basically two things we want to accomplish: 1) Make top faster by replacing the /proc based data gathering with taskstats. To really make it go fast with taskstats we filter with last_depart to read only tasks that have changed since the last snapshot. 2) Make top more precise. That is where all of the extra accounting comes into play. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Hello Oleg, Sorry, if I was not clear enough with my descriptions. Let me try to describe my motivation again: The cumulative accounting patch implements an infrastructure that makes it possible to collect all CPU time usage between two task snapshots without using task exit events. The main idea is to show cumulative CPU time fields for each task in the top command that contain the CPU time of children that died in the last interval. Example 1 (simple case): A "make" process forked several gcc processes after snapshot 1 and all of them exited before snapshot 2. We subtract the cumulative CPU times of "make" of snapshot 2 from the cumulative times of "make" of snapshot 1. The result will be the consumed CPU time of the dead gcc processes in the last interval. The value is that we can see in top that "make" is "responsible" for this CPU time. Example 2: We have the gcc processes in snapshot 1 but not in snapshot 2. Then the top command has to find the nearest relative (e.g. the parent process) that is still alive in snapshot 2, create the delta of the cumulative time for this process and subtract the CPU time of the gcc processes of snapshot 1. This gives you the CPU time that was consumed by the gcc processes between snapshot 1 and 2. With your help we identified two problems that make this approach impossible or at least not exact with the current Linux implementation: 1. Cumulative CPU time counters are not complete (SA_NOCLDWAIT) 2. Because of reparent to init, there are situations where it is not clear to which tasks the CPU time of dead tasks between two snapshots has been accounted. This is a problem for example 2. The patch tries to solve the problem by adding a second set of cumulative data that contains all CPU time of dead children and adds the parallel hierarchy to make it unambiguous which parent got the CPU time of dead tasks (needed for example 2). I hope that you understand now the value and the motivation of fixing the two problems. I ...
Sorry for delay. Yes, I see. But I must admit, _personally_ I am not sure this problem worth the trouble. And, I already mentioned daemonize() case, IOW I am not sure it is _always_ right to choose exiting_parent->parent for accounting. To me, this can be equally confusing. A user sees the running deamon with ppid = 1, then this daemon exits and top reports the "unrelated" process as cpu consumer. But once again. I am _not_ against this patch. I never know when it comes to new features. Certainly you know better if this suits top. What I actually meant is: dear CC list, please look at this change and comment ;) Oleg. --
Hello Oleg, I think you are right... For that function, the introduced overhead, additional code and especially the possible confusion with two process hierarchies is not worth the effort. Maybe we have a chance to solve the reparent problem by introducing task exit event filters (e.g. give me all exit events for processes that have init as parent). So I will send no more patches with the parallel hierarchy: Good news for you :-) But for the second problem with the forgotten CPU time, I would like send you a new patch set, separated as you have requested. Although I personally think that also there we probably have no good chances to get them accepted upstream, because the signal_struct size is increased and some cycles are added to the task exit processing. It would be nice, if you find some time to look at the patches, but no hurry! Michael --
Hello Oleg, I want to establish the new hierarchy when a new process is forked and not for new threads, therefore the check for CLONE_THREAD in copy_process(). I do the reparenting with reparent_acct() when a process So you want to have the following three patches: * Introduce "struct cdata" + __account_ctime() (no functional change) * Add cdata_acct accounting + parallel accounting hierarchy * Add taskstats interface to export the data to userspace Correct? Michael --
Hi Michael, I can't read this discussion today, will try to do tomorrow and reply. Just one note, Something like this (if possible). But my main concern is parallel hierarchy itself. So, if you ask me, I'd prefer to see a separate patch which only adds acct_parent/etc and implements this reparenting, even without cdata_acct accounting. Again, if you think this is possible. I understand that it is not easy to maintain this series, Oleg. --
Yes, but copy_process() already checks CLONE_THREAD many times. No And it is not clear to me why release_task() is better than exit_notify(). That said, perhaps I'll understand this reading the next version. That is why I asked to split. Oleg. --
