Report CPU usage in CFS Cgroup directories
Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
milliseconds for that cgroup's tasks
This replaces the "example CPU Accounting CGroup subsystem" that
was merged into mainline last week.
Signed-off-by: Paul Menage <menage@google.com>
---
kernel/sched.c | 32 +++++++++++++++++++++++++++-----
1 file changed, 27 insertions(+), 5 deletions(-)
Index: container-2.6.23-mm1/kernel/sched.c
===================================================================
--- container-2.6.23-mm1.orig/kernel/sched.c
+++ container-2.6.23-mm1/kernel/sched.c
@@ -7005,15 +7005,37 @@ static u64 cpu_shares_read_uint(struct c
return (u64) tg->shares;
}
-static struct cftype cpu_shares = {
- .name = "shares",
- .read_uint = cpu_shares_read_uint,
- .write_uint = cpu_shares_write_uint,
+static u64 cpu_usage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ int i;
+ u64 res = 0;
+ for_each_possible_cpu(i) {
+ unsigned long flags;
+ spin_lock_irqsave(&tg->cfs_rq[i]->rq->lock, flags);
+ res += tg->se[i]->sum_exec_runtime;
+ spin_unlock_irqrestore(&tg->cfs_rq[i]->rq->lock, flags);
+ }
+ /* Convert from ns to ms */
+ do_div(res, 1000000);
+ return res;
+}
+
+static struct cftype cpu_files[] = {
+ {
+ .name = "shares",
+ .read_uint = cpu_shares_read_uint,
+ .write_uint = cpu_shares_write_uint,
+ },
+ {
+ .name = "usage",
+ .read_uint = cpu_usage_read,
+ },
};
static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
{
- return cgroup_add_file(cont, ss, &cpu_shares);
+ return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
}
struct cgroup_subsys cpu_cgroup_subsys = {
-
Is the lock absolutely required here? Hmm .. I hope the cgroup code prevents a task group from being destroyed while -- Regards, vatsa -
I'm not sure, I was hoping you or Ingo could comment on this. But some kind of locking seems to required at least on 32-bit platforms, since Good point - cgroups certainly prevents a cgroup itself from being freed while a control file is being read in an RCU section, and prevents a task group from being destroyed when that task group has been read via a task's cgroups pointer and the reader is still in an RCU section, but we need a generic protection for subsystem state objects being accessed via control files too. Using cgroup_mutex is certainly possible for now, although more heavy-weight than I'd like long term. Using css_get isn't the right approach, I think - we shouldn't be able to cause an rmdir to fail due to a concurrent read. Paul -
OK, the obvious solution is to use the same approach for subsystem state objects as we do for the struct cgroup itself - move the calls to the subsystem destroy methods to cgroup_diput. A control file dentry will keep alive the parent dir's dentry, which will keep alive the cgroup and (with this change) the subsystem state objects too. The only potential drawback that I can see is that an open fd on a cgroup directory or a control file will keep more memory alive than it would have done previously. Paul -
Well, most people who care about deletion will use the notify_on_release callback and retry. I think we have a good mechanism in place for deletion, so I would not worry about read delaying deletion. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
I'm not convinced this is true. Certainly the userspace approaches we're developing at Google don't (currently) use notify_on_release. Paul -
Well, without notify_on_release you can never be sure if a new task got added to the control group or if someone acquired a reference to it. I can't think of a safe way of removing control groups/cpusets without using notify_on_release. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Using notify_on_release doesn't make any guarantees that the notification, or the action based on that notification, will occur before the cgroup becomes busy again. An rmdir() on a busy cgroup is perfectly safe, it just fails with -EBUSY. I'd just like to avoid making active control file fds trigger than failure mode. Paul -
I tend to agree abt 32-bit platforms requiring a lock to read the 64-bit sum_exec_runtime field. Ingo/Dmitry, what do you think? fs/proc/array.c:task_utime() is also buggy in that case. -- Regards, vatsa -
I think we also need the notion of load, like we have in cpu_acct.c Don't we need to do a css_get() on the cgrp to ensure that the cgroup -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Yes, a notion of load would be good - but the "load" calculated by cpu_acct.c isn't all that useful yet - it's just a total of the CPU cycles used in the 10 second time interval prior to the current interval. Something that's more analagous to the real machine load See my other email. Yes, I think we need more here, but probably something more generic. Paul -
A simpler form of this would be just: spin_lock_irqsave(&cpu_rq(i)->lock, flags); -- Regards, vatsa -
It would be nice to split this into user and sys time at some point. We have also received request to provide idle time for a container/cgroup. I presume these will need to be provided as additional files (in addition to cpu.usage that is)? -- Regards, vatsa -
The semantics of "idle time" for a cgroup on a shared system seem a bit fuzzy. How would you define it? Suppose you have two cgroups that would each want to use, say, 55% of a CPU - technically they should each be regarded as having 45% idle time, but if they run on a the same CPU the chances are that they will both always have some processes on their runqueue due to contention with the other group. So how would you measure the difference between this and a cgroup that really is trying to use 100%? Paul -
No, not for a group. We could extend account_user_time() and I think the percentage of time when it didn't have any runnable task in Good point. I think we need to subtract out the time it was waiting on runqueue when calculating idle time. |------- . . . . . . ---------zzzzzzzzzzzz.......-----------| t0 t1 t2 t3 t4 t5 t6 ---- -> Running time .... -> Waiting time (to get on the cpu) zzzz -> Sleeping time (when it didnt want to run because of lack of tasks) So, in this case, idle time = (t4 - t3) / [ (t6 - t1) - (t2-t1) - (t5-t4) ? This idle time will be a per-cpu stat for every cgroup and needs to be consolidated across cpus into a single idle-stat number, just like how top does it. -- Regards, vatsa -
This would be an idle fraction, not an idle time. (seconds divided by seconds) It doesn't seem quite right to me that a cgroup's idle time metric be affected by the activity of other cgroups on the machine, but it's hard to come up with a way of measuring it that doesn't have this behaviour - which is why, in the absence of hard CPU partitioning, it's not clear to me how much use this would be. What would people be planning to use it for? Paul -
agreed, we need to be reporting idle time in (milli)seconds, although the requirement we had was to report it back in percentage. I guess the percentage figure can be derived from the raw idle time number. How about: idle time = t4-t3 (effectively sleep time) I don't see how the idle time metric defined above (t4-t3) can be affected by other cgroup activity, unless the execution pattern of one cgroup is dependent on the others. However the minute you tranlsate this idle time into a percentage wrt wall-clock time, then yes a cgroup's idle percentage can be affected by others. For idle percentage to be meaningfull, I would imagine that user-space tools will need to calculate it after discarding a group's I think primarily for systems management tools to report back various statistics about a OpenVZ/VServer-like container (just like top reports stats for a OS currently). Let me check if there are other uses envisoned for it. -- Regards, vatsa -
If the other cgroups are busier, and t1-t2 is longer, then the cgroup will get to the point where it's ready to sleep later in wallclock time, and t4-t3 will be shorter in absolute terms. If there were no other cgroups running, then presumably the sleep time would actually be the sum of those three periods. Even so, I guess you're right that t4-t3 is the most appropriate thing Sorry, I didn't mean "how will you report it to users?", I meant "what kinds of useful information will the users be able to get from it?" Paul -
