This patch implements some trivial statistics for the CPU controller.
As per our current requirements, we have decided to keep the stats tick based as opposed to using CFS precise accounting.
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
CC: Balbir Singh <balbir@linux.vnet.ibm.com>
CC: Dhaval Giani <dhaval@linux.vnet.ibm.com>
diff --git a/kernel/sched.c b/kernel/sched.c
index 8206fda..e2acf06 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -164,10 +164,38 @@ struct cfs_rq;
static LIST_HEAD(task_groups);
+#ifdef CONFIG_CGROUP_SCHED
+enum cpu_cgroup_stat_index {
+ CPU_CGROUP_STAT_UTIME, /* Usertime of the task group */
+ CPU_CGROUP_STAT_STIME, /* Kerneltime of the task group */
+
+ CPU_CGROUP_STAT_NSTATS,
+};
+
+struct cpu_cgroup_stat_cpu {
+ s64 count[CPU_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct cpu_cgroup_stat {
+ struct cpu_cgroup_stat_cpu cpustat[NR_CPUS];
+};
+
+/* Called under irq disable. */
+static void __cpu_cgroup_stat_add(struct cpu_cgroup_stat *stat,
+ enum cpu_cgroup_stat_index idx, int val)
+{
+ int cpu = smp_processor_id();
+
+ BUG_ON(!irqs_disabled());
+ stat->cpustat[cpu].count[idx] += val;
+}
+#endif
+
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
struct cgroup_subsys_state css;
+ struct cpu_cgroup_stat stat;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3733,6 +3761,16 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
cpustat->nice = cputime64_add(cpustat->nice, tmp);
else
cpustat->user = cputime64_add(cpustat->user, tmp);
+
+ /* Charge the task's group */
+#ifdef CONFIG_CGROUP_SCHED
+ {
+ struct task_group *tg;
+ tg = task_group(p);
+ __cpu_cgroup_stat_add(&tg->stat, CPU_CGROUP_STAT_UTIME,
+ cputime_to_msecs(cputime));
+ }
+#endif
}
/*
@@ -3796,6 +3834,15 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
cpustat->idle = cputime64_add(cpustat->idle...u64? time does not go negative :) -- regards, Dhaval --
On Sunday 06 April 2008 01:10:41 am Dhaval Giani wrote: Just to keep things clearly separated. If you've not noticed, Right. But these stats are not only going to measure time. We need the same variables for measuring other stats as well. I'm not sure if we would encounter scheduler stats that would count negative. ok. How does 'value' look ? Hmmm.. You are right.This does not prevent concurrent updates on other CPUs from crossing a 32bit boundary. Am not sure how to do this in a safe way. I Thanks for the review. -- regards, Balaji Rao Dept. of Mechanical Engineering, National Institute of Technology Karnataka, India --
So many stats to steal code from,.. but you didn't :-( Look at mm/vmstat.c, that is a rather complete example. The trick to solving the above is to use per cpu deltas instead, the deltas can be machine word size and are thus always read in an atomic manner (provided they are also naturally aligned). --
Hi Peter, This wont work for time based statistics. At nsec granularity, a word can hold a time value of up to ~4s. I propose to solve this problem by using a lock to protect the statistics, but only on 32bit architectures. I'm not sure how good a solution this is, but that's the best I can think of ATM. -- regards, Balaji Rao Dept. of Mechanical Engineering, National Institute of Technology Karnataka, India --
Not needed, keep per cpu word deltas and fold into a global u64 counter while holding a lock. --
This implements a couple of basic statistics for the CPU resource controller.
v2->v3
-------
Proper locking while collecting stats. Thanks to Peter Zijlstra for suggesting
the delta approach.
This applies against 2.6.25-mm1
---
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
diff --git a/kernel/sched.c b/kernel/sched.c
index bbdc32a..5bda75a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -248,10 +248,51 @@ struct cfs_rq;
static LIST_HEAD(task_groups);
+#ifdef CONFIG_CGROUP_SCHED
+
+#define CPU_CGROUP_STAT_THRESHOLD 1 << 30
+
+enum cpu_cgroup_stat_index {
+ CPU_CGROUP_STAT_UTIME, /* Usertime of the task group */
+ CPU_CGROUP_STAT_STIME, /* Kerneltime of the task group */
+
+ CPU_CGROUP_STAT_NSTATS,
+};
+
+struct cpu_cgroup_stat_cpu {
+ s64 count[CPU_CGROUP_STAT_NSTATS];
+ u32 delta[CPU_CGROUP_STAT_NSTATS];
+} ____cacheline_aligned_in_smp;
+
+struct cpu_cgroup_stat {
+ struct cpu_cgroup_stat_cpu cpustat[NR_CPUS];
+ spinlock_t lock;
+};
+
+/* Called under irq disable. */
+static void __cpu_cgroup_stat_add(struct cpu_cgroup_stat *stat,
+ enum cpu_cgroup_stat_index idx, int val)
+{
+ int cpu = smp_processor_id();
+ unsigned long flags;
+
+ BUG_ON(!irqs_disabled());
+ stat->cpustat[cpu].delta[idx] += val;
+
+ if (stat->cpustat[cpu].delta[idx] > CPU_CGROUP_STAT_THRESHOLD) {
+ spin_lock_irqsave(&stat->lock, flags);
+ stat->cpustat[cpu].count[idx] += stat->cpustat[cpu].delta[idx];
+ stat->cpustat[cpu].delta[idx] = 0;
+ spin_unlock_irqrestore(&stat->lock, flags);
+ }
+}
+#endif
+
/* task group related information */
struct task_group {
#ifdef CONFIG_CGROUP_SCHED
struct cgroup_subsys_state css;
+ struct cpu_cgroup_stat stat;
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3837,6 +3878,16 @@ void account_user_time(struct task_struct *p, cputime_t cputime)
cpustat->nice = cputime64_add(cpustat->nice, tmp);
else
cpustat->user = cputime64_add(cpustat->user, tmp);
+
+...On Thu, 1 May 2008 23:11:06 +0530 And with NR_CPUS=1024, we're starting to talk serious pigginess. And there's one of these per task_group! Chances are we just won't be able That loop iterates 1024 times, under spin_lock_irqsave(). On a 2-way. Rethink, please... There are numerous reasons here for implementing the counters as dynamically-allocated, per-online-cpu things, with a cpu-hotplug notifier. All a bit of a hassle, but that's life. Anyway, forget all that. Did you consider using include/linux/percpu_counter.h? If so, what was wrong with it? Because it would be much better to fix per-cpu counters than to invent new stuff. --
On Friday 02 May 2008 02:30:26 am Andrew Morton wrote: <snip> Hi Andrew, No, I hadn't consider using the percpu_counters infrastructure. But today when I tried using it, I got an early exception.I guess its because I tried calling percpu_counter_init from within sched_init, which I perhaps shouldn't do, because percpu_counter_init expects cpu hotplug code to be initialized by then. Right ? Correct me if I'm wrong. How about we start collecting statistics at a later stage i.e, after percpu_counter becomes usable ? -- Warm Regards, Balaji Rao Dept. of Mechanical Engineering NITK --
On Sat, 3 May 2008 01:10:28 +0530 I don't see any reason why we cannot run percpu_counter_init() prior to running percpu_counter_startup(). And it is desirable that we be able to start using the percpu-counters quite early. Can you debug it a bit please? It's probably some silly little thing, It would be better to make the core infrastructure more robust, rather than working around problems it might have. It's rather nice that percpu_counters internally take care of cpu-hotplugging, and use cpu_online_map. I was amazed at how easily that was added. I still expect it to break somehow.. --
percpu_counter_init uses kmalloc to create percpu counters. This raises an early exception as kmem_cache is not initialized that early. It worked for me if we statically allocate memory for the counters. But its not at all a nice thing to do and I don't see another way to make it fit for early use. I'm beginning to run out of ideas! Why not do what I earlier suggested - begin collecting statistics once we are able to safely use percpu_counters ? This now seems to be the best alternative IMHO. -- Warm Regards, Balaji Rao Dept. of Mechanical Engineering NITK --
On Sat, 3 May 2008 04:17:03 +0530 whaa? kmalloc is ready to be used quite early in boot. It's a bit of a concern that the CPU resource controller is doing stuff before even kmalloc is ready to go. What's the call path here? Via cgroup_init_early()? Does it need to run I'd need to see the code. If we end up doing if (counters_are_ready) increment_counter(); all over then place then we need to think harder. Maybe we need a cgroup_init_late(), which can do memory allocations. If nothing actually needs to touch the counters before cgroup_init_late() runs then that might be OK. --
I would prefer to keep the stats logically separate. So something like
struct cpu_cgroup_stat_cpu {
u64 time[];
s64 some_other_stat;
}
and so on. (I am not sure, is there some advantage gained by using
I am going to answer that one when I am awake :-)
--
regards,
Dhaval
--This would break the generic nature of __cpu_cgroup_stat_add. Its not a nice thing in my opinion. -- regards, Balaji Rao Dept. of Mechanical Engineering, National Institute of Technology Karnataka, India --
I prefer keeping stats in the array as Balaji has done, it makes it easier to do batch processing on the stats. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
A quick question? Why? Everywhere in the CFS we use precise accounting, (even within the CPU controller). Doesn't make sense using ticks here then. Will review and respond soon. Thanks, -- regards, Dhaval --
Using CFS precise accounting has a cost (see task_utime and task_stime). Currently our requirements don't call for precise statistics. -- regards, Balaji Rao Dept. of Mechanical Engineering, National Institute of Technology Karnataka, India --
| Netfilter kernel module | 8 hours ago | Linux kernel |
| serial driver xmit problem | 10 hours ago | Linux kernel |
| Why Windows is better than Linux | 10 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 17 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 1 day ago | Linux general |
| Compiling gfs2 on kernel 2.6.27 | 2 days ago | Linux kernel |
