Earlier version of this patchset here - lkml subject: "[RFC PATCH 0/4] Finer granularity and task/cgroup irq time accounting" http://marc.info/?l=linux-kernel&m=127474630527689&w=2 Currently, the softirq and hardirq time reporting is only done at the CPU level. There are usecases where reporting this time against task or task groups or cgroups will be useful for user/administrator in terms of resource planning and utilization charging. Also, as the accoounting is already done at the CPU level, reporting the same at the task level does not add any significant computational overhead other than task level storage (patch 1). The softirq/hardirq statistics commonly done based on tick based sampling. Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity accounting. Having similar mechanism to get fine granularity accounting on x86 will be a major challenge, given the state of TSC reliability on various platforms and also the overhead it may add in common paths like syscall entry exit. An alternative is to have a generic (sched_clock based) and configurable fine-granularity accounting of si and hi time which can be reported over the /proc/<pid>/stat API (patch 2). Patch 3 and 4 are exporting this info at the cgroup level. Changes since the original RFC - * General code cleanup and documentation for new APIs added. * Handle notsc option by having a runtime flag sched_clock_irqtime, along with the original CONFIG_IRQ_TIME_ACCOUNTING option. Peter Zijlstra suggested the use of alternate instruction kind of mechanism here. But, that is mostly x86 specific and not generic. The irq time accounting code is mostly generic. * Did performance runs with various systems with tsc based sched_clock - both with and without sched_clock_stable - running tbench, dbench, SPECjbb and did not notice any measurable slowness when this option is enabled. Todo - * Peter Zijlstra suggested modifying scale_rt_power to account for irq time. I have a patch for that ...
Currently, kernel does not account softirq and hardirq times at the task level. There is irq time info in kstat_cpu which is accumulated at the cpu level. Without the task level information, the non irq run time of task(s) would have to be guessed based on their exec time and CPU on which they were running recently and assuming that the CPU irq time reported are spread across all the tasks running there. And this guess can be widely off the mark. Sample case, considering just the softirq: If there are varied workloads running on a CPU, say a CPU bound task (loop) and a network IO bound task (nc) along with the network softirq load, there is no way for the administrator/user to know the non-irq runtime of each of these tasks. Only information available is the total runtime for each of the tasks and kstat_cpu softirq time for the CPU. In this example, considering a 10 second sample, both loop and nc would have total run time of ~5s. And kstat_cpu softirq on this cpu increase was 355 (~3.5s). So, all the information the user gets is that both the tasks are running for roughly the same amount of time and softirq is around 35%. As a result user may conclude that irq overhead for both tasks are equal (1.75s) and the non-irq runtime of both the tasks are around ~3.25s. Yes. There is another factor of system and user time reported for these tasks that I am ignoring as that is tough to correlate with irq time, in cases where the tasks have significant non-irq system time. This change adds tracking of softirq time on each task and task group. This information is exported in /proc/<pid>/stat. So, the user can get info like below, looking at exec_time and si_time in appropriate /proc/<pid>/stat. (Taken for a 10s interval) task exec_time softirqtime (in USER_HZ) (loop) (nc) 505 0 500 359 502 1 501 363 503 0 502 354 504 0 499 359 503 3 500 360 with this, user can get the non-irq run time as 5s and ~1.45s for loop and nc, respectively. Signed-off-by: ...
s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does the fine granularity accounting of user, system, hardirq, softirq times. Adding that option on archs like x86 may be challenging however, given the state of TSC reliability on various platforms and also the overhead it may add in syscall entry exit. Instead, add an option that only does finer accounting of hardirq-softirq, providing precise irq times (instead of timer ticks based samples). This accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING so that there wont be any overhead for users not interested in paying the perf penalty. And this accounting is based on sched_clock, so other archs may find it useful as well. Note that the kstat_cpu irq times are still based on tick based samples and only the task irq times report this new finer granularity irq time. The reason being that the kstat irq also includes system time and changing only irq time to have finer granularity can result in inconsistency like sum kstat time adding up to more than 100% etc. Continuing with the example from previous patch, without finer granularity accounting, exec_time and si_time in 10s intervals were (appropriate fields of /proc/<pid>/stat) (loop) (nc) 505 0 500 359 502 1 501 363 503 0 502 354 504 0 499 359 503 3 500 360 And with finer granularity accounting they were (loop) (nc) 503 9 502 301 502 8 502 303 502 9 501 302 502 8 502 302 503 9 501 302 Signed-off-by: Venkatesh Pallipadi <venki@google.com> --- arch/ia64/include/asm/system.h | 4 -- arch/powerpc/include/asm/system.h | 4 -- arch/s390/include/asm/system.h | 1 - arch/x86/Kconfig | 11 +++++++ arch/x86/kernel/tsc.c | 2 + fs/proc/array.c | 4 +- include/linux/hardirq.h | 15 +++++++++- include/linux/sched.h | 13 ++++++++ kernel/sched.c | 59 +++++++++++++++++++++++++++++++++++- 9 files ...
Generalize cpuacct usage, making it easier to add new stats in the following
patch.
Also adds alloc_percpu_array() interface in percpu.h
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
include/linux/percpu.h | 4 ++++
kernel/sched.c | 39 ++++++++++++++++++++++++++-------------
kernel/sched_fair.c | 2 +-
kernel/sched_rt.c | 2 +-
4 files changed, 32 insertions(+), 15 deletions(-)
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d3a38d6..216f96a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -167,6 +167,10 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
#define alloc_percpu(type) \
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), __alignof__(type))
+#define alloc_percpu_array(type, size) \
+ (typeof(type) __percpu *)__alloc_percpu(sizeof(type) * size, \
+ __alignof__(type))
+
/*
* Optional methods for optimized non-lvalue per-cpu variable access.
*
diff --git a/kernel/sched.c b/kernel/sched.c
index f167fbb..c12c8ea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1396,12 +1396,20 @@ enum cpuacct_stat_index {
CPUACCT_STAT_NSTATS,
};
+enum cpuacct_charge_index {
+ CPUACCT_CHARGE_USAGE, /* ... execution time */
+
+ CPUACCT_CHARGE_NCHARGES,
+};
+
#ifdef CONFIG_CGROUP_CPUACCT
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+static void cpuacct_charge(struct task_struct *tsk,
+ enum cpuacct_charge_index idx, u64 cputime);
static void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val);
#else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_charge(struct task_struct *tsk,
+ enum cpuacct_charge_index idx, u64 cputime) {}
static inline void cpuacct_update_stats(struct task_struct *tsk,
enum cpuacct_stat_index idx, cputime_t val) {}
#endif
@@ -8661,7 +8669,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
/* track cpu usage of a ...Adds hi_time, si_time, hi_time_percpu and si_time_percpu info in cpuacct
cgroup.
The info will be fine granularity timings when either
CONFIG_IRQ_TIME_ACCOUNTING or CONFIG_VIRT_CPU_ACCOUNTING is enabled.
Otherwise the info will be based on tick samples.
Looked at adding this under cpuacct.stat. But, this information is useful
to the administrator in percpu format, so that any hi or si activity
on a particular CPU can be noted and some resource reallocation
(move the irq away, assign a different CPU to this cgroup, etc)
can be done based on that info.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
Documentation/cgroups/cpuacct.txt | 5 +++
kernel/sched.c | 73 +++++++++++++++++++++++++++++++------
2 files changed, 66 insertions(+), 12 deletions(-)
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b93094..817435e 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -48,3 +48,8 @@ system times. This has two side effects:
against concurrent writes.
- It is possible to see slightly outdated values for user and system times
due to the batch processing nature of percpu_counter.
+
+cpuacct.hi_time and cpuacct.si_time provides the information about hardirq
+and softirq processing time that was accounted to this cgroup. There is also
+percpu variants of hi_time and si_time that splits the info at percpu level.
+All this times are in USER_HZ unit.
diff --git a/kernel/sched.c b/kernel/sched.c
index c12c8ea..7198041 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1398,6 +1398,8 @@ enum cpuacct_stat_index {
enum cpuacct_charge_index {
CPUACCT_CHARGE_USAGE, /* ... execution time */
+ CPUACCT_CHARGE_SI_TIME, /* ... softirq time */
+ CPUACCT_CHARGE_HI_TIME, /* ... hardirq time */
CPUACCT_CHARGE_NCHARGES,
};
@@ -3226,9 +3228,11 @@ void enable_sched_clock_irqtime(void)
#endif
#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
-static void ...On Mon, 19 Jul 2010 16:57:11 -0700 I never understood why the softirq and hardirq time gets accounted to a task at all. Why is it that the poor task that is running gets charged with the cpu time of an interrupt that has nothing to do with the task? I consider this to be a bug, and now this gets formalized in the To get fine granular accounting for interrupts you need to do a sched_clock call on irq entry and another one on irq exit. Isn't that too expensive on a x86 system? (I do think this is a good idea but still there is the worry about the overhead). -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
On Tue, Jul 20, 2010 at 12:55 AM, Martin Schwidefsky Agree that this is a bug. I started by looking at resolving that. But, it was not exactly easy. Ideally we want irq times to be charged to right task as much as possible. With things like network rcv softirq for example, there is a task thats is going to consume the packet eventually that should be charged. If we cant find a suitable match we may have to charge it to some system thread. Things like threaded interrupts will mitigate this problem a bit. But, until we have a good enough solution, this bug will be around with us. This change takes a small step giving hint about this to user/administrator who can take some corrective action based on it. Next step is to give CFQ scheduler some info about this and I am working on a patch for that. That will help in load balancing decisions, with irq heavy CPU not trying to get equal weight-age as other CPU. I don't think these interfaces are binding in any way. If and when we have tasks not being charged for irq, we can simply report "0" in these interfaces (there is some precedent for this in On x86: Yes. Overhead is a potential problem. Thats the reason I had this inside a CONFIG option. But, I have tested this with few workloads on different systems released in past two years timeframe and I did not see any measurable overhead. Note that this is used only when sched_clock is based off of TSC and not when it is based on jiffies. The sched_clock overhead I measured on different platforms was in 30-150 cycles range, which probably isn't going to be highly visible in generic workloads. Archs like s390/powerpc/ia64 already do this kind of accounting with VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup level info free of charge (other than potential bugs with this code change :-)). Thanks, Venki --
On Tue, 20 Jul 2010 09:55:29 -0700 Yes, fixing that behavior will be tough. Just consider a standard page cache I/O that gets merged with other I/O. You would need to "split" the interrupt time for a block I/O to the process that benefit from it. An added twist is that there can be multiple processes that require the page. Split the time even more to the different requesters of a page? Then the order when the requests come in suddenly gets important. Or consider the IP packets in a network buffer, split the interrupt time to the recipients? The list goes on and on, my guess is that it will be next to impossible to do it right. If the current situation is wrong because the ire and softorq system time gets misaccounted and the "correct" solution is impossible the only thing left to do is to stop accounting irq and That makes sense to me, with a working TSC the overhead should be Well, the task and cgroup information is there but what does it really tell me? As long as the irq & softirq time can be caused by any other process I don't see the value of this incorrect data point. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
On Thu, Jul 22, 2010 at 4:12 AM, Martin Schwidefsky Data point will be correct. How it gets used is a different qn. This interface will be useful for Alert/Paranoid/Annoyed user/admin who sees that the job exec_time is high but it is not doing any useful work. With this additional info, he can probably choose to move the job off to different system. User probably knows more about the job characteristics and whether it is rightly or wrongly being charged. Say one task in the task group being charged for another task in the task group is probably OK as well. So, user can look at this in different granularity than kernel can. Thanks, Venki --
I'm very sympathetic with Martin's POV. irq/softirq times per task don't really make sense. In the case you provide above the solution would be to subtract these times from the task execution time, not break it out. In that case he would see his task not do much, and end up with the same action list. --
cgroup level info does make sense, assuming that tasks that share the costs being mentioned here belong to the same cgroup. Though the data is calculated per task, when accumulated per cgroup, it should be close to being correct. -- Three Cheers, Balbir --
I don't think that's a valid assumption. If its not true for tasks, then its not true for groups of tasks either. It might be slightly less wrong due to the larger number of entities reducing the error bounds, but its still wrong in principle. The whole attribution mess can only be solved by actually splitting out the entries that do work, like per-cgroup workqueue threads and similar things. System wide entities like IRQs are very hard to attribute correctly like Martin already argued, and I don't think its worth doing. --
The point is for containers it is more likely to give the right answer I see Martin's view point, is the suggestion then that we amortize these costs across all tasks? -- Three Cheers, Balbir --
I'm still not sure what you want them for, but if its for wanting to know wth the system is up to, simply account them on their own, and not include them in any task stats. That is, keep the existing hi/si interface and improve upon that, but also subtract those times from the task execution times. That way, if a cpu is like 80% hogged by IRQ action, you'll not see a 100% busy task, but only a 20%. At that point you can also feed the IRQ time back into sched_rt_avg_update() (which strictly speaking isn't rt but !fair), and the load-balancer will automagically try and move tasks away from that cpu. If you really want to account (and possibly control) all the work belonging to a particular group you'll have to make sure work does indeed stay within the group -- which is where per-cgroup workqueue threads and per-cgroup softirq threads etc. come into play. Lumping all work together and then trying to extract something again is silly. And hardirq time really is system time, not cgroup or task time. --
Consider one group heavily dirtying pages, it stuffs the IO queues full and gets blocked on IO completion. Since the CPU is then free to schedule something else we start running things from another group, those IO completions will come in while we run other group and get accounted to other group -- FAIL. s/group/task/ etc.. That just really doesn't work, accounting async work, esp stuff that is not under software control it very tricky indeed. So what are you wanting to do, and why. Do you really need accounting madness? --
On Tue, 24 Aug 2010 13:53:55 +0200 Well, I have sent a patch back in 2006 that stops adding the hardirq / softirq time to the currently running process. See http://lkml.org/lkml/2006/8/24/139 It did not get very far, so that answer to the question if we need accounting madness seems to be yes .. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Well, that's only a little accounting, but trying to infer to what task to account IRQ time is going to become madness. --
Yes, we don't have sufficient context to charge the correct context. I think openvz has some technology there, we will too when we have I/O cgroups at a cgroup level, but the instances of such operations are I think Venki gave the answer in the posting "There are usecases where reporting this time against task or task groups or cgroups will be useful for user/administrator in terms of resource planning and utilization charging" I don't have any specific use cases, I was just reviewing the patchset and trying to understand how to solve the problem. -- Three Cheers, Balbir --
Or confusing, what happens if you attribute the IRQ overhead of a ping-flood to your tasks? By not providing these numbers per task/group people will have to actually think about what it is that is causing these high irq loads and have a chance of actually doing better than random attribution. So no, providing random numbers on the slight chance that they might possibly make sense for your workload doesn't seem like a sound reason to provide them. --
(long email alert) I have two different answers for why we ended up with this madness. My personal take on why we need this and the actual flow why I ended up with this patchset. - Current /proc/stat hardirq and softirq time reporting is broken for most archs as it does tick sampling. Hardirq time specifically is further broken due to interrupts being disabled during irq - http://kerneltrap.org/mailarchive/linux-kernel/2010/5/25/4574864 OK. Lets fix /proc/stat. But, that doesn't seem enough. We should also not account this time to tasks themselves. - I started looking as not accounting this time to tasks themselves. This was really tricky as things are tightly tied to scheduler multiple issues. 1) A silly case as in of two tasks on one CPU, one task totally CPU bound and another task doing network recv. This is how task and softirq time looks like for this (10s samples) (loop) (nc) 503 9 502 301 502 8 502 303 502 9 501 302 502 8 502 302 503 9 501 302 Now, when I did "not account si time to task", the loop task ended up getting a lot less CPU time and doing less work as nc task doing rcv got more CPU share, which was not right thing to do. IIRC, I had something like <300 centiseconds for loop after the change (with si activity increasing due to higher runtime of nc task). 2) Also, a minor problem of breaking current userspace API for tasks/cgroup stats assume that irq times are included. So, even though it seems accounting irq time as "system time" seems the right thing to do, it can break scheduling in many ways. May be hardirq can be accounted as system time. But, dealing with softirq is tricky as they can be related to the task. Figuring out si time and accouting to the right task is a non-starter. There are so many different ways in which si will come into picture. finding and accounting it to right task will be almost impossible. So, why not do the simple things first. Do not disturb any existing scheduling decisions, account accurate hi and ...
Yeah, architectures without a decent clock are a pain (x86 is still on
that list although nhm/wsm don't suck too bad), but it might be
worthwhile to look at what arch/$foo are strictly tick based.
A quick look suggests:
alpha
arm (some)
avr32
cris (it could remove its implementation, its identical
to the weak function provided by kernel/sched_clock.c)
frv (idem)
h8300
m32r
m68k* (except nommu-coldfire)
mips (except cavium-octeon)
parisc
score
sh
xtensa
which seems to mean too damn many, I bet we can't simply move those to
I'm not exactly sure where that would get complicated, simply treat
interrupts the same as preemptions by other tasks and things should
Is that actually specified or simply assumed because our implementation
always had that bug? I would really call not accounting irq time to
I haven't yet seen any scheduler breakage here, it will divide time
differently, but not in a broken way, if the system consumes 1/3rd of
the time, there's only 2/3rd left to fairly distribute between tasks, so
something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.
You'd get exactly the same kind of thing if you replace (soft)irq with a
FIFO task.
The whole schizo softirq infrastructure (hardirq tails and tasks) is a
pain though, I would really love to rid the kernel of it, but I've got
no idea how to do something like that given that things like the whole
This is where I strongly disagree, providing an interface that cannot
possibly be implemented correctly just so you can fudge something (still
not sure what from userspace) seems a very bad idea indeed.
--
Atleast the way I tried it turned out to be messy. Keep track of time at si and hi and remove it from update_curr delta. I did it that way as I didn't want to take rq lock on hi si path. Doing it as preemption with put_prev/pick_next would be expensive. No? But, FIFO in that case would be some unrelated task taking away CPU. Here one task can take more than its share due to si. Also, network RFS will try to get softirq to the right CPU thats running this task. So, this will be sort of common case where task with softirq will run faster and other non-si tasks will run slower I don't think correctness is a problem. TSC is pretty good for this purpose on current hardware. I agree that usability is debatable. The use case I mentioned is some management application trying to find interference/slowness for a task/task group because some other si intensive task or flood ping on that CPU, getting to know that from si/hi time for task and what it "expects it to be". Yes this is vague. But, I think you agree that problem of si/hi interference on unrelated task exists today. And providing this interface was the quick way to give some hint to management apps about such problem. But. other alternative of making si and hi time as "system time" will help this use case as well, as the user will notice lower exec_time in that case. If you strongly think that the right way is to make both si and hi "system time" and that will not cause unfairness and slowdown for some unrelated tasks, I can try to cleanup the patch I had for that and send it out. I am afraid though, it will cause some regression and we will end up back at square one after a month or so. :( Thanks, Venki --
On Tue, 24 Aug 2010 19:02:04 -0700 But it is a correctness problem. It is wrong to account the si and hi time to some random process. To base any kind of decision on wrong data is asking for trouble. If we can not correctly attribute the si and hi time to the correct process (which we agree is next to impossible) then the only thing left to do is to report the time on its own. You can still pick a random process in your management application and add the time in user space. As wrong as before but some other application might want to do smarter things with the data point. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
No, removing it from update_curr()'s delta is exactly what I meant. That gives the same end result as if the task were preempted (ie. it ran No it wouldn't, it would make nc run exactly its fair share. SoftIRQ No, how is being preempted by an unrelated task (FIFO whatever) Like already argued, it is a correctness issue, TSC is only an accuracy issue, but since you cannot properly attribute this very accurate time Sure it will affect some workloads, but its also more correct. If network workloads suffer we can look at sorting out some of those problems. But really SoftIRQ != task context and thus should not be accounted to said task. --
Right, andthis connects to something Frederic sent a few RFC patches for some time ago: finegrained irq/softirq perf stat support. If we do something in this area we need a facility that enables both types of statistics gathering. Frederic's model is based on exclusion - so you could do a perf stat run that excluded softirq and hardirq execution from a workload's runtime. It's nifty, as it allows the reduction of measurement noise. (IRQ and softirq execution can be regarded as random noise added (or not added) to execution times) Thanks, Ingo --
That facility is called irq_enter() and irq_exit() etc.. --
Peter, Ping. Does the patchset look sane. Thanks, Venki --
Thanks for the prod, I should rm -rf my inbox and start over,.. its impossible to keep track of things :/ --
