Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting

Previous thread: [patch 0/5] x86: xsaveopt kernel enabling patches - v2 by Suresh Siddha on Monday, July 19, 2010 - 4:05 pm. (4 messages)

Next thread: [PATCH 17/49] memblock: Define MEMBLOCK_ERROR internally instead of using ~(phys_addr_t)0 by Yinghai Lu on Monday, July 19, 2010 - 4:56 pm. (1 message)
From: Venkatesh Pallipadi
Date: Monday, July 19, 2010 - 4:57 pm

Earlier version of this patchset here -
lkml subject:
"[RFC PATCH 0/4] Finer granularity and task/cgroup irq time accounting"
http://marc.info/?l=linux-kernel&m=127474630527689&w=2

Currently, the softirq and hardirq time reporting is only done at the
CPU level. There are usecases where reporting this time against task
or task groups or cgroups will be useful for user/administrator
in terms of resource planning and utilization charging. Also, as the
accoounting is already done at the CPU level, reporting the same at
the task level does not add any significant computational overhead
other than task level storage (patch 1).

The softirq/hardirq statistics commonly done based on tick based sampling.
Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
accounting. Having similar mechanism to get fine granularity accounting
on x86 will be a major challenge, given the state of TSC reliability
on various platforms and also the overhead it may add in common paths
like syscall entry exit.

An alternative is to have a generic (sched_clock based) and configurable
fine-granularity accounting of si and hi time which can be reported
over the /proc/<pid>/stat API (patch 2).

Patch 3 and 4 are exporting this info at the cgroup level.

Changes since the original RFC -
* General code cleanup and documentation for new APIs added.
* Handle notsc option by having a runtime flag sched_clock_irqtime, along
  with the original CONFIG_IRQ_TIME_ACCOUNTING option.
  Peter Zijlstra suggested the use of alternate instruction kind of mechanism
  here. But, that is mostly x86 specific and not generic. The irq time
  accounting code is mostly generic.
* Did performance runs with various systems with tsc based sched_clock -
  both with and without sched_clock_stable - running tbench, dbench, SPECjbb
  and did not notice any measurable slowness when this option is enabled.
Todo -
* Peter Zijlstra suggested modifying scale_rt_power to account for
  irq time. I have a patch for that ...
From: Venkatesh Pallipadi
Date: Monday, July 19, 2010 - 4:57 pm

Currently, kernel does not account softirq and hardirq times at
the task level. There is irq time info in kstat_cpu which is
accumulated at the cpu level.

Without the task level information, the non irq run time of task(s) would
have to be guessed based on their exec time and CPU on which they were
running recently and assuming that the CPU irq time reported are spread
across all the tasks running there. And this guess can be widely off the mark.

Sample case, considering just the softirq:

If there are varied workloads running on a CPU, say a CPU bound task (loop)
and a network IO bound task (nc) along with the network softirq load,
there is no way for the administrator/user to know the non-irq runtime of each
of these tasks. Only information available is the total runtime for each of the
tasks and kstat_cpu softirq time for the CPU.

In this example, considering a 10 second sample, both loop and nc would have
total run time of ~5s. And kstat_cpu softirq on this cpu increase was
355 (~3.5s).

So, all the information the user gets is that both the tasks are running for
roughly the same amount of time and softirq is around 35%. As a result user
may conclude that irq overhead for both tasks are equal (1.75s) and the
non-irq runtime of both the tasks are around ~3.25s. Yes. There is another
factor of system and user time reported for these tasks that I am ignoring
as that is tough to correlate with irq time, in cases where the tasks have
significant non-irq system time.

This change adds tracking of softirq time on each task and task group.
This information is exported in /proc/<pid>/stat.

So, the user can get info like below, looking at exec_time and si_time in
appropriate /proc/<pid>/stat.
(Taken for a 10s interval)
task exec_time softirqtime (in USER_HZ)
(loop)  (nc)
505 0   500 359
502 1   501 363
503 0   502 354
504 0   499 359
503 3   500 360

with this, user can get the non-irq run time as 5s and ~1.45s for
loop and nc, respectively.

Signed-off-by: ...
From: Venkatesh Pallipadi
Date: Monday, July 19, 2010 - 4:57 pm

s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
the fine granularity accounting of user, system, hardirq, softirq times.
Adding that option on archs like x86 may be challenging however, given the
state of TSC reliability on various platforms and also the overhead it may
add in syscall entry exit.

Instead, add an option that only does finer accounting of hardirq-softirq,
providing precise irq times (instead of timer ticks based samples). This
accounting is added with a new config option CONFIG_IRQ_TIME_ACCOUNTING
so that there wont be any overhead for users not interested in paying the
perf penalty. And this accounting is based on sched_clock, so other archs
may find it useful as well.

Note that the kstat_cpu irq times are still based on tick based samples
and only the task irq times report this new finer granularity irq time.
The reason being that the kstat irq also includes system time and
changing only irq time to have finer granularity can result in inconsistency
like sum kstat time adding up to more than 100% etc.

Continuing with the example from previous patch, without finer
granularity accounting, exec_time and si_time in 10s intervals were
(appropriate fields of /proc/<pid>/stat)
(loop)  (nc)
505 0   500 359
502 1   501 363
503 0   502 354
504 0   499 359
503 3   500 360

And with finer granularity accounting they were
(loop)  (nc)
503 9   502 301
502 8   502 303
502 9   501 302
502 8   502 302
503 9   501 302

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 arch/ia64/include/asm/system.h    |    4 --
 arch/powerpc/include/asm/system.h |    4 --
 arch/s390/include/asm/system.h    |    1 -
 arch/x86/Kconfig                  |   11 +++++++
 arch/x86/kernel/tsc.c             |    2 +
 fs/proc/array.c                   |    4 +-
 include/linux/hardirq.h           |   15 +++++++++-
 include/linux/sched.h             |   13 ++++++++
 kernel/sched.c                    |   59 +++++++++++++++++++++++++++++++++++-
 9 files ...
From: Venkatesh Pallipadi
Date: Monday, July 19, 2010 - 4:57 pm

Generalize cpuacct usage, making it easier to add new stats in the following
patch.

Also adds alloc_percpu_array() interface in percpu.h

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 include/linux/percpu.h |    4 ++++
 kernel/sched.c         |   39 ++++++++++++++++++++++++++-------------
 kernel/sched_fair.c    |    2 +-
 kernel/sched_rt.c      |    2 +-
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index d3a38d6..216f96a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -167,6 +167,10 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 #define alloc_percpu(type)	\
 	(typeof(type) __percpu *)__alloc_percpu(sizeof(type), __alignof__(type))
 
+#define alloc_percpu_array(type, size)	\
+	(typeof(type) __percpu *)__alloc_percpu(sizeof(type) * size, \
+						__alignof__(type))
+
 /*
  * Optional methods for optimized non-lvalue per-cpu variable access.
  *
diff --git a/kernel/sched.c b/kernel/sched.c
index f167fbb..c12c8ea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1396,12 +1396,20 @@ enum cpuacct_stat_index {
 	CPUACCT_STAT_NSTATS,
 };
 
+enum cpuacct_charge_index {
+	CPUACCT_CHARGE_USAGE,	/* ... execution time */
+
+	CPUACCT_CHARGE_NCHARGES,
+};
+
 #ifdef CONFIG_CGROUP_CPUACCT
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+static void cpuacct_charge(struct task_struct *tsk,
+		enum cpuacct_charge_index idx, u64 cputime);
 static void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val);
 #else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
+static inline void cpuacct_charge(struct task_struct *tsk,
+		enum cpuacct_charge_index idx, u64 cputime) {}
 static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
@@ -8661,7 +8669,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 /* track cpu usage of a ...
From: Venkatesh Pallipadi
Date: Monday, July 19, 2010 - 4:57 pm

Adds hi_time, si_time, hi_time_percpu and si_time_percpu info in cpuacct
cgroup.

The info will be fine granularity timings when either
CONFIG_IRQ_TIME_ACCOUNTING or CONFIG_VIRT_CPU_ACCOUNTING is enabled.
Otherwise the info will be based on tick samples.

Looked at adding this under cpuacct.stat. But, this information is useful
to the administrator in percpu format, so that any hi or si activity
on a particular CPU can be noted and some resource reallocation
(move the irq away, assign a different CPU to this cgroup, etc)
can be done based on that info.

Signed-off-by: Venkatesh Pallipadi <venki@google.com>
---
 Documentation/cgroups/cpuacct.txt |    5 +++
 kernel/sched.c                    |   73 +++++++++++++++++++++++++++++++------
 2 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
index 8b93094..817435e 100644
--- a/Documentation/cgroups/cpuacct.txt
+++ b/Documentation/cgroups/cpuacct.txt
@@ -48,3 +48,8 @@ system times. This has two side effects:
   against concurrent writes.
 - It is possible to see slightly outdated values for user and system times
   due to the batch processing nature of percpu_counter.
+
+cpuacct.hi_time and cpuacct.si_time provides the information about hardirq
+and softirq processing time that was accounted to this cgroup. There is also
+percpu variants of hi_time and si_time that splits the info at percpu level.
+All this times are in USER_HZ unit.
diff --git a/kernel/sched.c b/kernel/sched.c
index c12c8ea..7198041 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1398,6 +1398,8 @@ enum cpuacct_stat_index {
 
 enum cpuacct_charge_index {
 	CPUACCT_CHARGE_USAGE,	/* ... execution time */
+	CPUACCT_CHARGE_SI_TIME,	/* ... softirq time */
+	CPUACCT_CHARGE_HI_TIME,	/* ... hardirq time */
 
 	CPUACCT_CHARGE_NCHARGES,
 };
@@ -3226,9 +3228,11 @@ void enable_sched_clock_irqtime(void)
 #endif
 
 #if defined(CONFIG_VIRT_CPU_ACCOUNTING)
-static void ...
From: Martin Schwidefsky
Date: Tuesday, July 20, 2010 - 12:55 am

On Mon, 19 Jul 2010 16:57:11 -0700

I never understood why the softirq and hardirq time gets accounted to a
task at all. Why is it that the poor task that is running gets charged
with the cpu time of an interrupt that has nothing to do with the task?
I consider this to be a bug, and now this gets formalized in the

To get fine granular accounting for interrupts you need to do a
sched_clock call on irq entry and another one on irq exit. Isn't that
too expensive on a x86 system? (I do think this is a good idea but
still there is the worry about the overhead).

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Venkatesh Pallipadi
Date: Tuesday, July 20, 2010 - 9:55 am

On Tue, Jul 20, 2010 at 12:55 AM, Martin Schwidefsky

Agree that this is a bug. I started by looking at resolving that. But,
it was not exactly easy. Ideally we want irq times to be charged to
right task as much as possible. With things like network rcv softirq
for example, there is a task thats is going to consume the packet
eventually that should be charged. If we cant find a suitable match we
may have to charge it to some system thread. Things like threaded
interrupts will mitigate this problem a bit. But, until we have a good
enough solution, this bug will be around with us.

This change takes a small step giving hint about this to
user/administrator who can take some corrective action based on it.
Next step is to give CFQ scheduler some info about this and I am
working on a patch for that. That will help in load balancing
decisions, with irq heavy CPU not trying to get equal weight-age as
other CPU. I don't think these interfaces are binding in any way. If
and when we have tasks not being charged for irq, we can simply report
"0" in these interfaces (there is some precedent for this in

On x86: Yes. Overhead is a potential problem. Thats the reason I had
this inside a CONFIG option. But, I have tested this with few
workloads on different systems released in past two years timeframe
and I did  not see any measurable overhead. Note that this is used
only when sched_clock is based off of TSC and not when it is based on
jiffies. The sched_clock overhead I measured on different platforms
was in 30-150 cycles range, which probably isn't going to be highly
visible in generic workloads.
Archs like s390/powerpc/ia64 already do this kind of accounting with
VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup
level info free of charge (other than potential bugs with this code
change :-)).

Thanks,
Venki
--

From: Martin Schwidefsky
Date: Thursday, July 22, 2010 - 4:12 am

On Tue, 20 Jul 2010 09:55:29 -0700

Yes, fixing that behavior will be tough. Just consider a standard page
cache I/O that gets merged with other I/O. You would need to "split" the
interrupt time for a block I/O to the process that benefit from it. An
added twist is that there can be multiple processes that require the
page. Split the time even more to the different requesters of a page?
Then the order when the requests come in suddenly gets important. Or
consider the IP packets in a network buffer, split the interrupt time
to the recipients?
The list goes on and on, my guess is that it will be next to impossible
to do it right. If the current situation is wrong because the ire
and softorq system time gets misaccounted and the "correct" solution is
impossible the only thing left to do is to stop accounting irq and

That makes sense to me, with a working TSC the overhead should be

Well, the task and cgroup information is there but what does it really
tell me? As long as the irq & softirq time can be caused by any other
process I don't see the value of this incorrect data point.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Venkatesh Pallipadi
Date: Thursday, July 22, 2010 - 7:12 pm

On Thu, Jul 22, 2010 at 4:12 AM, Martin Schwidefsky

Data point will be correct. How it gets used is a different qn. This
interface will be useful for Alert/Paranoid/Annoyed user/admin who
sees that the job exec_time is high but it is not doing any useful
work. With this additional info, he can probably choose to move the
job off to different system. User probably knows more about the job
characteristics and whether it is rightly or wrongly being charged.
Say one task in the task group being charged for another task in the
task group is probably OK as well. So, user can look at this in
different granularity than kernel can.

Thanks,
Venki
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 12:51 am

I'm very sympathetic with Martin's POV. irq/softirq times per task don't
really make sense. In the case you provide above the solution would be
to subtract these times from the task execution time, not break it out.
In that case he would see his task not do much, and end up with the same
action list.


--

From: Balbir Singh
Date: Tuesday, August 24, 2010 - 1:05 am

cgroup level info does make sense, assuming that tasks that share the
costs being mentioned here belong to the same cgroup. Though the data
is calculated per task, when accumulated per cgroup, it should be
close to being correct.

-- 
	Three Cheers,
	Balbir
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 2:09 am

I don't think that's a valid assumption.

If its not true for tasks, then its not true for groups of tasks either.
It might be slightly less wrong due to the larger number of entities
reducing the error bounds, but its still wrong in principle.

The whole attribution mess can only be solved by actually splitting out
the entries that do work, like per-cgroup workqueue threads and similar
things.

System wide entities like IRQs are very hard to attribute correctly like
Martin already argued, and I don't think its worth doing.
--

From: Balbir Singh
Date: Tuesday, August 24, 2010 - 4:38 am

The point is for containers it is more likely to give the right answer

I see Martin's view point, is the suggestion then that we amortize
these costs across all tasks?


-- 
	Three Cheers,
	Balbir
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 4:49 am

I'm still not sure what you want them for, but if its for wanting to
know wth the system is up to, simply account them on their own, and not
include them in any task stats.

That is, keep the existing hi/si interface and improve upon that, but
also subtract those times from the task execution times.

That way, if a cpu is like 80% hogged by IRQ action, you'll not see a
100% busy task, but only a 20%.

At that point you can also feed the IRQ time back into
sched_rt_avg_update() (which strictly speaking isn't rt but !fair), and
the load-balancer will automagically try and move tasks away from that
cpu.

If you really want to account (and possibly control) all the work
belonging to a particular group you'll have to make sure work does
indeed stay within the group -- which is where per-cgroup workqueue
threads and per-cgroup softirq threads etc. come into play.

Lumping all work together and then trying to extract something again is
silly.

And hardirq time really is system time, not cgroup or task time.
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 4:53 am

Consider one group heavily dirtying pages, it stuffs the IO queues full
and gets blocked on IO completion. Since the CPU is then free to
schedule something else we start running things from another group,
those IO completions will come in while we run other group and get
accounted to other group -- FAIL.

s/group/task/ etc..

That just really doesn't work, accounting async work, esp stuff that is
not under software control it very tricky indeed.

So what are you wanting to do, and why. Do you really need accounting
madness?
--

From: Martin Schwidefsky
Date: Tuesday, August 24, 2010 - 5:06 am

On Tue, 24 Aug 2010 13:53:55 +0200

Well, I have sent a patch back in 2006 that stops adding the hardirq /
softirq time to the currently running process. See
http://lkml.org/lkml/2006/8/24/139
It did not get very far, so that answer to the question if we need
accounting madness seems to be yes ..

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 5:39 am

Well, that's only a little accounting, but trying to infer to what task
to account IRQ time is going to become madness.
--

From: Balbir Singh
Date: Tuesday, August 24, 2010 - 5:47 am

Yes, we don't have sufficient context to charge the correct context. I
think openvz has some technology there, we will too when we have I/O
cgroups at a cgroup level, but the instances of such operations are

I think Venki gave the answer in the posting

"There are usecases where reporting this time against task
or task groups or cgroups will be useful for user/administrator
in terms of resource planning and utilization charging"

I don't have any specific use cases, I was just reviewing the patchset
and trying to understand how to solve the problem.


-- 
	Three Cheers,
	Balbir
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 6:08 am

Or confusing, what happens if you attribute the IRQ overhead of a
ping-flood to your tasks?

By not providing these numbers per task/group people will have to
actually think about what it is that is causing these high irq loads and
have a chance of actually doing better than random attribution.

So no, providing random numbers on the slight chance that they might
possibly make sense for your workload doesn't seem like a sound reason
to provide them.


--

From: Venkatesh Pallipadi
Date: Tuesday, August 24, 2010 - 12:20 pm

(long email alert)
I have two different answers for why we ended up with this madness.

My personal take on why we need this and the actual flow why I ended
up with this patchset.

- Current /proc/stat hardirq and softirq time reporting is broken for
most archs as it does tick sampling. Hardirq time specifically is
further broken due to interrupts being disabled during irq -
http://kerneltrap.org/mailarchive/linux-kernel/2010/5/25/4574864

OK. Lets fix /proc/stat. But, that doesn't seem enough. We should also
not account this time to tasks themselves.

- I started looking as not accounting this time to tasks themselves.
This was really tricky as things are tightly tied to scheduler
multiple issues. 1) A silly case as in of two tasks on one CPU, one
task totally CPU bound and another task doing network recv. This is
how task and softirq time looks like for this (10s samples)
(loop)  (nc)
503 9   502 301
502 8   502 303
502 9   501 302
502 8   502 302
503 9   501 302
Now, when I did "not account si time to task", the loop task ended up
getting a lot less CPU time and doing less work as nc task doing rcv
got more CPU share, which was not right thing to do. IIRC, I had
something like <300 centiseconds for loop after the change (with si
activity increasing due to higher runtime of nc task).
2) Also, a minor problem of breaking current userspace API for
tasks/cgroup stats assume that irq times are included.

So, even though it seems accounting irq time as "system time" seems
the right thing to do, it can break scheduling in many ways. May be
hardirq can be accounted as system time. But, dealing with softirq is
tricky as they can be related to the task.

Figuring out si time and accouting to the right task is a non-starter.
There are so many different ways in which si will come into picture.
finding and accounting it to right task will be almost impossible.

So, why not do the simple things first. Do not disturb any existing
scheduling decisions, account accurate hi and ...
From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 1:39 pm

Yeah, architectures without a decent clock are a pain (x86 is still on
that list although nhm/wsm don't suck too bad), but it might be
worthwhile to look at what arch/$foo are strictly tick based.

A quick look suggests:

 alpha
 arm (some)
 avr32
 cris (it could remove its implementation, its identical
       to the weak function provided by kernel/sched_clock.c)
 frv  (idem)
 h8300
 m32r
 m68k* (except nommu-coldfire)
 mips (except cavium-octeon)
 parisc
 score
 sh
 xtensa

which seems to mean too damn many, I bet we can't simply move those to


I'm not exactly sure where that would get complicated, simply treat
interrupts the same as preemptions by other tasks and things should


Is that actually specified or simply assumed because our implementation
always had that bug? I would really call not accounting irq time to

I haven't yet seen any scheduler breakage here, it will divide time
differently, but not in a broken way, if the system consumes 1/3rd of
the time, there's only 2/3rd left to fairly distribute between tasks, so
something like, 1/3-loop 1/3-nc 1/3-softirq makes perfect sense.

You'd get exactly the same kind of thing if you replace (soft)irq with a
FIFO task.

The whole schizo softirq infrastructure (hardirq tails and tasks) is a
pain though, I would really love to rid the kernel of it, but I've got
no idea how to do something like that given that things like the whole


This is where I strongly disagree, providing an interface that cannot
possibly be implemented correctly just so you can fudge something (still
not sure what from userspace) seems a very bad idea indeed.


--

From: Venkatesh Pallipadi
Date: Tuesday, August 24, 2010 - 7:02 pm

Atleast the way I tried it turned out to be messy. Keep track of time
at si and hi
and remove it from update_curr delta. I did it that way as I didn't
want to take rq lock
on hi si path. Doing it as preemption with put_prev/pick_next would be
expensive. No?



But, FIFO in that case would be some unrelated task taking away CPU.
Here one task can take more than its share due to si.

Also, network RFS will try to get softirq to the right CPU thats
running this task. So, this will be sort of common case where task
with softirq will run faster and other non-si tasks will run slower

I don't think correctness is a problem. TSC is pretty good for this
purpose on current hardware. I agree that usability is debatable.

The use case I mentioned is some management application trying to find
interference/slowness for a task/task group because some other si
intensive task or flood ping on that CPU, getting to know that from
si/hi time for task and what it "expects it to be". Yes this is vague.
But, I think you agree that problem of si/hi interference on unrelated
task exists today. And providing this interface was the quick way to
give some hint to management apps about such problem. But. other
alternative of making si and hi time as "system time" will help this
use case as well, as the user will notice lower exec_time in that
case.

If you strongly think that the right way is to make both si and hi
"system time" and that will not cause unfairness and slowdown for some
unrelated tasks, I can try to cleanup the patch I had for that and
send it out. I am afraid though, it will cause some regression and we
will end up back at square one after a month or so. :(

Thanks,
Venki
--

From: Martin Schwidefsky
Date: Wednesday, August 25, 2010 - 12:20 am

On Tue, 24 Aug 2010 19:02:04 -0700

But it is a correctness problem. It is wrong to account the si and hi
time to some random process. To base any kind of decision on wrong data
is asking for trouble. If we can not correctly attribute the si and hi
time to the correct process (which we agree is next to impossible) then
the only thing left to do is to report the time on its own. You can
still pick a random process in your management application and add the
time in user space. As wrong as before but some other application might
want to do smarter things with the data point.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--

From: Peter Zijlstra
Date: Wednesday, September 8, 2010 - 4:12 am

No, removing it from update_curr()'s delta is exactly what I meant. That
gives the same end result as if the task were preempted (ie. it ran

No it wouldn't, it would make nc run exactly its fair share. SoftIRQ

No, how is being preempted by an unrelated task (FIFO whatever)

Like already argued, it is a correctness issue, TSC is only an accuracy
issue, but since you cannot properly attribute this very accurate time

Sure it will affect some workloads, but its also more correct. If
network workloads suffer we can look at sorting out some of those
problems. But really SoftIRQ != task context and thus should not be
accounted to said task.
--

From: Ingo Molnar
Date: Tuesday, August 24, 2010 - 1:14 am

Right, andthis connects to something Frederic sent a few RFC patches for 
some time ago: finegrained irq/softirq perf stat support. If we do 
something in this area we need a facility that enables both types of 
statistics gathering.

Frederic's model is based on exclusion - so you could do a perf stat run 
that excluded softirq and hardirq execution from a workload's runtime. 
It's nifty, as it allows the reduction of measurement noise. (IRQ and 
softirq execution can be regarded as random noise added (or not added) 
to execution times)

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 1:49 am

That facility is called irq_enter() and irq_exit() etc..
--

From: Venkatesh Pallipadi
Date: Monday, August 23, 2010 - 5:56 pm

Peter,

Ping.
Does the patchset look sane.

Thanks,
Venki


--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 12:52 am

Thanks for the prod, I should rm -rf my inbox and start over,.. its
impossible to keep track of things :/
--

Previous thread: [patch 0/5] x86: xsaveopt kernel enabling patches - v2 by Suresh Siddha on Monday, July 19, 2010 - 4:05 pm. (4 messages)

Next thread: [PATCH 17/49] memblock: Define MEMBLOCK_ERROR internally instead of using ~(phys_addr_t)0 by Yinghai Lu on Monday, July 19, 2010 - 4:56 pm. (1 message)