Impact: remove spinlocks and irq disabling in function return tracer.
I've tried to figure out all of the race condition that could happen when
the tracer pushes or pops a return address trace to/from the current
thread_info.
Theory:
_ One thread can only execute on one cpu at a time. So this code doesn't need
to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an interrupt. If an
interrupt is raised, it will increase the index of the return stack storage and
then execute until the end of the tracing to finally free the index it used.
We don't need to disable irqs.
This is theorical. In practice, I've tested it with a two-core SMP and had no
problem at all. Perhaps -tip testing could confirm it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
arch/x86/kernel/ftrace.c | 43 +++++--------------------------------------
1 files changed, 5 insertions(+), 38 deletions(-)
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 16a571d..1db0e12 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -44,62 +44,37 @@ void ftrace_nmi_exit(void)
atomic_dec(&in_nmi);
}
-/*
- * Synchronize accesses to return adresses stack with
- * interrupts.
- */
-static raw_spinlock_t ret_stack_lock;
-
/* Add a function return address to the trace stack on thread info.*/
static int push_return_trace(unsigned long ret, unsigned long long time,
unsigned long func)
{
int index;
- struct thread_info *ti;
- unsigned long flags;
- int err = 0;
-
- raw_local_irq_save(flags);
- __raw_spin_lock(&ret_stack_lock);
+ struct thread_info *ti = current_thread_info();
- ti = current_thread_info();
/* The return trace stack is full */
- if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1) {
- err = -EBUSY;
- goto out;
- }
+ if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1)
+ return -EBUSY;
index = ...applied to tip/tracing/function-return-tracer, thanks Frederic! Ingo --
BTW I'm wondering about consistency in time capturing. When I look into kernel/sched_clock.c I see this in introduction: "The clock: sched_clock_cpu() is monotonic per cpu, and should be somewhat consistent between cpus (never more than 2 jiffies difference)." Two Jiffies, that could result in a lot of inconsistency in the way of nanosec capturing. The current task can be preempted between the call time and the return time and I'm doing a cpu_clock(raw_smp_processor_id) on these two times. Should I keep the same processor_id for these two captures? But what would happen if this cpu is shut down between these two times? One other solution would be to plan time capture in usec but I would mostly lose the interest of function cost measuring.... What do you think? Thanks. --
in practice the jitter is much lower - a couple of microseconds - up to a few dozen at most. But it's a possibility, and i think the best solution is something that Steve suggested yesterday: a /debug/tracing/trace_options flag that turns on global ordering for tracing timestamps. Something like: echo global_timestamps > /debug/tracing/trace_options tracers could also change the default of this flag. The function-cost tracer will probably want to default to globally synchronous timestamps, while the preempt and irqsoff tracers want to default to local timestamps only. Would something like this work for you? Ingo --
But I guess this flag would apply on the timestamp inserted by the ring-buffer. Unfortunately I can't use it since I have to capture the clock for two times and not only during insertion in the ring-buffer. --
i think the clock should be a property of the tracer, not of the ring buffer. Hence if a tracer has the option set, it will get coherent timestamps - including ringbuffer insertion timestamps. Ingo --
If so that would be suitable. And for his purpose, the ring-buffer would propose a function for time snapshot adapted to the current flags and that rely on sched_clock? --
yeah - and it would be all means by a global entity in the beginning - i.e. we'd just generalize the code around ring_buffer_time_stamp() to listen to the "globally coherent" flag, and allow it to be used for callgraph cost measurement code too. If the "globally coherent" flag is set, then the implementation would be something like: A simple "last global timestamp" value combined with a "last local timestamp" value, and the global timestamp is only ever moved forward. It is updated via cmpxchg loop. This gives coherency and a monotonic clock. The local timestamp would be taken from cpu_clock(cpu), and a global timestamp would be constructed out of it. Or something like that. Would that work? [ Would you be interested in sending patches? :-) ] Ingo --
Ok, so correct me if I'm wrong.
Global timestamp would be captured by using sched_clock().
That's what is done currently in ring_buffer_time_stamp()
And the global timestamp would be combination of a last global
timestamp and a relative position from now to this last at
each insertion in the ring-buffer (or tracing time capture). Am I right?
I don't really understand why you want to update with a cmpxchg loop...
And the local timestamp would be built through cpu_clock() with
absolute values at each captures? Because we can't consider
relative values since this is loosing sense between cpu. Unless we
have per_cpu relative values.....
Then, depending of the current_tracer, ring_buffer_time_stamp would
act as a wrapper:
if (current_tracer->time_flag == TIME_GLOBAL_COHERENT)
return ring_buffer_global_timestamp();
else
return ring_buffer_local_timestamp();
Or more efficient with a function pointer set at the same we change
the time flag for the current tracer.
--
the cmpxchg loop would be needed to ensure timestamp monotonicity:
every new "global time" is cmpxchg-ed with the "previous global time"
(and is first monotonicity checked).
"prev_global_time" also acts as a global serializer: it ensures that
events are timestamped in a monotonic and ordered way.
i.e. something like this (pseudocode, without the cmpxchg):
u64 prev_global_time;
DEFINE_PER_CPU(prev_local_time);
u64 global_time()
{
u64 now, delta, now_global;
prev_global = prev_global_time;
now = sched_clock();
delta = now - per_cpu(prev_local_time, this_cpu);
per_cpu(prev_local_time, this_cpu) = now;
now_global = prev_global + delta;
prev_global = now_global;
return now_global;
}
note how we build "global time" out of "local time".
The cmpxchg would be used to put the above one into a loop, and
instead of updating the global time in a racy way:
prev_global = now_global;
We'd update it via the cmpxchg:
atomic64_t prev_global_time;
...
while (atomic64_cmpxchg(&prev_global_time,
prev_global, now_global) != prev_global) {
[...]
}
To make sure the global time goes monotonic. (this way we also avoid a
spinlock - locks are fragile for instrumentation)
Ingo
--
This goes down shit-creek real fast if the TSC goes funny and jumps fwd or something. --
yes, it was simplified pseudocode: instead of raw sched_clock() it should use a more reliable source like cpu_clock(this_cpu). the important bit is how to build a global clock out of the local clock. Ingo --
Ok, I understand better.
But consider the following:
u64 global_time()
{
u64 now, delta, now_global;
prev_global = prev_global_time;
while (atomic64_cmpxchg(&prev_global_time,
prev_global, now_global) != prev_global) {
now = sched_clock();
delta = now - per_cpu(prev_local_time, this_cpu);
per_cpu(prev_local_time, this_cpu) = now;
now_global = prev_global + delta;
prev_global = now_global;
}
return now_global;
}
Sarting with prev_global_time = 0
If we have two cpu and the above function is executed 5 times on the first cpu.
We couldl have per_cpu(prev_local_time) = 50 for example. And so
prev_global_time will be equal to 50.
Just after that, almost at the same time, cpu2 calls global_time()
delta will be equal to 50 (sched_clock() - per_cpu(prev_local_time)
which is 0) and prev_global_time will be 50 + 50 = 100.
This is not consistent.
I don't know where but I'm pretty sure I missed something....
--
you are right - it needs a bit more logic.
I think the simplest would be something like this:
atomic64_t global_clock = INIT_ATOMIC64(0);
u64 global_time()
{
u64 now, delta, now_global, prev_global;
do {
prev_global = atomic64_read(&global_clock);
now = cpu_clock(raw_smp_processor_id());
if ((s64)(now - prev_global) < 0) {
now = prev_global;
break;
}
} while (atomic64_cmpxchg(&global_clock,
prev_global, now) != prev_global);
return now;
}
This is the simplest way of implementing monotonic time: we only allow
global_clock to go forwards. If all cpu_clock()s are perfectly in
sync, we've got no problem: then "now - prev_global" will never be
negative and we can return the local clock as the latest global time.
But if one of the CPU clocks is "behind", the function returns the
latest global time up until the local clock catches up. Time wont be
allowed to jump around by going back. If the clock is behind for a
long time, then we get a lot of timestamps with the same value - that
will be very visible in the trace and we'll then work in improving the
cpu_clock() implementation.
So i think we could start with this simplest approach, and see how
often we get the same timestamp for a long time (indication of the
clocks being not perfectly in sync).
Ingo
--
Ok! I understand now this approach. So, if global ordering flag is set, we return this kind of protected value, Ok, good idea. Thanks for the explanations! --
Frederic Weisbecker <fweisbec@gmail.com> writes:
The problem I think is that you assume the ++ is atomic against
interrupts, which is not guaranteed by the C compiler. e.g.
it would be perfectly legal for the compiler to generate code like
local register i
i = index;
write to index'ed array using i
<--------- interrupt here would overwrite data
...
index = i + 1;
You would need to convert the index access to a "local_add_return()" and
possibly also add memory barriers.
-Andi
--
ak@linux.intel.com
--
Yes in the common case that would be a danger. But here, if an interrupt is raised, it will increment the counter and then decrement it at return time without dropping the cpu. So after the interrupt, the value will remain the same... --
The buffer contents will not be necessarily the same. See the scenario above. The interrupt would use the same i as the current function and would overwrite the partially written entry. -Andi --
So the answer to this is: i = index++; barrier(); write to index i (not index); -- Steve --
That was my first thought when I wrote the original email, but the disadvantage is that barrier() is a big hammer that flushes everything and can make the code much worse. That is why I suggested local_add_return() instead. -Andi -- ak@linux.intel.com --
barrier() is a compiler barrier, does nothing with the caches, and is quite cheap. We only need a compiler barrier because we are only protecting ourselves from things that happen on the current CPU. No other devices or other CPUs are involved. -- Steve --
Oh I see the issue now. The value of the index could have been incremented in a register and not yet in the memory... So yes, a barrier() to make these operations flushed in memory before using the index. --
I did not refer to CPU caches, but the compiler's register allocation [ok if you want the registers are the "level 0 cache"]. A memory barrier all messes it up. That is why it is better to only clobber specific memory regions, which is what local_* does. -Andi -- ak@linux.intel.com --
#define barrier() __asm__ __volatile__("": : :"memory")
static inline long local_add_return(long i, local_t *l)
{
long __i;
#ifdef CONFIG_M386
unsigned long flags;
if (unlikely(boot_cpu_data.x86 <= 3))
goto no_xadd;
#endif
/* Modern 486+ processor */
__i = i;
asm volatile(_ASM_XADD "%0, %1;"
: "+r" (i), "+m" (l->a.counter)
: : "memory");
return i + __i;
#ifdef CONFIG_M386
no_xadd: /* Legacy 386 processor */
local_irq_save(flags);
__i = local_read(l);
local_set(l, i + __i);
local_irq_restore(flags);
return i + __i;
#endif
}
Now tell me again how local_* is more efficient than barrier?
Not to mention, if this is ever used on other archs with load-linked and
store-conditional, it gets even worse.
-- Steve
--
> Now tell me again how local_* is more efficient than barrier? Look at the generated assembler, not the source. The M386 check is two instructions with the i386 stuff just moved away somewhere out of line that is about never taken. The rest is short and straight forward in asm too. And a lot of kernels are not compiled with 386 support anyways. -Andi -- ak@linux.intel.com --
OK, lets look at just the asm.
barrier:
__asm__ __volatile__("": : :"memory")
local_add_return:
asm volatile(_ASM_XADD "%0, %1;"
: "+r" (i), "+m" (l->a.counter)
: : "memory");
Your argument was that barrier clobbers memory, but it looks like the
local_add_return does the same.
Lets go look at the code that we are talking about:
index = ++ti->curr_ret_stack;
ti->ret_stack[index].ret = ret;
ti->ret_stack[index].func = func;
ti->ret_stack[index].calltime = time;
Which as you stated was incorrect, and it is. But you suggested that we
should switch to using local_add_return which would cause us to change a
lot of code to handle the type change. When this is ported to other archs,
they will copy that too, and then be using a lesser efficient algorithm.
All that is needed is to change the pointer to always point to the next
item, and then do:
index = ti->curr_ret_stack++;
barrier();
ti->ret_stack[index].ret = ret;
ti->ret_stack[index].func = func;
ti->ret_stack[index].calltime = time;
I don't see any major gain in switching to local_add_return.
-- Steve
--
That's a fair point. I'm not sure why the local_add_* has a memory barrier at all, imho it shouldn't need it. I bet it's just That's true with the current implementation. It would be a good idea to consider dropping that explicit barrier though, then it would be likely a slight win. -Andi -- ak@linux.intel.com --
