Re: [PATCH 1/2] tracing/function-return-tracer: Make the function return tracer lockless

Previous thread: [PATCH 6/6] ide: remove IDE PM hack from do_ide_request() by Bartlomiej Zolnierkiewicz on Wednesday, November 12, 2008 - 2:17 pm. (1 message)

Next thread: [PATCH 2/2] tracing/function-return-tracer: Call prepare_ftrace_return by registers by Frederic Weisbecker on Wednesday, November 12, 2008 - 2:49 pm. (15 messages)
From: Frederic Weisbecker
Date: Wednesday, November 12, 2008 - 2:47 pm

Impact: remove spinlocks and irq disabling in function return tracer.

I've tried to figure out all of the race condition that could happen when
the tracer pushes or pops a return address trace to/from the current
thread_info.

Theory:

_ One thread can only execute on one cpu at a time. So this code doesn't need
  to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an interrupt. If an
  interrupt is raised, it will increase the index of the return stack storage and
  then execute until the end of the tracing to finally free the index it used.
  We don't need to disable irqs.

This is theorical. In practice, I've tested it with a two-core SMP and had no
problem at all. Perhaps -tip testing could confirm it.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/ftrace.c |   43 +++++--------------------------------------
 1 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 16a571d..1db0e12 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -44,62 +44,37 @@ void ftrace_nmi_exit(void)
 	atomic_dec(&in_nmi);
 }
 
-/*
- * Synchronize accesses to return adresses stack with
- * interrupts.
- */
-static raw_spinlock_t ret_stack_lock;
-
 /* Add a function return address to the trace stack on thread info.*/
 static int push_return_trace(unsigned long ret, unsigned long long time,
 				unsigned long func)
 {
 	int index;
-	struct thread_info *ti;
-	unsigned long flags;
-	int err = 0;
-
-	raw_local_irq_save(flags);
-	__raw_spin_lock(&ret_stack_lock);
+	struct thread_info *ti = current_thread_info();
 
-	ti = current_thread_info();
 	/* The return trace stack is full */
-	if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1) {
-		err = -EBUSY;
-		goto out;
-	}
+	if (ti->curr_ret_stack == FTRACE_RET_STACK_SIZE - 1)
+		return -EBUSY;
 
 	index = ...
From: Ingo Molnar
Date: Wednesday, November 12, 2008 - 3:15 pm

applied to tip/tracing/function-return-tracer, thanks Frederic!

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 1:50 am

BTW I'm wondering about consistency in time capturing.
When I look into kernel/sched_clock.c I see this in introduction:

"The clock: sched_clock_cpu() is monotonic per cpu, and should be somewhat
consistent between cpus (never more than 2 jiffies difference)."

Two Jiffies, that could result in a lot of inconsistency in the way of
nanosec capturing. The current task
can be preempted between the call time and the return time and I'm
doing a cpu_clock(raw_smp_processor_id)
on these two times.
Should I keep the same processor_id for these two captures? But what
would happen if this cpu is shut
down between these two times?
One other solution would be to plan time capture in usec but I would
mostly lose the interest of function cost measuring....

What do you think?
Thanks.
--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 1:55 am

in practice the jitter is much lower - a couple of microseconds - up 
to a few dozen at most.

But it's a possibility, and i think the best solution is something 
that Steve suggested yesterday: a /debug/tracing/trace_options flag 
that turns on global ordering for tracing timestamps. Something like:

  echo global_timestamps > /debug/tracing/trace_options

tracers could also change the default of this flag. The function-cost 
tracer will probably want to default to globally synchronous 
timestamps, while the preempt and irqsoff tracers want to default to 
local timestamps only.

Would something like this work for you?

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 2:16 am

But I guess this flag would apply on the timestamp inserted by the
ring-buffer. Unfortunately I can't use it
since I have to capture the clock for two times and not only during
insertion in the ring-buffer.
--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 2:23 am

i think the clock should be a property of the tracer, not of the ring 
buffer. Hence if a tracer has the option set, it will get coherent 
timestamps - including ringbuffer insertion timestamps.

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 2:27 am

If so that would be suitable. And for his purpose, the ring-buffer
would propose a function for time snapshot
adapted to the current flags and that rely on sched_clock?
--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 2:40 am

yeah - and it would be all means by a global entity in the beginning - 
i.e. we'd just generalize the code around ring_buffer_time_stamp() to 
listen to the "globally coherent" flag, and allow it to be used for 
callgraph cost measurement code too.

If the "globally coherent" flag is set, then the implementation would 
be something like:

A simple "last global timestamp" value combined with a "last local 
timestamp" value, and the global timestamp is only ever moved forward. 
It is updated via cmpxchg loop. This gives coherency and a monotonic 
clock. The local timestamp would be taken from cpu_clock(cpu), and a 
global timestamp would be constructed out of it. Or something like 
that.

Would that work? [ Would you be interested in sending patches? :-) ]

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 5:36 am

Ok, so correct me if I'm wrong.
Global timestamp would be captured by using sched_clock().
That's what is done currently in ring_buffer_time_stamp()
And the global timestamp would be combination of a last global
timestamp and a relative position from now to this last at
each insertion in the ring-buffer (or tracing time capture). Am I right?
I don't really understand why you want to update with a cmpxchg loop...

And the local timestamp would be built through cpu_clock() with
absolute values at each captures? Because we can't consider
relative values since this is loosing sense between cpu. Unless we
have per_cpu relative values.....

Then, depending of the current_tracer, ring_buffer_time_stamp would
act as a wrapper:
if (current_tracer->time_flag == TIME_GLOBAL_COHERENT)
    return ring_buffer_global_timestamp();
else
    return ring_buffer_local_timestamp();

Or more efficient with a function pointer set at the same we change
the time flag for the current tracer.
--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 5:54 am

the cmpxchg loop would be needed to ensure timestamp monotonicity: 
every new "global time" is cmpxchg-ed with the "previous global time" 
(and is first monotonicity checked).

"prev_global_time" also acts as a global serializer: it ensures that 
events are timestamped in a monotonic and ordered way.

i.e. something like this (pseudocode, without the cmpxchg):

 u64 prev_global_time;

 DEFINE_PER_CPU(prev_local_time);

 u64 global_time()
 {
	u64 now, delta, now_global;

	prev_global = prev_global_time;
	now = sched_clock();
	delta = now - per_cpu(prev_local_time, this_cpu);
	per_cpu(prev_local_time, this_cpu) = now;

	now_global = prev_global + delta;
	prev_global = now_global;

	return now_global;
 }

note how we build "global time" out of "local time".

The cmpxchg would be used to put the above one into a loop, and 
instead of updating the global time in a racy way:

	prev_global = now_global;

We'd update it via the cmpxchg:

	atomic64_t prev_global_time;

	...

	while (atomic64_cmpxchg(&prev_global_time,
				 prev_global, now_global) != prev_global) {
		[...]
	}

To make sure the global time goes monotonic. (this way we also avoid a 
spinlock - locks are fragile for instrumentation)

	Ingo
--

From: Peter Zijlstra
Date: Thursday, November 13, 2008 - 5:59 am

This goes down shit-creek real fast if the TSC goes funny and jumps fwd
or something.

--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 6:02 am

yes, it was simplified pseudocode: instead of raw sched_clock() it 
should use a more reliable source like cpu_clock(this_cpu).

the important bit is how to build a global clock out of the local 
clock.

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 10:26 am

Ok, I understand better.
But consider the following:

 u64 global_time()
 {
       u64 now, delta, now_global;
       prev_global = prev_global_time;

       while (atomic64_cmpxchg(&prev_global_time,
                                 prev_global, now_global) != prev_global) {

           now = sched_clock();
           delta = now - per_cpu(prev_local_time, this_cpu);
           per_cpu(prev_local_time, this_cpu) = now;
           now_global = prev_global + delta;
           prev_global = now_global;
       }
       return now_global;
 }

Sarting with prev_global_time = 0
If we have two cpu and the above function is executed 5 times on the first cpu.
We couldl have per_cpu(prev_local_time) = 50 for example. And so
prev_global_time will be equal to 50.

Just after that, almost at the same time, cpu2 calls global_time()

delta will be equal to 50 (sched_clock() - per_cpu(prev_local_time)
which is 0) and prev_global_time will be 50 + 50 = 100.
This is not consistent.
I don't know where but I'm pretty sure I missed something....
--

From: Ingo Molnar
Date: Thursday, November 13, 2008 - 11:57 am

you are right - it needs a bit more logic.

I think the simplest would be something like this:

 atomic64_t global_clock = INIT_ATOMIC64(0);

 u64 global_time()
 {
	u64 now, delta, now_global, prev_global;

	do {
		prev_global = atomic64_read(&global_clock);
		now = cpu_clock(raw_smp_processor_id());

		if ((s64)(now - prev_global) < 0) {
			now = prev_global;
			break;
		}
	} while (atomic64_cmpxchg(&global_clock,
				   prev_global, now) != prev_global);
 
	return now;
 }

This is the simplest way of implementing monotonic time: we only allow 
global_clock to go forwards. If all cpu_clock()s are perfectly in 
sync, we've got no problem: then "now - prev_global" will never be 
negative and we can return the local clock as the latest global time.

But if one of the CPU clocks is "behind", the function returns the 
latest global time up until the local clock catches up. Time wont be 
allowed to jump around by going back. If the clock is behind for a 
long time, then we get a lot of timestamps with the same value - that 
will be very visible in the trace and we'll then work in improving the 
cpu_clock() implementation.

So i think we could start with this simplest approach, and see how 
often we get the same timestamp for a long time (indication of the 
clocks being not perfectly in sync).

	Ingo
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 1:06 pm

Ok! I understand now this approach.
So, if global ordering flag is set, we return this kind of protected value,



Ok, good idea.
Thanks for the explanations!
--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 7:53 am

Frederic Weisbecker <fweisbec@gmail.com> writes:

The problem I think is that you assume the ++ is atomic against
interrupts, which is not guaranteed by the C compiler. e.g.
it would be perfectly legal for the compiler to generate code like

local register i
i = index;
write to index'ed array using i
                            <--------- interrupt here would overwrite data
...
index = i + 1;

You would need to convert the index access to a "local_add_return()" and 
possibly also add memory barriers.

-Andi

-- 
ak@linux.intel.com
--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 10:01 am

Yes in the common case that would be a danger. But here, if an
interrupt is raised, it will increment
the counter and then decrement it at return time without dropping the
cpu. So after the interrupt, the
value will remain the same...
--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 10:12 am

The buffer contents will not be necessarily
the same. See the scenario above. The interrupt would use the 
same i as the current function and would overwrite the
partially written entry.

-Andi
--

From: Steven Rostedt
Date: Thursday, November 13, 2008 - 10:21 am

So the answer to this is:

  i = index++;
  barrier();
  write to index i (not index);

-- Steve


--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 10:34 am

That was my first thought when I wrote the original email,
but the disadvantage is that barrier() is a big hammer
that flushes everything and can make the code much worse.
That is why I suggested local_add_return() instead.

-Andi

-- 
ak@linux.intel.com
--

From: Steven Rostedt
Date: Thursday, November 13, 2008 - 10:32 am

barrier() is a compiler barrier, does nothing with the caches, and is 
quite cheap. We only need a compiler barrier because we are only 
protecting ourselves from things that happen on the current CPU. No other 
devices or other CPUs are involved.

-- Steve

--

From: Frédéric Weisbecker
Date: Thursday, November 13, 2008 - 10:46 am

Oh I see the issue now. The value of the index could have been
incremented in a register and not yet
in the memory...
So yes, a barrier() to make these operations flushed in memory before
using the index.
--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 11:29 am

I did not refer to CPU caches, but the compiler's register allocation
[ok if you want the registers are the "level 0 cache"]. A memory barrier
all messes it up. That is why it is better to only clobber specific
memory regions, which is what local_* does.

-Andi

-- 
ak@linux.intel.com
--

From: Steven Rostedt
Date: Thursday, November 13, 2008 - 11:35 am

#define barrier() __asm__ __volatile__("": : :"memory")

static inline long local_add_return(long i, local_t *l)
{
	long __i;
#ifdef CONFIG_M386
	unsigned long flags;
	if (unlikely(boot_cpu_data.x86 <= 3))
		goto no_xadd;
#endif
	/* Modern 486+ processor */
	__i = i;
	asm volatile(_ASM_XADD "%0, %1;"
		     : "+r" (i), "+m" (l->a.counter)
		     : : "memory");
	return i + __i;

#ifdef CONFIG_M386
no_xadd: /* Legacy 386 processor */
	local_irq_save(flags);
	__i = local_read(l);
	local_set(l, i + __i);
	local_irq_restore(flags);
	return i + __i;
#endif
}



Now tell me again how local_* is more efficient than barrier?

Not to mention, if this is ever used on other archs with load-linked and 
store-conditional, it gets even worse.

-- Steve

--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 11:55 am

> Now tell me again how local_* is more efficient than barrier?

Look at the generated assembler, not the source. The M386 check is two instructions
with the i386 stuff just moved away somewhere out of line that is
about never taken. The rest is short and straight forward in asm too.
And a lot of kernels are not compiled with 386 support anyways.

-Andi

-- 
ak@linux.intel.com
--

From: Steven Rostedt
Date: Thursday, November 13, 2008 - 12:03 pm

OK, lets look at just the asm.

barrier:
	__asm__ __volatile__("": : :"memory")

local_add_return:
	asm volatile(_ASM_XADD "%0, %1;"
		     : "+r" (i), "+m" (l->a.counter)
		     : : "memory");


Your argument was that barrier clobbers memory, but it looks like the 
local_add_return does the same.

Lets go look at the code that we are talking about:

 	index = ++ti->curr_ret_stack;
 	ti->ret_stack[index].ret = ret;
 	ti->ret_stack[index].func = func;
 	ti->ret_stack[index].calltime = time;

Which as you stated was incorrect, and it is. But you suggested that we 
should switch to using local_add_return which would cause us to change a 
lot of code to handle the type change. When this is ported to other archs, 
they will copy that too, and then be using a lesser efficient algorithm.

All that is needed is to change the pointer to always point to the next 
item, and then do:

        index = ti->curr_ret_stack++;
	barrier();
        ti->ret_stack[index].ret = ret;
        ti->ret_stack[index].func = func;
        ti->ret_stack[index].calltime = time;

I don't see any major gain in switching to local_add_return.

-- Steve
--

From: Andi Kleen
Date: Thursday, November 13, 2008 - 12:19 pm

That's a fair point. I'm not sure why the local_add_* has a memory
barrier at all, imho it shouldn't need it. I bet it's just

That's true with the current implementation. It would
be a good idea to consider dropping that explicit 
barrier though, then it would be likely a slight win.

-Andi

-- 
ak@linux.intel.com
--

Previous thread: [PATCH 6/6] ide: remove IDE PM hack from do_ide_request() by Bartlomiej Zolnierkiewicz on Wednesday, November 12, 2008 - 2:17 pm. (1 message)

Next thread: [PATCH 2/2] tracing/function-return-tracer: Call prepare_ftrace_return by registers by Frederic Weisbecker on Wednesday, November 12, 2008 - 2:49 pm. (15 messages)