Re: [PATCH 02/22 -v7] Add basic support for gcc profiler instrumentation

Previous thread: [PATCH 15/22 -v7] trace generic call to schedule switch by Steven Rostedt on Tuesday, January 29, 2008 - 8:15 pm. (1 message)

Next thread: [PATCH 19/22 -v7] trace preempt off critical timings by Steven Rostedt on Tuesday, January 29, 2008 - 8:15 pm. (3 messages)
From: Steven Rostedt
Date: Tuesday, January 29, 2008 - 8:15 pm

If CONFIG_MCOUNT is selected and /proc/sys/kernel/mcount_enabled is set to a
non-zero value the mcount routine will be called everytime we enter a kernel
function that is not marked with the "notrace" attribute.

The mcount routine will then call a registered function if a function
happens to be registered.

[This code has been highly hacked by Steven Rostedt, so don't
 blame Arnaldo for all of this ;-) ]

Update:
  It is now possible to register more than one mcount function.
  If only one mcount function is registered, that will be the
  function that mcount calls directly. If more than one function
  is registered, then mcount will call a function that will loop
  through the functions to call.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 Makefile                   |    3 
 arch/x86/Kconfig           |    1 
 arch/x86/kernel/entry_32.S |   25 +++++++
 arch/x86/kernel/entry_64.S |   36 +++++++++++
 include/linux/linkage.h    |    2 
 include/linux/mcount.h     |   38 ++++++++++++
 kernel/sysctl.c            |   11 +++
 lib/Kconfig.debug          |    1 
 lib/Makefile               |    2 
 lib/tracing/Kconfig        |   10 +++
 lib/tracing/Makefile       |    3 
 lib/tracing/mcount.c       |  141 +++++++++++++++++++++++++++++++++++++++++++++
 12 files changed, 273 insertions(+)

Index: linux-mcount.git/Makefile
===================================================================
--- linux-mcount.git.orig/Makefile	2008-01-29 17:01:56.000000000 -0500
+++ linux-mcount.git/Makefile	2008-01-29 17:26:17.000000000 -0500
@@ -509,6 +509,9 @@ endif
 
 include $(srctree)/arch/$(SRCARCH)/Makefile
 
+ifdef CONFIG_MCOUNT
+KBUILD_CFLAGS	+= -pg
+endif
 ifdef CONFIG_FRAME_POINTER
 KBUILD_CFLAGS	+= -fno-omit-frame-pointer -fno-optimize-sibling-calls
 else
Index: linux-mcount.git/arch/x86/Kconfig
===================================================================
--- ...
From: Peter Zijlstra
Date: Wednesday, January 30, 2008 - 1:46 am

That comment does not explain which race it closes; this is esp


--

From: Steven Rostedt
Date: Wednesday, January 30, 2008 - 6:08 am

OK, fair enough. I'll explain it a bit more.

How's this:

 /*
  * We are entering ops into the mcount_list but another
  * CPU might be walking that list. We need to make sure
  * the ops->next pointer is valid before another CPU sees
  * the ops pointer included into the mcount_list.
  */

--

From: Steven Rostedt
Date: Wednesday, January 30, 2008 - 7:09 am

Paul,

Peter and I are having a discussion on craziness of archs and memory
barriers. You seem to understand crazy archs pretty well, and we would
like some advice. :-)

See below:


The above is my new comment. But Peter says that it's still not good
enough and that all write memory barriers need read barriers. Let me
explain the situation here.

We have a single link list called mcount_list that is walked when more
than one function is registered by mcount. Mcount is called at the start
of all C functions that are not annotated with "notrace". When more than
one function is registered, mcount calls a loop function that does the
following:

notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
{
        struct mcount_ops *op = mcount_list;

        while (op != &mcount_list_end) {
                op->func(ip, parent_ip);
                op = op->next;
        };
}

A registered function must already have a "func" filled, and the mcount
register code takes care of "next".  It is documented that the calling
function should "never" change next and always expect that the func can be
called after it is unregistered. That's not the issue here.

The issue is how to insert the ops into the list. I've done the following,
as you can see in the code this text is inserted between.

   ops->next = mcount_list;
   smp_wmb();
   mcount_list = ops;

The read side pair is the reading of ops to ops->next, which should imply
a smp_rmb() just by the logic. But Peter tells me things like alpha is
crazy enough to do better than that! Thus, I'm asking you.

Can some arch have a reader where it receives ops->next before it received
ops? This seems to me to be a phsyic arch, to know where ops->next is
before it knows ops!

Remember, that the ops that is being registered, is not viewable by any
other CPU until mcount_list = ops. I don't see the need for a read barrier
in this case. But I could very well be wrong.

Help!

--

From: Peter Zijlstra
Date: Wednesday, January 30, 2008 - 7:25 am

When thinking RCU, this would be rcu_dereference and imply a read


--

From: Paul E. McKenney
Date: Friday, February 1, 2008 - 3:34 pm

This is true.  A write barrier ensures that the writes remain ordered,
but unless the reads are also ordered, the reader can still get confused.
For example (assuming all variables are initially zero):

writer:

	a = 1;
	smp_wmb();  /* or smp_mb() */
	b = 1;

reader:

	tb = b;
	ta = a;

The writer will (roughly speaking) execute the assignments in order,
but the reader might not.  If the reader executes the assignment from
"a" first, it might see tb==1&&ta==0.  To prevent this, we do:

reader:

	tb = b;
	smp_rmb();  /* or smp_mb() */
	ta = a;


Specifically:

notrace void mcount_list_func(unsigned long ip, unsigned long parent_ip)
{
        struct mcount_ops *op = rcu_dereference(mcount_list);

        while (op != &mcount_list_end) {
                op->func(ip, parent_ip);
                op = rcu_dereference(op->next);

This assumes that you are using call_rcu(), synchronize_rcu(), or

Peter is correct when he says that Alpha does not necessarily respect data
dependencies.  See the following URL for the official story:

	http://www.openvms.compaq.com/wizard/wiz_2637.html

And I give an example hardware cache design that can result in this
situation here:

	http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf

See the discussion starting with the "Why Reorder Memory Accesses?"
heading in the second column of the first page.

Strange, but true.  It took an Alpha architect quite some time to

The trick is that the machine might have a split cache, with (say)
odd-numbered cache lines being processed by one half and even-numbered
lines processed by the other half.  If reading CPU has one half of the
cache extremely busy (e.g., processing invalidation requests from other
CPUs) and the other half idle, memory misordering can happen in the
receiving CPU -- if the pointer is processed by the idle half, and
the pointed-to struct by the busy half, you might see the unitialized
contents of the pointed-to structure.  The reading ...
From: Steven Rostedt
Date: Friday, February 1, 2008 - 6:56 pm

One special part of this is that the ops structure is never to be freed
(this is documented). It should be a static read-mostly structure.
Since it is not to be freed, I did not export the registered functions to
keep modules from using it. I may later add an export that will cause the
module to increment it's usage count so that it may never be freed.

There's no guarantees that prevent the func from being called after it was
unregistered, nor should the users of this, ever touch the "next" pointer.


hmm, I'm still not convinced ;-)

This is a unique situation. We don't need to worry about items being freed
because there's too many races to allow that. The items are only to
register functions and are not to be dynamically allocated or freed. In
this situation we do not need to worry about deletions.

The smp_wmb is only for initialization of something that is about to enter
the list. It is not to protect against freeing.

Specifically:

   ops->next = mcount_list;
   smp_wmb();
   mcount_list = ops;


What this is to prevent is a new item that has next = NULL being viewable
to other CPUS before next is initalized.

On another cpu we have (simplified by removing loop):

  op = mcount_list;
  op->func();
  op = op->next;
  if (op->next != NULL)
     op->func;

What we want to prevent is reading of the new ops before ops->next is set.

What you are saying is that on alpha, even though the write to ops->next
has completed before mcount_list is set, we can still get a reversed
order?

  ops->next = mcount_list;  -- in one cache line
  smp_wmb();
  mcount_list = ops;       -- in another cache line

Even though the ops->next is completed, we can have on another cpu:

   op = mcount_list; (which is the ops from above)
   op = op->next;  -- still see the old ops->next?

I just want to understand this. I already put in the read_barrier_depends
because it doesn't hurt on most archs anyway (nops).

Thanks,

-- Steve


--

From: Paul E. McKenney
Date: Saturday, February 2, 2008 - 2:41 pm

Yep, you have dependencies, so something like the following:

initial state:

	struct foo {
		int a;
	};
	struct foo x = { 0 };
	struct foo y = { 0 };
	struct foo *global_p = &y;
	/* other variables are appropriately declared auto variables */

	/* No kmalloc() or kfree(), hence no RCU grace periods. */
	/* In the terminology of http://lwn.net/Articles/262464/, we */
	/* are doing only publish-subscribe, nothing else. */

writer:

	x.a = 1;
	smp_wmb();  /* or smp_mb() */
	global_p = &x;

reader:

	p = global_p;
	ta = p->a;

Both Alpha and aggressive compiler optimizations can result in the reader
seeing the new value of the pointer (&x) but the old value of the field
(0).  Strange but true.  The fix is as follows:

reader:

	p = global_p;
	smp_read_barrier_depends();  /* or use rcu_dereference() */
	ta = p->a;

So how can this happen?  First note that if smp_read_barrier_depends()
was unnecessary in this case, it would be unnecessary in all cases.

Second, let's start with the compiler.  Suppose that a highly optimizing
compiler notices that in almost all cases, the reader finds p==global_p.
Suppose that this compiler also notices that one of the registers (say
r1) almost always contains this expected value of global_p, and that
cache pressure ensures that an actual load from global_p almost always
generates an expensive cache miss.  Such a compiler would be within its
rights (as defined by the C standard) to generate code assuming that r1
already had the right value, while also generating code to validate this
assumption, perhaps as follows:

	r2 = global_p;  /* high latency, other things complete meanwhile */
	ta == r1->a;
	if (r1 != r2)
		ta = r2->a;

Now consider the following sequence of events on a superscalar CPU:

	reader: r2 = global_p; /* issued, has not yet completed. */
	reader: ta = r1->a; /* which gives zero. */
	writer: x.a = 1;
	writer: smp_wmb();
	writer: global_p = &x;
	reader: r2 = global_p; /* this instruction now completes ...
From: Steven Rostedt
Date: Monday, February 4, 2008 - 10:09 am

Hi Paul,

First I want to say, "Thank you", for taking the time to explain this in
considerable detail. But I still have some minor questions.

 (Even though you already convinced me, but I still want full
  understanding ;-)



I think you missed one step here (causing my confusion). I don't want to
assume so I'll try to put in the missing step:



Thanks, this is starting to clear things up for me (And scare me away from

Since the code doesn't use RCU, I'll keep with the

Paul,

Thanks again for this lengthy email. It took me several readings to absorb
it all in.

I recommend that someone have a pointer to this email because it really
does explain why read_barrier_depends is needed.

Excellent job of explaining this!!! Much appreciated.

-- Steve

--

From: Paul E. McKenney
Date: Monday, February 4, 2008 - 2:40 pm

You lost me on this one...  The writer has only the following three steps:

writer:

	x.a = 1;
	smp_wmb();  /* or smp_mb() */
	global_p = &x;

Where did the "r1 = p" come from?  For that matter, where did "p" come

Ah!  Please note that I am doing something unusual here in that I am
working with global variables, as opposed to the normal RCU practice of
dynamically allocating memory.  So "x" is just a global struct, not a

Of course, fairness would require that it also scare you away from
value-speculation optimizations in compilers.  ;-)


Glad you liked it!!!

						Thanx, Paul
--

From: Steven Rostedt
Date: Monday, February 4, 2008 - 3:03 pm

But lets look at a simple version of my original code anyway ;-)

Writer:

void add_op(struct myops *x) {
	/* x->next may be garbage here */
	x->next = global_p;
	smp_wmb();
	global_p = x;
}

Reader:

void read_op(void)
{
	struct myops *p = global_p;

	while (p != NULL) {
		p->func();
		p = next;
		/* if p->next is garbage we crash */
	}
}


Here, we are missing the read_barrier_depends(). Lets look at the Alpha
cache issue:


reader reads the new version of global_p, and then reads the next
pointer. But since the next pointer is on a different cacheline than
global_p, it may have somehow had that in it's cache still. So it uses the
old next pointer which contains the garbage.

Is that correct?

But I will have to admit, that I can't see how an aggressive compiler
might have screwed this up. Being that x is a parameter, and the function
add_op is not in a header file.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, February 4, 2008 - 3:41 pm

Tell me if I am mistakened, but applying Paul's explanation to your
example would give (I unroll the loop for clarity) :

Writer:

void add_op(struct myops *x) {
	/* x->next may be garbage here */
	x->next = global_p;
	smp_wmb();
	global_p = x;
}

Reader:

void read_op(void)
{
	struct myops *p = global_p;

  if (p != NULL) {
		p->func();
		p = p->next;
  /*
   * Suppose the compiler expects that p->next is likely to be equal to
   * p + sizeof(struct myops), uses r1 to store previous p, r2 to store the
   * next p and r3 to store the expected value. Let's look at what the
   * compiler could do for the next loop iteration.
   */
  r2 = r1->next   (1)
  r3 = r1 + sizeof(struct myops)
  r4 = r3->func   (2)
  if (r3 == r2 && r3 != NULL)
    call r4

		/* if p->next is garbage we crash */
	} else
    return;

  if (p != NULL) {
		p->func();
		p = p->next;
		/* if p->next is garbage we crash */
	} else
    return;
  .....
}

In this example, we would be reading the expected "r3->func" (2) before
reading the real r1->next (1) value if reads are issued out of order.

Paul, am I correct ? And.. does the specific loop optimization I just
described actually exist ?

Thanks for your enlightenment :)


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Paul E. McKenney
Date: Monday, February 4, 2008 - 11:11 pm

This is indeed another form of value prediction.  Perhaps more common
in scientific applications, but one could imagine it occurring in the
kernel as well.

In some cases, the read from the real r1->next might be deferred until
after the computation so as to overlap the speculative computation with
the memory latency.  Border-line insane, perhaps, but some compiler

;-)

						Thanx, Paul
--

From: Paul E. McKenney
Date: Monday, February 4, 2008 - 10:13 pm

OK, I understand.  You are correct, it would make more sense at the machine
level for the writer to do something like:

writer:

	r1 = &x;
	r1->a = 1;
	smp_wmb();  /* or smp_mb() */


Indeed!  Changing the reader to be as follows should fix it:

Reader:

void read_op(void)
{
	struct myops *p = global_p;

	while (p != NULL) {
		smp_read_barrier_depends();
		p->func();
		p = next;
		/* if p->next is garbage we crash */
	}

Ah...

Suppose that we have a compiler that uses profile-based feedback.
It compiles the kernel with profiling code that tracks the values of
pointers, whether function arguments or global variables.  All it need
do is look for repeated values, and then track the fraction of time
that the value occurs.  If a given value occurs (say) 99.999% of the
time, it might make sense for the compiler to simply guess the value
ahead of time.  Even more to the point, if the compiler determines
that an existing register already has the correct value 99.999% of
the time, we can simply use that register, then check that the value
was correct.

This might appear as follows:

	Reader:

	void read_op(void)
	{
		struct myops *p = global_p;

		while (p != NULL) {
			p->func();
			/* do stuff with r1 assuming r1==p->next. */
			r2 = p->next;
			if (r2 != r1)
				/* compensate somehow for guessing wrong. */
		}
	}

Insane?  Probably so.  But there are compiler guys who swear by it.

						Thanx, Paul
--

From: Jan Kiszka
Date: Wednesday, January 30, 2008 - 6:21 am

(and the corresponding 64-bit version)

Is the impact of this change on the (already expensive) mcount_enabled
case negligible? I worried about use cases where we want to gain some
(relative) worst-case numbers via these instrumentations.

In my personal priority scheme, CONFIG_MCOUNT=y && !mcount_enabled comes
after mcount_enabled.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--

From: Steven Rostedt
Date: Wednesday, January 30, 2008 - 6:53 am

The goal here was to limit the instruction cache hit that we take when

well, actually, I disagree. I only set mcount_enabled=1 when I'm about to
test something. You're right that we want the impact of the test least
affected, but when we have mcount_enabled=1 we usually also have a
function that's attached and in that case this change is negligible. But
on the normal case where mcount_enabled=0, this change may have a bigger
impact.

Remember CONFIG_MCOUNT=y && mcount_enabled=0 is (15% overhead)
         CONFIG_MCOUNT=y && mcount_enabled=1 dummy func (49% overhead)
         CONFIG_MCOUNT=y && mcount_enabled=1 trace func (500% overhead)

The trace func is the one that will be most likely used when analyzing. It
gives hackbench a 500% overhead, so I'm expecting this change to be
negligible in that case. But after I find what's wrong, I like to rebuild
the kernel without rebooting so I like to have mcount_enabled=0 have the
smallest impact ;-)

I'll put back the original code and run some new numbers.

-- Steve

--

From: Steven Rostedt
Date: Wednesday, January 30, 2008 - 7:28 am

I just ran with the original version of that test (on x86_64, the same box
as the previous tests were done, with the same kernel and config except
for this change)

Here's the numbers with the new design (the one that was used in this
patch):

mcount disabled:
 Avg: 4.8638 (15.934498% overhead)

mcount enabled:
 Avg: 6.2819 (49.736610% overhead)

function tracing:
 Avg: 25.2035 (500.755607% overhead)

Now changing the code to:

ENTRY(mcount)
        /* likely(mcount_enabled) */
        cmpl $0, mcount_enabled
        jz out

        /* taken from glibc */
        subq $0x38, %rsp
        movq %rax, (%rsp)
        movq %rcx, 8(%rsp)
        movq %rdx, 16(%rsp)
        movq %rsi, 24(%rsp)
        movq %rdi, 32(%rsp)
        movq %r8, 40(%rsp)
        movq %r9, 48(%rsp)

        movq 0x38(%rsp), %rsi
        movq 8(%rbp), %rdi

        call   *mcount_trace_function

        movq 48(%rsp), %r9
        movq 40(%rsp), %r8
        movq 32(%rsp), %rdi
        movq 24(%rsp), %rsi
        movq 16(%rsp), %rdx
        movq 8(%rsp), %rcx
        movq (%rsp), %rax
        addq $0x38, %rsp

out:
        retq


mcount disabled:
 Avg: 4.908 (16.988058% overhead)

mcount enabled:
 Avg: 6.244. (48.840369% overhead)

function tracing:
 Avg: 25.1963 (500.583987% overhead)


The change seems to cause a 1% overhead difference. With mcount disabled,
the newer code has a 1% performance benefit. With mcount enabled as well
as with tracing on, the old code has the 1% benefit.

But 1% has a bigger impact on something that is 15% than it does on
something that is 48% or 500%, so I'm keeping the newer version.

-- Steve

--

Previous thread: [PATCH 15/22 -v7] trace generic call to schedule switch by Steven Rostedt on Tuesday, January 29, 2008 - 8:15 pm. (1 message)

Next thread: [PATCH 19/22 -v7] trace preempt off critical timings by Steven Rostedt on Tuesday, January 29, 2008 - 8:15 pm. (3 messages)