Re: Random scheduler/unaligned accesses crashes with perf lock events on sparc 64

Previous thread: [GIT PULL] perf fixes by Ingo Molnar on Sunday, April 4, 2010 - 3:15 am. (1 message)

Next thread: [PATCHv2] virtio-net: move sg off stack by Michael S. Tsirkin on Sunday, April 4, 2010 - 6:07 am. (2 messages)
From: Frederic Weisbecker
Date: Sunday, April 4, 2010 - 5:18 am

Hi,

On tip:master, while turning on lock events with perf through

	perf lock record

I get random kernel crashes, sometimes about weird unaligned
accesses, sometimes about scheduler that complains.

I hope someone has an idea about this.

In three different attempts (got to force reboot each time), I got:

First (task_tgid_nr_ns() is called from perf_event_pid()):

[  565.464201] Kernel unaligned access at TPC[486b74] task_tgid_nr_ns+0x8/0x54
[  565.475801] sun4v_data_access_exception: ADDR[000060f8b13a2004] CTX[0000] TYPE[0009], going.
[  565.488610]               \|/ ____ \|/
1>[  565.492705] Unable to handle kernel NULL pointer dereference
[  565.492719] Unable to handle kernel NULL pointer dereference
[  565.492733] Unable to handle kernel NULL pointer dereference
[  565.492747] Unable to handle kernel NULL pointer dereference
[  565.492761] Unable to handle kernel NULL pointer dereference
[  565.492776] Unable to handle kernel NULL pointer dereference
[  565.492790] Unable to handle kernel NULL pointer dereference
[  565.492804] Unable to handle kernel NULL pointer dereference
[  565.492818] Unable to handle kernel NULL pointer dereference
[  565.492832] Unable to handle kernel NULL pointer dereference
[  565.492847] Unable to handle kernel NULL pointer dereference

Second:

[  250.508047] Kernel unaligned access at TPC[4d1a0c] perf_swevent_ctx_event+0x16c/0x1b0

(this one happened in asm/processor_64.h: prefetch(), probably while
walking to the context's event list)

Third:

[   60.147895] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[  130.637924] sun4v_data_access_exception: ADDR[ffffa010933a6060] CTX[0000] TYPE[0009], going.
[  130.639329] kernel BUG at kernel/sched.c:1166!
[  130.639364]               \|/ ____ \|/
[  130.639370]               "@'/ .. \`@"
[  130.639377]               /_| \__/ |_\
[  130.639382]                  \__U_/
[  130.639394] swapper(0): Kernel bad sw trap 5 [#1]
[  130.639430] TSTATE: 0000000080e01605 ...
From: Frederic Weisbecker
Date: Sunday, April 4, 2010 - 5:21 am

Forgot the config, it's in attachment.

Thanks.
From: David Miller
Date: Sunday, April 4, 2010 - 6:00 pm

From: Frederic Weisbecker <fweisbec@gmail.com>

I get them too when even trying to use the function tracer in some
simple way.  They occur in random places and eventually the entire
machine wedges and is unusuable.

I suspect there is some bug that ends up corrupting memory, so I
started trying to debug this last night when I first saw it.

I'll let you know if I make any progress.
--

From: Frederic Weisbecker
Date: Sunday, April 4, 2010 - 11:57 pm

Yeah. But it's not totally random. I often see the same scheduler
crash.



Ok. I indeed have the same problem with the function tracer when


Thanks!

--

From: David Miller
Date: Monday, April 5, 2010 - 12:22 pm

From: Frederic Weisbecker <fweisbec@gmail.com>

You shouldn't be getting the unaligned fixup in the first place,
especially in the locations where you see them.  I suspect that
once you see first fixup, all of the registers in the cpu have
been corrupted in one way or another.

I suspect something fundamental gets corrupted, for example the
current register window (%cwp) is corrupted and that screws up all of
the registers so every single function starts accessing garbage.

My suspicions lie in three places, the ftrace mcount()
stubs, stack_trace_flush(), or the new perf_arch_save_caller_regs()
since those are the three places offhand that could make
us potentially make us return to function in the wrong
register window.
--

From: Frederic Weisbecker
Date: Monday, April 5, 2010 - 12:40 pm

It happens without CONFIG_FUNCTION_TRACER as well (but it happens
when the function tracer runs). And I hadn't your perf_arch_save_caller_regs()
when I triggered this.

--

From: David Miller
Date: Monday, April 5, 2010 - 1:46 pm

From: Frederic Weisbecker <fweisbec@gmail.com>

Good clues, thanks.
--

From: David Miller
Date: Monday, April 5, 2010 - 7:15 pm

From: Frederic Weisbecker <fweisbec@gmail.com>

I think there's still something wrong with the ring buffer stuff on
architectures like sparc64.

Stephen, I'm looking at the 8-byte alignment fix that was made a few
weeks ago, commit:

commit 2271048d1b3b0aabf83d25b29c20646dcabedc05
Author: Steven Rostedt <srostedt@redhat.com>
Date:   Thu Mar 18 17:54:19 2010 -0400

    ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align

and I'm not so sure it's completely correct.

Originally, the ring buffer entries determine where the entry data
resides (either &event->array[0] or &event->array[1]) based upon the
length.

Beforehand, in all cases:

1) If length could be encoded into event->type_len (ie. <=
   RB_MAX_SMALL_DATA) then event->type_len holds the length
   and the event data starts at &event->array[0]

2) Otherwise (length > RB_MAX_SMALL_DATA) the length is
   encoded into event->array[0] and the event data starts at
   &event->array[1]

But now, there is a new semantic when CONFIG_64BIT is true and
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is false (which isn't the right
test btw, f.e. sparc 32-bit needs this handling just like sparc 64-bit
does since it uses full 64-bit loads and stores to access u64 objects
and thus will crash without proper alignment, the correct test should
be just CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS being false).

This new semantic is:

1) Entries always encode the length in ->array[0] and ->type_len
   is set to zero.

And then there are special cases like events of type
RINGBUF_TYPE_PADDING which, even though ->type_len is non-zero, encode
a length field in ->array[0] which is used by the ring buffer
iterators such as rb_event_length(), but this only applies only if
event->time_delta is non-zero.  (Phew!)

The commit adjusts the code in rb_calculate_event_length() to force 8
byte chunks when RB_FORCE_8BYTE_ALIGNMENT is set.  It also adjusted
the rb_update_event() logic so that it unconditionally ...
From: Steven Rostedt
Date: Tuesday, April 6, 2010 - 6:41 am

David,

It's best to send to my rostedt@goodmis.org account, just like it is
best to send to your davem@davemloft.net ;-)



OK, so the a 64 bit word still needs 64 bit alignment when storing to a
data pointer.

I wonder if we should just have a special copy in this case for the
events and remove this patch in the ring buffer. That is:

	__assign_word(__entry->word, value);

And have in !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS be:

	#define __assgin_word(dest, src)	\
		memcpy(&(dest), &(src), sizeof(src));


The RINGBUF_TYPE_PADDING is used to either fill the rest of the sub
buffer, where alignment does not matter, or to replace a deleted event

What about removing the logic from the ring buffer and moving it to the
TRACE_EVENT() macros as I suggested above?

We would probably need a way to read the buffers too.

I also know that Mathieu has some strange tricks to force alignment but
I'm still not convinced it would make things any less fragile than what
is already there.

-- Steve


--

From: David Miller
Subject:
Date: Tuesday, April 6, 2010 - 10:46 am

From: Steven Rostedt <srostedt@redhat.com>


Doesn't work, first of all, because if gcc can see the types
it will inline emit aligned loads and stores.

Secondly, the accessors who read in these entries need them to be

No, that makes no sense, just align the data types properly
instead of making every single access site pay the price.

Just fix the CPP test to handle 32-bit platforms with slow
mis-alignment handling, everything else seems to be working properly
after I fixed the NMI issues on sparc64.
--

From: Steven Rostedt
Date: Tuesday, April 6, 2010 - 11:15 am

My SoB is rostedt@goodmis.org, my commits are srostedt@redhat.com just
because I want to give credit to the one that pays me for making those


A little more work will need to be done since on 32bit, the page headers
are 12 bytes, not 16, so we start off without being aligned 8bytes. But
that too is fixable.

-- Steve


--

From: David Miller
Date: Tuesday, April 6, 2010 - 2:17 pm

From: Steven Rostedt <rostedt@goodmis.org>

Oh yeah the buffer_data_page, good point.
--

From: David Miller
Date: Tuesday, April 6, 2010 - 2:50 am

From: Frederic Weisbecker <fweisbec@gmail.com>

I figured out the problem, it's NMIs.  As soon as I disable all of the
NMI watchdog code, the problem goes away.

This is because some parts of the NMI interrupt handling path are not
marked with "notrace" and the various tracer code paths use
local_irq_disable() (either directly or indirectly) which doesn't work
with sparc64's NMI scheme.  These essentially turn NMIs back on in the
NMI handler before the NMI condition has been cleared, and thus we can
re-enter with another NMI interrupt.

We went through this for perf events, and we just made sure that
local_irq_{enable,disable}() never occurs in any of the code paths in
perf events that can be reached via the NMI interrupt handler.  (the
only one we had was sched_clock() and that was easily fixed)

So, the first mcount hit we get is for rcu_nmi_enter() via
nmi_enter().

I can see two ways to handle this:

1) Pepper 'notrace' markers onto rcu_nmi_enter(), rcu_nmi_exit()
   and whatever else I can see getting hit in the NMI interrupt
   handler code paths.

2) Add a hack to __raw_local_irq_save() that keeps it from writing
   anything to the interrupt level register if we have NMI's disabled.
   (this puts the cost on the entire kernel instead of just the NMI
   paths).

#1 seems to be the intent on other platforms, the majority of the NMI
code paths are protected with 'notrace' on x86, I bet nobody noticed
that nmi_enter() when CONFIG_NO_HZ && !CONFIG_TINY_RCU ends up calling
a function that does tracing.

The next one we'll hit is atomic_notifier_call_chain()  (amusingly
notify_die() is marked 'notrace' but the one thing it calls isn't)

For example, the following are the generic notrace annotations I
would need to get sparc64 ftrace functioning again. (Frederic I will
send you the full patch with the sparc specific bits under seperate
cover in so that you can test things...)

--------------------
kernel: Add notrace annotations to common routines invoked via ...
From: Frederic Weisbecker
Date: Tuesday, April 6, 2010 - 3:19 am

Ok, but this as a cause looks weird.
The function tracer handler disables interrupts. I don't remember exactly
why but we also have a no-preempt mode that only disables preemption instead:
(function_trace_call_preempt_only())

It means having such interrupt reentrancy is not a problem. In fact, the
function tracer is not reentrant:

	data = tr->data[cpu];
	disabled = atomic_inc_return(&data->disabled);

	if (likely(disabled == 1))
		trace_function(tr, ip, parent_ip, flags, pc);

	atomic_dec(&data->disabled);

we do this just to prevent from tracing recursion (in case we have
a traceable function in the inner function tracing path).

Nmis are just supposed to be fine with the function tracer.

--

From: David Miller
Date: Tuesday, April 6, 2010 - 3:28 am

From: Frederic Weisbecker <fweisbec@gmail.com>

It's not reentrancy.

It's the fact that local_irq_disable() (read it again, it's the
"disable" that re-renables NMIs on sparc64) turns NMIs back on even in
code where we are still trying to figure out how to service the NMI
still.

It's because we implement NMIs on sparc64 by having the performance
counter interrupt come in on the level 15 interrupt, and we run the
entire kernel at level 14 when IRQs are "disabled".
--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 4:12 am

One thing we can do is make the code WARN about this, how about
something like the below

---
Subject: lockdep: WARN about local_irq_{en,dis}able in NMI context

Some architectures implement NMIs by using IRQ priority levels and
mixing local_irq_{en,dis}able() with NMIs will give unexpected results.

Hence disallow this in general and WARN about it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/lockdep.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 08e6f76..06ec1c7 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2341,6 +2341,11 @@ EXPORT_SYMBOL(trace_hardirqs_on_caller);
 
 void trace_hardirqs_on(void)
 {
+	/*
+	 * Some architectures can't deal with local_irq_{enable,disable}
+	 * from NMI context (SPARC), enforce this.
+	 */
+	WARN_ON_ONCE(in_nmi());
 	trace_hardirqs_on_caller(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_on);
@@ -2375,6 +2380,11 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
 
 void trace_hardirqs_off(void)
 {
+	/*
+	 * Some architectures can't deal with local_irq_{enable,disable}
+	 * from NMI context (SPARC), enforce this.
+	 */
+	WARN_ON_ONCE(in_nmi());
 	trace_hardirqs_off_caller(CALLER_ADDR0);
 }
 EXPORT_SYMBOL(trace_hardirqs_off);


--

From: David Miller
Date: Tuesday, April 6, 2010 - 4:13 am

From: Peter Zijlstra <peterz@infradead.org>

It's going to warn every bootup in cpu_clock() on x86.
--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 4:20 am

*sigh*, yes, we could hack around that I suppose.. it would be nice to
automate this check though, I bet you don't fancy tracking down more
such splats than you have to.

You could of course insert the debug code into your arch routines but
that would limit the coverage checks to whatever you happen to run.

--

From: David Miller
Date: Tuesday, April 6, 2010 - 4:22 am

From: Peter Zijlstra <peterz@infradead.org>

Yes, f.e. you could add local_irq_*_nmi() or similar that don't
complain when called inside of an NMI.

I would certainly welcome this debugging facility, for sure!

It would have saved me two days of work this time, in fact.


--

From: Frederic Weisbecker
Date: Tuesday, April 6, 2010 - 4:38 am

That reminds me we have a new pair of local_irq_disable/enable
in perf_event_task_output(), which path can be taken by hardware
pmu events.

See this patch:

8bb39f9aa068262732fe44b965d7a6eb5a5a7d67
perf: Fix 'perf sched record' deadlock

--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 4:51 am

ARGH.. yes

Also, I guess that should live in perf_output_lock/unlock() not in
perf_event_task_output().

Egads, how to fix that

--

From: Mike Galbraith
Date: Tuesday, April 6, 2010 - 5:54 am

Damn, so deadlock fix isn't a fix.  No idea.

	-Mike

--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 5:57 am

well it is,. but it breaks sparc..

I'm currently compile testing a bunch to fix all that up by doing what
davem suggested: local_irq_save_nmi()/local_irq_restore_nmi().



--

From: Paul E. McKenney
Date: Tuesday, April 6, 2010 - 11:04 am

Assuming that static inline functions don't need the notrace flag:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

(If static inline functions -do- need notrace, then the definitions
--

Previous thread: [GIT PULL] perf fixes by Ingo Molnar on Sunday, April 4, 2010 - 3:15 am. (1 message)

Next thread: [PATCHv2] virtio-net: move sg off stack by Michael S. Tsirkin on Sunday, April 4, 2010 - 6:07 am. (2 messages)