Hi all,
Yesterday's (and today's) linux-next boot (PowerPC) failed like this:
------------[ cut here ]------------
Badness at kernel/lockdep.c:2301
NIP: c0000000000a35c8 LR: c0000000000084c4 CTR: 0000000000000000
REGS: c000000000bf77e0 TRAP: 0700 Not tainted (2.6.34-rc4-autokern1)
MSR: 8000000000021032 <ME,CE,IR,DR> CR: 24000044 XER: 00000004
TASK = c000000000aa3d30[0] 'swapper' THREAD: c000000000bf4000 CPU: 0
GPR00: 0000000000000001 c000000000bf7a60 c000000000bf32f0 c0000000000084c4
GPR04: 0000000000000000 0000000000000a00 0000000000000000 0000000000000068
GPR08: 0000000000000008 c000000000c4fabe 0000000000000000 7265677368657265
GPR12: 8000000000009032 c000000007691000 0000000001c00000 c000000000770bf8
GPR16: c00000000076f390 0000000000000000 0000000000430000 00000000024876f0
GPR20: c000000000887480 0000000002487480 c0000000008876f0 0000000001b5f8d0
GPR24: c000000000770478 0000000003300000 c000000000c1f1c8 c000000000884610
GPR28: c000000000c1b290 c0000000000084c4 c000000000b45068 c000000000aa3d30
NIP [c0000000000a35c8] .trace_hardirqs_on_caller+0xb0/0x224
LR [c0000000000084c4] system_call_common+0xc4/0x114
Call Trace:
[c000000000bf7a60] [c000000000bf7ba0] init_thread_union+0x3ba0/0x4000 (unreliable)
[c000000000bf7af0] [c0000000000084c4] system_call_common+0xc4/0x114
--- Exception: c01 at .kernel_thread+0x28/0x70
LR = .rest_init+0x34/0xf8
[c000000000bf7de0] [c00000000086916c] .proc_sys_init+0x20/0x64 (unreliable)
[c000000000bf7e50] [c0000000000099c0] .rest_init+0x20/0xf8
[c000000000bf7ee0] [c000000000848af0] .start_kernel+0x484/0x4a8
[c000000000bf7f90] [c0000000000083c0] .start_here_common+0x1c/0x5c
Instruction dump:
409e0188 0fe00000 48000180 801f08d8 2f800000 41be0050 880d01da 2fa00000
41be0028 e93e8538 88090000 68000001 <0b000000> 2fa00000 41be0010 e93e8538
------------[ cut here ]------------
Caused by commit bd6d29c25bb1a24a4c160ec5de43e0004e01f72b ("lockstat:
Make lockstat counting per cpu"). This added a WARN_ON_ONCE to
debug_atomic_inc() ...Ok, we'll fix the warning. Btw., WARN_ON trapping on PowerPC is clearly a PowerPC bug - there's a good reason we have WARN_ON versus BUG_ON - it should be fixed. Ingo --
From: Ingo Molnar <mingo@elte.hu> I disagree, an implementation should be allowed to use the most efficient implementation possible for both interfaces. I would be using traps for both on sparc64 if that were really feasible on sparc64 (and actually with gcc-4.5's "asm goto" it might actually be now) The WARN and BUG macros, when implemented without traps, have serious implications for overall code size and register pressure. --
It trades robustness for slightly better space/code efficiency. Such a trap based mechanism exists on x86 as well and we use it for BUG_ON(). We intentionally dont use it to generate warnings and dont override __WARN(), because it would blow up way too often when a warning triggers in some sensitive codepath that cannot take a trap. Anyway, the warning obviously has to be fixed - but the boot crash itself is PowerPC's own doing. Ingo --
Well, yes and no, as I explained in a separate branch of that thread. We indeed can't cope with a WARN in that spot because it goes recursive. Now the reason we have this double-enable is due afaik to the way I implemented IRQ trace, because things like syscalls basically force-enable IRQs on powerpc and I don't necessarily have tracking informations in the exception return path of what the "old" value was. I need to double check what the exact scenario here is and whether I can fix it but it's one of those cases where what lockdep is warning about isn't actually an error I believe. Cheers, Ben. --
Right, and I don't think the reason why we have WARN_ON vs. BUG_ON ever had anything to do with whether it's implemented with a trap or not :-) It's purely related to whether it's supposed to be fatal or not. Now, there is indeed the potential problem you mention of WARN_ON being called in places where such a trap is unsafe, but so far, this is the first time I can remember we hit that problem so we've been getting away with it for quite a while :-) Now, whether the trap is or is not more efficient than an explicit test is something that is still being debated on powerpc. It has the advantage of not un-leafing functions (and thus not creating stack frames, adding register reloads, etc... when not needed), basically putting the burden of saving/restoring registers to the (hopefully) rare path of actually taking the WARN/BUG. We could probably manufacture something similar with careful use of inline asm and an out of line asm trampoline. The benefit of the trap instruction vs. conditional branches per-se is probably nil. It's really more about the codegen impact, register clobber due to the added function call, etc.. at least for us. Cheers, --
Hi Ingo, Thanks. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
When a path restore the flags while irqs are already enabled, we
update the per cpu var redundant_hardirqs_on in a racy fashion
and debug_atomic_inc() warns about this situation.
In this particular case, we need to explictly disable the irqs before
updating this stat var in order to update it safely.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
kernel/lockdep.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 78325f8..65d4336 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2298,7 +2298,11 @@ void trace_hardirqs_on_caller(unsigned long ip)
return;
if (unlikely(curr->hardirqs_enabled)) {
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
debug_atomic_inc(redundant_hardirqs_on);
+ raw_local_irq_restore(flags);
return;
}
/* we'll do an OFF -> ON transition: */
--
1.6.2.3
--
Hi Frederic, Ping? Any movement on this? -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
Ingo, if you have no problems with these two patches, could you please apply them? Thanks. --
Hi Frederic, I applied this patch to linux-next today and it fixes my PowerPC boot problems, so I will keep it in linux-next until it gets applied to the tip tree or something better comes along. Thanks. -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
In this case, I guess the following fix should be sufficient?
I'm going to test it and provide a sane changelog.
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 78325f8..65d4336 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -2298,7 +2298,11 @@ void trace_hardirqs_on_caller(unsigned long ip)
return;
if (unlikely(curr->hardirqs_enabled)) {
+ unsigned long flags;
+
+ raw_local_irq_save(flags);
debug_atomic_inc(redundant_hardirqs_on);
+ raw_local_irq_restore(flags);
return;
}
/* we'll do an OFF -> ON transition: */
--
that looks rather ugly. Why not do a raw: this_cpu_inc(lockdep_stats.redundant_hardirqs_on); which basically open-codes debug_atomic_inc(), but without the warning? Btw., using the this_cpu() methods might result in faster code for all the debug_atomic_inc() macros as well? Ingo --
Because that would open a race against interrupts that might touch lockdep_stats.redundant_hardirqs_on too. If you think it's not very important (this race must be pretty rare I guess), Indeed, will change that too. Thanks. --
How so, its a pure per-cpu variable right? so either the increment happens before the interrupts hits, or after, in either case there should not be a race with interrupts. --
In x86 yeah, I guess the compiler simply loads the address and does an inc directly, which is atomic wrt interrupts. But what about another arch that would need an intermediate load of the value: load val, reg add reg, 1 <---interrupt here store reg, val --
There is also no guarantee we are in a non-preemptable section. We can then also race against another cpu. I'm not sure what to do. --
it's a statistics counter so worst-case we lose a count. It's not a real issue - but might be worth adding a comment. Ingo --
I think this is because our syscall entry pretty much force-enable irqs. I remember deciding back then that getting lockdep balanced in that area was tricky and I didn't do it to avoid adding more overhead to the syscall path but I suppose I could revisit if necessary. Cheers, Ben. --
