On latest git, I'm seeing "start_kernel(): bug: interrupts were enabled
early" messages on ARM (sample log below).
This appears to be caused by:
start_kernel -> radix_tree_init -> kmem_cache_create (slub) ->
down_write -> __down_write (lib/rwsem-spinlock.c) -> spin_unlock_irq
radix_tree_init was moved earlier by:
commit 773e3eb7b81e5ba13b5155dfb3bb75b8ce37f8f9
Author: Yinghai Lu <yinghai@kernel.org>
Date: Wed Feb 10 01:20:33 2010 -0800
init: Move radix_tree_init() early
Prepare for using radix trees in early_irq_init().
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1265793639-15071-30-git-send-email-yinghai@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Rabin
---
Uncompressing Linux... done, booting the kernel.
Linux version 2.6.34-rc2-00184-g01e7770 (rabin@debian) (gcc version 4.4.1 (Sourcery G++ Lite 2009q3-67) ) #5 Fri Mar 26 00:56:33 IST 2010
CPU: ARMv7 Processor [410fc080] revision 0 (ARMv7), cr=10c03c7f
CPU: VIPT nonaliasing data cache, VIPT nonaliasing instruction cache
Machine: ARM-RealView PB-A8
Ignoring unrecognised tag 0x00000000
bootconsole [earlycon0] enabled
Memory policy: ECC disabled, Data cache writeback
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32512
Kernel command line: earlyprintk mem=128M console=ttyAMA0 root=/dev/ram0 init=/linuxrc
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Memory: 128MB = 128MB total
Memory: 123916k/123916k available, 7156k reserved, 0K highmem
Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
DMA : 0xffc00000 - 0xffe00000 ( 2 MB)
vmalloc : 0xc8800000 - 0xf8000000 ( 760 MB)
lowmem : 0xc0000000 - 0xc8000000 ( 128 MB)
modules : 0xbf000000 - 0xc0000000 ( 16 MB)
.init : ...On Fri, 26 Mar 2010 01:11:00 +0530
That's going to be hard to fix.
Once upon a time, enabling interrupts too early in boot would kill
powerpc boxes stone dead. From the lack of noise I assume that this is
not happening in current kernels for some reason.
We have two checks in start_kernel():
if (!irqs_disabled()) {
printk(KERN_WARNING "start_kernel(): bug: interrupts were "
"enabled *very* early, fixing it\n");
local_irq_disable();
}
rcu_init();
radix_tree_init();
/* init some links before init_ISA_irqs() */
early_irq_init();
init_IRQ();
prio_tree_init();
init_timers();
hrtimers_init();
softirq_init();
timekeeping_init();
time_init();
profile_init();
if (!irqs_disabled())
printk(KERN_CRIT "start_kernel(): bug: interrupts were "
"enabled early\n");
perhaps the second one isn't needed? Perhaps no architecture requires
that local interrupts be disabled across the above initialisations?
I'll ask Rafael and Maciej to track this as a post-2.6.33 regression.
--
spin_unlock_irq from arm is different from other archs? YH --
Not all arches use lib/rwsem-spinlock.c. In particular, x86 doesn't when X86_XADD is set. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
What I note is that lib/rwsem-spinlock.c seems to be rather inconsistent in its use of spin_lock_irqsave/spin_lock_irqrestore versus spin_lock_irq/spin_unlock_irq... in fact, __down_read is the *only* place where we use the latter as opposed to the former. Is that a bug? If so, it would certainly explain this behavior. -hpa --
It's based on down_read() and down_write() not being callable from interrupt context, or with interrupts disabled (since they can sleep). up_read(), up_write(), down_read_trylock(), down_write_trylock(), downgrade_write() can all be called from interrupt context since they cannot sleep. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Do not run the checks while we are in a single threaded context? I thought we had some dynamic code patching thingamy that could change those when we go to smp mode? --
You have to remember that on embedded architectures, such as ARM, where XIP is supported we can't change the text segment at run time - which means dynamic code patching won't work. However, the kernel should still work in such situations. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: --
The question still remains what the incremental cost is of doing irqsave/irqrestore. -hpa --
Compared to irq disable/enable, it wouldn't be much higher; saving can be done by a direct register to register move, so that should be relatively cheap. The restore may be a little bit more depending on the CPU arch version, but not significant. So there shouldn't be a problem from ARM POV to switch to using irqsave/irqrestore. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: --
The only other option is to have local_irq_enable() check a global (system_state ?) before enabling. Almost as gross ... Cheers, Ben. --
On Thu, 01 Apr 2010 09:37:51 +1100 Add an irq-disable-depth counter to the task_struct, fix all the bugs which that exposes.. But these things are all utterly gross. The bottom line is that radix_tree_init() is manifestly unsuited to being called with local interrupts disabled. 773e3eb7b81e5ba13b5155dfb3bb75b8ce37f8f9 was just a wrong patch. --
Except that powerpc (and now it seems x86) both want to use radix trees for interrupt handling... At least on powerpc, we trick and use a linear search until the radix trees are initialized, which we do later during boot, but that somewhat sucks. I believe sherry picking things like not calling radix_tree_init() is going to fix one case today, until we have another one, and another one, and etc... I suspect we're better off fixing the root of the problem in down/up. Cheers, Ben. --
Not by adding overhead to every single down_read()/down_write() just to fix a once-off startup problem - that's taking laziness way too far. We'd be better off hacking a kmem_cache_create() special case to avoid taking the rwsem. Add SLAB_I_SUCK to `flags' perhaps. --
How much overhead is this on non-x86 architectures (keep in mind x86 doesn't use this?) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Just a few instructions, I guess. But we can do it with zero. And from a design POV, pretending that down_read()/down_write() can be called with interrupts disabled is daft - they cannot! Why muck up the usual code paths with this startup-specific hack? --
Because we the problem of when interrupts are enabled for the first time is a nasty one, and having entire layer of things not usable at the right level of init because somewhere something might do an irq enable due to calling generic code that down's a semaphore is a PITA. Seriously, Andrew, I don't see a clean solution... Something -somewhere- will have to be ugly. Allocation is a pretty basic service that a lot of stuff expect especially when booting. We went through that discussion before when we moved the SLAB init earlier during boot, because it makes no sense to have tons of code to have to figure out what allocator to call depending on what phase of the moon it's called from (especially when said code can also be called later during boot, say for hotplug reasons). So we moved sl*b init earlier, thus we ought to be able to also kmem_cache_alloc() earlier. We -fixed- that problem already afaik. Cheers, Ben. --
I would like to point out that initialization is a particular subcase of a more general rule: - It is safe to call a semaphore/rwlock down with IRQ disabled *if and only if* the caller can guarantee non-contention. Initialization is an obvious subcase, but there might be others. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Sure. We can also check if we are early in boot in kmem_cache_create and just not take the semaphores (which are useless if we are single threaded). --
Can we provide a kmem_cache_create_early()? One that takes no locks and gets cleaned up with the other __init stuff? David --
Yuck. I hate having to expose more APIs. Also the problem with that is means callers have to know. So we need to propagate up all call chains etc... (ie, radix_tree_init_early(), etc...) This is pretty much exactly the discussion we had when moving sl*b early, and back then, the final word from Linus (heh, for once he agreed with me :-) was that this made no sense. We can bury logic inside kmem_cache_create() though, it's not -that- a hot path. Cheers, Ben. --
None on powerpc, we use atomic ops and don't disable IRQs. BTW. The same problem used to happen with mutex debug. Was this ever fixed ? Cheers, Ben. --
The problem isn't about checks. Those -will- enable IRQs before it's safe to do so, the checks are perfectly right to point it out, it -is- a bug waiting to happen on some random HW. Cheers, Ben. --
Taking sems in single threaded mode is pretty pointless. Those functions could just return success until the scheduler is actually able to do something with threads. --
We use the standard generic kernel implementation. Is x86 different? ;) -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: --
Yes, these are optimized on x86. -hpa --
On Wed, 31 Mar 2010 13:47:23 -0700 No, spin_unlock_irq() unconditionally enables interrupts on all architectures. --
So I found checkin 60ba96e546da45d9e22bb04b84971a25684e4d46 in the bk-historic git tree: [PATCH] rwsem: Make rwsems use interrupt disabling spinlocks The attached patch makes read/write semaphores use interrupt disabling spinlocks in the slow path, thus rendering the up functions and trylock functions available for use in interrupt context. This matches the regular semaphore behaviour. I've assumed that the normal down functions must be called with interrupts enabled (since they might schedule), and used the irq-disabling spinlock variants that don't save the flags. Signed-Off-By: David Howells <dhowells@redhat.com> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org> What we have here is a case of this assumption being violated, because the lock is taken with interrupts disabled on a path where contention cannot happen (because the code is single-threaded at this point), but the lock is taken due to reuse of generic code. The obvious way to fix this would be to use spin_lock_irqsave..spin_lock_irqrestore in __down_read as well as in the other locations; I don't have a good feel for what the cost of doing so would be, though. On x86 it's fairly expensive simply because the only way to save the state is to push it on the stack, which the compiler doesn't deal well with, but this code isn't used on x86. -hpa --
On Wed, 31 Mar 2010 14:12:54 -0700 Well, it's all a bit nasty. kmem_cache_create() does a lot of stuff, including calling into the page allocator with GFP_KERNEL - expecting kmem_cache_create() to preserve local_irq_disable() is a bit optimistic. radix_tree_init() calls hotcpu_notifier() which also does mutex_lock(&cpu_add_remove_lock); The easiest fix is to reposition the interrutps-are-now-enabled point in start_kernel(). But I have a feeling that some versions of early_irq_init() won't like that. --
Well, the sl*b allocator -has- been modified to avoid enabling IRQs early, at least I remember we did that when we moved it to be Yeah that won't work. Interrupts must not be enabled before at least init_IRQ() and time_init(). The problem is that until all these guys have gone through their initializations, there may be pending spurrious crap coming from the HW (timers, external IRQs, profile IRQs) due to such HW not yet properly "sanitized" by the kernel. Plenty of archs have those assumptions wired in. I don't think moving the IRQ enable point earlier is the right approach. Cheers, Ben. --
I think that's what we should just do, with a good comment both in the
code and the changelog. I'm not entirely happy with it, because obviously
it's conceptually kind of dubious to take a lock with interrupts disabled
in the first place, but this is not a new issue per se.
The whole bootup code is special, and we already make similar guarantees
about memory allocators and friends - just because it's too dang painful
to have some special code that does GFP_ATOMIC for early bootup when the
same code is often shared and used at run-time too.
So we've accepted that people can do GFP_KERNEL allocations and we won't
care about them if we're in the boot phase (and suspend/resume), and we
have that whole 'gfp_allowed_mask' thing for that.
I think this probably falls under exactly the same heading of "not pretty,
but let's not blow up".
So making the slow-path do the spin_[un]lock_irq{save,restore}() versions
sounds like the right thing. It won't be a performance issue: it _is_ the
slow-path, and we're already doing the expensive part (the spinlock itself
and the irq thing).
So ACK on the idea. Who wants to write the trivial patch and test it?
Preferably somebody who sees the problem in the first place - x86 should
not be impacted, since the irq-disabling slow-path should never be hit
without contention anyway (and contention cannot/mustnot happen for this
case).
Linus
--
Ahh, yes. In this case, that doesn't likely change anything. The save/restore versions of the irq-safe locks shouldn't be appreciably more expensive than the non-saving ones. And architectures that really care should have done their own per-arch optimized version anyway. Maybe we should even document that - so that nobody else makes the mistake x86-64 did of thinking that the "generic spinlock" version of the rwsem's is anything but a hacky and bad fallback case. Linus --
That depends on the CPU. Some CPUs have quite expensive interrupt disablement instructions. FRV does for instance; but fortunately, on the FRV, I can use some of the excessive quantities of conditional registers to pretend that I In some cases, it's actually the best way. On a UP machine, for instance, where they reduce to nothing or where your only atomic instruction is an XCHG equivalent. David --
I think you're missing the part where we're not _adding_ any irq disables: we're just changing the unconditional irq disable to a save-and-disable (and the unconditional irq enable to a restore). So even if irq's are expensive to disable, the change from spin_lock_irq() to spin_lock_irqsave() Again, you seem to think that we used to have just a plain spin_lock. Not so. We currently have a spin_lock_irq(), and it is NOT a no-op even on UP. It does that irq disable. Anyway, I suspect that even with just an atomic xchg, you can do a better job at doing down_read() than using the generic spin-lock version (likely by busy-looping on a special "we're busy" value). But if you do want to use the generic spin-lock version, I doubt any architecture makes that irqsave version noticeable slower than the unconditional irq version. Linus --
OK, I'll bite since I was seeing boot-time hangs on ARM (TI OMAP3) due
to this. Patch below.
Kevin
From 7baff4008353bbfd2a2e2a4da22b87bc4efa4194 Mon Sep 17 00:00:00 2001
From: Kevin Hilman <khilman@deeprootsystems.com>
Date: Wed, 7 Apr 2010 11:52:46 -0700
Subject: [PATCH] rwsem generic spinlock: use IRQ save/restore spinlocks
rwsems can be used with IRQs disabled, particularily in early boot
before IRQs are enabled. Currently the spin_unlock_irq() usage in the
slow-patch will unconditionally enable interrupts and cause problems
since interrupts are not yet initialized or enabled.
This patch uses save/restore versions of IRQ spinlocks in the slowpath
to ensure interrupts are not unintentionally disabled.
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
---
lib/rwsem-spinlock.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index ccf95bf..ffc9fc7 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -143,13 +143,14 @@ void __sched __down_read(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;
struct task_struct *tsk;
+ unsigned long flags;
- spin_lock_irq(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);
if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
/* granted */
sem->activity++;
- spin_unlock_irq(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
goto out;
}
@@ -164,7 +165,7 @@ void __sched __down_read(struct rw_semaphore *sem)
list_add_tail(&waiter.list, &sem->wait_list);
/* we don't need to touch the semaphore struct anymore */
- spin_unlock_irq(&sem->wait_lock);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
/* wait to be given the lock */
for (;;) {
@@ -209,13 +210,14 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
{
struct rwsem_waiter waiter;
struct task_struct *tsk;
+ unsigned long flags;
...This looks reasonable and fine for me. Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> -- Live like a child, think like the god. --
No, it's not, it will enable IRQs and thats illegal to do so early during boot. We've been over that one again and again, the problem is that people want to keep using that instead of irqsave/restore because it's a nano-optimisation on x86... oh well... Cheers, Ben. --
Well, guess what... the particular user in this case *isn't used at all on x86* so it is a non-issue here. So I take it we have your particular vote to use irqsave/irqrestore in lib/rwsem-spinlock.c? -hpa --
The optimised fast paths used on x86 rwsems don't disable interrupts. David --
Any reason not to use the same technique for all the archs - plus the trick used in arch/armkernel/entry-armv.S:__kuser_cmpxchg for those archs which don't have atomic instructions or ll/sc? If the problem here is _only_ semaphores, and the above might make semaphores faster anyway, perhaps it's a solution. -- Jamie --
Assuming you're talking about the __LINUX_ARM_ARCH__ < 6 + CONFIG_MMU You trade off a bit of overhead in the semaphore path for a bit of overhead in the interrupt path. We probably take more sempahores than we do interrupts, so it's probably worthwhile. Still, cmpxchg() needs to be SMP-safe. Realistically, this isn't something that can be done in generic code. It has to be done in arch-specific code. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
It depends what atomic instructions you have available. David --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List |
