Prevent nested interrupts when the IRQ stack is near overflowing v2
Interrupts can always nest when they don't run with IRQF_DISABLED.
When a lot of interrupts hit the same vector on the same
CPU nested interrupts can overflow the irq stack and cause hangs.
This has been observed with MSI-X & Ethernet on a large system.
This patch automatically forces IRQF_DISABLED when
the interrupt stack runs low. I implemented it using
a "callback" (really just a weak call) from the generic IRQ code
to the architecture code because passing this state down the
normal call chain would have required changing too much code.
The irq checks are currently implemented for x86-(32,64) only,
but other architectures could (and probably should) do the same.
Currently the thresholds are 2K each. This is a fairly
arbitary number. On 4K stack i386 it's about half
the irq stack, on the other configurations it's 1/4-1/16.
This also fixes another minor bug on 32bit: don't dump a backtrace when the
irq stack runs low.
Based on discussions with Suresh B. Siddha and others.
Originally reported by Jesse Brandeburg.
v2: Use more common code on 32bit. Don't dump stack
on low irq stack.
Tested-by: emil.s.tantilov@intel.com
Cc: jesse.brandeburg@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Suresh B. Siddha <suresh.b.siddha@intel.com>
---
arch/x86/kernel/irq_32.c | 43 ++++++++++++++++++++++++++++++++++---------
arch/x86/kernel/irq_64.c | 18 ++++++++++++++++--
include/linux/irq.h | 2 ++
kernel/irq/handle.c | 16 +++++++++++++++-
4 files changed, 67 insertions(+), 12 deletions(-)
Index: linux-2.6.34-rc1-ak/include/linux/irq.h
===================================================================
--- linux-2.6.34-rc1-ak.orig/include/linux/irq.h 2010-03-14 03:58:12.000000000 +0100
+++ linux-2.6.34-rc1-ak/include/linux/irq.h 2010-03-24 19:11:23.000000000 +0100
@@ -520,4 +520,6 @@
}
#endif /* CONFIG_SMP */
+extern int ...In which way are which interrupts nested ? Thanks, tglx --
That's utter nonsense. An interrupt storm on the same vector does not cause irq nesting. The irq code prevents reentering a handler and in case of MSI-X it just disables the IRQ when it comes again while the first irq on that vector is still in progress. So the maximum nesting is two up to handle_edge_irq() where it disables the IRQ and returns right away. Thanks, tglx --
Sorry it's the same CPU, not the same vector. Yes the reference to same vector was misleading. " Multiple vectors on a multi port NIC pointing to the same CPU, all hitting the irq stack until it overflows. Real maximum nesting is all IRQs running with interrupts on pointing to the same CPU. Enough from multiple busy IRQ sources and you go boom. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
"misleading" is an euphemism at best ... So there are several questions: 1) Why are those multiple vectors all hitting the same cpu at the same time ? How many of them are firing at the same time ? 2) What kind of scenario is that ? Massive traffic on the card or some corner case ? 3) Why does the NIC driver code not set IRQF_DISABLED in the first place? AFAICT the network drivers just kick off NAPI, so whats the Which leads to the general question why we have that IRQF_DISABLED shite at all. AFAICT the historical reason were IDE drivers, but we grew other abusers like USB, SCSI and other crap which runs hard irq handlers for hundreds of micro seconds in the worst case. All those offenders need to be fixed (e.g. by converting to threaded irq handlers) so we can run _ALL_ hard irq context handlers with interrupts disabled. lockdep will sort out the nasty ones which enable irqs in the middle of that hard irq handler. Your band aid patch is just disgusting. How do you ensure that none of the handlers on which you enforce IRQ_DISABLED does not enable interrupts itself ? You _CANNOT_. I'm not taking that patch unless you come up with a real convincing story. Thanks, tglx --
I think the idea was to minimize irq latency for other interrupts But yes defaulting to IRQF_DISABLED would fix it too, at some My understanding was that traditionally the irq handlers were allowed to nest and the "fast" non nest case was only added for some Ok glad to give you advertisement time for your pet project... Anyways if such a thing was done it would be a long term project I can't, just as much as I cannot enforce they won't crash or not loop forever or something. But afaik they don't. I did some quick grepping and didn't found a driver who does that at least. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
So what's the point ? Is the irq handler of that card so long running,
that it causes trouble ? If yes, then this needs to be fixed. If no,
What's the cost? Nothing at all. There is no f*cking difference between:
IRQ1 10us
IRQ2 10us
IRQ3 10us
IRQ4 10us
and
IRQ1 2us
IRQ2 2us
IRQ3 2us
IRQ4 10us
IRQ3 8us
IRQ2 8us
IRQ1 8us
The system is neither running a task nor a softirq for 40us in both
cases.
So what's the point of running a well written (short) interrupt
handler with interrupts enabled ? Nothing at all. It just makes us
You just don't get it. Long running interrupt handlers are a
BUG. Period. If they are short they can run with IRQs disabled w/o any
Your patch is not a fix, It's a lousy, horrible and unreliable
workaround. It's not fixing the root cause of the problem at hand.
The real fix is to run the NIC interrupt handlers with IRQs disabled
and be done with it. If you still think that introduces latencies then
prove it with numbers.
Thanks,
tglx
--
I believe it's more like that you have a lot of them (even with interrupt mitigation and NAPI polling) and they have to scale up the work to handle high bandwidths, so it ends up with a lot of time in them anyways, ending with skew in the timers and similar issues. Anyways my goal here was simply to have a least intrusive fix, not Yes it could work out this. Or it could not. I'm not sure, so I chose Ok so you're just proposing to always set IRQF_DISABLED? If you can force everyone to use that that would work I guess. I don't know what problems it would cause. My fear is that any latency problems Sorry you got that wrong. I'm not proposing to change semantics, you are proposing that. So you would need to prove anything if at all. Anyways if you think you can write a better patch to fix that bug please fell free to write one. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Timekeeping is an absolute non issue here. The time keeping code can
Oh it's safe because it solves the problem on that particular machine
and workload?
No, it's not safe at all.
- it does not prevent a driver reenabling interrupts
- it will cause real large latencies when the interrupt which
happens to trigger that check is one of the long running
ones. Are you going to deal with the "oh we see >500us" latencies
bugreports ?
- worst case it'll brick your machine if the interrupt which
happens to trigger that check relies on irq enabled semantics
(e.g. missing jiffies update). Yes, such a driver is broken by
design, but we have such crap lurking. So much for not changing
For any sane written driver, yes. We have crap irq handlers which need
It papers over the problem. We already know that the NIC driver floods
the machine with interrupts, so why are you insisting that we need to
bandaid that problem ?
The minimal intrusive way is a one liner in that very driver code and
if it causes problems for that very driver then we don't fix them with
adding a callback in the generic interrupt code path.
The message which we would send out with applying that band aid would
be simply: Go ahead driver writers and let your handlers run as long
as they want, we'll safe you in 99.9% of the cases and we'll happily
go and debug the 0.1% of completely undebuggable shit which will
result out of that.
No thanks,
tglx
--
Well in this case it's simply because it has 4 ports and they are all active and have a lot of MSI-X vectors for each stream. Even if you had the perfect interrupt handler that ran in one cycle, if you had enough of them in parallel from different ports Well it's simply the current state of affairs today. I'm merely attempting to make the current state slightly safer without breaking I'm not sure I fully understand your suggestion. Is your suggestion to only set IRQF_DISABLED that one driver and ignore the other ones? (let's call that the "ostrich approach") Or is your suggestion to set IRQF_DISABLED by default? Or is it something else? Thanks, -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Well, I'd agree if those stack overflows would be a massive reported problem. Right now they happen with a weird test case which points out a trouble spot. Multi vector NICs under heavy load. So why not go there and change the handful of drivers to run their handlers with irqs disabled? Band aids are the last resort if we can't deal with a problem by other sane means. And this problem falls not into that category, it can be solved in the affected drivers with zero effort. Thanks, tglx --
At least the people who reported it to me thought it was a massive Ok, but afaik it's not that small a number: MSI-X support is getting more and more wide spread. It's pretty universal in the 10+GbitE space for space and starts getting deployed for block too. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
How about - because it can happen anyway ? It's merely a lot less likely and if it is occuring at random in very obscure situations the chances are it doesn't produce a nice neat report that it occurred either. Your assertion that this is the only case it is seen appears unbacked by Thomas - you can't complain about Andi's patches being inadequate and then suggest an ever more incomplete 'solution'. Absolute minimum we need to WARN() in that situation so that kerneloops.org can gather statistics on what occurs and try to find out. The real fix is not zero effort, as you yourself said there are quite a few subsystems re-enabling interrupts in their IRQ paths and all would need fixing. Alan --
So historically _all_ interrupts allow nesting, except for what used to be called "fast" irq handlers. Notably the serial line, which especially on chips without buffering needs really low latencies. So that's where the irq handler code comes from: a "fast" interrupt will not ever enable the CPU interrupts at all, and can run without ever doing the whole "mask and ACK" code. But no, we absolutely MUST NOT make regular interrupt handlers be that kind of fast handlers (ie IRQF_DISABLED in modern parlance), because that would make the whole point moot. So Thomas, you're wrong. We can't fix all irq handlers to be really quick, and we MUST NOT run them with all other irq's disabled if they are not - because that obviates the whole point. [ There's another historical issue, which is that at least the SCSI layer used to depend on timer interrupts still coming in, because there were literally cases where the cdoe would wait for a timer tick while _inside_ the SCSI irq handler. That may or may not be true any more - I certainly hope it isn't. But But there _is_ a big f*cking difference when there is another irq involved that wants to happen immediately. For a network card that's not (much) of an issue, since any sane network card will buffer things anyway. For an original 8250 with a single-byte buffer running at 19200 bps on a slow machine, it really is a _huge_ deal. On that slow machine, the other interrupt handlers won't be 10us anyway. A single PIO op is 1us! There are other cases where this matters, but that 8250 is the historical one. NOTE! Historically, the "fast" handlers also had a much faster irq response because they didn't do that whole MASK/ACK/END thing. So they'd just keep the CPU interrupts disabled, and ACK at the end, and I think we've even used AUTOEIO so that they didn't need any ACK at all, and we never touched the interrupt controller itself for them. Again, that's a big deal when you're trying to push a serial line ...
Btw, it was even more extreme than that. The fast irq handlers got a totally separate kernel entry point, and wouldn't save all registers, only the compiler-clobbered ones. Which is why they then had no "struct pt_regs" etc. And yes, it really mattered. Then later we got so bloated that it wasn't much of an issue - and just made everything more complicated. Linus --
* Linus Torvalds <torvalds@linux-foundation.org> wrote: I think the patch as posted solves a real problem, but also perpetuates a bad situation. At minimum we should print a (one-time) warning that some badness occured. That would push us either in the direction of improving drivers, or towards improving the generic code. Thanks, Ingo --
Furthermore, applying that patch as-is would not just cause us to do nothing about it in the future, it would also add a rather fragile looking piece of logic. I.e. it's a sweep-under-the-rug thing pretty much IMO. So i think Thomas is quite right wrt. ugliness of the patch but missed the other important fact that this can occur in a lot of places with high enough IRQ parallelism and cannot be fixed one by one. Thanks, Ingo --
What should a driver do to prevent that? I don't see what it could do short of castrating itself (like refusing to use multiple ports) As Linus says the driver doesn't know if setting IRQF_DISABLED is safe. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
As an aside this is happening on MSI irqs. They can never be shared. So in fact the driver can know it is safe. Should we perhaps make all MSI irqs automatically set IRQF_DISABLED? Eric --
For the MSI ones it definitely works as they can not be shared. I'm just not sure whether we can just enforce that for MSI. Thanks, tglx --
FWIW lockdep forces IRQF_DISABLED and yells when anybody does local_irq_enable() while in hardirq context, I haven't seen any such splats in a long while, except from the ARM people who did funny things with building their own threaded interrupts or somesuch. So I'm sure there's some devices, like IDE PIO and the curious 8390, and maybe some other drivers that aren't as common, but I don't think the situation is quite as bad as some people paint it. --
The thing is, that won't show the drivers that just simply _expect_ other interrupts to happen. The SCSI situation, iirc, used to be that certain error conditions simply caused a delay loop (reset? I forget) that depended on 'jiffies'. So there was no 'local_irq_enable()' involved, nor did it even happen under any normal load - only in error situations. Now, I think (and sincerely) that the SCSI situation is likely long since fixed, but we have thousands and thousands of drivers, and these kinds of things are very hard to notice automatically. Are there any cases around that still have busy-loop delays based on real-time in their irq handlers? I simply don't know. Linus --
I recently found a few in drivers/net/ there's all kinds of funny stuff in there.. not sure how common the matching hardware is though. One thing we could do is instrument jiffies to yell when its used from hardirq context and fix up these things. But yeah, there's funky hardware around, but I think we should provide more incentives to keep modern hardware drivers saner. --
Also, Arjan mentioned he wanted to sweep the kernel for jiffies users and convert them to timer interfaces.. --
% linux-2.6> gid jiffies | wc -l 7569 Good luck. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Dynticks found most of the loop-for-jiffies-in-handlers cruft: an IRQ handler hitting the idle loop and then spinning for jiffies, with dynticks enabled, would spin forver because there's no timer irq. That is because in dynticks mode we dont reactivate the tick when we go into an irq handler (that has hit the idle task). So any looping on jiffies will loop forever. We only reactivate the tick when we go out of idle. We had a few bugs of that kind in the early days of dynticks - they were not particularly numerous and all were fixed. Thanks, Ingo --
Yes, but that kind of crap should not go unnoticed if it triggers (which may
be rare): in form of a hard lockup. We had that lockdep behavior of disabling
irqs in all handlers forever, ever since it went upstream four years ago.
So we had 3 separate mechanisms in the past 3-4 years:
- lockdep [which disables irqs in all handlers, indiscriminately]
- dynticks [which found in-irq jiffies loops]
- -rt [which found hardirqs-off loops]
That should have weeded out most of that kind of IRQ handler abuse.
In terms of test coverage, beyond upstream kernel testers who enable lockdep,
Fedora rawhide has also been running kernels with lockdep enabled for the past
3-4 years, so there's a fair amount of 'weird hardware' coverage as well. It
certainly wont cover 100% everything, but it should cover everything that
people bothered to test + report.
I'm fairly certain, based on having played with those aspects from many angles
that disabling irqs in all drivers should work just fine today already.
So the patch below should at most trigger bugs in areas that need fixing
anyway, and i'm quite sure that under no circumstance would it cause
unforeseen problems in 'thousands of drivers'.
So i think this would be a clear step forward. (It's also probably the best
thing for performance to batch up IRQs instead of nesting them.)
We could do this carefully, first in the IRQ tree then in linux-next, them
upstream, with plenty of time for people to test. If it causes _any_
widespread problems that make you uncomfortable it would be very easy to
revert. What do you think?
Thanks,
Ingo
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 76d5a67..27e5c69 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -370,9 +370,6 @@ irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
irqreturn_t ret, retval = IRQ_NONE;
unsigned int status = 0;
- if (!(action->flags & IRQF_DISABLED))
- local_irq_enable_in_hardirq();
-
do {
...If we do this, then we should just remove all the IRQF_DISABLED code in kernel/irq/manage.c too, and basically make IRQF_DISABLED a clear no-op (still leave it around as a #define, to not break any users). It won't make it any harder to revert if it causes problems, and that way there will be no crazy dead code (and comments) left around. And I just checked: even the 8250 serial driver doesn't use IRQF_DISABLED any more, so doing that shouldn't cause any latency issues (sure, the serial driver may interrupt another irq, but another irq can also interrupt the serial driver as things stand now, so the original latency issue with fast irq handlers doesn't actually work these days _anyway_). Linus --
From: Linus Torvalds <torvalds@linux-foundation.org> FWIW, I'm currently using IRQF_DISABLED for virtual network device interrupts on sparc64 as a workaround for some stack overflow issues. This change will just force me to work harder to find out a better way to fix the problem, so don't let my issue hold this back, it's just an FYI... --
You missed the point of the patch. It makes _all_ interrupts act like IRQF_DISABLED. So it actually fixes your issue, by making IRQF_DISABLE a no-op - not because it doesn't exist any more, but because the non-IRQF_DISABLED case would not exist any more. Linus --
From: Linus Torvalds <torvalds@linux-foundation.org> Yep, Peter Z. explained this to me under seperate cover. Great! --
I think IRQF_DISABLED is the sanest method to run IRQs: it's the most atomic, thus most cache-efficient, it doesnt nest, and it's also a tiny bit faster to The patch i sent basically hardcodes IRQF_DISABLED - so it should fix your problem automatically. IRQF_DISABLED will become an unconditional thing, and we can then also remove it. Ingo --
Btw, if we really do decide that everybody is IRQF_DISABLED, that really should make the whole mask-and-ack issue for the irq controllers go away. We'd still need it, but only for the very special IDE controller case and others who _explicitly_ re-enable interrupts. Those would really have to cause the irq to be masked so that we don't get issues with recursive irqs of the same kind. Linus --
This case was already broken back when the x86-64 port used to run with NMI watchdog on by default, because any IO exception handling with interrupts off would trigger the NMI watchdog eventually. I remember having to add bandaids to Fusion for this a long time ago. Of course ISA drivers might still do it. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Anyone you've forgotten to offend ? We have IRQF_DISABLED and stuff using that model because for something like ten years the Linux kernel had no real sane notion of handing stuff off to worker threads or threaded irq support and also because there were so many errata around IRQ masking as well as all the evil business with interrupt delivery being asynchronous to the PCI or ISA transactions on some CPUs (eg you could write the irq mask register on the device, read it back to ensure it occurred and *still* get an IRQ delivered after that point because it was on the APIC bus) Dumping everyones code under an insult without any historical context isn't helpful. Pretty much the only 'core' driver today which enables IRQs in the irq handlers and needs it is the old IDE layer. There are also a couple of drivers which play games with disable/enable_irq in the IRQ paths for other reasons (lack of irq threads when written and a hardware model thats totally SMP unfriendly). 8390 is the obvious one here and it at least would be far far saner using threaded IRQs and normal locking with IRQs unmasked. Alan --
Hmm, not sure. Have not measured IRQ handler run times for quite a Right, but that's not the problem here. We talk about a (hopefully) well written interrupt handler which runs for a very short time. What's the point of running it with interrupts enabled ? Nothing, we just run into stack overflow problems. So what's better: an unreliable and ugly hackaround or just avoiding the possible stack overflow in the first place ? Thanks, tglx --
> time. What's the point of running it with interrupts enabled ? The underlying question you posed was why don't we kill off the running with irqs enabled cases. That requires more work but should definitely I think you are trying to have a debate when we are in agreement anyway. I don't btw think the workaround is 'unreliable', ugly as sin yes but the logic appears sound. Alan --
The NIC handlers can do quite some work under high traffic. I don't think that's a accurate description of the patch at all. Besides I believe it's reliable in all cases that matter. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Thanks. I'm tempted to just ignore it in this case, but in theory it might still have troubles if there are a lot of interrupts to the same CPU. I've only had a report on a very large system with a very high interrupt rate on a very fast NIC though, so presumably it's not too common. Anyways this patch will fix the problem for all drivers that do not explicitely enable interrupts, which is the overwhelming majority. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
