synchronize_irq needs at the very least a compiler barrier and a
read barrier on SMP, but there are enough cases around where a
write barrier is also needed and it's not a hot path so I prefer
using a full smp_mb() here.It will degrade to a compiler barrier on !SMP.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---Index: linux-work/kernel/irq/manage.c
===================================================================
--- linux-work.orig/kernel/irq/manage.c 2007-10-18 11:22:16.000000000 +1000
+++ linux-work/kernel/irq/manage.c 2007-10-18 11:22:20.000000000 +1000
@@ -33,6 +33,7 @@ void synchronize_irq(unsigned int irq)
if (irq >= NR_IRQS)
return;+ smp_mb();
while (desc->status & IRQ_INPROGRESS)
cpu_relax();
}-
Hi,
I have read this thread and I concluded few things:
1) It is impossible to know that the card won't send more interrupts:
Even if I do a read from the device, the IRQ can be pending in the bus/APIC
It is even possible (and likely) that the IRQ line will be shared, thus the
handler can be called by non-relevant device.2) the synchronize_irq(); in .suspend is useless:
an IRQ can happen immediately after this synchronize_irq();
and interrupt even the .suspend()
(At least theoretically)Thus I now understand that .suspend() should do:
saa_writel(SAA7134_IRQ1, 0);
saa_writel(SAA7134_IRQ2, 0);
saa_writel(SAA7134_MAIN_CTRL, 0);dev->insuspend = 1;
smp_wmb();/* at that point the _request to disable card's IRQs was issued, we don't know
that there will be no irqs anymore.
the smp_mb(); guaranties that the IRQ handler will bail out in that case. *//* .......*/
pci_save_state(pci_dev);
pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
return 0;and the interrupt handler:
smp_rmb();
if (dev->insuspend)
goto out;Am I right?
Best regards,
Maxim Levitsky
-
It's not totally useless not. It guarantees that by the time your
are out of your suspend(), a simultaneous instance of the IRQ handler
is either finished, or it (or any subsequent occurence) have seenThe above doesn't handle the case where the IRQ handle just passed the
if (insuspend) test. You may end up calling pci_set_power_state() while
in the middle of some further HW accesses in the handler.Not quite :-)
Also not that the work we are doing with synchronize_irq, if it gets
merged, renders it unnecessary for you to add those two memory barriers,
synchronize_irq will pretty much do the ordering for you.Cheers,
Ben.-
Fine, I Agree, and this is why I used it in first place, I just forgot.
To wait for already running IRQ handler, that tested the dev->insuspend.
(And I assumed that interrupts are disabled on the cpu that runs .suspend, but I know now
that this isn't true.)Besides that can I ask few more .suspend/resume questions:
1) some drivers use pci_disable_device(), and pci_enable_device().
should I use it too?2) I accidentally did this:
pci_set_power_state(pci_dev, pci_choose_state(pci_dev, state));
pci_save_state(pci_dev);I somehow thought that this is correct, that I should save the pci config state
after the power-down, but now I know that it isn't correct.
I now need to send a patch for dmfe.c network driver that has the same commands written by me.
(but it works perfectly anyway)
Is it possible to access pci configuration space in D3?And lastly speaking of network drivers, one issue came to my mind:
most network drivars has a packet queue and in case of dmfe it is located in main memory,
and card does dma from it.in .suspend I ignore that some packets may be in that queue, and I want
to ask, whenever there are better ways to do that.this is my dmfe .suspend routine.
/* Disable upper layer interface */
netif_device_detach(dev);/* Disable Tx/Rx */
db->cr6_data &= ~(CR6_RXSC | CR6_TXSC);
update_cr6(db->cr6_data, dev->base_addr);/* Disable Interrupt */
outl(0, dev->base_addr + DCR7);
outl(inl (dev->base_addr + DCR5), dev->base_addr + DCR5);/* Fre RX buffers */
dmfe_free_rxbuffer(db);/* Enable WOL */
pci_read_config_dword(pci_dev, 0x40, &tmp);
tmp &= ~(DMFE_WOL_LINKCHANGE|DMFE_WOL_MAGICPACKET);if (db->wol_mode & WAKE_PHY)
tmp |= DMFE_WOL_LINKCHANGE;
if (db->wol_mode & WAKE_MAGIC)
tmp |= DMFE_WOL_MAGICPACKET;pci_write_config_dword(pci_dev, 0x40, tmp);
pci_enable_wake(pci_dev, PCI_D3hot, 1);
pci_enable_wake(pci_dev, PCI_D3cold, 1);/* Power down device*/
pci_se...
I generally don't do the former, and I would expect the late to be done
by pci_restore_state() for you. pci_disable_device(), last I looked,
only cleared the bus master bit though, which might be a good idea to
do.People in ACPI/x86 land, are there more good reasons to do one or the
other ?That reminds me that I volunteered to write a documentation on how
drivers should do all that stuff at KS and didn't get to actually doingRight, you need to do the save_state before the power down. You need to
avoid pretty much any access to the device after the save state otherIt's only really safe to access the PM register itself, though I suppose
you should be able to walk the capability chain to do that. But INote that the network stack nowadays does a fair bit of cleaning up for
The above -might- not be needed any more, I have to check. I used to do
Looks allright on a quick glance appart from the bits we already
I think the network stack does that nowadays but we'll have to double
check, that's based on what DaveM told me at KS.Ben.
-
Hi,
Thanks a lot.
I fix the order of calls in dmfe.c
and in saa7134-core.c.I probably need to add this synchronize_irq() logic in dmfe.c too, but I probably do it later,
I think I am overestimating this race, since most drivers don't do dev->insuspend checks in IRQ handler.
Maybe even just use free_irq() after all....Best regards,
Maxim Levitsky
-
Most drivers are probably underestimating the race :-)
free_irq() would work provided that you did the masking on chip before
(and unmask only after request_irq on the way back in). But it's a bit
like using a 10 tons truck to crush an ant...Ben.
-
Agreed.
So, I will add synchronize_irq() to both saa7134, and dmfe, the two drivers that their .suspend/.resume
routines were written by me.I already added a synchronize_irq() plus few more fixes to the driver , but those patches are still in v4l tree.
I now has this:
saa_writel(SAA7134_IRQ1, 0);
saa_writel(SAA7134_IRQ2, 0);
saa_writel(SAA7134_MAIN_CTRL, 0);synchronize_irq(pci_dev->irq);
dev->insuspend = 1;and I will probably need (with the synchronize_irq patch applied)
/* Disable interrupts, DMA, and rest of the chip*/
saa_writel(SAA7134_IRQ1, 0);
saa_writel(SAA7134_IRQ2, 0);
saa_writel(SAA7134_MAIN_CTRL, 0);dev->insuspend = 1;
synchronize_irq(pci_dev->irq);/* ACK pending interrupts just in case*/
saa_writel(SAA7134_IRQ_REPORT,saa_readl(SAA7134_IRQ_REPORT));......
Best regards,
Maxim Levitsky
-
Hopefully :-)
Cheers,
Ben.-
If we patch synchronize_irq() as discussed here then you can
use it in place of smp_wmb() and remove the smp_rmb from the
interrupt handler (the latter is the path that matters).Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
Something like that can work, yes. However, you need to make sure that:
- even when you ignore the interrupt (because the driver doesn't care,
it's suspending), you need to make sure the hardware gets shut up by
reading (or writing) the proper interrupt status register.Otherwise, with a level interrupt, the interrupt will continue to be
held active ("screaming") and the CPU will get interrupted over and
over again, until the irq subsystem will just turn the irq off
entirely.- when you resume, make sure that you get the engine going again, with
the understanding that some interrupts may have gotten ignored.Also, in general, these kinds of things shouldn't always even be
neicessary. If you use the suspend_late()/resume_early() hooks, those will
be called with interrupts off, and while they can be harder to debug (and
may be worth avoiding for non-critical drivers), they do allow for simpler
models partly exactly because they don't need to worry about interrupts
etc.Linus
-
His suspend routine wrote to the IRQ mask (or equivalent) register in
his code example, thus the HW should shut up eventually, thus that isn't
strictly necessary, the IRQ in that case is just a "short
interrupt" (noticed by the PIC and delivered but possibly not asserted
anymore at this stage or about to go down).We have many scenario generating such short interrupts (pretty much any
time we use interrupt disable/enable on the chip, for example it's
common with NAPI).This is one of the reasons why we used to have a "timebomb" causing an
IRQ to erroneously get disabled by the IRQ core assuming the IRQ was
stuck, just because we accumulated too many short interrupts on that
line. A recent patch by Alan fixed that to use some more sensible
algorithm checking the time since the last spurrious interrupt, though I
suspect he's being too optimistic on his timeout value and some
scenario, such as NAPI with just the wrong packet rate, can still
trigger the bug. I need to find a piece of HW slow enough inAnd you forgot that he still needs synchronize_irq(), or his IRQ handler
might well be in the middle of doing something on another CPU, past the
point where it tested "insuspend", which will do no good when a
simultaneous pci_set_power_state() puts the chip into D3. On some
machines, this will machine check. Either that or he needs a spinlock inYes, this is a easier way to deal with devices that are on a directly
mapped bus (path to them isn't gone by the time you hit suspend_late)
and don't need to do fancy things involing waiting for a queue to drain
using interrupts etc... (at one stage of HW complexity, having to
re-implement polled versions of everything is just not worth the
benefit). In general, suspend_late should cover the need of most PCI
devices I suppose.Ben.
-
In fact, he -must not- ack it. Because is the HW is really down (in D3),
got knows what accessing the ACK register will do. I can give you
ideas... from nothing on most x86 desktops to machine checks on most
powerpc machines, though I could imagine some cards bad enough to lock
your bus up if you try to access a register while they are in D3 (I've
seen some of those critters and it seems not all bridges timeout on
cards that just keep sending retries).Cheers,
Ben.-
I agree, but while device is powered off, its registers can't be accessed
Thus, if I ack the IRQ every time the handler is called, I will access the
Here it isn't a problem, this is a video capture card, and I suspend it by just stopping dma
on all active buffers even if in the middle of capture, and I send those buffers to card againBest regards,
Maxim Levitsky-
It will actually crash your machine on some platforms. So no, best is to
-not- ack. The masking is enough, the IRQ will go down eventually.Ben.
-
So, what exactly does it protect against? At a minimum, this needs a
comment in the changelog, and probably preferably in the source code too.The thing is, synchronize_irq() can only protect against interrupts that
are *already*running* on another CPU, and the caller must have made sure
that no new interrupts are coming in (or at least that whatever new
interrupts that come in will not pick up a certain piece of data).So I can imagine that the smb_mb() is there so that the caller - who has
cleared some list or done something like that - will have any preceding
writes that it did be serialized with actually checking the old state of
"desc->status".Fair enough - I can see that a smp_mb() is useful. But I think it needs
documenting as such, and preferably with an example of how this actually
happened in the first place (do you have one?)The synchronize_irq() users that are really fundamental (ie the irq
datastructures themselves) all should use the irq descriptor spinlock, and
should *not* be needing any of this, since they would have serialized with
whoever actually set the IRQ_INPROGRESS bit in the first place.So in *that* sense, I think the memory barrier is useless, and I can't
make up my mind whether it's good or bad. Which is why I'd really like to
have an example scenario spelled out where it makes a difference..Ok?
Linus
-
One could argue that it's the caller that should do the mb() but that
would really be asking for trouble. Since most usage scenario require
it, and it's not a hot path, I prefer having it here.Note that some kind of read barrier or compiler barrier should be needed
regardless, or we are just not sync'ing with anything at all (we may
have loaded the value ages ago and thus operate on a totally stale
value). I prefer a full barrier to also ensure all previous stores areThat isn't always the case. You can have very efficient lock-less fast
path and use synchronize_irq as a kind of RCU-like way to have the slowWell, typically, I can clear a pointer and want to make sure my IRQ
handler isn't using it anymore before freeing the memory. Or set a "HW
is off" flag. Having spinlocks all over isn't always the best approach
on things that are really hot path, it's fair enough to use lockless
approaches like that by moving the burden to the slow path that does
synchronize_irq.In general, I tend to think that for this function to make any sense
(that is, to synchronize anything at all), it needs a barrier or you are
just making a decision based on a totally random value of desc->status
since it can have been re-ordered, speculatively loaded, pre-fetched,
whatever'ed... :-).Want a respin with a comment ?
Ben.
-
We already have a compiler barrier there in the form of cpu_relax.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
Isn't it too late ? The barrier should be before the test_bit, to
prevent it from moving up.Ben.
-
Take a real life example:
drivers/message/fusion/mptbase.c
/* Disable interrupts! */
CHIPREG_WRITE32(&ioc->chip->IntMask, 0xFFFFFFFF);ioc->active = 0;
synchronize_irq(pdev->irq);And we aren't in a spinlock here.
That's just a random example grepped.... I think I see a few more. Then,
some drivers like tg3 actually do an smp_mb() before calling
synchronize_irq(). But then, some don't.I think trying to have all drivers be correct here is asking for
trouble, we'd rather have synchronize_irq() be uber-safe. It's not like
it was used in hot path anyway.Ben.
-
I really don't see what the point of the barrier would be here.
Barriers are generally useless unless you have a counter-part
on the other side.The counterpart here is presumably the interrupt handler, which
should be terminated by the IO write above regardless of the
memory barrier.Of course I might have missed your point. If so please give
a description like this for the race that you see:CPU1 CPU2
disable IRQ
whatever the race isWhile in general I'd agree with you about give latitude to
drivers, memory barriers I think is something that we can't
afford to.The reason is that memory barries tend to come in pairs, e.g.,
CPU1 CPU2
write A
wmb
write B
read B
rmb
read ATaking away either barrier would render the other useless.
So whenever we add only one barrier for the benefit of driver
writers who don't bother to think about barriers we may not
be helping them at all.In any case, such writers should use easier tools like spin
locks rather than trying to go lockless.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
The barrier would guarantee that ioc->active (and in fact the write to
the chip too above) are globally visible before the read of the irq
status. There are two different issues not to mixup here. Any previous
store vs. a read (general case) and MMIO store of IRQ mask which hasLet's ignore for a moment the generic problem of loads crossing previous
stores and focus on the pure issue of using MMIO writes to mask
interrupts.Writing to a chip to disable an IRQ is generally never going to provide
any synchronisation guarantee. The MMIO write itself is asynchronous and
not ordered vs. a subsequent read from main storage (ie memory). So
unless you do an MMIO read to flush it, you basically haven't yet
disabled your IRQ don't know when it will be. That's one thing.Another one is that even if you do an MMIO read to flush, your IRQ may
already have been latched by your PIC, or may be an MSI already issued
and still travelling along busses, and thus might well occur in a few
instructions. In that later case, note that ignoring it is perfectly
fine since the IRQ line will eventually go down (for a level IRQ that
is, for an edge IRQ, ignoring is always fine). That's what we cause
short interrupts, they can be common, though linux has historical issues
with them (unrelated to this discussion). But your interrupt handler
_will_ be called, and thus should be aware that your intend is to ignore
it.So for both of the reasons above, the MMIO write doesn't mean you IRQ
won't happen and in fact, synchronize_irq() here is totally useless
since it won't prevent the IRQ from actually still happening just after
the test_bit of IRQ_INPROGRESS.Now, the way to actually do that properly is to _also_ have a flag to
indicate your handler you don't want to deal with that interrupt. That
could be something along the lines of:writel(irq mask);
wmb();
chip->masked = 1;
smp_mb();
synchronize_irq();Now, the IRQ handler can just do
if (chip->masked == 1)
return <what...
No, it doesn't really guarantee that.
The thing is, there is no such thing as "globally visible".
There is a "ordering of visibility wrt CPU's", but it's not global, it's
quite potentially per-CPU. So a barrier on one CPU doesn't guarantee
anything at all without a barrier on the *other* CPU.That said, the interrupt handling itself contains various barriers on the
CPU's that receive interrupts, thanks to the spinlocking. But I do agree
with Herbert that adding a "smb_mb()" is certainly in no way "obviously
correct", because it doesn't talk about what the other side does wrt
barriers and that word in memory.Linus
-
I agree and you can see that in fact, we don't have enough barrier on
the other side since spin_unlock doesn't prevent subsequent loads from
crossing a previous store...I wonder if that's worth trying to address, adding a barrier in
handle_IRQ_event for example, or we can continue ignoring the barrier
and let some drivers do their own fixes in fancy ways.Ben.
-
So how about something like this (untested! not necessarily very well
thought through either!)Basic notion: the only thing that serializes the IRQ_INPROGRESS flag is
the descriptor lock. And we don't have to (or even want to!) hold it while
waiting for the thing, but we want to *have*held*it* in between whatever
we're synchronizing with.The internal irq handler functions already held the lock when they did
whatever they need to serialize - and they are possibly performance
critical too - so they use the "internal" function that doesn't get the
lock unnecessarily again.Hmm?
Linus
---
kernel/irq/manage.c | 22 ++++++++++++++++++----
1 files changed, 18 insertions(+), 4 deletions(-)diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..f3e9575 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -14,6 +14,18 @@#include "internals.h"
+/*
+ * Internally wait for IRQ_INPROGRESS to go away on other CPU's,
+ * after having serialized with the descriptor lock.
+ */
+static inline void do_synchronize_irq(struct irq_desc *desc)
+{
+#ifdef CONFIG_SMP
+ while (desc->status & IRQ_INPROGRESS)
+ cpu_relax();
+#endif
+}
+
#ifdef CONFIG_SMP/**
@@ -28,13 +40,15 @@
*/
void synchronize_irq(unsigned int irq)
{
+ unsigned long flags;
struct irq_desc *desc = irq_desc + irq;if (irq >= NR_IRQS)
return;- while (desc->status & IRQ_INPROGRESS)
- cpu_relax();
+ spin_lock_irqsave(&desc->lock, flags);
+ spin_unlock_irqrestore(&desc->lock, flags);
+ do_synchronize_irq(desc);
}
EXPORT_SYMBOL(synchronize_irq);@@ -129,7 +143,7 @@ void disable_irq(unsigned int irq)
disable_irq_nosync(irq);
if (desc->action)
- synchronize_irq(irq);
+ do_synchronize_irq(desc);
}
EXPORT_SYMBOL(disable_irq);@@ -443,7 +457,7 @@ void free_irq(unsigned int irq, void *dev_id)
unregister_handler_proc(irq, action);/* Make sure it's not being used on another CP...
Thinking about this again and checking code I have come to the
conclusion that1) There is a race (in some drivers) even with the full mb.
2) Linus's patch fixes it.
3) It's difficult to fix it elegantly in the driver.In other words I think this patch is great :)
First of all let's agree on some basic assumptions:
* A pair of spin lock/unlock subsumes the effect of a full mb.
* A spin lock in general only equates to (SS/SL/LL).
* A spin unlock in general only equates to (SS/LS).In particular, a load after a spin unlock may pass before the
spin unlock.Here is the race (with tg3 as the only example that I know of).
The driver attempts to quiesce interrupts such that after the
call to synchronize_irq it is intended that no further IRQ
handler calls for that device will do any work besides acking
the IRQ.Here is how it does it:
CPU0 CPU1
spin lock
load irq_sync
irq_sync = 1
smp_mb
synchronize_irq()
while (IRQ_INPROGRESS)
wait
return
set IRQ_INPROGRESS
spin unlock
tg3_msi
ack IRQ
if (irq_sync)
return
do workThe problem here is that load of irq_sync in the handler has
passed above the setting of IRQ_INPROGRESS.Linus's patch fixes it because this becomes:
CPU0 CPU1
spin lock
load irq_sync
irq_sync = 1
smp_mb
synchronize_irq
set IRQ_INPROGRESS
spin unlock
spin lock
spin unlock
tg3_msi
ack IRQ
if (irq_sync)
return
do work
while (IRQ_INPROGRESS)
wait
spin lock
clear IRQ_INPROGRESS
spin unlock
returnEven though we still do the work on the right we will now notice
the INPROGRESS flag on the left and wait.It's hard to fix this in the drivers because they'd either have
to access the desc lock or add a full mb to the fast path on the
right.Once this goes in we can also remove the smp_mb from tg3.c. BTW,
a lot of drivers (including the fusion example Ben quoted) call
synchronize_irq before free_irq. This is unneces...
Hey, I appreciate it, but I do think that the "spinlock only to
immediately unlock it again" is pretty horrid.I'm convinced that there should be some way to do this without actually
taking the lock. I *think* it should work with something likefor (;;) {
smp_rmb();
if (!spin_is_locked(&desc->lock)) {
smp_rmb();
if (!(desc->status & IRQ_INPROGRESS)
break;
}
cpu_relax();
}instead. Which basically requires that we see the descriptor lock being
not held, before we see the IRQ_INPROGRESS bit being clear. Put another
way: it loops until it sees the lock having been released, and the
IRQ_INPROGRESS bit being clear after that.The above requires no serializing instructions on x86, which is a good
goal (now that smp_rmb() is just a compiler barrier). And it doesn't
actually have to bounce any cachelines.And it doesn't have that ugly "get lock only to release it", which just
makes me go "Eww!".But it's a bit subtler. It basically depends on the fact that
spin_unlock() obviously has to make sure that there is a release barrier
in the unlock, so any writes done (to the IRQ_INPROGRESS bit) within the
locked region *must* be visible before the spinlock itself has been
released.So somebody should:
- use another pair of eyes and brains to back me up on this.
- write up some coherent changelog entry, using the emails that have
passed back and forth.
- actually turn the above into a tested patch with a comment.And I'm pushing for that "somebody" being somebody else than me ;)
Linus
-
I'm starting to doubt this.
One of the issues is that we still need the smp_mb() in front of the loop
(because we want to serialize the loop with any writes in the caller).The other issue is that I don't think it's enough that we saw the
descriptor lock unlocked, and then the IRQ_INPROGRESS bit clear. It might
have been unlocked *while* the IRQ was in progress, but the interrupt
handler is now in its last throes, and re-takes the spinlock and clears
the IRQ_INPROGRESS thing. But we're not actually happy until we've seen
the IRQ_INPROGRESS bit clear and the spinlock has been released *again*.So those two tests should actually be the other way around: we want to see
the IRQ_INPROGRESS bit clear first.It's all just too damn subtle and clever. Something like this should not
need to be that subtle.Maybe the rigth thing to do is to not rely on *any* ordering what-so-ever,
and just make the rule be: "if you look at the IRQ_INPROGRESS bit, you'd
better hold the descriptor spinlock", and not have any subtle ordering
issues at all.But that makes us have a loop with getting/releasing the lock all the
time, and then we get back to horrid issues with cacheline bouncing and
unfairness of cache accesses across cores (ie look at the issues we had
with the runqueue starvation in wait_task_inactive()).Those were fixed by starting out with the non-locked and totally unsafe
versions, but then having one last "check with lock held, and repeat only
if that says things went south".See commit fa490cfd15d7ce0900097cc4e60cfd7a76381138 and ponder. Maybe we
should take the same approach here, and do something likerepeat:
/* Optimistic, no-locking loop */
while (desc->status & IRQ_INPROGRESS)
cpu_relax();/* Ok, that indicated we're done: double-check carefully */
spin_lock_irqsave(&desc->lock, flags);
status = desc->status;
spin_unlock_irqrestore(&desc->lock, flags);/* Oops, that failed? */
if (status &...
Right. I think if we accept the definition of a spin lock/unlock
that Nick outlined earlier, then we can see that on the IRQ path
there simply isn't a memory barrier between the changing of the
status field and the execution of the action:spin_lock
set IRQ_INPROGRESS
spin_unlockaction
spin_lock
clear IRQ_INPROGRESS
spin_unlockWhat may happen is that action can either float upwards to give
spin_lock
action
set IRQ_INPROGRESS
spin_unlockspin_lock
clear IRQ_INPROGRESS
spin_unlockor it can float downwards to give
spin_lock
set IRQ_INPROGRESS
spin_unlockspin_lock
clear IRQ_INPROGRESS
action
spin_unlockAs a result we can add as many barriers as we want on the slow
(synchronize_irq) path and it just won't do any good (not unless
we add some barriers on the IRQ path == fast path).What we do have on the right though is the fact in some ways
action behaves as if it's inside the spin lock even though it's
not. In particular, action cannot float up past the first spinThat's why I think this patch is in fact the only one that
solves all the races in this thread. The case that it solves
which the lock/unlock patch does not is the one where action
flows downwards past the clearing of IRQ_INPROGRESS. I missed
this case earlier.In fact this bug exists elsewhere too. For example, the network
stack does this in net/sched/sch_generic.c:/* Wait for outstanding qdisc_run calls. */
while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
yield();This has the same problem as the current synchronize_irq code.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
OK here is the fix for that case.
[NET]: Fix possible dev_deactivate race condition
The function dev_deactivate is supposed to only return when
all outstanding transmissions have completed. Unfortunately
it is possible for store operations in the driver's transmit
function to only become visible after dev_deactivate returns.This patch fixes this by taking the queue lock after we see
the end of the queue run. This ensures that all effects of
any previous transmit calls are visible.If however we detect that there is another queue run occuring,
then we'll warn about it because this should never happen as
we have pointed dev->qdisc to noop_qdisc within the same queue
lock earlier in the functino.Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e01d576..b3b7420 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -556,6 +556,7 @@ void dev_deactivate(struct net_device *dev)
{
struct Qdisc *qdisc;
struct sk_buff *skb;
+ int running;spin_lock_bh(&dev->queue_lock);
qdisc = dev->qdisc;
@@ -571,12 +572,31 @@ void dev_deactivate(struct net_device *dev)dev_watchdog_down(dev);
- /* Wait for outstanding dev_queue_xmit calls. */
+ /* Wait for outstanding qdisc-less dev_queue_xmit calls. */
synchronize_rcu();/* Wait for outstanding qdisc_run calls. */
- while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
- yield();
+ do {
+ while (test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
+ yield();
+
+ /*
+ * Double-check inside queue lock to ensure that all effects
+ * of the queue run are visible when we return.
+ */
+ spin_lock_bh(&dev->queue_lock);
+ running = t...
Ouch!, is there really no sane locking alternative? Hashed waitqueues
-
Well if we ever moved the transmission to full process context
then we'll gladly accept your patch :)Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
From: Herbert Xu <herbert@gondor.apana.org.au>
Applied, thanks Herbert!
-
OK, here is the patch again with a changelog:
[IRQ]: Fix synchronize_irq races with IRQ handler
As it is some callers of synchronize_irq rely on memory barriers
to provide synchronisation against the IRQ handlers. For example,
the tg3 driver doestp->irq_sync = 1;
smp_mb();
synchronize_irq();and then in the IRQ handler:
if (!tp->irq_sync)
netif_rx_schedule(dev, &tp->napi);Unfortunately memory barriers only work well when they come in
pairs. Because we don't actually have memory barriers on the
IRQ path, the memory barrier before the synchronize_irq() doesn't
actually protect us.In particular, synchronize_irq() may return followed by the
result of netif_rx_schedule being made visible.This patch (mostly written by Linus) fixes this by using spin
locks instead of memory barries on the synchronize_irq() path.Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..1f31422 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -29,12 +29,28 @@
void synchronize_irq(unsigned int irq)
{
struct irq_desc *desc = irq_desc + irq;
+ unsigned int status;if (irq >= NR_IRQS)
return;- while (desc->status & IRQ_INPROGRESS)
- cpu_relax();
+ do {
+ unsigned long flags;
+
+ /*
+ * Wait until we're out of the critical section. This might
+ * give the wrong answer due to the lack of memory barriers.
+ */
+ while (desc->status & IRQ_INPROGRESS)
+ cpu_relax();
+
+ /* Ok, that indicated we're done: double-check carefully. */
+ spin_lock_irqsave(&desc->lock, flags);
+ status = desc->status;
+ spin_unlock_irqrestore(&desc->lock, flags);
+
+ /* Oops, that failed? */
...
Good for me.
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cheers,
Ben.-
Hrm... not on yet. Herbert, care to resend, looks like it fell down the
wrong hole in Linus mailbox :-)Cheers,
Ben.-
Thanks for the reminder Ben. Here it is again:
[IRQ]: Fix synchronize_irq races with IRQ handler
As it is some callers of synchronize_irq rely on memory barriers
to provide synchronisation against the IRQ handlers. For example,
the tg3 driver doestp->irq_sync = 1;
smp_mb();
synchronize_irq();and then in the IRQ handler:
if (!tp->irq_sync)
netif_rx_schedule(dev, &tp->napi);Unfortunately memory barriers only work well when they come in
pairs. Because we don't actually have memory barriers on the
IRQ path, the memory barrier before the synchronize_irq() doesn't
actually protect us.In particular, synchronize_irq() may return followed by the
result of netif_rx_schedule being made visible.This patch (mostly written by Linus) fixes this by using spin
locks instead of memory barries on the synchronize_irq() path.Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 80eab7a..1f31422 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -29,12 +29,28 @@
void synchronize_irq(unsigned int irq)
{
struct irq_desc *desc = irq_desc + irq;
+ unsigned int status;if (irq >= NR_IRQS)
return;- while (desc->status & IRQ_INPROGRESS)
- cpu_relax();
+ do {
+ unsigned long flags;
+
+ /*
+ * Wait until we're out of the critical section. This might
+ * give the wrong answer due to the lack of memory barriers.
+ */
+ while (desc->status & IRQ_INPROGRESS)
+ cpu_relax();
+
+ /* Ok, that indicated we're done: double-check carefully. */
+ spin_lock_irqsave(&desc->lock, flags);
+ status = desc->status;
+ spin_unlock_ir...
Well, we are generally safer here. That is, unless action is a store,
it will not pass spin_lock, at least not on powerpc afaik.In fact, if action, as it is in our case, is something like
if (foo)
return;We cant execute the store inside spin_lock() without having loaded
foo, there is an implicit dependency here.But anyway, Linus patch fixes that too if it was a problem. Now if
we grep for while (test_bit and while (!test_bit I'm sure we'll find
other similar surprises.That's also one of the reasons why I _love_ nick patches that add a
proper lock/unlock API on bits, because at least those who are doing
those hand-made pseudo-locks with bitops to save space will be
getting a proper lock/unlock API, no more excuse.The network stack is doing more fancy things so it's harder (though I
wonder sometimes if it's really worth the risks taken for not using
spinlocks... maybe).Ben.
-
The network stack always made be nervous with it's bitops use as locks.
Any loop of test_bit() alone as a synchronisation method, without locks
or barriers is, imho, inherently buggy.Ben.
-
Paulus and I convinced ourselves that this would work. If we call our
variable that gets set before synchronize_irq and read in the IRQ
handler "foo", we get to:- writing foo can travel down at most to just before the unlock in the
code above- reading foo can travel up out of the IRQ handler at most just after
the lock in the code that sets IRQ_INPROGRESS.The whole lock/set IRQ_INPROGRESS/unlock path can then only happen
before the locked section above, in which case we see and wait nicely
and all is good, or after, in which case the store to foo will be
visible to the IRQ handler as it will be ordered with the unlock in the
code above.Pfiew !
So Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Thanks !
Ben.
-
Note that napi_synchronize needs a slightly different treatement.
Here, the situation boils down to:
one CPU does:
foo = 1;
while(test_bit(bar))
barrier();and the other:
if (!test_and_set_bit(bar)) {
read & use foo
smp_mb__before_clear_bit();
clear_bit(bar);
}The good thing here is that read & use foo is part of the critical
section (I hate hand-made locks ...) defined by bar which makes
things somewhat easier than the synchronize_irq() case.I think a simple smp_mb(); here after foo = 1; is enough, which means
basically just having an smp_mp(); inside napi_synchronize(), before
the test_bit(). Or do I miss something ?Cheers,
Ben.-
Yes I think you're right. In this case we do have barriers
everywhere else so this should work.Although if you want napi_synchronize to have the property that
when it returns all NAPI processing effects are visible then
you'd need another smp_mb() after the loop.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
Not unless you mean a pair of spin lock/unlock as in
2 spin lock/unlock pairs (4 operations).*X = 10;
spin_lock(&lock);
/* *Y speculatively loaded here */
/* store to *X leaves CPU store queue here */
spin_unlock(&lock);I don't use the sparc barriers, so they don't come naturally to
me ;)I think both loads and stores can pass into the critical section
by having the spin_lock pass earlier ops, or by having spin_unlock
-
Good point.
Although in this case we're still safe because in the worst
cases:CPU0 CPU1
irq_sync = 1
synchronize_irq
spin lock
load IRQ_INPROGRESS
irq_sync sync is visible
spin unlock
spin lock
load irq_sync
while (IRQ_INPROGRESS)
wait
return
set IRQ_INPROGRESS
spin unlock
tg3_msi
ack IRQ
if (irq_sync)
return
spin lock
clear IRQ_INPROGRESS
spin unlock------------------------------------------------------------
CPU0 CPU1
spin lock
load irq_sync
irq_sync = 1
synchronize_irq
set IRQ_INPROGRESS
spin unlock
spin lock
load IRQ_INPROGRESS
irq_sync sync is visible
spin unlock
while (IRQ_INPROGRESS)
wait
tg3_msi
ack IRQ
if (irq_sync)
return
do work
spin lock
clear IRQ_INPROGRESS
spin unlock
returnSo because we're using the same lock on both sides, it does
do the right thing even without the memory barrier.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
Yeah, if you've got the lock on both sides there, then I
agree it will be correct.
-
That may do the trick as the read can't cross the spin_lock (it can
cross spin_unlock but not lock). Advantage over adding a barrier to
handle_IRQ_event() is that it keeps the overhead to the slow path
(synchronize_irq).Note that I didn't actually experience a problem here. I just came upon
that by accident while thinking about a similar issue I have with
napi_synchronize().Looks good to me on a first glance (unfortunately a bit ugly but heh).
Ben.
-
On Thu, 18 Oct 2007 11:25:42 +1000
Anyone reading this code is going to ask "wtf is that for". It needs a
comment telling them.mb() is the new lock_kernel(). Sigh.
-
Ugh ?
That sounds fairly obvious to me :-) we are reading a value, that is
totally unordered, nothing to do about lock kernel or whatever, if we
want the above statement to make any sense in any kind of usage
scenario, it needs to be ordered vs. what happens before.For example, take a construct like:
device->my_hw_is_off = 1;
synchronize_irq();
turn_off_hardware();That basically makes sure the irq either sees device->my_hw_is_off
being set to 1, or if an irq handler is already in progress and hasn't
seen it, we wait for it to complete.(You can replace "hw_is_off" with anything that we want to set and make
sure the IRQ handler sees it before proceeding. It could be clearing a
pointer to something and make sure the irq sees it before freeing the
data, etc...).I think pretty much any use of synchronize_irq() I can imagine needs
such kind of ordering... or it simply doesn't synchronize anything :-)Cheers,
Ben.-
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Linus Torvalds | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| David Newall | Re: Slow DOWN, please!!! |
| Ian Campbell | Re: [PATCH] x86: Construct 32 bit boot time page tables in native format. |
| Matthias Scheler | Re: HEADS UP: timecounters (branch simonb-timecounters) merged into -current |
| Greg Troxel | Re: Interface to change NFS exports |
| Thor Lancelot Simon | metadata cache and memory fragmentation |
| YAMAMOTO Takashi | amap memory allocation |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | [GIT]: Networking |
| Dushan Tcholich | Re: ksoftirqd high cpu load on kernels 2.6.24 to 2.6.27-rc1-mm1 |
