Keep track of disable_irq_nosync() invocations and call enable_irq() the right number of times if work has been cancelled that would include them. Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> --- Now that the call to flush_work_keventd() (problematic because of rtnl_mutex being held) has been replaced by cancel_work_sync() another issue has arisen and been left unresolved. As the MDIO bus cannot be accessed from the interrupt context the PHY interrupt handler uses disable_irq_nosync() to prevent from looping and schedules some work to be done as a softirq, which, apart from handling the state change of the originating PHY, is responsible for reenabling the interrupt. Now if the interrupt line is shared by another device and a call to the softirq handler has been cancelled, that call to enable_irq() never happens and the other device cannot use its interrupt anymore as its stuck disabled. I decided to use a counter rather than a flag because there may be more than one call to phy_change() cancelled in the queue -- a real one and a fake one triggered by free_irq() if DEBUG_SHIRQ is used, if nothing else. Therefore because of its nesting property enable_irq() has to be called the right number of times to match the number disable_irq_nosync() was called and restore the original state. This DEBUG_SHIRQ feature is also the reason why free_irq() has to be called before cancel_work_sync(). While at it I updated the comment about phy_stop_interrupts() being called from `keventd' -- this is no longer relevant as the use of cancel_work_sync() makes such an approach unnecessary. OTOH a similar comment referring to flush_scheduled_work() in phy_stop() still applies as using cancel_work_sync() there would be dangerous. Checked with checkpatch.pl and at the run time (with and without DEBUG_SHIRQ). Please apply. Maciej patch-mips-2.6.23-rc5-20070904-phy-irq-fix-9 diff -up --recursive --new-file ...
On Wed, 19 Sep 2007 15:38:19 +0100 (BST) You always put boring, crappy, insufficient text in the for-the-changelog section and interesting, useful, sufficient text in the not-for-the-changelog section. But you can't fool me! I have an editor and I fix it up. -
I'll swap the sections in the future then. ;-) Frankly I was not sure whether the changelog was happy about being fed with lengthy explanations and it has not spoken out. I have to admit this is a habit carried over from the FSF-style ChangeLog -- where the enforced rule is actually *not* to provide any explanation for why changes are done and only describe what has been modified (with Thank you and sorry for the extra work I caused you -- I shall keep your suggestion in mind in the future. Maciej -
On Fri, 21 Sep 2007 13:51:12 +0100 (BST) I think it's worth putting plenty of details in the changelog: it's compressed on-disk and on-the-wire and is overall pretty cheap. If people don't actually seek the information out, it has close to zero impact on them. But on those occasions when people _do_ seek the information out (and it can be years later) then they want every drop of information they can get. Numerous times I've gone back to the 2.5.x mm/ changelogs to work out what on earth we were thinking when we did something, and it has proved quite useful in explaining the existing code, or in suggesting possible problems which we had forgotten about by 2007. otoh, you can get a lot of handy info by googling for strategic parts of the kernel code, or by googling snippets of the existing-but-short changelog. For example, this patch: google for "Keep track of disable_irq_nosync() invocations" and voila. Perhaps we don't need changelogs at all ;) -
On 19-09-2007 16:38, Maciej W. Rozycki wrote: Hi, Could you explain why cancel_work_sync() is better here than flush_scheduled_work() wrt. rtnl_lock()? Regards, Jarek P. -
Well, this is actually the bit that made cancel_work_sync() be written in the first place. The short story is the netlink lock is most probably held at this point (depending on the usage of phy_disconnect()) and there is also an event waiting in the queue that requires the lock, so if flush_scheduled_work() is called here a deadlock will happen. Let me find a reference for a longer story...: http://www.linux-mips.org/cgi-bin/mesg.cgi?a=linux-mips&i=Pine.LNX.4.64N.061003150... and then discussed again: http://www.uwsg.indiana.edu/hypermail/linux/kernel/0612.0/0593.html Maciej -
Yes, it's all right here. Sorry for bothering - I should've found this by myself. I've still some doubts about this possible enable_irq() after free_irq(). If it's the only handler the status would be changed again and at least some of this code in check_irq_resend() would be run, but I can miss something again or/and this doesn't matter, as well. Thanks, Jarek P. -
Ah, no problem -- even with the right keys you may sometimes get lost in Well, enable_irq() and disable_irq() themselves are nesting, so they are not a problem. OTOH, free_irq() does not seem to maintain the depth count correctly, which looks like a bug to me and which could trigger regardless of whether flush_scheduled_work() or cancel_work_sync() was called. The reason is CONFIG_DEBUG_SHIRQ which makes a simulated interrupt event be sent at the end of free_irq(). It looks like a problem that is complementary to one I signalled here: http://lkml.org/lkml/2007/9/12/82 with respect to request_irq(), where, similarly, such an interrupt event is sent at the beginning. It looks like nobody was concerned back then, but perhaps it is time to do a better investigation now and propose a solution. I'll think about it and thanks for your inquisitiveness that has led to these conclusions. Maciej -
On Tue, Oct 16, 2007 at 06:19:32PM +0100, Maciej W. Rozycki wrote: I'm not sure free_irq() should maintain the depth count - rather warn on not zero. But, IMHO, any activity on freed irq seems suspicious to me (and doesn't look like very common), even if it's safe with current Yes, these DEBUG_SHIRQ checks are suspicious to me too, but they seem to be reasonable only in the case of possible resent irqs (so not for all irqs). On the other hand, it seems, proper irq handler with proper Btw., since, because of this patch, I've had a one more look at phy.c and there are a few more doubts, so this time I'll try to bother you for real: 1) phy_change() checks PHY_HALTED flag without lock; I think it's racy: eg. if it's done during phy_stop() it can check just before the flag is set and reenable interrupts just after phy_stop() ends. 2) phy_change() doesn't reenable irq line after it sees returns with errors; IMHO it should at least write some warning, but maybe try some safety plan, so enable_irq() and try to disable interrupts and free_irq() on the next call (if it happens). (But, I can be very wrong with this - maybe it's OK and official way.) 3) phy_interrupt() checks PHY_HALTED flag without lock too, but I'm not sure now if it could be dangerous after fixing #1; on the other hand even if we know it's not our regular interrupt, with current DEBUG_SHIRQ it could be easier to call schedule_work() anyway since we are sure it's before/in free_irq() yet. 4) phy_interrupt() should check return value from schedule_work() and enable irq on 0. 5) phy_stop_interrupts(): maybe I miss something, but it seems phy_stop() is required before this, so maybe there should be a comment on this? 6) phy_stop_interrupts(): if I'm not wrong with #3 calling phy_disable_interrupts() looks like we are not sure this phy_stop() really works; than maybe a WARN_ON? 7) phy_stop_interrupts(): after above mentioned changes in phy_interrupt(), and phy_changes() (always enable_irq()) I can't ...
On Wed, Oct 17, 2007 at 10:58:09AM +0200, Jarek Poplawski wrote: Should be: 6) phy_stop_interrupts(): if I'm not wrong with #5 calling Jarek P. -
.../... While there... is somebody interested in making the whole PHY lib operate a task level and use mutexes instead of spinlock ? I need that for drivers like EMAC (who use their own PHY layer for now), and I might even give a go at adapting phylib myself, but I'd like to take the temperature about it first. Basically, there is nothing in phylib that is performance critical or such that requires it to run at irq time and/or use locks. On the other hand, it complicates things in various areas. The most obvious one being that it prevents the network driver mii access callbacks from sleeping which can be annoying as MDIO can be slow, and some drivers have fancy muxes in there that are better off mutexed than spinlocked. Ben. -
On Wed, Oct 17, 2007 at 10:58:09AM +0200, Jarek Poplawski wrote: After rethinking, it looks like this last cancel should be useless. So, if phy_interrupt() schedules only if !PHY_HALTED and phy_change() does enable_irq() with no exeptions, it seems phy_interrupt() even without lock must see PHY_HALTED state before this free_irq() with possible DEBUG_SHIRQ call, then maybe only this safety: WARN_ON(work_pending(&phydev->phy_queue)); Btw, I've read this was considered and not liked, but IMHO, if this really has to be like this, creating phy's own workqueue seems to be resonable, at the very least to reduce latencies to other users of this irq. Jarek P. -
After reading this and earlier threads about phylib's way of using workqueue I think such a lighter and safer wrt. locking alternative for flush_scheduled_work should be useful, but maybe it's only my imagination. So, let's ask Oleg Nesterov, whose solutions are here only copy-cut-pasted & possibly abused by myslef. ---------> Subject: flush_work_sync as an alternative for flush_scheduled_work Similar to cancel_work_sync() but will only busy wait & block (without cancel). Signed-off-by: Jarek Poplawski <jarkao2@o2.pl> --- include/linux/workqueue.h | 1 + kernel/workqueue.c | 24 ++++++++++++++++++++++++ 2 files changed, 25 insertions(+) diff -Nurp 2.6.23-mm1-/include/linux/workqueue.h 2.6.23-mm1/include/linux/workqueue.h --- 2.6.23-mm1-/include/linux/workqueue.h 2007-10-12 23:45:24.000000000 +0200 +++ 2.6.23-mm1/include/linux/workqueue.h 2007-10-17 20:55:26.000000000 +0200 @@ -192,6 +192,7 @@ extern void init_workqueues(void); int execute_in_process_context(work_func_t fn, struct execute_work *); extern int cancel_work_sync(struct work_struct *work); +extern void flush_work_sync(struct work_struct *work); /* * Kill off a pending schedule_delayed_work(). Note that the work callback diff -Nurp 2.6.23-mm1-/kernel/workqueue.c 2.6.23-mm1/kernel/workqueue.c --- 2.6.23-mm1-/kernel/workqueue.c 2007-10-12 23:45:25.000000000 +0200 +++ 2.6.23-mm1/kernel/workqueue.c 2007-10-17 20:54:03.000000000 +0200 @@ -539,6 +539,30 @@ int cancel_delayed_work_sync(struct dela } EXPORT_SYMBOL(cancel_delayed_work_sync); +/** + * flush_work_sync - block until a work_struct's callback has terminated + * @work: the work which is to be flushed + * + * Similar to cancel_work_sync() but will only busy wait (without cancel) + * if the work is queued. If the work's callback appears to be running, + * flush_work_sync() will block until it has completed (but doesn't block + * while other callbacks are running, like flush_scheduled_work() does). + * + * It is not ...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Yes, it won't block, but will spin in busy-wait loop until all other works scheduled before this work are finished. Not good. After that it really blocks waiting for this work to complete. And I am a bit confused. We can't use flush_workqueue() because some of the queued work_structs may take rtnl_lock, yes? But in that case we can't use the new flush_work_sync() helper as well, no? If we can't just cancel the work, can't we do something like if (cancel_work_sync(w)) w->func(w); If we really the new helper, perhaps we can make it a bit better? 1. Modify insert_work() to take the "struct list_head *at" parameter instead of "int tail". I think this patch will also cleanup the code a bit, and shrink a couple of bytes from .text 2. flush_work_sync() inserts a barrier right after this work and blocks. We still need some retry logic to handle the queueing is in progress of course, but we won't spin waiting for the other works. What do you think? Oleg. -
We do an equivalent of this -- all that we care about that w->func(w) would do is enable_irq() and the rest we want to skip at this point. We probably do not need the counter in the end though. Maciej -
OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOPS! Of course, we can't!!! I remembered there was this issue long time ago, but then I've had some break in tracking net & workqueue. So, while reading this patch I was alarmed at first, and self-misled later. I think, there is definitely needed some warning about locking (or unlocking) during these flush_ & cancel_ functions. (Btw, I've very much wondered now, why I didn't notice at that 'old' time, that you added such a great feature (wrt. locking) and I even didn't notice this...). So, Maciej (and other readers of this thread) - I withdraw my false opinion from my second message here: it's very wrong to call this sched_work_sync() with rtnl_lock(). It's only less probable to lockup Looks like a very good idea, but I need more time to rethink this. Until monday I should have an opinion on that (today a bit under Since there is no gain wrt. locking with my current proposal, I withdraw this patch of course. It looks like my wrong patch was great idea because we got this very precious Oleg's opinion! (I know I'm a genius sometimes...) Thanks very much, Jarek P. -
On Fri, Oct 19, 2007 at 09:50:14AM +0200, Jarek Poplawski wrote: ...But, not much less... Jarek P. -
OK, I know I'm dumber and dumber everyday, but it seems in a hurry I got it wrong again or miss something (as usual): these all flushes are rtnl lockup vulnerable wrt. other work functions, but cancel_work_sync looks perfectly fine... (Or am I wrong because: ...?) Then, if by any chance I'm right, something like flush_work_sync (or changed flush_scheduled_work, if there is no problem with such a change of implementation) could be safely (if it's called without locks used by flushed work only) done cancel_work_sync() way, by running a work function after try_to_grab_pending() returns 1 (after list_del_init - of course without respecting a queue order). Regards, Jarek P. -
If this work doesn't rearm itself - yes. (otherwise, the same ->func can run twice _at the same time_) But again, in this case wait_on_work() after try_to_grab_pending() == 1 doesn't block, so we can just do if (cancel_work_sync(w)) w->func(); Oleg. -
Of course, I should have written it much shorter by saying flush_scheduled_work could be done internally just like you suggested! My point is to make this all safer and simpler, so we don't have to remember each time about these differences wrt. locking. Then checking of such patches could be much easier. Unless this current behavior of flush_scheduled_work has any real advantages, worth of this potential unsafety. (Btw. is there any reason to use this with rearming works, anyway?) Thanks, Jarek P. -
On Mon, Oct 22, 2007 at 10:02:59PM +0400, Oleg Nesterov wrote: ...but, if it were run just before work_clear_pending()? Jarek P. -
3. Add lockdep annotation like the other API. :) Andrew just sent my patch (used to be two patches by somebody's request but that's fine) titled "workqueue: debug flushing deadlocks with lockdep" to Linus. johannes
What do you mean by "proper irq handler with proper hardware"? Using softirqs (they used to be called bottom-halves) is actually a natural way I remember having a look into it, but it was long ago and I cannot immediately recall the conclusion. Which means it is either broken or deserves a comment as non-obvious. I will have a look into it again, but No way to do this safely -- at this point the device probably still has its interrupt output asserted and the register to clear it is inaccessible, so enabling the line will enter an infinite loop. At this point the system is no longer stable, so it is better to keep at least some functionality, so that it may be attempted to be shut down cleanly, The API is documented in: Documentation/networking/phy.txt -- you are welcome to improve. If you do not want to get into the gory details, just use the cooked interface and phy_disconnect() will do the dirty work for CONFIG_DEBUG_SHIRQ. Barring this option, cancel_work_sync() could have been moved to the front of free_irq(). I think I have seen this in reality, with the interrupt line left stuck disabled afterwards, but I will double check when I have an opportunity. The approach implemented with this patch does work, which is the important bit, and if Well, if there is another handler registered on this line, you'll get I am not sure I know phy.c well enough either, ;-) and your concerns are appreciated as interesting conclusions may develop. If someone disagrees with what I have written here, they are welcome to speak out too. Maciej -
Technically until free_irq returns a handler should respond to calls and with proper hardware it should have no problem with checking if it's its interrupt, even after disabling this hardware, because of possible races. But with a scenario like this: - disable_irq() - disable_my_hadrware_irq() - clear_my_hardware_irq() - free_irq() - enable_irq() it seems the handler should respond even after free_irq because there could be still interrupts to resend, originally generated by its hardware, so such behavior looks very suspicious, at least with some type of interrupts. So, I think, the idea of DEBUG_SHIRQ is generally right and very useful - but, of course, there could be exceptions, which btw. could try some hacks under DEBUG_SHIRQ too. And my opinion about 'properness' was very general (not about phy) too, just like my How? If schedule_work won't succeed because there is a pending one, I've thought that phy_stop() could be needed before phy_stop_interrupt() to set PHY_HALTED, but since it disables and clears interrupts too, then there should be no need to repeat this, maybe only check it's done. But if there is no such dependency, then But, I've enough of other concerns too, so nothing urgent here... Many thanks, Jarek P. -
Well, the hardirq handler can check it, no problem -- it is just it is so slow, the latency would be unacceptable. The problem with the softirq handler is we do really not want it to be called after the driver has already been shut down and its structures freed. It used to happen before this flush/cancel call was added with the usual effect (oops) as by then These are softirqs, not hardware interrupts, so they are as such not related to *_irq() infrastructure. The flaw is the depth count of IRQ lines is not maintained consistently. This is, according to comments around the code in question, to cover up bugs elsewhere. Not a brillant idea, I am afraid -- there should be no need to reset the depth upon request_irq() and likewise with free_irq(), but there you go. I would be happy to investigate a possible solution and rewrite the necessary bits, but right now I am committed to other stuff, overdue already, sorry. The view could change if we supported hot-pluggable interrupt controllers, but it is not the case at the moment right now, so the interrupt lines are there to stay for the duration of the system lifespan The idea is right, no question, but I am not quite sure it has been properly architected into our current design. Actually I am almost sure of the reverse. This is why I was (and still am) interested in feedback Correct and my note is misleading, sorry. The thing is we shouldn't have come here the second time in the first place (which is I think why the check is not there) as handlers for the same line are not allowed to run in parallel (cf. irq_desc->lock and IRQ_INPROGRESS). Perhaps BUG_ON(!schedule_work()) would be appropriate, though I am not sure if we should handle every "impossible" condition we can imagine. Quite a lot of hardirq handlers assume two instances will not run in parallel, so if it ever happened, then a serious breakage would The problem is at the moment I am still probably the only user of this ...
But then... your patch seems to make it possible, because it enables irq to the initial state of the counter. Of course, this could happen on closing only. Jarek P. -
That's because free_irq() does not disable the interrupt in the correct manner. The scenario is more or less like this: phy_interrupt() [depth == 0] disable_irq() depth++; status |= IRQ_DISABLED; ... free_irq() [depth == 1] status |= IRQ_DISABLED; ... phy_change() [depth == 1] enable_irq() depth--; status &= ~IRQ_DISABLED; oops! Now if free_irq() correctly incremented the depth counter, then the last enable_irq() would still decrement it, but with its initial value of 2 it would not change the status to reenable the line. Maciej -
Well, I have now recalled what the issue is -- we just plainly and simply want to avoid a hardirq spinlock for the very reason we do not do all the processing in the hardirq handler. The thing is we make accesses to the MDIO bus with the phydev lock held and it may take ages until these accesses will have completed. And we cannot afford keeping interrupts disabled for so long. So the only way is to make the check for the HALTED state lockless and make sure any race condition is handled gracefully and does not lead to inconsistent behaviour. Which I think as of what we have in the net-2.6.24 tree is the case, but there are never too many eyes to look at a piece of code, so if anybody feels like proving me wrong, then just go ahead! Maciej -
Actually I'm not convinced with this explanation. It seems to me that since there are such serious locking problems (especially with rntl), there could be once more considered a private workqueue. You've written earlier about being a lonely user of this code. But, since Benjamin offered his help with changing to mutexes, which looks like very reasonable idea to me (probably I miss most of the points...), maybe it's very good opportunity to both: make this code better and double the user base! I'm interested in looking for such solution if Benjamin thinks there could be too few problems for him... So, let somebody tell us what could be wrong with this idea? Cheers (till Monday), Jarek P. -
Well, this will change eventually when I submit the patch for Broadcom SiByte platforms so that sb1250-mac.c will be able to request an interrupt for the PHYs. All the infrastructure is in place except from platform code to configure the SOC's GPIO line used for the interrupt input (when I do not object and can certainly cooperate, but cannot make it a high priority work for me at the moment -- sorry. Maciej -
My main problem is time :-) But I need to do the change to mutex if I am to use phylib for emac. I can't tell when I'll have time to do it tho. Ben. -
