I'm encountering a very odd issue which only seems to crop up while I sleep at night. In the morning, I go back to my Linux machine and there's a ridiculously bad latency I immediately notice in Firefox's redraw which I cannot explain. glxgears shows the typical 5K FPS, so it seems like graphics are normal, but there's something causing horrendous redraw times for Firefox. If I exit X11, the problem then extends to my virtual terminals. If I switch a VT, it takes about a second to redraw the whole screen top-to-bottom[1]. I can also get the same issue if I do page-up/page-down in less, or pretty much anything that causes the screen to redraw. I'm using 2.6.27-rc6-tip right now, but this problem occurs on pristine rc5 and rc6 as well. uptime is only 7 hours, and this problem seems to get progressively worse over time. Strangely enough, unloading the 'ath9k' driver causes the latency to outright -vanish- instantaneously (I only discovered this -after- I made the YouTube video cited below, so the video doesn't demonstrate this tidbit, unfortunately). I wish I knew what these symptoms are pointing to, because this makes no sense to me at all. Is there some hardware interrupt going bonkers? The only thing I can think of is that I've got 'irqpoll' on the command line. I have it because I had to reload the ath9k driver at one point, and this was the result (caused ath9k to become a useless driver until I rebooted, which re-enabled IRQ 17): [53997.493777] ath9k: driver unloaded [54001.098532] ath9k: 0.1 [54001.098833] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 [54001.099013] ath9k 0000:03:00.0: setting latency timer to 64 [54001.233825] phy1: Selected rate control algorithm 'ath9k_rate_control' [54001.236980] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17 [54002.136218] irq 17: nobody cared (try booting with the "irqpoll" option) [54002.136229] Pid: 0, comm: swapper Tainted: P 2.6.27-rc6-00036-ga551b98 #2 [54002.136234] [54002.136235] Call ...
irqpoll is a monster of evil and that should make your system crawl to its knees. I would advise instead we work with you fixing the the missed interrupts issue upon rmmod. Luis --
Also, please provide the output of cat /proc/interrupts and also please do not cross post to all these lists, just use linux-wireless or ath9k. Luis --
On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez
Note that the problem necessitating use of irqpoll in the first place
seems to only happen under certain conditions. I am unsure what these
conditions are. Before 'ath9k: connectivity is lost after Group
rekeying is done', I had used rmmod/modprobe as my solution to the
issue, which triggered the IRQ issue.
alcarin steven # cat /proc/interrupts
CPU0 CPU1
0x0: 63227 0 IO-APIC-edge hpet
0x8: 1 0 IO-APIC-edge rtc0
0x9: 13080 0 IO-APIC-fasteoi acpi
0xe: 8195 0 IO-APIC-edge ide0
0xf: 0 0 IO-APIC-edge ide1
0x10: 36 0 IO-APIC-fasteoi uhci_hcd:usb5
0x11: 10645 0 IO-APIC-fasteoi ath
0x12: 42 0 IO-APIC-fasteoi uhci_hcd:usb4
0x17: 919 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
0x13: 32885 0 IO-APIC-fasteoi uhci_hcd:usb3,
ata_piix, ohci1394
0x200100: 1 0 PCI-MSI-edge eth0
0x16: 223 0 IO-APIC-fasteoi HDA Intel
NMI: 0 0 Non-maskable interrupts
LOC: 78087 95718 Local timer interrupts
RES: 11576 16384 Rescheduling interrupts
CAL: 6862 8889 Function call interrupts
TLB: 54 41 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
SPU: 0 0 Spurious interrupts
Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
all the maintainers of ath9k and didn't get a single response (except
a 'me too' from a fellow ath9k user). I didn't just want to hear
crickets this time.
- Steven
--
I wonder if this is related: alcarin ~ # dmesg | grep GPE [ 0.166806] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62 [ 685.409077] ACPI: EC: GPE storm detected, disabling EC GPE --
FWIW, you might want to try this pach in this thread: http://www.gossamer-threads.com/lists/linux/kernel/966076 I was receiving a disable irq #23 this patch"knock on wood", seems to help with this issue. -- Justin P. Mattock --
Both the patches there affect EHCI, though. If I was having a USB controller fink out with "IRQ X: nobody cared", then that patch would probably indeed be applicable. --
You mean this patch: [PATCH] ath9k: connectivity is lost after Group rekeying is done http://marc.info/?l=linux-wireless&m=122163541519736&w=2 So let me get this straight -- you applied this new patch, and haven't Understood, but I also have used this before with ath9k and I got exactly the same results you did -- I just refused to use it again and just try to fix the issues present. ath9k issues tons of interrupts, not sure why irqpoll option would cause latency so bad as the interrupts *are* handled. Not sure *exactly* how irqpoll works but its description mentions using it forces each interrupt handler on the IRQ line to check the interrupt is for it. You have to keep in mind that not only are ath9k interrupts then being sent to the devices on its line but it would seem that all other devices on each line would suffer from the interrupts of the other guys. Why ath9k would be the *only* culprit of causing latency But it was this interrupt line which had an interrupt not handled. I'm not sure why this would happen. Can't we rule out ath9k then since Patches speak more than words, but yeah sorry, we should have addressed this there. I've personally have just been busy with tackling aggregation. Luis --
On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez Appologize about that, I'm wondering what is or was the main commit that might be causing all of these disable irq #'s in usb, wireless, etc.. just doing a quick search one cant help but to think maybe this is something with tick.(but could be wrong); -- Justin P. Mattock --
On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez I hadn't at the time of that writing, no. I saw it as a fix for future ignored IRQs, and hadn't noticed any difference with it on or off. So it seemed like since there was no consequence having it enabled, why not leave it enabled all the time? Now that I'm aware it's the spawn of satan, I'm trying with it off. So far, so good. But I haven't had need to reload ath9k so frequently, and even when I do, I can't reproduce the specific conditions which No, it's in hex. 0x17 = 23, 0x11 = 17. IRQ 17 is the one that pooped If I do, I'll need to know what precisely to do about it. What debug Which is far more important, I agree. It's annoying to get speeds <802.11b on my pre-802.11n capable chipset and network. ;) - Steven --
Sure, OK so in ath9k's core.h we have a macro used to indicate what type of debugging stuff gets print out. By default we only leave enabled the fatal stuff. #define DBG_DEFAULT (ATH_DBG_FATAL ) You want to add to this ATH_DBG_INTERRUPT as follows: #define DBG_DEFAULT (ATH_DBG_FATAL | ATH_DBG_INTERRUPT) Then since the issue seems to come up when you rmmod, simply try to download do some large RXing or TXing or both and then rmmod. Capture Yeah exactly, I hope the work we do will get in for 27. We'll see. Luis --
Oh and you should at least get some 11n rates, just not aggregation :) Luis --
On Thu, Sep 18, 2008 at 3:08 PM, Luis R. Rodriguez I always get "1 megabit/sec". It actually measures (at best) to be a throughput of 8mbits/sec, though the average (due to strange random packet loss) is about 4mbits/sec. -- And that was with aggregation re-enabled via your patch. Without, it never exceeds 1mbit/sec (actual, not just as indicated by iwconfig. - Steven --
Yeah, iwconfig reports a wrong rate, should be easy to fix but I That patch just fixes aggregation for mac80211 after the TX MQ Right. Luis --
the rate patch I posted today will fix it. not for .27 though johannes
This IRQ debugging is lots of fun. My /var/log/messages jumped in size to 241MB. :) My test case was a file transfer that basically copies 50MB from a remote machine's /dev/urandom and writes it to /dev/null locally. That should be a pretty nice large RX. And jeez, I'm seeing tremendous download speeds now and I have no idea what is different about my configuration. I'm getting up to 24 megabit per second transfer rates (actual), which is incredibly unusual for ath9k. I wasn't able to reproduce the IRQ problems on -tip or Linus' current tree (both with the group rekeying fix applied, which may or may not be a good idea for reproducing the issue). Even tried doing rmmod/modprobe in fast succession multiple times. Couldn't get it to throw up. I wonder... My rmmod/modprobe originally was used after the improper group rekeying before in order to get it back to normal. Perhaps the encryption has to go wonky before this IRQ issue surfaces? Another possibility is the earlier mentioned "ACPI: EC: GPE storm detected, disabling EC GPE". I don't know if it's related or not, but I haven't waited for it to show up before testing. Oh well, if the IRQ problem resurfaces (or if I get a ridiculous latency tomorrow morning with irqpoll off), I'll try to get some useful data to help debug it. --
Thanks for testing, and glad to see this is resolved. Luis --
On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez Damn. It's back. I was using wireless normally this evening. Browsing the web, that kind of thing, and then the wireless inexplicably dropped (even with the group rekeying patch), so I unloaded/reloaded the module. This popped up in dmesg: [ 3834.375658] vendor=8086 device=27d2 [ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled [ 3834.375716] ath9k: driver unloaded [ 3838.552419] ath9k: 0.1 [ 3838.552502] vendor=8086 device=27d2 [ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 [ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64 [ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control' [ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17 [ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option) [ 3839.427136] Pid: 0, comm: swapper Tainted: P 2.6.27-rc6-tip-00478-g74f1a36 #1 [ 3839.427141] Call Trace: [ 3839.427145] <IRQ> [<ffffffff802219c5>] ? read_hpet+0x9/0x1c [ 3839.427165] [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c [ 3839.427172] [<ffffffff8026b0c6>] note_interrupt+0x106/0x160 [ 3839.427180] [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda [ 3839.427188] [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190 [ 3839.427194] [<ffffffff8020c551>] ret_from_intr+0x0/0xa [ 3839.427198] <EOI> [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e [ 3839.427211] [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3 [ 3839.427218] [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb [ 3839.427226] [<ffffffff8060c921>] rest_init+0x75/0x77 [ 3839.427231] handlers: [ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k]) [ 3839.427263] Disabling IRQ #17 [ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01 [ 3850.217027] wlan0: authenticate with AP ...
in tip/master there's an ftrace_printk() facility. You can just replace the verbose printk()s with ftrace_printk() and see the result in /debug/tracing/trace. This sort of tracing has very low overhead and can be used in an easy ad-hoc manner with no extra infrastructure. Here's a few quick-start links about how to enable the scheduler tracer: http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt any ftrace_printk() you add in the kernel will show up in that trace. (If the scheduling events are uninteresting and clutter the output then you might want to remove the scheduler tracing entries from kernel/*.c by removing the trace_sched_*() calls or use a less noisy tracer.) to get enough of a trace history you might want to increase the number of trace entries in /debug/tracing/trace_entries from 16K to 128K or so. Ingo --
Thanks for the suggestion, I'll do that. :) --
one more suggestion: you should first check how large the 'scope' of your trace is - i.e. how long of a time span it can cover, in the workload that you are going to trace. if it's too short - say just a few hundred milliseconds then you need a larger trace buffer or fewer trace entries. another thing: you can auto-stop tracing on the failure condition. That way you dont have to worry about whether you notice the badness. Unfortunately there's no easy callback to just anonymously stop all tracing that goes on. There's ftrace_kill_atomic() but i'm not sure it properly stops the scheduler tracer. What is needed is to set kernel/trace/trace.c's tracing_disabled variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy ad-hoc way to trigger a stop-tracing event without elaborate changes? Ingo --
I hate trips where my email box becomes filled. I skip over a lot of emails that I should not have. I think I added a way to do this. Hmm, I think it was in -rt that I added it, and have not gotten around to sending you the updates. Yes, we need an easy way to disable the tracer. I think we have ftrace_disable() function. Perhaps that only disables tracing that checks the tracing_disabled variable. Anyway, I made sure the new ring_buffer code has an easy way to disable it. -- Steve --
we have kill_ftrace(), but that will permanently zap it. Ingo --
OK, on Monday, as I work on getting ftrace to use the ring buffer, I'll also look to see which features are only in -rt that need to go to tip. -- Steve --
It'd be extremely convenient to have a tracer that did absolutely nothing but display ftrace_printk()s. --
yeah, had the same thought. Could you please contribute one? :) I'd suggest you take the smallest tracer as a template: trace_sched_switch.c and turn that into trace_nop.c - and change all the function names and ASCII and kconfig glue to 'nop'/'NOP'? That can then be used as a basis for additional ftrace plugins, or for ad-hoc temporary tracing. Ingo --
Do pardon me if this is incredibly stupid, but I have some thoughts about ath9k interrupt handling. Let me know if there's any validity here. And if there's any invalid points, please help me understand what's going on. :) First of all, I know for certain the ath9k interrupt handler is registered before the driver prints the line saying the name of the device ("phy1: Atheros 5416"). So any IRQs to IRQ 17 should be seen by the ath9k handler (ath_isr) after that point, right? Shortly after that point is where we're seeing the "nobody cared". This seems to suggest that it registered the handler successfully (because we didn't get an error message), but the handler returned IRQ_NONE. Second of all, I'm looking at the ath9k interrupt handler right now, and there are a few cases where it returns IRQ_NONE. And here's where I'm a bit fuzzy. Since there could be any number of things using IRQ 17 (though in my case, ath9k is on its own dedicated IRQ), it seems odd that you check the value of sc->sc_invalid, since the cookie passed to the handler might not actually be ath9k's cookie if multiple drivers have registered IRQ handlers for that particular IRQ. Who knows if what you're reading is even valid? Heck, what if some driver uses a NULL for their cookie (however unlikely)? You'd get a segmentation fault on the second line of the interrupt handler. Of course, I could be completely wrong: does parent interrupt handler check to see which device driver owns the particular device signaling an IRQ and then call the appropriate handler? - Steven --
I am able to reproduce this IRQ nobody cared issue in my setup and the
following patch seems to be fixing the issue. Please try it out and let
me know if it solves your issue in your setup.
**********
IRQs should be disabled before calling free_irq. Also clear pending IRQs.
Signed-off-by: Senthil Balasubramanian <senthilkumar@atheros.com>
---
drivers/net/wireless/ath9k/core.c | 2 ++
drivers/net/wireless/ath9k/main.c | 8 +++++++-
2 files changed, 9 insertions(+), 1 deletions(-)
diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index c262ef2..c007dd2 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1183,6 +1183,8 @@ void ath_deinit(struct ath_softc *sc)
DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
+ tasklet_kill(&sc->intr_tq);
+ tasklet_kill(&sc->bcon_tasklet);
ath_stop(sc);
if (!(sc->sc_flags & SC_OP_INVALID))
ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index aca893a..559e0e8 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1781,10 +1781,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
- if (pdev->irq)
+ if (pdev->irq) {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_flags |= SC_OP_INVALID;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
--
1.5.5
--
The patch you prvide doesn't want to apply. What code did you base this on? The first change listed doesn't work because there is no tasklet_kill() in core.c, and the line immediately after ath_stop there has "!sc->sc_invalid" instead of the "!(sc->sc_flags & SC_OP_INVALID)". The second fails because SC_OP_INVALID isn't defined. However, if your patch did apply to my code, I bet it'd solve the issue, based on what it says it does. --
I am on 2.6.27-rc6 and this patch is on top of my earlier patch
titled "[PATCH] ath9k: connectivity is lost after Group rekeying is
done". However this patch can be applied on top of latest wireless testing
too. I could apply this patch succesfully on top of wireless testing
git tree.
My git-describe says v2.6.27-rc6-1378-g34e512f.
There is no sc_invalid flag in "struct ath_softc" today. Where did
you get this variable from? It was removed in the following commit
-----------------------------------------------
commit f2c9705a05ecbc0d94216a3b042d5641e8bf70b1
Author: Sujith <Sujith.Manoharan@atheros.com>
Date: Mon Aug 11 14:05:08 2008 +0530
ath9k: Use bitfields for sc operations
Signed-off-by: Sujith Manoharan <Sujith.Manoharan@atheros.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
-----------------------------------------------
Which code base are you using?
--
Thanks for the patch Senthil. He is on Linus' v2.6.27-rc, you are on
wireless-testing so I've ported it (no changes, it just applies cleanly
now). This is also untested by me:
diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index f6c4528..9a432e5 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1336,6 +1336,8 @@ void ath_deinit(struct ath_softc *sc)
DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
+ tasklet_kill(&sc->intr_tq);
+ tasklet_kill(&sc->bcon_tasklet);
ath_stop(sc);
if (!sc->sc_invalid)
ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index c5107f2..75acf54 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1400,10 +1400,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
- if (pdev->irq)
+ if (pdev->irq) {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_flags |= SC_OP_INVALID;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
--
I'm seeing the same latency issue I reported originally with the latest -tip kernel and WITHOUT irqpoll. Apparently there's something else to this... - Steven --
Steven, Can you reproduce this? If so can you tell us how? If you can reproduce can you try to test this with wireless-testing as well? That is still based on 2.6.26-rc6 but its just has all the new wireless development stuff queued up for 2.6.28 so far. Everything else is the same. Luis --
On Fri, Sep 19, 2008 at 4:53 PM, Luis R. Rodriguez It's a weird problem, because I'm able to reproduce it by running a kernel with ath9k loaded for a few hours. In this particular case, I only ran this kernel for about 3.5 hours , but the delay when VT switching was getting noticeable. Unload/reload of ath9k and the latency disappeared again. It's definitely ath9k's doing, but I don't know how. I'm finding the delay is most noticeable after 7 hours uptime with ath9k loaded. I'll pull wireless-testing and run that kernel for another 8 hours. What's the URL for the repo? Surely you don't mean the wireless-testing on git.kernel.org that hasn't been touched for almost a week? I kind of wish there was a better debugging facility in ath9k. With ATH_DBG_INTERRUPTS, I see several hundred messages per second, which fills log files up like crazy (I switched to ftrace_printk as Ingo suggested, which has alleviated this somewhat). Is there a way to make it less verbose (perhaps so that it prints data at a rate of no more than 2 or 3 messages per second)? Or maybe the rate at which it prints this stuff is indicative something's wrong? - Steven --
Yes, that's the guy. This week was the Kernel Summit and the Linux Plumbers Conference too so there is understandably some lag in pulls as developers are getting drunk. Patches are welcomed ;) Also, we have a few interrupt levels you can use. In this case we are No, interrupts do happen quite often so that is what you get, perhaps we can enhance this by not printing on some interrupts but the point is to debug interrupts after all. To help us debug in userspace in a nicer way we can use sadc and sar. So you can try this: On a window leave this running: /usr/lib/sysstat/sadc -I 1 irqlog.sar And when you are seeing your latency issues, run this: sar -I 17 -f irqlog.sar | less That is capital "i", this will show you the monitor results of number of interrupts on IRQ 17. You want to see if the rate of change is huge when you have latency issues. If not then its most likely an interrupt storm by ath9k (which is my suspect right now) but something else. Luis --
On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez --
On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez Thanks for informing me about this useful utility. I wasn't aware of its existence. However, this profiler seems broken. When I check individual or 'ALL' statistics, they're all 0.00 IRQs/sec, but SUM shows the actual number of total IRQs per second. I suspect it's only broken because of some kernel config option not being enabled. Anyway, there definitely is an interrupt storm here. Let me demonstrate... This is when the system is operating normally: [...] 19:14:39 sum 1271.00 19:14:40 sum 1286.14 19:14:41 sum 1373.00 19:14:42 sum 1691.00 19:14:43 sum 1380.00 19:14:44 sum 902.00 19:14:45 sum 772.00 19:14:46 sum 1155.00 19:14:47 sum 804.00 19:14:48 sum 809.00 19:14:49 sum 965.00 19:14:50 sum 1077.23 Average: sum 615.69 Here's while the latency issue is occurring, which can now safely be called an interrupt storm: [...] 20:21:07 sum 18864.08 20:21:08 sum 21152.73 20:21:09 sum 19221.35 20:21:10 sum 20786.14 20:21:11 sum 19891.92 20:21:12 sum 19816.16 20:21:13 sum 19430.69 20:21:14 sum 20432.00 20:21:15 sum 19657.84 20:21:16 sum 19386.00 20:21:17 sum 19386.79 20:21:18 sum 20152.94 20:21:19 sum 19077.88 20:21:20 sum 19852.00 Average: sum 19958.91 And here's the change as I unload and reload ath9k: [...] 20:21:38 sum 20138.83 20:21:40 sum 21576.09 20:21:41 sum 17678.38 20:21:42 sum 19834.34 20:21:43 sum 19845.10 20:21:44 sum 19470.00 20:21:45 sum 9273.00 <--- ath9k unloaded/reloaded 20:21:46 sum 187.00 20:21:47 sum 1066.00 20:21:48 sum 610.00 20:21:49 sum 106.93 20:21:50 sum ...
Worth noting that this change (modified version of the patch by
Senthil from earlier) made it so it didn't give an 'IRQ 17: nobody
cared' when I reloaded ath9k during the IRQ storm...
---
ath9k: flush pending IRQs before freeing IRQ
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
diff --git a/drivers/net/wireless/ath9k/main.c
b/drivers/net/wireless/ath9k/main.c
index 0997196..591eec1 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1411,10 +1411,17 @@ static void ath_pci_remove(struct pci_dev *pdev)
{
struct ieee80211_hw *hw = pci_get_drvdata(pdev);
struct ath_softc *sc = hw->priv;
+ enum ath9k_int status;
if (pdev->irq)
+ {
+ ath9k_hw_set_interrupts(sc->sc_ah, 0);
+ ath9k_hw_getisr(sc->sc_ah, &status); /* NB: clears ISR too */
+ sc->sc_invalid = 1;
free_irq(pdev->irq, sc);
+ }
ath_detach(sc);
+
pci_iounmap(pdev, sc->mem);
pci_release_region(pdev, 0);
pci_disable_device(pdev);
--
Thanks, for letting us know -- John, seems that IRQ patch by Senthil is 27 material as it fixes an IRQ issue (nobody cared complaint, and it disables the IRQ otherwise) upon rmmod during an IRQ storm. Luis --
I think it might help if you can see the .sar file yourself. I've uploaded a gzipped copy (great compression ratio in this case for merely using gzip!). http://www.uplinklabs.net/~tycho/linux/irqlog-storm.sar.gz - Steven --
mcgrof@tesla ~/tmp/irq $ sar -f irqlog-storm.sar -A Invalid system activity file: irqlog-storm.sar Perhaps we its a different format than what my sar can read... So here's what we need to do -- try to zero on in what causes this IRQ storm. Most likely its some strange event that triggers it, perhaps some sort of noise which makes the device try to adapt and hits a path we are not handling correctly yet. Can you provide the initial log of when ath9k gets loaded? This tells us what card it is, like AR5416 or AR9280, etc. Luis --
It changes like every release... Not a very useful thing, that. johannes
Ah -- oh well. OK so yeah you have an interrupt storm, this is evident. Do you think there is a way you can try to trigger the interrupt storm so it happens < than after 7 hours or however long its taking you? If so then it'd be good to get your debug out using the interrupt debug. If that's not useful then we can enable all debug stuff (file will be huge). Try turning on the microwave, turn on all your bluetooth devices, use your 2.4 GHz wireless phone, use your xbox 360 controllers and ps3 controllers all at once :D Luis --
On Sat, Sep 20, 2008 at 2:03 PM, Luis R. Rodriguez It's apparently 7 idle hours and ~1-2 hours when I'm actively browsing the web. I don't know what in particular triggers it, so I guess I'd just need to enable debug logging and then start running through all Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several 2.4GHz phones. ;) --
Microwaves are the best to make the signal wet poop BTW, if you have that option available. Not sure if we can trigger the device that way but I'm just brainstorming here. Luis --
On Sat, Sep 20, 2008 at 2:26 PM, Luis R. Rodriguez Well, I've got the thing set to ATH_DBG_ANY (I want this damn bug found!), which will certainly be *interesting*. In 5 minutes, the log has jumped 9 megabytes. I've also got sadc monitoring IRQs so I know for certain when the interrupt storm starts. I'll let you know how this goes. - Steven --
OK, here's the time index at which it b0rked: 18:28:38 sum 1106.00 18:28:39 sum 1037.62 18:28:40 sum 1069.00 18:28:41 sum 1167.00 18:28:42 sum 1155.00 18:28:43 sum 1339.00 18:28:44 sum 18355.00 18:28:45 sum 17845.45 18:28:46 sum 15285.00 18:28:47 sum 17511.00 18:28:48 sum 17568.69 18:28:49 sum 17704.04 18:28:50 sum 18566.67 18:28:51 sum 18913.13 And the entire log of the minute 18:28 is here: http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz It looks pretty crazy. - Steven --
Thanks, I'll take a look when I get a chance (remember aggretion is in the queue too) unless someone beats me to it. Asking in advance -- if you can also provide a log of minutes 43, 44 and 45 that would be great, but this may just be enough. Luis --
Oh wait, those are seconds, nevermind :) Luis --
On Sun, Sep 21, 2008 at 11:54 PM, Luis R. Rodriguez You may want to take a quick look at it now. I certainly don't mean you should push aggregation aside, but the log looks kind of screwy at 18:28:43 (look at line 4972 to see what I mean). It looks as though it did printk()s from multiple threads with no locking mechanism. It's possible I will need to re-capture the error, and I'd like to know if this log will help or not. - Steven --
At first glance I see tons of MIB interrupts being spewed, besides the ackward prints you see too my main concern so far is the MIB interrupts as we don't handle ANI. My initial thought is because don't hanld ANI yet we are letting MIB interrupts to continue to spew out. Luis Please test the following patch. diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c index c262ef2..9a51739 100644 --- a/drivers/net/wireless/ath9k/core.c +++ b/drivers/net/wireless/ath9k/core.c @@ -681,10 +681,12 @@ int ath_open(struct ath_softc *sc, struct ath9k_channel *initial_chan) * Enable MIB interrupts when there are hardware phy counters. * Note we only do this (at the moment) for station mode. */ +#if 0 if (ath9k_hw_phycounters(ah) && ((sc->sc_ah->ah_opmode == ATH9K_M_STA) || (sc->sc_ah->ah_opmode == ATH9K_M_IBSS))) sc->sc_imask |= ATH9K_INT_MIB; +#endif /* * Some hardware processes the TIM IE and fires an * interrupt when the TIM bit is set. For hardware --
On Mon, Sep 22, 2008 at 1:06 AM, Luis R. Rodriguez It didn't apply to -rc7, but I managed to apply it manually (apparently you made this for wireless-testing?) and added the #if 0. After 7 hours uptime with the driver, no apparent interrupt storm. I'll do more thorough testing later, but I have a 3 hour drive today, and some packing to do, so it'll need to wait a bit. - Steven --
Yeah I did it for wireless-testing. Good to hear this so far has solved the issue. I'll port it to 27 and post it once I get your blessings that this fixed it by a Tested-by. Luis --
On Mon, Sep 22, 2008 at 9:01 AM, Luis R. Rodriguez Okay. I'm in Seattle now. Yes, it solved the issue, but surely an #if 0 isn't a proper solution. What's the actual bug here? I'm afraid the meaning of what's going on here isn't exactly intuitive. Also, I'm doing an 8 hour run with this patch tonight. Maybe more, depends on when I wake up. ;) I think if this resolves the issue on two separate ≈8 hour runs, it should be considered the solution to the issue, unless it resurfaces. - Steven
I am only 25 minutes into the run, but I feel this is worth reporting before I forget. No IRQ storm thus far, but the connection keeps dropping. Or rather, it doesn't disassociate, but while I was transferring a large file, it had points of time where it suddenly lost and reacquired the connection (signal strength fluctuates wildly between 70-90% and then 0% for a second, and then back to fluctuating). Perhaps this is related to the lack of aggregation? - Steven
For 2.6.27 it will be unless people are willing to accept a patch to Check linux-wireless for my patch on MIB event comments. But I'll summarize here for you. We get MIB interrupts to update statistics on a number of PHY related elements. We then use this information to process ANI -- some automatic noise immunity work. If you don't process the information for ANI it seems you get the hardware treating this as untreated and hence you get an interrupt storm. So since 2.6.27 doesn't have ANI implemented the MIB interrupts are not useful except to collect internal statistics. Since you are so eager to test this with a "proper" solution I will send you a patch to test ANI for 2.6.27. We'll leave it up to the maintainers and distributions to decide whether or not to pick this up. But for that let's leave that for another thread. Luis --
On Sat, Sep 20, 2008 at 1:57 PM, Luis R. Rodriguez How is the best way to approach this? I assume ATH_DBG_INTERRUPT and pray like hell my /var partition doesn't fill up first? Maybe I'll just symlink /var/log/messages to a file on a substantially larger [10238.347956] ath9k: 0.1 [10238.348028] vendor=8086 device=27d2 [10238.348033] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 [10238.348050] ath9k 0000:03:00.0: setting latency timer to 64 [10238.480781] phy1: Selected rate control algorithm 'ath9k_rate_control' [10238.483320] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17 It claims to be a 5416, but I was under the impression that it was the 5418 that came with the MacBook Pro. --
lspci output is outdated, we have submitted some new updates to it. The output of the driver should be the one you can rely on for chipset information for now. The problem is besides the PCI device ID there is also EEPROM information which further identifies the chipset like devices can be dual band (2 GHz and 5 GHz capable) and some other single band -- using the same PCI device:vendor ID. So for now we've tried to update the PCI id stuff (not sure when it'll be merged) to tell your "family chipset" and in parens the MAC or known model. It can still use some cleanup but yeah, hope this helps understand this a little better. You can see more info on this here: http://wireless.kernel.org/en/users/Drivers/ath9k#supportedchips Luis --
On Thu, Sep 18, 2008 at 3:01 PM, Luis R. Rodriguez Will do! And actually, to be precise, the issue occurred not on the Linus has been fairly strict lately, so I am as uncertain as you are. It really should get merged, because having 2.6.27 with [partially] broken wireless networking is kind of stupid. In the worst case, isn't it possible to do a revert to get the old aggregation crap back in? --
