ath9k: massive unexplained latency in 2.6.27 (rc5, rc6, probably others)

Previous thread: Interrupt Handler Invocation by Singaravelan Nallasellan on Thursday, September 18, 2008 - 10:57 am. (2 messages)

Next thread: [GIT PULL] 2.6.27-rc6 updates for s390 by Martin Schwidefsky on Thursday, September 18, 2008 - 11:23 am. (1 message)
From: Steven Noonan
Date: Thursday, September 18, 2008 - 11:23 am

I'm encountering a very odd issue which only seems to crop up while I
sleep at night. In the morning, I go back to my Linux machine and
there's a ridiculously bad latency I immediately notice in Firefox's
redraw which I cannot explain. glxgears shows the typical 5K FPS, so
it seems like graphics are normal, but there's something causing
horrendous redraw times for Firefox. If I exit X11, the problem then
extends to my virtual terminals. If I switch a VT, it takes about a
second to redraw the whole screen top-to-bottom[1]. I can also get the
same issue if I do page-up/page-down in less, or pretty much anything
that causes the screen to redraw.

I'm using 2.6.27-rc6-tip right now, but this problem occurs on
pristine rc5 and rc6 as well. uptime is only 7 hours, and this problem
seems to get progressively worse over time. Strangely enough,
unloading the 'ath9k' driver causes the latency to outright -vanish-
instantaneously (I only discovered this -after- I made the YouTube
video cited below, so the video doesn't demonstrate this tidbit,
unfortunately). I wish I knew what these symptoms are pointing to,
because this makes no sense to me at all. Is there some hardware
interrupt going bonkers? The only thing I can think of is that I've
got 'irqpoll' on the command line. I have it because I had to reload
the ath9k driver at one point, and this was the result (caused ath9k
to become a useless driver until I rebooted, which re-enabled IRQ 17):

[53997.493777] ath9k: driver unloaded
[54001.098532] ath9k: 0.1
[54001.098833] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[54001.099013] ath9k 0000:03:00.0: setting latency timer to 64
[54001.233825] phy1: Selected rate control algorithm 'ath9k_rate_control'
[54001.236980] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
[54002.136218] irq 17: nobody cared (try booting with the "irqpoll" option)
[54002.136229] Pid: 0, comm: swapper Tainted: P 2.6.27-rc6-00036-ga551b98 #2
[54002.136234]
[54002.136235] Call ...
From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 11:34 am

irqpoll is a monster of evil and that should make your system crawl to
its knees. I would advise instead we work with you fixing the the
missed interrupts issue upon rmmod.

  Luis
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 11:42 am

Also, please provide the output of

cat /proc/interrupts

and also please do not cross post to all these lists, just use
linux-wireless or ath9k.

  Luis
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 12:00 pm

On Thu, Sep 18, 2008 at 11:42 AM, Luis R. Rodriguez

Note that the problem necessitating use of irqpoll in the first place
seems to only happen under certain conditions. I am unsure what these
conditions are. Before 'ath9k: connectivity is lost after Group
rekeying is done', I had used rmmod/modprobe as my solution to the
issue, which triggered the IRQ issue.

alcarin steven # cat /proc/interrupts
           CPU0       CPU1
0x0:      63227          0   IO-APIC-edge      hpet
0x8:          1          0   IO-APIC-edge      rtc0
0x9:      13080          0   IO-APIC-fasteoi   acpi
0xe:       8195          0   IO-APIC-edge      ide0
0xf:          0          0   IO-APIC-edge      ide1
0x10:         36          0   IO-APIC-fasteoi   uhci_hcd:usb5
0x11:      10645          0   IO-APIC-fasteoi   ath
0x12:         42          0   IO-APIC-fasteoi   uhci_hcd:usb4
0x17:        919          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2
0x13:      32885          0   IO-APIC-fasteoi   uhci_hcd:usb3,
ata_piix, ohci1394
0x200100:          1          0   PCI-MSI-edge      eth0
0x16:        223          0   IO-APIC-fasteoi   HDA Intel
NMI:          0          0   Non-maskable interrupts
LOC:      78087      95718   Local timer interrupts
RES:      11576      16384   Rescheduling interrupts
CAL:       6862       8889   Function call interrupts
TLB:         54         41   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts

Sorry, but in the past I've posted to linux-wireless, ath9k-devel, and
all the maintainers of ath9k and didn't get a single response (except
a 'me too' from a fellow ath9k user). I didn't just want to hear
crickets this time.

- Steven
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 12:23 pm

I wonder if this is related:

alcarin ~ # dmesg | grep GPE
[    0.166806] ACPI: EC: GPE = 0x17, I/O: command/status = 0x66, data = 0x62
[  685.409077] ACPI: EC: GPE storm detected, disabling EC GPE
--

From: Justin Mattock
Date: Thursday, September 18, 2008 - 1:18 pm

FWIW, you might want to try this pach
in this thread:
http://www.gossamer-threads.com/lists/linux/kernel/966076
I was receiving a disable irq #23 this patch"knock on wood",
seems to help with this issue.


-- 
Justin P. Mattock
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 1:25 pm

Both the patches there affect EHCI, though. If I was having a USB
controller fink out with "IRQ X: nobody cared", then that patch would
probably indeed be applicable.
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 1:44 pm

You mean this patch:

[PATCH] ath9k: connectivity is lost after Group rekeying is done
http://marc.info/?l=linux-wireless&m=122163541519736&w=2

So let me get this straight -- you applied this new patch, and haven't

Understood, but I also have used this before with ath9k and I got
exactly the same results you did -- I just refused to use it again and
just try to fix the issues present.

ath9k issues tons of interrupts, not sure why irqpoll option would
cause latency so bad as the interrupts *are* handled. Not sure
*exactly* how irqpoll works but its description mentions using it
forces each interrupt handler on the IRQ line to check the interrupt
is for it. You have to keep in mind that not only are ath9k interrupts
then being sent to the devices on its line but it would seem that all
other devices on each line would suffer from the interrupts of the
other guys. Why ath9k would be the *only* culprit of causing latency


But it was this interrupt line which had an interrupt not handled.

I'm not sure why this would happen. Can't we rule out ath9k then since


Patches speak more than words, but yeah sorry, we should have
addressed this there. I've personally have just been busy with
tackling aggregation.

 Luis
--

From: Justin Mattock
Date: Thursday, September 18, 2008 - 2:31 pm

On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez

Appologize about that, I'm wondering
what is or was the main commit that might be causing
all of these disable irq #'s in usb, wireless, etc..
just doing a quick search one cant help but to think
maybe this is something with tick.(but could be wrong);

-- 
Justin P. Mattock
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 2:49 pm

On Thu, Sep 18, 2008 at 1:44 PM, Luis R. Rodriguez

I hadn't at the time of that writing, no. I saw it as a fix for future
ignored IRQs, and hadn't noticed any difference with it on or off. So
it seemed like since there was no consequence having it enabled, why
not leave it enabled all the time?

Now that I'm aware it's the spawn of satan, I'm trying with it off. So
far, so good. But I haven't had need to reload ath9k so frequently,
and even when I do, I can't reproduce the specific conditions which


No, it's in hex. 0x17 = 23, 0x11 = 17. IRQ 17 is the one that pooped

If I do, I'll need to know what precisely to do about it. What debug

Which is far more important, I agree. It's annoying to get speeds
<802.11b on my pre-802.11n capable chipset and network. ;)

- Steven
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 3:01 pm

Sure, OK so in ath9k's core.h we have a macro used to indicate what
type of debugging stuff gets print out. By default we only leave
enabled the fatal stuff.

#define DBG_DEFAULT (ATH_DBG_FATAL )

You want to add to this ATH_DBG_INTERRUPT as follows:

#define DBG_DEFAULT (ATH_DBG_FATAL | ATH_DBG_INTERRUPT)

Then since the issue seems to come up when you rmmod, simply try to
download do some large RXing or TXing or both and then rmmod. Capture

Yeah exactly, I hope the work we do will get in for 27. We'll see.

 Luis
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 3:08 pm

Oh and you should at least get some 11n rates, just not aggregation :)

  Luis
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 4:04 pm

On Thu, Sep 18, 2008 at 3:08 PM, Luis R. Rodriguez

I always get "1 megabit/sec". It actually measures (at best) to be a
throughput of 8mbits/sec, though the average (due to strange random
packet loss) is about 4mbits/sec. -- And that was with aggregation
re-enabled via your patch. Without, it never exceeds 1mbit/sec
(actual, not just as indicated by iwconfig.

- Steven
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 4:10 pm

Yeah, iwconfig reports a wrong rate, should be easy to fix but I

That patch just fixes aggregation for mac80211 after the TX MQ

Right.

  Luis
--

From: Johannes Berg
Date: Thursday, September 18, 2008 - 4:16 pm

the rate patch I posted today will fix it. not for .27 though

johannes
From: Steven Noonan
Date: Thursday, September 18, 2008 - 7:52 pm

This IRQ debugging is lots of fun. My /var/log/messages jumped in size
to 241MB. :)

My test case was a file transfer that basically copies 50MB from a
remote machine's /dev/urandom and writes it to /dev/null locally. That
should be a pretty nice large RX. And jeez, I'm seeing tremendous
download speeds now and I have no idea what is different about my
configuration. I'm getting up to 24 megabit per second transfer rates
(actual), which is incredibly unusual for ath9k.

I wasn't able to reproduce the IRQ problems on -tip or Linus' current
tree (both with the group rekeying fix applied, which may or may not
be a good idea for reproducing the issue). Even tried doing
rmmod/modprobe in fast succession multiple times. Couldn't get it to
throw up.

I wonder... My rmmod/modprobe originally was used after the improper
group rekeying before in order to get it back to normal. Perhaps the
encryption has to go wonky before this IRQ issue surfaces? Another
possibility is the earlier mentioned "ACPI: EC: GPE storm detected,
disabling EC GPE". I don't know if it's related or not, but I haven't
waited for it to show up before testing.

Oh well, if the IRQ problem resurfaces (or if I get a ridiculous
latency tomorrow morning with irqpoll off), I'll try to get some
useful data to help debug it.
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 8:01 pm

Thanks for testing, and glad to see this is resolved.

  Luis
--

From: Steven Noonan
Date: Friday, September 19, 2008 - 12:29 am

On Thu, Sep 18, 2008 at 8:01 PM, Luis R. Rodriguez

Damn. It's back. I was using wireless normally this evening. Browsing
the web, that kind of thing, and then the wireless inexplicably
dropped (even with the group rekeying patch), so I unloaded/reloaded
the module. This popped up in dmesg:

[ 3834.375658] vendor=8086 device=27d2
[ 3834.375666] ath9k 0000:03:00.0: PCI INT A disabled
[ 3834.375716] ath9k: driver unloaded
[ 3838.552419] ath9k: 0.1
[ 3838.552502] vendor=8086 device=27d2
[ 3838.552511] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[ 3838.552532] ath9k 0000:03:00.0: setting latency timer to 64
[ 3838.688924] phy1: Selected rate control algorithm 'ath9k_rate_control'
[ 3838.693652] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17
[ 3839.427125] irq 17: nobody cared (try booting with the "irqpoll" option)
[ 3839.427136] Pid: 0, comm: swapper Tainted: P
2.6.27-rc6-tip-00478-g74f1a36 #1
[ 3839.427141] Call Trace:
[ 3839.427145]  <IRQ>  [<ffffffff802219c5>] ? read_hpet+0x9/0x1c
[ 3839.427165]  [<ffffffff8026af71>] __report_bad_irq+0x3d/0x8c
[ 3839.427172]  [<ffffffff8026b0c6>] note_interrupt+0x106/0x160
[ 3839.427180]  [<ffffffff8026b801>] handle_fasteoi_irq+0xad/0xda
[ 3839.427188]  [<ffffffff8020f7b0>] do_IRQ+0x10c/0x190
[ 3839.427194]  [<ffffffff8020c551>] ret_from_intr+0x0/0xa
[ 3839.427198]  <EOI>  [<ffffffff8026c6f0>] rcu_pending+0x62/0x6e
[ 3839.427211]  [<ffffffff8025bc11>] ? tick_nohz_stop_sched_tick+0x2e4/0x2f3
[ 3839.427218]  [<ffffffff8020ad94>] cpu_idle+0x7b/0xdb
[ 3839.427226]  [<ffffffff8060c921>] rest_init+0x75/0x77
[ 3839.427231] handlers:
[ 3839.427234] [<ffffffffa0240238>] (ath_isr+0x0/0x170 [ath9k])
[ 3839.427263] Disabling IRQ #17
[ 3842.263699] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3848.035003] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3848.432701] ADDRCONF(NETDEV_UP): wlan0: link is not ready
[ 3850.216947] wlan0: authenticate with AP 00:1e:52:79:4d:01
[ 3850.217027] wlan0: authenticate with AP ...
From: Ingo Molnar
Date: Friday, September 19, 2008 - 1:17 am

in tip/master there's an ftrace_printk() facility. You can just replace 
the verbose printk()s with ftrace_printk() and see the result in 
/debug/tracing/trace.

This sort of tracing has very low overhead and can be used in an easy 
ad-hoc manner with no extra infrastructure. Here's a few quick-start 
links about how to enable the scheduler tracer:

 http://redhat.com/~mingo/sched-devel.git/readme-tracer.txt
 http://redhat.com/~mingo/sched-devel.git/howto-trace-latencies.txt

any ftrace_printk() you add in the kernel will show up in that trace. 

(If the scheduling events are uninteresting and clutter the output then 
you might want to remove the scheduler tracing entries from kernel/*.c 
by removing the trace_sched_*() calls or use a less noisy tracer.)

to get enough of a trace history you might want to increase the number 
of trace entries in /debug/tracing/trace_entries from 16K to 128K or so.

	Ingo
--

From: Steven Noonan
Date: Friday, September 19, 2008 - 1:22 am

Thanks for the suggestion, I'll do that. :)
--

From: Ingo Molnar
Date: Friday, September 19, 2008 - 1:32 am

one more suggestion: you should first check how large the 'scope' of 
your trace is - i.e. how long of a time span it can cover, in the 
workload that you are going to trace.

if it's too short - say just a few hundred milliseconds then you need a 
larger trace buffer or fewer trace entries.

another thing: you can auto-stop tracing on the failure condition. That 
way you dont have to worry about whether you notice the badness. 
Unfortunately there's no easy callback to just anonymously stop all 
tracing that goes on. There's ftrace_kill_atomic() but i'm not sure it 
properly stops the scheduler tracer.

What is needed is to set kernel/trace/trace.c's tracing_disabled 
variable to 0. I've Cc:-ed Steve Rostedt - Steve, why is there no easy 
ad-hoc way to trigger a stop-tracing event without elaborate changes?

	Ingo
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 6:23 pm

I hate trips where my email box becomes filled. I skip over a lot of 
emails that I should not have.


I think I added a way to do this. Hmm, I think it was in -rt that I added 
it, and have not gotten around to sending you the updates.

Yes, we need an easy way to disable the tracer. I think we have 
ftrace_disable() function. Perhaps that only disables tracing that checks 
the tracing_disabled variable.

Anyway, I made sure the new ring_buffer code has an easy way to disable 
it.

-- Steve

--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 12:20 pm

we have kill_ftrace(), but that will permanently zap it.

	Ingo
--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 12:28 pm

OK, on Monday, as I work on getting ftrace to use the ring buffer, I'll 
also look to see which features are only in -rt that need to go to tip.

-- Steve

--

From: Steven Noonan
Date: Friday, September 19, 2008 - 1:45 am

It'd be extremely convenient to have a tracer that did absolutely
nothing but display ftrace_printk()s.
--

From: Ingo Molnar
Date: Friday, September 19, 2008 - 1:51 am

yeah, had the same thought.

Could you please contribute one? :) I'd suggest you take the smallest 
tracer as a template: trace_sched_switch.c and turn that into 
trace_nop.c - and change all the function names and ASCII and kconfig 
glue to 'nop'/'NOP'?

That can then be used as a basis for additional ftrace plugins, or for 
ad-hoc temporary tracing.

	Ingo
--

From: Steven Noonan
Date: Friday, September 19, 2008 - 1:22 am

Do pardon me if this is incredibly stupid, but I have some thoughts
about ath9k interrupt handling. Let me know if there's any validity
here. And if there's any invalid points, please help me understand
what's going on. :)

First of all, I know for certain the ath9k interrupt handler is
registered before the driver prints the line saying the name of the
device ("phy1: Atheros 5416"). So any IRQs to IRQ 17 should be seen by
the ath9k handler (ath_isr) after that point, right? Shortly after
that point is where we're seeing the "nobody cared". This seems to
suggest that it registered the handler successfully (because we didn't
get an error message), but the handler returned IRQ_NONE.

Second of all, I'm looking at the ath9k interrupt handler right now,
and there are a few cases where it returns IRQ_NONE. And here's where
I'm a bit fuzzy. Since there could be any number of things using IRQ
17 (though in my case, ath9k is on its own dedicated IRQ), it seems
odd that you check the value of sc->sc_invalid, since the cookie
passed to the handler might not actually be ath9k's cookie if multiple
drivers have registered IRQ handlers for that particular IRQ. Who
knows if what you're reading is even valid? Heck, what if some driver
uses a NULL for their cookie (however unlikely)? You'd get a
segmentation fault on the second line of the interrupt handler. Of
course, I could be completely wrong: does parent interrupt handler
check to see which device driver owns the particular device signaling
an IRQ and then call the appropriate handler?

- Steven
--

From: Senthil Balasubramanian
Date: Friday, September 19, 2008 - 7:28 am

I am able to reproduce this IRQ nobody cared issue in my setup and the
following patch seems to be fixing the issue. Please try it out and let
me know if it solves your issue in your setup.

**********
IRQs should be disabled before calling free_irq. Also clear pending IRQs.

Signed-off-by: Senthil Balasubramanian <senthilkumar@atheros.com>
---
 drivers/net/wireless/ath9k/core.c |    2 ++
 drivers/net/wireless/ath9k/main.c |    8 +++++++-
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index c262ef2..c007dd2 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1183,6 +1183,8 @@ void ath_deinit(struct ath_softc *sc)
 
 	DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
 
+	tasklet_kill(&sc->intr_tq);
+	tasklet_kill(&sc->bcon_tasklet);
 	ath_stop(sc);
 	if (!(sc->sc_flags & SC_OP_INVALID))
 		ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index aca893a..559e0e8 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1781,10 +1781,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
 {
 	struct ieee80211_hw *hw = pci_get_drvdata(pdev);
 	struct ath_softc *sc = hw->priv;
+	enum ath9k_int status;
 
-	if (pdev->irq)
+	if (pdev->irq) {
+		ath9k_hw_set_interrupts(sc->sc_ah, 0);
+		ath9k_hw_getisr(sc->sc_ah, &status);	/* NB: clears ISR too */
+		sc->sc_flags |= SC_OP_INVALID;
 		free_irq(pdev->irq, sc);
+	}
 	ath_detach(sc);
+
 	pci_iounmap(pdev, sc->mem);
 	pci_release_region(pdev, 0);
 	pci_disable_device(pdev);
-- 
1.5.5

--

From: Steven Noonan
Date: Friday, September 19, 2008 - 9:42 am

The patch you prvide doesn't want to apply. What code did you base this 
on?

The first change listed doesn't work because there is no tasklet_kill() in 
core.c, and the line immediately after ath_stop there has 
"!sc->sc_invalid" instead of the "!(sc->sc_flags & SC_OP_INVALID)".

The second fails because SC_OP_INVALID isn't defined.

However, if your patch did apply to my code, I bet it'd solve the issue, 
based on what it says it does.
--

From: Senthil Balasubramanian
Date: Friday, September 19, 2008 - 10:58 am

I am on 2.6.27-rc6 and this patch is on top of my earlier patch
titled "[PATCH] ath9k: connectivity is lost after Group rekeying is
done". However this patch can be applied on top of latest wireless testing
too. I could apply this patch succesfully on top of wireless testing
git tree.

My git-describe says v2.6.27-rc6-1378-g34e512f.

There is no sc_invalid flag in "struct ath_softc" today. Where did
you get this variable from? It was removed in the following commit

-----------------------------------------------
commit f2c9705a05ecbc0d94216a3b042d5641e8bf70b1
Author: Sujith <Sujith.Manoharan@atheros.com>
Date:   Mon Aug 11 14:05:08 2008 +0530

    ath9k: Use bitfields for sc operations

    Signed-off-by: Sujith Manoharan <Sujith.Manoharan@atheros.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
-----------------------------------------------

Which code base are you using?
--

From: Luis R. Rodriguez
Date: Friday, September 19, 2008 - 11:23 am

Thanks for the patch Senthil. He is on Linus' v2.6.27-rc, you are on
wireless-testing so I've ported it (no changes, it just applies cleanly
now). This is also untested by me:

diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index f6c4528..9a432e5 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -1336,6 +1336,8 @@ void ath_deinit(struct ath_softc *sc)
 
 	DPRINTF(sc, ATH_DBG_CONFIG, "%s\n", __func__);
 
+	tasklet_kill(&sc->intr_tq);
+	tasklet_kill(&sc->bcon_tasklet);
 	ath_stop(sc);
 	if (!sc->sc_invalid)
 		ath9k_hw_setpower(sc->sc_ah, ATH9K_PM_AWAKE);
diff --git a/drivers/net/wireless/ath9k/main.c b/drivers/net/wireless/ath9k/main.c
index c5107f2..75acf54 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1400,10 +1400,16 @@ static void ath_pci_remove(struct pci_dev *pdev)
 {
 	struct ieee80211_hw *hw = pci_get_drvdata(pdev);
 	struct ath_softc *sc = hw->priv;
+	enum ath9k_int status;
 
-	if (pdev->irq)
+	if (pdev->irq) {
+		ath9k_hw_set_interrupts(sc->sc_ah, 0);
+		ath9k_hw_getisr(sc->sc_ah, &status);	/* NB: clears ISR too */
+		sc->sc_flags |= SC_OP_INVALID;
 		free_irq(pdev->irq, sc);
+	}
 	ath_detach(sc);
+
 	pci_iounmap(pdev, sc->mem);
 	pci_release_region(pdev, 0);
 	pci_disable_device(pdev);
--

From: Steven Noonan
Date: Friday, September 19, 2008 - 4:31 pm

I'm seeing the same latency issue I reported originally with the latest 
-tip kernel and WITHOUT irqpoll.

Apparently there's something else to this...

- Steven
--

From: Luis R. Rodriguez
Date: Friday, September 19, 2008 - 4:53 pm

Steven,

Can you reproduce this? If so can you tell us how? If you can reproduce
can you try to test this with wireless-testing as well? That is still
based on 2.6.26-rc6 but its just has all the new wireless development
stuff queued up for 2.6.28 so far. Everything else is the same.

  Luis
--

From: Steven Noonan
Date: Friday, September 19, 2008 - 5:01 pm

On Fri, Sep 19, 2008 at 4:53 PM, Luis R. Rodriguez

It's a weird problem, because I'm able to reproduce it by running a
kernel with ath9k loaded for a few hours. In this particular case, I
only ran this kernel for about 3.5 hours , but the delay when VT
switching was getting noticeable. Unload/reload of ath9k and the
latency disappeared again. It's definitely ath9k's doing, but I don't
know how. I'm finding the delay is most noticeable after 7 hours
uptime with ath9k loaded.

I'll pull wireless-testing and run that kernel for another 8 hours.
What's the URL for the repo? Surely you don't mean the
wireless-testing on git.kernel.org that hasn't been touched for almost
a week?

I kind of wish there was a better debugging facility in ath9k. With
ATH_DBG_INTERRUPTS, I see several hundred messages per second, which
fills log files up like crazy (I switched to ftrace_printk as Ingo
suggested, which has alleviated this somewhat). Is there a way to make
it less verbose (perhaps so that it prints data at a rate of no more
than 2 or 3 messages per second)? Or maybe the rate at which it prints
this stuff is indicative something's wrong?

- Steven
--

From: Luis R. Rodriguez
Date: Friday, September 19, 2008 - 6:13 pm

Yes, that's the guy. This week was the Kernel Summit and the Linux
Plumbers Conference too so there is understandably some lag in pulls
as developers are getting drunk.


Patches are welcomed ;)

Also, we have a few interrupt levels you can use. In this case we are

No, interrupts do happen quite often so that is what you get, perhaps
we can enhance this by not printing on some interrupts but the point
is to debug interrupts after all. To help us debug in userspace in a
nicer way we can use sadc and sar. So you can try this:

On a window leave this running:

/usr/lib/sysstat/sadc -I 1 irqlog.sar

And when you are seeing your latency issues, run this:

sar -I 17 -f irqlog.sar | less

That is capital "i", this will show you the monitor results of number
of interrupts on IRQ 17. You want to see if the rate of change is huge
when you have latency issues. If not then its most likely an interrupt
storm by ath9k (which is my suspect right now) but something else.

  Luis
--

From: Luis R. Rodriguez
Date: Friday, September 19, 2008 - 6:15 pm

On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez

--

From: Steven Noonan
Date: Friday, September 19, 2008 - 8:29 pm

On Fri, Sep 19, 2008 at 6:13 PM, Luis R. Rodriguez

Thanks for informing me about this useful utility. I wasn't aware of
its existence.

However, this profiler seems broken. When I check individual or 'ALL'
statistics, they're all 0.00 IRQs/sec, but SUM shows the actual number
of total IRQs per second. I suspect it's only broken because of some
kernel config option not being enabled.

Anyway, there definitely is an interrupt storm here. Let me demonstrate...

This is when the system is operating normally:
[...]
19:14:39          sum   1271.00
19:14:40          sum   1286.14
19:14:41          sum   1373.00
19:14:42          sum   1691.00
19:14:43          sum   1380.00
19:14:44          sum    902.00
19:14:45          sum    772.00
19:14:46          sum   1155.00
19:14:47          sum    804.00
19:14:48          sum    809.00
19:14:49          sum    965.00
19:14:50          sum   1077.23
Average:          sum    615.69


Here's while the latency issue is occurring, which can now safely be
called an interrupt storm:
[...]
20:21:07          sum  18864.08
20:21:08          sum  21152.73
20:21:09          sum  19221.35
20:21:10          sum  20786.14
20:21:11          sum  19891.92
20:21:12          sum  19816.16
20:21:13          sum  19430.69
20:21:14          sum  20432.00
20:21:15          sum  19657.84
20:21:16          sum  19386.00
20:21:17          sum  19386.79
20:21:18          sum  20152.94
20:21:19          sum  19077.88
20:21:20          sum  19852.00
Average:          sum  19958.91


And here's the change as I unload and reload ath9k:

[...]
20:21:38          sum  20138.83
20:21:40          sum  21576.09
20:21:41          sum  17678.38
20:21:42          sum  19834.34
20:21:43          sum  19845.10
20:21:44          sum  19470.00
20:21:45          sum   9273.00  <--- ath9k unloaded/reloaded
20:21:46          sum    187.00
20:21:47          sum   1066.00
20:21:48          sum    610.00
20:21:49          sum    106.93
20:21:50          sum ...
From: Steven Noonan
Date: Friday, September 19, 2008 - 8:40 pm

Worth noting that this change (modified version of the patch by
Senthil from earlier) made it so it didn't give an 'IRQ 17: nobody
cared' when I reloaded ath9k during the IRQ storm...

---

    ath9k: flush pending IRQs before freeing IRQ

    Signed-off-by: Steven Noonan <steven@uplinklabs.net>

diff --git a/drivers/net/wireless/ath9k/main.c
b/drivers/net/wireless/ath9k/main.c
index 0997196..591eec1 100644
--- a/drivers/net/wireless/ath9k/main.c
+++ b/drivers/net/wireless/ath9k/main.c
@@ -1411,10 +1411,17 @@ static void ath_pci_remove(struct pci_dev *pdev)
 {
        struct ieee80211_hw *hw = pci_get_drvdata(pdev);
        struct ath_softc *sc = hw->priv;
+       enum ath9k_int status;

        if (pdev->irq)
+       {
+               ath9k_hw_set_interrupts(sc->sc_ah, 0);
+               ath9k_hw_getisr(sc->sc_ah, &status);    /* NB: clears ISR too */
+               sc->sc_invalid = 1;
                free_irq(pdev->irq, sc);
+       }
        ath_detach(sc);
+
        pci_iounmap(pdev, sc->mem);
        pci_release_region(pdev, 0);
        pci_disable_device(pdev);
--

From: Luis R. Rodriguez
Date: Saturday, September 20, 2008 - 1:56 pm

Thanks, for letting us know -- John, seems that IRQ patch by Senthil is
27 material as it fixes an IRQ issue (nobody cared complaint, and it
disables the IRQ otherwise) upon rmmod during an IRQ storm.

  Luis
--

From: Steven Noonan
Date: Saturday, September 20, 2008 - 8:21 am

I think it might help if you can see the .sar file yourself. I've
uploaded a gzipped copy (great compression ratio in this case for
merely using gzip!).

http://www.uplinklabs.net/~tycho/linux/irqlog-storm.sar.gz

- Steven
--

From: Luis R. Rodriguez
Date: Saturday, September 20, 2008 - 1:57 pm

mcgrof@tesla ~/tmp/irq $ sar -f irqlog-storm.sar -A
Invalid system activity file: irqlog-storm.sar

Perhaps we its a different format than what my sar can read...

So here's what we need to do -- try to zero on in what causes this IRQ
storm. Most likely its some strange event that triggers it, perhaps
some sort of noise which makes the device try to adapt and hits a path
we are not handling correctly yet. Can you provide the initial log of
when ath9k gets loaded? This tells us what card it is, like AR5416 or
AR9280, etc.

  Luis
--

From: Johannes Berg
Date: Saturday, September 20, 2008 - 1:58 pm

It changes like every release... Not a very useful thing, that.

johannes
From: Luis R. Rodriguez
Date: Saturday, September 20, 2008 - 2:03 pm

Ah -- oh well.

OK so yeah you have an interrupt storm, this is evident.

Do you think there is a way you can try to trigger the interrupt storm
so it happens < than after 7 hours or however long its taking you? If
so then it'd be good to get your debug out using the interrupt debug.
If that's not useful then we can enable all debug stuff (file will be
huge).

Try turning on the microwave, turn on all  your bluetooth devices, use
your 2.4 GHz wireless phone, use your xbox 360 controllers and ps3
controllers all at once :D

  Luis
--

From: Steven Noonan
Date: Saturday, September 20, 2008 - 2:22 pm

On Sat, Sep 20, 2008 at 2:03 PM, Luis R. Rodriguez


It's apparently 7 idle hours and ~1-2 hours when I'm actively browsing
the web. I don't know what in particular triggers it, so I guess I'd
just need to enable debug logging and then start running through all

Heh. I don't have an Xbox 360 or PS3 (sad, I know). Do have several
2.4GHz phones. ;)
--

From: Luis R. Rodriguez
Date: Saturday, September 20, 2008 - 2:26 pm

Microwaves are the best to make the signal wet poop BTW, if you have
that option available. Not sure if we can trigger the device that way
but I'm just brainstorming here.

  Luis
--

From: Steven Noonan
Date: Saturday, September 20, 2008 - 2:48 pm

On Sat, Sep 20, 2008 at 2:26 PM, Luis R. Rodriguez

Well, I've got the thing set to ATH_DBG_ANY (I want this damn bug
found!), which will certainly be *interesting*. In 5 minutes, the log
has jumped 9 megabytes.

I've also got sadc monitoring IRQs so I know for certain when the
interrupt storm starts.

I'll let you know how this goes.

- Steven
--

From: Steven Noonan
Date: Saturday, September 20, 2008 - 6:48 pm

OK, here's the time index at which it b0rked:

18:28:38          sum   1106.00
18:28:39          sum   1037.62
18:28:40          sum   1069.00
18:28:41          sum   1167.00
18:28:42          sum   1155.00
18:28:43          sum   1339.00
18:28:44          sum  18355.00
18:28:45          sum  17845.45
18:28:46          sum  15285.00
18:28:47          sum  17511.00
18:28:48          sum  17568.69
18:28:49          sum  17704.04
18:28:50          sum  18566.67
18:28:51          sum  18913.13

And the entire log of the minute 18:28 is here:
http://www.uplinklabs.net/~tycho/linux/18.28.00-59.log.gz

It looks pretty crazy.

- Steven
--

From: Luis R. Rodriguez
Date: Sunday, September 21, 2008 - 11:54 pm

Thanks, I'll take a look when I get a chance (remember aggretion
is in the queue too) unless someone beats me to it. Asking in advance
-- if you can also provide a log of minutes 43, 44 and 45 that would
be great, but this may just be enough.

  Luis
--

From: Luis R. Rodriguez
Date: Monday, September 22, 2008 - 12:14 am

Oh wait, those are seconds, nevermind :)

  Luis
--

From: Steven Noonan
Date: Monday, September 22, 2008 - 12:26 am

On Sun, Sep 21, 2008 at 11:54 PM, Luis R. Rodriguez

You may want to take a quick look at it now. I certainly don't mean
you should push aggregation aside, but the log looks kind of screwy at
18:28:43 (look at line 4972 to see what I mean). It looks as though it
did printk()s from multiple threads with no locking mechanism. It's
possible I will need to re-capture the error, and I'd like to know if
this log will help or not.

- Steven
--

From: Luis R. Rodriguez
Date: Monday, September 22, 2008 - 1:06 am

At first glance I see tons of MIB interrupts being spewed, besides the
ackward prints you see too my main concern so far is the MIB interrupts
as we don't handle ANI. My initial thought is because don't hanld ANI
yet we are letting MIB interrupts to continue to spew out.

  Luis

Please test the following patch.

diff --git a/drivers/net/wireless/ath9k/core.c b/drivers/net/wireless/ath9k/core.c
index c262ef2..9a51739 100644
--- a/drivers/net/wireless/ath9k/core.c
+++ b/drivers/net/wireless/ath9k/core.c
@@ -681,10 +681,12 @@ int ath_open(struct ath_softc *sc, struct ath9k_channel *initial_chan)
 	 * Enable MIB interrupts when there are hardware phy counters.
 	 * Note we only do this (at the moment) for station mode.
 	 */
+#if 0
 	if (ath9k_hw_phycounters(ah) &&
 	    ((sc->sc_ah->ah_opmode == ATH9K_M_STA) ||
 	     (sc->sc_ah->ah_opmode == ATH9K_M_IBSS)))
 		sc->sc_imask |= ATH9K_INT_MIB;
+#endif
 	/*
 	 * Some hardware processes the TIM IE and fires an
 	 * interrupt when the TIM bit is set.  For hardware
--

From: Steven Noonan
Date: Monday, September 22, 2008 - 8:11 am

On Mon, Sep 22, 2008 at 1:06 AM, Luis R. Rodriguez

It didn't apply to -rc7, but I managed to apply it manually
(apparently you made this for wireless-testing?) and added the #if 0.
After 7 hours uptime with the driver, no apparent interrupt storm.

I'll do more thorough testing later, but I have a 3 hour drive today,
and some packing to do, so it'll need to wait a bit.

- Steven
--

From: Luis R. Rodriguez
Date: Monday, September 22, 2008 - 9:01 am

Yeah I did it for wireless-testing. Good to hear this so far has
solved the issue. I'll port it to 27 and post it once I get your
blessings that this fixed it by a Tested-by.

  Luis
--

From: Steven Noonan
Date: Tuesday, September 23, 2008 - 12:22 am

On Mon, Sep 22, 2008 at 9:01 AM, Luis R. Rodriguez

Okay. I'm in Seattle now.

Yes, it solved the issue, but surely an #if 0 isn't a proper solution.
What's the actual bug here? I'm afraid the meaning of what's going on
here isn't exactly intuitive.

Also, I'm doing an 8 hour run with this patch tonight. Maybe more,
depends on when I wake up. ;)

I think if this resolves the issue on two separate ≈8 hour runs, it
should be considered the solution to the issue, unless it resurfaces.

- Steven
From: Steven Noonan
Date: Tuesday, September 23, 2008 - 12:55 am

I am only 25 minutes into the run, but I feel this is worth reporting
before I forget. No IRQ storm thus far, but the connection keeps
dropping. Or rather, it doesn't disassociate, but while I was
transferring a large file, it had points of time where it suddenly
lost and reacquired the connection (signal strength fluctuates wildly
between 70-90% and then 0% for a second, and then back to
fluctuating). Perhaps this is related to the lack of aggregation?

- Steven
From: Steven Noonan
Date: Tuesday, September 23, 2008 - 9:20 am

This 8 hour test passes.

- Steven
From: Luis R. Rodriguez
Date: Tuesday, September 23, 2008 - 12:22 pm

For 2.6.27 it will be unless people are willing to accept a patch to

Check linux-wireless for my patch on MIB event comments. But I'll
summarize here for you. We get MIB interrupts to update statistics on
a number of PHY related elements. We then use this information to
process ANI -- some automatic noise immunity work. If you don't
process the information for ANI it seems you get the hardware treating
this as untreated and hence you get an interrupt storm. So since
2.6.27 doesn't have ANI implemented the MIB interrupts are not useful
except to collect internal statistics.

Since you are so eager to test this with a "proper" solution I will
send you a patch to test ANI for 2.6.27. We'll leave it up to the
maintainers and distributions to decide whether or not to pick this
up. But for that let's leave that for another thread.

  Luis
--

From: Steven Noonan
Date: Saturday, September 20, 2008 - 2:18 pm

On Sat, Sep 20, 2008 at 1:57 PM, Luis R. Rodriguez

How is the best way to approach this? I assume ATH_DBG_INTERRUPT and
pray like hell my /var partition doesn't fill up first? Maybe I'll
just symlink /var/log/messages to a file on a substantially larger

[10238.347956] ath9k: 0.1
[10238.348028] vendor=8086 device=27d2
[10238.348033] ath9k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[10238.348050] ath9k 0000:03:00.0: setting latency timer to 64
[10238.480781] phy1: Selected rate control algorithm 'ath9k_rate_control'
[10238.483320] phy1: Atheros 5416: mem=0xffffc20000060000, irq=17

It claims to be a 5416, but I was under the impression that it was the
5418 that came with the MacBook Pro.
--

From: Luis R. Rodriguez
Date: Saturday, September 20, 2008 - 2:24 pm

lspci output is outdated, we have submitted some new updates to it.
The output of the driver should be the one you can rely on for chipset
information for now. The problem is besides the PCI device ID there is
also EEPROM information which further identifies the chipset like
devices can be dual band (2 GHz and 5 GHz capable) and some other
single band -- using the same PCI device:vendor ID. So for now we've
tried to update the PCI id stuff (not sure when it'll be merged) to
tell your "family chipset" and in parens the MAC or known model. It
can still use some cleanup but yeah, hope this helps understand this a
little better. You can see more info on this here:

http://wireless.kernel.org/en/users/Drivers/ath9k#supportedchips

  Luis
--

From: Steven Noonan
Date: Thursday, September 18, 2008 - 4:08 pm

On Thu, Sep 18, 2008 at 3:01 PM, Luis R. Rodriguez

Will do! And actually, to be precise, the issue occurred not on the

Linus has been fairly strict lately, so I am as uncertain as you are.
It really should get merged, because having 2.6.27 with [partially]
broken wireless networking is kind of stupid. In the worst case, isn't
it possible to do a revert to get the old aggregation crap back in?
--

From: Luis R. Rodriguez
Date: Thursday, September 18, 2008 - 4:12 pm

up anyway.

  Luis
--

Previous thread: Interrupt Handler Invocation by Singaravelan Nallasellan on Thursday, September 18, 2008 - 10:57 am. (2 messages)

Next thread: [GIT PULL] 2.6.27-rc6 updates for s390 by Martin Schwidefsky on Thursday, September 18, 2008 - 11:23 am. (1 message)