Re: Mark IPW2100 as BROKEN: Fatal interrupt. Scheduling firmware restart.

Previous thread: Re: [PATCH] max3100 driver by Ben Pfaff on Sunday, September 21, 2008 - 9:09 am. (1 message)

Next thread: Re: [PATCH] acer-wmi: add error checks in module_init by Carlos Corbacho on Sunday, September 21, 2008 - 10:53 am. (6 messages)
From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 10:23 am

Hi.

Following bug exists in the ipw2100 driver/firmware for years and Intel
folks never responded to zillions bugzilla entries and forum notices in
the internet with some patch or firmware update (although did request
dmesg and debug info, and received them).

ipw2100: Fatal interrupt. Scheduling firmware restart.

I believe it is a firmware bug because after driver is unloaded and
loaded back again wireless adapter usually starts working (for small
amount of time though). My conspiracy feeling can suggest, that it may
be kind of a force to buy a new one, or trivial error in the firmware,
when it writes to the same place in the flash and essentially given cell
became dead or whatever else.

Intel folks, please fix this problem, I see no other way to force you to do
this than to mark ipw2100 driver as broken, since that is what it is.

Bug exists at least in .15 upto .24 kernels, just search above dmesg
line. I cought it with 2.6.24-19-386 ubuntu kernel, 1.3 firmware
version. lspci:

02:04.0 Network controller: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter (rev 04)
	Subsystem: Intel Corporation Samsung X10/P30 integrated WLAN
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 64 (500ns min, 8500ns max), Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at 90080000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [dc] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-

dmesg is pretty usual.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index 9931b5a..c24fc6a 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -125,7 +125,7 @@ config PCMCIA_RAYCS
 
 config IPW2100
 ...
From: Michael Buesch
Date: Sunday, September 21, 2008 - 10:36 am

You are pretty funny, actually. :)

I think the bug should be fixed, but what makes _you_ think you can _force_



-- 
Greetings Michael.
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 10:38 am

Maybe because I bought that adapter and it stopped working and Intel
knows about this bug and does not fix it for years?

-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 11:04 am

On Sun, 21 Sep 2008 21:23:17 +0400

so now you go from an occasional burp to having nothing at all.
How about you run with this patch on your own machine only?

or.. since you say a reload of the driver fixes it.. why don't you make
a patch for the driver that does basically the actions of a reload
automatically when the driver detects the issue?
(and stick a WARN_ON in for good measure so that kerneloops.org can
start tracking these burps)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 11:28 am

And how else user can get attention to the problem which is not fixed by
the vendor? We close our eyes and there is no problem, since we do not
see it. I just brought a lamp: no user can see that essentially driver

It stops after several seconds (or packets?). Sometimes (but rarely)
it works several minutes, sometimes it fires above dmesg line and
continues to work, sometimes it fires it for a while and then stops
writing it, although driver does not send or receive anything (at
least ifconfig counters do not change).

Actually, I do not think it is a driver problem, since what it does is
pretty much straightforward, but if you will tell me how else can we fix
this issue, I will print it and glue near the window so this gotcha
could be used with other problems. Or you can (as everyone else who do)
just said that this is damn wrong and forget about problem for the next
several years.

-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 11:35 am

On Sun, 21 Sep 2008 22:28:38 +0400


again.. so how about you detect this condition and do, in the driver
code, the equivalent of rmmod/insmod to the hardware. I'm sure people
who have the hardware would appreciate that type of patch a lot more
than the one you sent out.




-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Wei Weng
Date: Sunday, September 21, 2008 - 11:52 am

I guess it is your way of "middle finger" to all the IPW2100 customers who try
to use it on a Linux machine.


Thanks
Wei


--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 12:20 pm

On Sun, 21 Sep 2008 14:52:37 -0400

if suggesting a workaround is giving the middle finger in your mind, 
then I don't think it's worth my time to discuss this further with you.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 12:00 pm

Which does not have access to the firmware... Which IMO is failing and
 
Reset task does efectively ipw2100_up(), so the difference is power
cycles over the PCI bus and enable/disable/request commands. Like this
stuff:
	/* We disable the RETRY_TIMEOUT register (0x41) to keep
	 * PCI Tx retries from interfering with C3 CPU state */
	pci_read_config_dword(pci_dev, 0x40, &val);
	if ((val & 0x0000ff00) != 0)
		pci_write_config_dword(pci_dev, 0x40, val & 0xffff00ff);

I do remember I had a tibet monk course of decoding ipw2100 PCI
config address space, just need to find my kimono.

Do you want me to implement ipw2100 driver as a big work structure
which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
And that will be the fix suggested by Intel? That would explain a lot.

P.S. And some people tell that asking for bug bisection is a hard
pressure on user. Vendor has to ask him to fix bug himself instead,
and that will be a solution!

Getting the fact, that rmmod/insmod does not always fix the problem (but
most of the time for a short period of time), I again want to point,
that it looks like a firmware problem related to some inner timings. You
ask me to fix the driver and do not even listen to what I said
previously and do not get that into account and analyze.

-- 
	Evgeniy Polyakov
--

From: Johannes Berg
Date: Sunday, September 21, 2008 - 12:14 pm

I think what Arjan is saying is that it would be better to put pressure
on the responsible folks (I don't think Arjan is anywhere near them at
all) if you'd put in a WARN_ON() for this error and that would make the
top entry on kerneloops.org all the time... And additionally put in a
workaround for yourself for now.

And can we keep the flames off this list please? That comment from Wei
Weng was absolutely uncalled for, and inciting a flamewar (as you have
already blogged) was not really productive either.

johannes
From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 12:38 pm

Hi.



As I pointed, I can rewrite the whole driver's initialization process,
so that it looked like init/wait/exit loop, which can be processed at
the module load and when fatal interrupt fires. Do this a fix? This is
not even a remotely workaround. We can just add
rmmod/modprobe/ifdown/ifup to the crontab job. Another users reported in
bugzilla that they needed to reboot a machine to make card working

If we will keep silence, no one will notice that problem exists.

I do hope this will result in a progress. Arjan, do you aggree to add
this patch to the current tree?

diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..9a7b64c 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
 
 static u32 ipw2100_debug_level = IPW_DL_NONE;
 
+static int ipw2100_max_fatal_ints = 10;
+
 #ifdef CONFIG_IPW2100_DEBUG
 #define IPW_DEBUG(level, message...) \
 do { \
@@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
 	if (inta & IPW2100_INTA_FATAL_ERROR) {
 		printk(KERN_WARNING DRV_NAME
 		       ": Fatal interrupt. Scheduling firmware restart.\n");
+		WARN_ON(1);
+
+		BUG_ON(ipw2100_max_fatal_ints-- <= 0);
+
 		priv->inta_other++;
 		write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
 


-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 12:43 pm

On Sun, 21 Sep 2008 23:38:09 +0400


BUG_ON in interrupt context is just extremely hostile, since it means
the box is dead.

also I would suggest using WARN_ON_ONCE() 

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 1:20 pm

Well, I actually wanted to have a bug there because of it, but now I
think that annoying repeated warning is enough to bring attention to the
problem by putting bug information into some magic special place called
kerneloops collection.

Consider for inclusing for the upcoming kernel to get wider
notifications. Yes, it is not a bugfix, I know.

diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..6599211 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,9 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
 
 static u32 ipw2100_debug_level = IPW_DL_NONE;
 
+static int ipw2100_max_fatal_ints = 10;
+module_param(ipw2100_max_fatal_ints, int, 0644);
+
 #ifdef CONFIG_IPW2100_DEBUG
 #define IPW_DEBUG(level, message...) \
 do { \
@@ -3174,6 +3177,9 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
 	if (inta & IPW2100_INTA_FATAL_ERROR) {
 		printk(KERN_WARNING DRV_NAME
 		       ": Fatal interrupt. Scheduling firmware restart.\n");
+
+		WARN_ON(ipw2100_max_fatal_ints-- >= 0);
+
 		priv->inta_other++;
 		write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
 


-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 1:27 pm

On Mon, 22 Sep 2008 00:20:57 +0400

are you more interested in bringing attention than finding something
that makes the driver work ? I sort of am getting that impression and

still more complex than needed; a WARN_ON_ONCE() will be enough.
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 1:57 pm

I do think that it can not be fixed without serious intervention of the
Intel (hardware) folks, since bug exists more than 4 years in two
firmwares and lots of very different driver versions and was reproduced
even on 2.4 kernel.

I will experiment with reloading issues as Alan suggested and to
add/remove more surgery into initialization process to be allowed to
'workaround' the issue, since it looks noone else will.

But that's definitely not a fix and in my personal workaround's 10

That allows to dump whatever number of warnings you want. The more we
have, the louder will be customers scream.

-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 2:02 pm

On Mon, 22 Sep 2008 00:57:06 +0400

artificially increasing numbers isn't going to do that; it just shows
you're more interested in making a stink than in getting something


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 2:05 pm

As practice shows, I'm the only one who is interested in getting
something improved, and Intel, as we see right now, is not interested in
it at all, since you ask me not only decrease error verbosity, but also
do not work towards fixing the bug by trying to understand where it
lives.

-- 
	Evgeniy Polyakov
--

From: Arjan van de Ven
Date: Sunday, September 21, 2008 - 2:14 pm

On Mon, 22 Sep 2008 01:05:55 +0400

I did no such thing and you know it.

I'm sorry, I'm not going to waste time on this if you keep acting
this dishonest; welcome to my mail filter...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Denys Fedoryshchenko
Date: Sunday, September 21, 2008 - 2:43 pm

You are not right. I had totaly disfunctional Intel driver on two laptops and 
reported about issue to Intel. Yes it took time, they took all debugs and 
went coma mode (i was thinking that), but suddently i got mail from them, and 
next kernel/firmware release worked for me flawlessly. So they did perfect 
job.

Don't be negative and prepare yourself for giving long debug outputs. Patience 
and only patience.
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 3:07 pm

Just to be clear, did it take 4 years? :)

Anyway, I already made conclusions, as probably others: I will
experiment with different 'workarounds' for this bug, maybe I will
succeed, maybe Intel will decided to fix it, maybe LHC will crash the
world. Verbose warning about the bug was frowned upon, so its up to uses
to make a progress here...

-- 
	Evgeniy Polyakov
--

From: Denys Fedoryshchenko
Date: Sunday, September 21, 2008 - 3:15 pm

Any bugzilla entry? I cannot find on 
http://www.intellinuxwireless.org/bugzilla/ anything about this bug.

I submit two reports in my case, one in kernel bugzilla, one in intel linux 
wireless project..
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 4:46 pm

Lucky you :)

-- 
	Evgeniy Polyakov
--

From: Marcel Holtmann
Date: Sunday, September 21, 2008 - 4:27 pm

as Arjan and Alan pointed out already, WARN_ON_ONCE is enough and I
agree with them. Just to make this perfectly clear, this is with my
community hat on.

Please send a proper patch with a simple WARN_ON_ONCE and I am happy to
sign off on it.

Regards

Marcel


--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 5:00 pm

I really do not care about if there is warning at all, I just want that
bug to be fixed. And a we can see, something started to change, and that's
probably a good sign. I glad there is a result. I will check d3 states
tomorrow. Attached patch if you think it is yet needed.

diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..637dc05 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -3174,16 +3174,18 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
 	if (inta & IPW2100_INTA_FATAL_ERROR) {
 		printk(KERN_WARNING DRV_NAME
 		       ": Fatal interrupt. Scheduling firmware restart.\n");
+
 		priv->inta_other++;
 		write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
 
 		read_nic_dword(dev, IPW_NIC_FATAL_ERROR, &priv->fatal_error);
-		IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
-			       priv->net_dev->name, priv->fatal_error);
-
 		read_nic_dword(dev, IPW_ERROR_ADDR(priv->fatal_error), &tmp);
-		IPW_DEBUG_INFO("%s: Fatal error address value: 0x%08X\n",
-			       priv->net_dev->name, tmp);
+
+		printk(KERN_WARNING "%s: Fatal error value: 0x%08X, "
+				"address: 0x%08X, inta: 0x%08lX\n",
+			priv->net_dev->name, priv->fatal_error, tmp,
+			(unsigned long)inta & IPW_INTERRUPT_MASK);
+		WARN_ON_ONCE(1);
 
 		/* Wake up any sleeping jobs */
 		schedule_reset(priv);

-- 
	Evgeniy Polyakov
--

From: Alan Cox
Date: Sunday, September 21, 2008 - 3:38 pm

But if Intel don't care then you can scream all you like 8)

A WARN_ON_ONCE is sufficient to capture an idea of how many people it is
effecting and maybe to figure out what the trigger is from their reports,
at that point there is some chance to get it fixed (especially if its
remotely triggerable ;))

Alan
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 4:44 pm

Well, redhat, suse and ubuntu bugzillas happend to be not enough. Why do
you believe a single warning at a new place will be? or couple of tens
or whatever else? If it cares, it cares. If it does not...

I attracted vendor's attention, vendor told me to fix it myself and to
create a patch to fill an entry in another 'bugzilla', so that vendor
could get results and probably decide to walk down from the cloud and
fix it.

So, if they do not care, I do not care about their care. That's the
deal. I will try to find a workaround, even if it is a real crap,
fortunately other users will not strike this bug too frequently.

-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Sunday, September 21, 2008 - 4:48 pm

Evgeniy, you're bordering on being an asshole, if not actually
being one.

If you behaved this way for a bug I was responsible for, I would
absolutely ignore you until you settled down and started to behave
more reasonably.

You're acting like a bomb which is about to explode, which is probably
why the actual Intel maintainers for this driver don't want to touch
you with a ten foot pole.  You're being volatile and extremely
unpleasant to interact with about this issue.

The Intel folks replying to you right now are general Intel linux
folks who are trying to help you, not the driver maintainers who can
look into the firmware and attack that angle.  So give them A FUCKING
BREAK!

Getting the OOPS to kerneloops.org is the way forward and will help
your cause, whether you believe it or not.
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 5:18 pm

Out of curiosity, what's worse: being an asshole and pretend to be good

That's the main point: 'until you started to behave more reasonably'.
For example filling another bug in rh/suse/ubuntu bugzilla?
Put yourself to the user's place, and suddenly picture changes
dramatically.

We got some progress on this bug, at least there is direct suggestion
from Matthew about power state, if it will fix the issue, I think it is
a good deal: one bug fix for lot of users for the mail in the killfile
and a worsened 'reputation'.

I provded a patch like Arjan wanted, and it can only change something
because of all this talks I started being an asshole. In my opinion.
Maybe there were some other ways around, but it looks like being a
provocative is the only way to get to the cloud. Who knows :)

-- 
	Evgeniy Polyakov
--

From: Bill Davidsen
Date: Monday, September 22, 2008 - 7:22 am

Has it occurred to you that YOU have a problem on YOUR maschine, and that your 
patch would kill wireless for all the people who have the hardware on working 
systems? My experience was somewhat like Denys' except I got no notice, I just 
found that after an update the wireless worked solidly, and continued to do so 
until that laptop because obsolete and slow, and went to live with one of the 
It would be good to gather data rather than claim it doesn't work, because for 
My experience with laptops has been that you fiddle with power saving, and more, 
and more, until you find the tricks which make the laptop save power by 
disabling something you need, like network or display. At least that's been both 
my practice and observation, that not every machine responds well to every power 
saving trick.

Have you checked for a BIOS update for the machine? Tried disabling all power 
saving settings and seeing if that changes the problem? I would normally assume 
you have, but you seem convinced that the bug is in the firmware and you're 
going to get it fixed. It may be a firmware bug, but if something in your system 
is triggering it, and most people don't have the problem, you might investigate 
a solution other than beating on Intel.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

--

From: Cyrill Gorcunov
Date: Sunday, September 21, 2008 - 1:05 pm

[Evgeniy Polyakov - Sun, Sep 21, 2008 at 11:38:09PM +0400]
| Hi.
| 
| On Sun, Sep 21, 2008 at 09:14:04PM +0200, Johannes Berg (johannes@sipsolutions.net) wrote:
| > > Do you want me to implement ipw2100 driver as a big work structure
| > > which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
| > > And that will be the fix suggested by Intel? That would explain a lot.
| > 
| > I think what Arjan is saying is that it would be better to put pressure
| > on the responsible folks (I don't think Arjan is anywhere near them at
| 
| Both maintainers were added to the copy list.
| 
| > all) if you'd put in a WARN_ON() for this error and that would make the
| > top entry on kerneloops.org all the time... And additionally put in a
| > workaround for yourself for now.
| 
| As I pointed, I can rewrite the whole driver's initialization process,
| so that it looked like init/wait/exit loop, which can be processed at
| the module load and when fatal interrupt fires. Do this a fix? This is
| not even a remotely workaround. We can just add
| rmmod/modprobe/ifdown/ifup to the crontab job. Another users reported in
| bugzilla that they needed to reboot a machine to make card working
| again. I'm not sure that user tried to do a rmmod/modprobe though.
| 
| > And can we keep the flames off this list please? That comment from Wei
| > Weng was absolutely uncalled for, and inciting a flamewar (as you have
| > already blogged) was not really productive either.
| 
| If we will keep silence, no one will notice that problem exists.
| 
| I do hope this will result in a progress. Arjan, do you aggree to add
| this patch to the current tree?
| 
| diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
| index 19a401c..9a7b64c 100644
| --- a/drivers/net/wireless/ipw2100.c
| +++ b/drivers/net/wireless/ipw2100.c
| @@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
|  
|  static u32 ipw2100_debug_level = IPW_DL_NONE;
|  
| ...
From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 1:26 pm

The only reason for this change is to make a mark at the kerneloops.
I.e. users know, there is a bug. Developers know, there is a bug.
Everyone knows that there is a bug, but until it is at the special place
we look to each other just like there is no bug.

Here are dumps for example:
http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=245

Bug existed even with 1.2 firmware and .11 kernel.
Intel, that's a great marketing slogan: stability everywhere!

-- 
	Evgeniy Polyakov
--

From: Cyrill Gorcunov
Date: Sunday, September 21, 2008 - 1:35 pm

[Evgeniy Polyakov - Mon, Sep 22, 2008 at 12:26:56AM +0400]
| On Mon, Sep 22, 2008 at 12:05:18AM +0400, Cyrill Gorcunov (gorcunov@gmail.com) wrote:
| > Since it's that serious maybe we should change
| > 
| > 		IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
| > 			       priv->net_dev->name, priv->fatal_error);
| > 
| > to printk(KERN_WARN)? And here is why - as I see now we can't say what
| > exactly is wrong - Evgeniy said he has a suspicious about firmware so
| > this WARNS will be collected by Arjan thru kerneloops and we could not
| > ask users to change debug level and repost problem - oops will have it
| > by default - and if it really firmware problem - firmware engineers could
| > find this _additional_ info usefull and resolve it (probably).
| 
| The only reason for this change is to make a mark at the kerneloops.
| I.e. users know, there is a bug. Developers know, there is a bug.
| Everyone knows that there is a bug, but until it is at the special place
| we look to each other just like there is no bug.
| 
| Here are dumps for example:
| http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=245
| 
| Bug existed even with 1.2 firmware and .11 kernel.
| Intel, that's a great marketing slogan: stability everywhere!
| 
| -- 
| 	Evgeniy Polyakov
| 


yes Evgeniy - all could know that but this register info could help
firmware engineers to distinguish problems (without additional efforts
like ask users to pass debug argument - kerneloops will have it
by default) if there not only one exist. I mean I don't think anyone
would reject additional info about problem ever :)

		- Cyrill -
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 2:06 pm

Agreed.

diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..36cdd57 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,9 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
 
 static u32 ipw2100_debug_level = IPW_DL_NONE;
 
+static int ipw2100_max_fatal_ints = 10;
+module_param(ipw2100_max_fatal_ints, int, 0644);
+
 #ifdef CONFIG_IPW2100_DEBUG
 #define IPW_DEBUG(level, message...) \
 do { \
@@ -3174,16 +3177,21 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
 	if (inta & IPW2100_INTA_FATAL_ERROR) {
 		printk(KERN_WARNING DRV_NAME
 		       ": Fatal interrupt. Scheduling firmware restart.\n");
+
+		printk(KERN_WARNING DRV_NAME ": INTA: 0x%08lX\n",
+			(unsigned long)inta & IPW_INTERRUPT_MASK);
+
 		priv->inta_other++;
 		write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
 
 		read_nic_dword(dev, IPW_NIC_FATAL_ERROR, &priv->fatal_error);
-		IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
+		printk(KERN_WARNING "%s: Fatal error value: 0x%08X\n",
 			       priv->net_dev->name, priv->fatal_error);
 
 		read_nic_dword(dev, IPW_ERROR_ADDR(priv->fatal_error), &tmp);
-		IPW_DEBUG_INFO("%s: Fatal error address value: 0x%08X\n",
+		printk(KERN_WARNING "%s: Fatal error address value: 0x%08X\n",
 			       priv->net_dev->name, tmp);
+		WARN_ON(ipw2100_max_fatal_ints-- >= 0);
 
 		/* Wake up any sleeping jobs */
 		schedule_reset(priv);


-- 
	Evgeniy Polyakov
--

From: Alan Cox
Date: Sunday, September 21, 2008 - 12:57 pm

Try putting it into D3 counting to 10 and powering it back up. Thats
about as close as you can get to pulling the plug when it hangs.

Alan
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 2:10 pm

I will experiment with this, thanks Alan.
Unfortunately my machine builds this only
updated driver for about 10 minutes, so
results will appear not too quickly.
I will start tests tomorrow.

-- 
	Evgeniy Polyakov
--

From: Evgeniy Polyakov
Date: Thursday, September 25, 2008 - 10:56 pm

I made several experimetns with power states in reset handler,
like put to d3 (hot), disable device, save/resetore states.
Fatal interrupts continue to fire with essentially the same rate.

The same error address does not always contain the same error value, but
frequently it is finit small set.

Here are some data:
[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501,
	inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11
	(level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at
	offset 0x1 (was 0x2900002, writing 0x2900006)

That is quite harmless, since interrupt handler just sees that device is
dissapearing. This brought me to think more about interrupt processing
(irq handler and related tasklet), and I found races between interrupt
tasklet, ipw2100_wx_event_work() handler, reset task and probably
others. Register access in some cases are proteceted by lock (interrupt
handler), and in some cases is not (all others). Although every user
first disables interrupts, but it can be handled right now and scheduled
tasklet already. Also priv->status field is frequently accessed and
modified with and without locks. This may be harmless, but still a red
flag.

Another data about the same failed address:
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, ...
From: Marcel Holtmann
Date: Sunday, September 21, 2008 - 12:35 pm

I don't know if it is for this bug or a different one, but Matthew
Garrett seem to have some pending patches. At least that is what he told
me at PlumbersConf. Lets see if these patches do help. And please follow
up with Arjan's suggestion and put a WARN_ON in the upstream code
instead of waving CONFIG_BROKEN around.

Regards

Marcel


--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 2:12 pm

Hi Marcel.


I expect it is something new, since this bug exists at least from the
1.2 firmware version and .11 kernel. It was also reproduced (long ago
though) on 2.4.

-- 
	Evgeniy Polyakov
--

From: Matthew Garrett
Date: Sunday, September 21, 2008 - 3:45 pm

The fix I had for this was actually for ipw2200, but it ought to be 
applicable for 2100 as well. The ideal fix is probably to ensure that 
ipw*_down D3s the card and *_up D0s it, which brings enhanced runtime 
power saving and also has the nice side effect of actually resetting the 
damned POS in error cases.

-- 
Matthew Garrett | mjg59@srcf.ucam.org
--

From: Matthew Garrett
Date: Sunday, September 21, 2008 - 3:42 pm

Try D3ing the chip in the firmware restart code. Yes, it's retarded.

-- 
Matthew Garrett | mjg59@srcf.ucam.org
--

From: Evgeniy Polyakov
Date: Sunday, September 21, 2008 - 4:45 pm

Thank you, I will start tests tomorrow.

-- 
	Evgeniy Polyakov
--

From: Kenneth Crudup
Date: Monday, September 22, 2008 - 9:21 am

I gotta admit, those "firmware restarts" were pretty annoying, and I'd
always wondered why Intel themselves couldn't be bothered to fix 'em.

	-Kenny

-- 
Kenneth R. Crudup  Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809      (888) 454-8181
--

Previous thread: Re: [PATCH] max3100 driver by Ben Pfaff on Sunday, September 21, 2008 - 9:09 am. (1 message)

Next thread: Re: [PATCH] acer-wmi: add error checks in module_init by Carlos Corbacho on Sunday, September 21, 2008 - 10:53 am. (6 messages)