e1000e NVM corruption

Previous thread: Re: udelay and timers by Elias Oltmanns on Sunday, September 28, 2008 - 1:55 am. (1 message)

Next thread: Re: [Bug #11516] severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27-rc5 by Thomas Gleixner on Sunday, September 28, 2008 - 3:58 am. (1 message)
From: jbi
Date: Sunday, September 28, 2008 - 2:06 am

I am not a member of this list (I read occasionally via the public 
archives) or an experienced kernel hacker but I do have a few 
semi-informed thoughts on the 8256x NVM corruption issue.  Take with 
salt as necessary:

According to Intel's datasheet, these interfaces expose *writable* PCI 
ID registers at NVM words 0x0A-0x0E[1].  Fill the NVM with FFs and the 
interface will probably respond with vendor ID : device ID FFFF:FFFF 
during bus enumeration.  The potential for these devices to disappear 
off the PCI bus after NVM corruption means that unbricking damaged 
devices could be nontrivial.

One possible recovery option would be see if a bricked interface still 
responds to commands at the appropriate hardware address in the PCI 
configuration space.  Even if the device escapes enumeration because its 
vendor ID and device ID are invalid, the hardware might be alive enough 
to work with an NVM-reflash driver that ignored the PCI ID.  In the best 
case, getting a dead device back could just be a matter of setting up 
MBARB by writing to the configuration space for bus 0 dev 25 fn 0 (for 
ICH9) and reloading the NVM with a reasonable set of defaults.

Another recovery possibility would be to rewrite the NVM using the ICH's 
SPI interface.  The same flash chip serves both the BIOS and NIC.  The 
NVM might be rewritable through the ICH even if the NVM is too deeply 
corrupted for the NIC to respond to reflash commands.  This fix would be 
complex, risky, and possibly motherboard specific, but should be able to 
put most bricked 8256x NICs back together unless NVM corruption has 
caused deeper damage.

Finally, given that one write to the wrong part of the NVM will turn one 
of these NICs into a brick, memory mapping the NVM--even for the 
briefest period--seems deeply imprudent.  As long as the NVM is memory 
mapped, all it takes to turn one of these NICs into a brick is for one 
kernel mode bug to make one dword write to any of those registers.  With 
the current cost of ...
Previous thread: Re: udelay and timers by Elias Oltmanns on Sunday, September 28, 2008 - 1:55 am. (1 message)

Next thread: Re: [Bug #11516] severe performance degradation on x86_64 going from 2.6.26-rc9 -> 2.6.27-rc5 by Thomas Gleixner on Sunday, September 28, 2008 - 3:58 am. (1 message)