Re: Linux 2.6.27-rc8

Previous thread: [GIT pull] timer fixes for .27 by Thomas Gleixner on Monday, September 29, 2008 - 3:15 pm. (7 messages)

Next thread: Re: questions about x86: mtrr cleanup for converting continuous to discrete layout by Dylan Taft on Monday, September 29, 2008 - 3:46 pm. (2 messages)
From: Linus Torvalds
Date: Monday, September 29, 2008 - 3:39 pm

So yet another week, another -rc. This one should be the last one: we're 
certainly not running out of regressions, but at the same time, at some 
point I just have to pick some point, and on the whole the regressions 
don't look _too_ scary. And -rc8 obviously does fix more of them.

Most of the changes since -rc7 are pretty small, and there aren't even a 
whole lot of them. The shortlog (appended) is just a couple of pages, and 
the diffstat is even smaller, but since the dirstat is a dense overview, 
I'll just put that here instead:

   4.6% arch/m32r/kernel/
   5.7% arch/m32r/
   9.5% arch/mips/pci/
  10.4% arch/mips/
   4.2% arch/x86/kernel/
   4.4% arch/x86/
  26.0% arch/
   3.5% drivers/usb/storage/
  10.4% drivers/usb/
   3.6% drivers/watchdog/
  23.8% drivers/
  11.5% fs/xfs/
  13.5% fs/
   3.7% kernel/
   9.8% net/9p/
  10.6% net/
   5.4% scripts/kconfig/
   5.9% scripts/
   7.4% sound/soc/codecs/
   8.4% sound/soc/
  10.1% sound/

and it's actually more spread out than usual. Arch and drivers are just 
half of the patch even when combined. 

Give it a try,

		Linus

---
Adrian Bunk (5):
      m32r: remove the unused NOHIGHMEM option
      m32r: don't offer CONFIG_ISA
      m32r: export empty_zero_page
      m32r: export __ndelay
      m32r/kernel/: cleanups

Adrian Hunter (2):
      UBIFS: TNC / GC race fixes
      UBIFS: remove incorrect assert

Akinobu Mita (2):
      [WATCHDOG] ibmasr: remove unnecessary spin_unlock()
      ibmasr: remove unnecessary spin_unlock()

Alan Cox (1):
      pcmcia: Fix broken abuse of dev->driver_data

Alan Stern (2):
      USB: unusual_devs addition for RockChip MP3 player
      USB: revert recovery from transient errors

Alex Chiang (1):
      [IA64] Ski simulator doesn't need check_sal_cache_flush

Alexander Beregalov (1):
      UBIFS: fix printk format warnings

Alexander Duyck (1):
      netdev: simple_tx_hash shouldn't hash inside fragments

Andrea Righi (1):
      x86, oprofile: BUG ...
From: david
Date: Monday, September 29, 2008 - 4:09 pm

unless there is news that I missed, the E1000 bricking bug is still out 
there. that is a particularly nasty one.

--

From: Jiri Kosina
Date: Monday, September 29, 2008 - 4:33 pm

If 2.6.27 is released with e1000e driver corrupting EEPROM contents on 
many systems out there, rendering the cards unusable for most of the 
i-am-not-a-hacker users (and remember, even Dave Airlie bricked his laptop 
completely to death, when trying to restore eeprom contents), well, I 
personally find that very scary.

Intel is working with us on tracking down and resolving the issue, but 
this is not going as well as one would like to see (one attempt, one card 
with completely hosed EEPROM contents ... and restoring the contents is 
not *that* trivial).

Intel has some patches to mitigate the symptoms (even though we still 
don't know who is causing the breakage, but Xorg is the biggest suspect in 
my eyes), but they are neither in your tree nor in any other maintainer's 
queue yet, as far as I know.

-- 
Jiri Kosina
SUSE Labs

--

From: Linus Torvalds
Date: Monday, September 29, 2008 - 6:56 pm

What's the magic to trigger it? I've got a laptop with that e1000e chip in 
it, and am obviously running a recent kernel on it. Do people have a 
handle on it? Is it actually verified to be kernel-related, and not 
related to the X server etc?

		Linus
--

From: Dave Airlie
Date: Monday, September 29, 2008 - 6:59 pm

On Tue, Sep 30, 2008 at 11:56 AM, Linus Torvalds

If we had the magic we'd have fixed it by now, the current working
theory is its X server related. This
hasn't been proven, though my ATI GPU e1000e seems fine so it may have
some legs.

If it is X related then its both a kernel + X server issue, the e1000e
driver opens the barn door, the X server drives the horses through it.

Of course until someone produces a way to fix the hw after it breaks,
reproducing this isn't something for the feint hearted. I'm hoping my
laptop
comes back today with a brand new motherboard in it.

Dave.
--

From: Arjan van de Ven
Date: Monday, September 29, 2008 - 7:06 pm

On Tue, 30 Sep 2008 11:59:58 +1000

we have a patch to save/restore now, in final testing stages
(obviously we want to be really careful with this)

Note that so far it seems to mostly hit with "new" distros, so both
new kernel and new X... ;(


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Linus Torvalds
Date: Monday, September 29, 2008 - 7:23 pm

Btw, the _real_ bug is clearly in the hardware design that allows you to 
brick those things without apparently even having a lock bit.

I'm hoping Intel doesn't treat this as just a software bug. Some hw 
designer should be thinking hard about which orifice they put their head 
up in.

It used to be that you could fry some monitors by feeding them 
out-of-range signals. The _monitors_ got fixed. 

		Linus
--

From: Linus Torvalds
Date: Monday, September 29, 2008 - 7:24 pm

Mostly. I think you can still do bad things to internal LCD's on at least 
some laptops. Although I hope I'm wrong. 

			Linus
--

From: Alan Cox
Date: Tuesday, September 30, 2008 - 5:06 am

You still can in some cases. You can also erase many video card
firmwares, trash disks, brick DVD drives and the like fairly easily too
but you do tend to have to try to be evil in these cases, not just get an
address wrong.

Alan
--

From: Brandeburg, Jesse
Date: Monday, September 29, 2008 - 8:42 pm

The hardware has a lock bit, and we're trying to figure out why the BIOS 
writers guide doesn't say to set it.  Probably because of the MAC address, 

We will post a patch to e1000e tomorrow that sets a lock bit that prevents 
the registers memory mapped by 0:19.0 BAR1 from causing flash write 
cycles.

The patches I've just posted don't quite do that yet.
--

From: Alan Cox
Date: Tuesday, September 30, 2008 - 5:05 am

I am confident they will, because right now some more malicious virus
writers will be thinking 'whoopeee party time'.
--

From: Linus Torvalds
Date: Monday, September 29, 2008 - 7:21 pm

Are you sure? There was a mandriva report abou NVM corruption on an e100 
too (that one apparently just caused PXE failure, the networking worked 
fine).

So I wonder if it's _purely_ X-server-related, adn the reason people blame 
2.6.27-rc1 is just timing of some X update and then people just look at 
the kernel beceuse the 'network card failed' looks so kernel-related.

The reason I mention that is right now it looks like the distros are just 
running around disabling the e1000e module, or perhaps downgrading it. 
Which may not even work!

The discussions in some of the bug-trackers seem to be full of people who 
have no actual information, but are perfectly willing to flail around 
wildly saying obviously crazy things.

The Ubuntu people are some of the crazier ones (should I be surprised?), 
but that one also has Ben Collins claiming they use the same e1000e driver 
for the 2.6.26/27 kernels (from intels sf.net project). That may be bogus, 
but if true it would indicate that it's possibly not so kernel-related, or 
at least not so e1000e-driver-related.

		Linus
--

From: Dave Airlie
Date: Monday, September 29, 2008 - 7:39 pm

On Tue, Sep 30, 2008 at 12:21 PM, Linus Torvalds

Well from a purely empirical standpoint, I've been running new X
against that laptop for a long time,
and others have the same laptop, so I think its a problem with the
e1000e driver putting the card into a state which allows
X to do bad things. I think X maybe causing issues on other hw, like
e100 and some realtek.. Also when we say X I think it looks like Intel
driver interaction issues,
as I said I'm running the same stuff on my ATI gpu laptop with e1000e
and haven't had any problems.

But I'm leaving this up to Intel, I don't think HP will take it too
kindly if I keep returning my laptop.

Dave.
--

From: Arjan van de Ven
Date: Monday, September 29, 2008 - 8:19 pm

On Mon, 29 Sep 2008 19:21:02 -0700 (PDT)


btw, we're also working on making some parts of the kernel more robust
against certain types of bugs; for example the ioremap checks and sysfs
resource checks. There's a set of checks and API changes we can do to
make it less likely that drivers end up doing bad stuff; but that's
obviously more for 2.6.28 than for .27



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jiri Kosina
Date: Tuesday, September 30, 2008 - 12:11 am

That is very probably completely separate issue, and shoudl have been 

I think that not many peeople are suspecting bug in e1000e directly. 
Rather a combination of X bug, kernel allowing X to do bad things (for 
example the missing check in drivers/pci/pci-sysfs.c:pci_mmap_resource() 
looks particularly suspicious) and a "bug-friendly" hardware behavior.

-- 
Jiri Kosina
SUSE Labs
--

From: Eric Piel
Date: Tuesday, September 30, 2008 - 12:58 am

Likely not, you are mentioning a patch for e1000, while the Mandriva bug 
report is about e100:
https://qa.mandriva.com/show_bug.cgi?id=44192

See you,
Eric

--

From: Luiz Fernando N. Capitulino
Date: Tuesday, September 30, 2008 - 9:28 am

Em Tue, 30 Sep 2008 09:58:56 +0200
Eric Piel <eric.piel@tremplin-utc.net> escreveu:

| Jiri Kosina schreef:
| > On Mon, 29 Sep 2008, Linus Torvalds wrote:
| > 
| >>> If it is X related then its both a kernel + X server issue, the e1000e 
| >>> driver opens the barn door, the X server drives the horses through it.
| >> Are you sure? There was a mandriva report abou NVM corruption on an e100 
| >> too (that one apparently just caused PXE failure, the networking worked 
| >> fine).
| > 
| > That is very probably completely separate issue, and shoudl have been 
| > fixed already by 78566fecb.
| Likely not, you are mentioning a patch for e1000, while the Mandriva bug 
| report is about e100:
| https://qa.mandriva.com/show_bug.cgi?id=44192

 Yes, also the reporter has said that he has got the problem with -rc7 and
this fix is available since -rc6.

 Jiri, doesn't e100 need that fix as well?

 Anyway, it is not clear for us whether this is a kernel problem. We
could not reproduce it here and the reporter is now checking his network.

-- 
Luiz Fernando N. Capitulino
--

From: Herton Ronaldo Krzesinski
Date: Tuesday, September 30, 2008 - 11:27 am

He finished checks and discovered the e100 issue was in reality a hardware 
problem in the switch being used that started to have problems now, 
coincidently with this e1000e issue getting more attention, after swapping 
the switch the problem stopped, so just a false alarm. I closed 
https://qa.mandriva.com/show_bug.cgi?id=44192 that was the original report.

--
[]'s
Herton
--

From: Brandeburg, Jesse
Date: Monday, September 29, 2008 - 7:30 pm

my current status mail was posted earlier today to lkml from this
address, since then we've had a local reproduction and are going for
number two.  The reproduction seems racy, i.e. it doesn't happen every
time, so we put it in a loop doing detect, check eeprom, detect, etc,
and we'll see if it fails.

Reproduction seems to consistently be around X probing time, no firm
leads yet.  As for Intel we have keithp and jbarnes as well as arjan,
auke, myself and a few others involved.

We have some patches to lock the nvm down, we'll be posting those
tonight and tomorrow, I also have some debug logic (and fixes) to help
prove that we don't think it's a race in e1000e.
--
Jesse
--

From: Thomas Gleixner
Date: Tuesday, September 30, 2008 - 3:07 pm

Can we get the simple debug patches including the fixes which resulted
from them pushed upstream ASAP ?

Thanks,

	tglx
--

From: Jiri Kosina
Date: Tuesday, September 30, 2008 - 12:06 am

So far it seems to be that you need 1) something close to xorg 7.4 and 
2) 2.6.27-rcX kernel to trigger it. Not every system having e1000e is 
affected.

Apparently it is some kind of race, as it usually takes multiple cycles to 
trigger (on one of our testing machines this took three attempts to 
trigger for the first time, and then after unbricking the machine and 
restarting testing, the reproduction tests have been running for several 
hours).

It always seems to happen when X is probing/initializing the graphics 
card. So it really seems to be some badness in Xorg intel driver 
initialization code, and kernel/hardware allows bad things to happen.

Last time I heard, our X developers are suspecting vbeinit initialization 
code in Intel driver and are looking into it.

Also, we are going to release next opensuse/SLES beta with patches that 
should mitigate the problem (Jesse has posted a new version of them), so 
hopefully we will then receive some stacktraces from the users who are 
able to trigger the problem more easily.

-- 
Jiri Kosina
SUSE Labs
--

From: Krzysztof Halasa
Date: Tuesday, September 30, 2008 - 7:09 am

And this e1000e must be ICH*, right? I.e. not a separate e1000e
chip/card?
-- 
Krzysztof Halasa
--

From: Jiri Kosina
Date: Tuesday, September 30, 2008 - 7:11 am

So far all the affected systems I am aware of were ICH.

-- 
Jiri Kosina
SUSE Labs
--

From: Allan, Bruce W
Date: Tuesday, September 30, 2008 - 8:48 am

Ditto here, i.e. we have no similar reports on other parts.

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Jiri Kosina
Sent: Tuesday, September 30, 2008 7:11 AM
To: Krzysztof Halasa
Cc: Linus Torvalds; Linux Kernel Mailing List; Brandeburg, Jesse
Subject: Re: Linux 2.6.27-rc8


So far all the affected systems I am aware of were ICH.

--
Jiri Kosina
SUSE Labs
--

--

From: Jiri Kosina
Date: Wednesday, October 1, 2008 - 8:37 am

[Empty message]
From: J.A.
Date: Tuesday, September 30, 2008 - 12:43 am

Hi....



Dealing with my Aspire One setup, I found this (so obvious I don't send a patch:)


arch/x86/kernel/cpu/mtrr/main.c:

static int __init disable_mtrr_cleanup_setup(char *str)
{
    if (enable_mtrr_cleanup != -1)
        enable_mtrr_cleanup = 0;
    return 0;
}
early_param("disable_mtrr_cleanup", disable_mtrr_cleanup_setup);

static int __init enable_mtrr_cleanup_setup(char *str)
{
    if (enable_mtrr_cleanup != -1)
        enable_mtrr_cleanup = 1;
    return 0;
}
early_param("enble_mtrr_cleanup", enable_mtrr_cleanup_setup);
             ^^^^^^

Nice ;)

-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free
Mandriva Linux release 2009.0 (Cooker) for i586
Linux 2.6.25-jam18 (gcc 4.3.1 20080626 (GCC) #1 SMP
--

From: Ingo Molnar
Date: Tuesday, September 30, 2008 - 12:55 am

heh. Could you send a patch with a changelog please?

	Ingo
--

From: J.A.
Date: Tuesday, September 30, 2008 - 1:02 am

Here it goes...I hope its right.

==================

Correct typo for 'enable_mtrr_cleanup' early boot param name.

Signed-off-by: J.A. Magallon <jamagallon@ono.com>

diff -p -up linux/arch/x86/kernel/cpu/mtrr/main.c.orig linux/arch/x86/kernel/cpu/mtrr/main.c
--- linux/arch/x86/kernel/cpu/mtrr/main.c.orig	2008-09-30 09:57:46.000000000 +0200
+++ linux/arch/x86/kernel/cpu/mtrr/main.c	2008-09-30 09:57:55.000000000 +0200
@@ -834,7 +834,7 @@ static int __init enable_mtrr_cleanup_se
 		enable_mtrr_cleanup = 1;
 	return 0;
 }
-early_param("enble_mtrr_cleanup", enable_mtrr_cleanup_setup);
+early_param("enable_mtrr_cleanup", enable_mtrr_cleanup_setup);
 
 struct var_mtrr_state {
 	unsigned long	range_startk;











-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free
Mandriva Linux release 2009.0 (Cooker) for i586
Linux 2.6.25-jam18 (gcc 4.3.1 20080626 (GCC) #1 SMP
--

From: Ingo Molnar
Date: Tuesday, September 30, 2008 - 1:05 am

applied to tip/x86/urgent, thanks!

	Ingo
--

From: Domenico Andreoli
Date: Wednesday, October 1, 2008 - 2:33 pm

Ingo, why did you require a patch? Was not it really more simple and
easy for everyone to write it yourself? Since I am sure it was not only
a laziness matter (really?), I am very curious to know the reason.

Thank you,
Domenico

-----[ Domenico Andreoli, aka cavok
 --[ http://www.dandreoli.com/gpgkey.asc
   ---[ 3A0F 2F80 F79C 678A 8936  4FEE 0677 9033 A20E BC50
--

From: Willy Tarreau
Date: Wednesday, October 1, 2008 - 10:27 pm

I see two things :
  - preserve authorship of the code
  - "laziness" as you call it, is the only way to scale for a maintainer.

Willy

--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 2:26 am

yeah, correct. Also, i asked (not required) J.A. Magallón whether he 
could send a patch - if he didnt (no time, etc.) i'd have fixed it 
myself (crediting him in the changelog).

But it's also a general principle: maintainers dont 'own' the code in 
any way and there should be no assymetry in the ability to modify the 
code. So if people are willing to fix bugs they notice, i prefer that 
far more than me doing it.

	Ingo
--

From: Domenico Andreoli
Date: Thursday, October 2, 2008 - 2:45 am

I think I got the lesson although the assymetry matter is still not that
clear to me. Anyway I also know that when you talk about code you prefer
patches to plain english so I expect you'd like others do the same ;)

Thank you,
Domenico

-----[ Domenico Andreoli, aka cavok
 --[ http://www.dandreoli.com/gpgkey.asc
   ---[ 3A0F 2F80 F79C 678A 8936  4FEE 0677 9033 A20E BC50
--

From: H. Peter Anvin
Date: Tuesday, September 30, 2008 - 11:47 am

These options are also named inconsistently with all other options.

The standard way to name an boolean option is "foo" versus "nofoo", in 
this case, "mtrrcleanup" vs "nomtrrcleanup".

	-hpa
--

From: Yinghai Lu
Date: Tuesday, September 30, 2008 - 12:30 pm

ok, we could change it...

YH
--

From: H. Peter Anvin
Date: Tuesday, September 30, 2008 - 12:59 pm

If we're fixing a typo anyway I'd suggest so.  We know we're not 
breaking anyone's working setup...

	-hpa

--

From: Yinghai Lu
Date: Tuesday, September 30, 2008 - 2:35 pm

mtrr_cleanup and no_mtrr_cleanup?

YH
--

From: H. Peter Anvin
Date: Tuesday, September 30, 2008 - 2:37 pm

Dashes seem to be used more than underscores, so it probably should be 
"mtrr-cleanup" and "nomtrr-cleanup" if you want a separator.

	-hpa
--

From: Yinghai Lu
Date: Tuesday, September 30, 2008 - 2:42 pm

i need to document the mtrr_cleanup_debug too...change it to
mtrrcleanup_debug ? just like initcall_debug?

YH
--

From: H. Peter Anvin
Date: Tuesday, September 30, 2008 - 3:02 pm

I would prefer "mtrr-cleanup-debug" if the main one is "mtrr-cleanup"; 
mixing dashes and underscores is a bit sick.  Unfortunately we have had 
very few attempts at consistency with command line options... some in 
the early days were even StudlyCaps (yuck...)

	-hpa
--

Previous thread: [GIT pull] timer fixes for .27 by Thomas Gleixner on Monday, September 29, 2008 - 3:15 pm. (7 messages)

Next thread: Re: questions about x86: mtrr cleanup for converting continuous to discrete layout by Dylan Taft on Monday, September 29, 2008 - 3:46 pm. (2 messages)