This message contains a list of some regressions from 2.6.26, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.26, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-09-21 169 45 36 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11611 Subject : Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325 Submitter : Rafael J. Wysocki <rjw@sisk.pl> Date : 2008-09-20 23:24 (2 days old) References : http://marc.info/?l=linux-kernel&m=122195277606974&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11610 Subject : Problem with kernel commit 664d080c41463570b95717b5ad86e79dc1be0877 Submitter : Michal 'vorner' Vaner <vorner@ucw.cz> Date : 2008-09-21 17:35 (1 days old) References : http://marc.info/?l=linux-acpi&m=122201853409501&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11609 Subject : oops in find_get_page Submitter : Marcin Slusarz <marcin.slusarz@gmail.com> Date : 2008-09-20 14:53 (2 days old) References : http://marc.info/?l=linux-kernel&m=122192251101892&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608 Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-16 23:00 (6 days old) References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11607 Subject : 2.6.27-rc6 =C2=A0Bug in ...
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (53 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11230 Subject : Kconfig no longer outputs a .config with freshly updated defconfigs Submitter : Josh Boyer <jwboyer@linux.vnet.ibm.com> Date : 2008-08-02 16:03 (51 days old) References : http://marc.info/?l=linux-kernel&m=121769306319391&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11220 Subject : Screen stays black after resume Submitter : Nico Schottelius <nico@schottelius.org> Date : 2008-07-31 21:05 (53 days old) References : http://marc.info/?l=linux-kernel&m=121753882422899&w=4 --
This is actually three problems in one :-(. If you try to suspend with minimum config, will resume still take 30 seconds? Is the problem still there in 2.6.27-rc7? Is there chance to bisect it? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (53 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11224 Subject : Only three cores found on quad-core machine. Submitter : Dave Jones <davej@redhat.com> Date : 2008-08-01 18:15 (52 days old) References : http://marc.info/?l=linux-kernel&m=121761475224719&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11271 Subject : BUG: fealnx in 2.6.27-rc1 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 14:58 (48 days old) References : http://marc.info/?l=linux-netdev&m=121794762016830&w=4 http://lkml.org/lkml/2008/8/10/98 Handled-By : Francois Romieu <romieu@fr.zoreil.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11237 Subject : corrupt PMD after resume Submitter : Alan Jenkins <alan-jenkins@tuffmail.co.uk> Date : 2008-08-02 9:51 (51 days old) References : http://marc.info/?l=linux-kernel&m=121767073424952&w=4 Handled-By : Hugh Dickins <hugh@veritas.com> Jeremy Fitzhardinge <jeremy@goop.org> Patch : http://marc.info/?l=linux-kernel&m=122001615314700&w=2 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11210 Subject : libata badness Submitter : Kumar Gala <galak@kernel.crashing.org> Date : 2008-07-31 18:53 (53 days old) References : http://marc.info/?l=linux-ide&m=121753059307310&w=4 Handled-By : Kumar Gala <galak@kernel.crashing.org> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11264 Subject : Invalid op opcode in kernel/workqueue Submitter : Jean-Luc Coulon <jean.luc.coulon@gmail.com> Date : 2008-08-07 04:18 (46 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11272 Subject : BUG: parport_serial in 2.6.27-rc1 for NetMos Technology PCI 9835 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 15:12 (48 days old) References : http://marc.info/?l=linux-kernel&m=121794900319776&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11340 Subject : LTP overnight run resulted in unusable box Submitter : Alexey Dobriyan <adobriyan@gmail.com> Date : 2008-08-13 9:24 (40 days old) References : http://marc.info/?l=linux-kernel&m=121861951902949&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11357 Subject : Can not boot up with zd1211rw USB-Wlan Stick Submitter : uwe <kender@freenet.de> Date : 2008-08-16 14:17 (37 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11335 Subject : 2.6.27-rc2-git5 BUG: unable to handle kernel paging request Submitter : Randy Dunlap <randy.dunlap@oracle.com> Date : 2008-08-12 4:18 (41 days old) References : http://marc.info/?l=linux-kernel&m=121851477201960&w=4 http://lkml.org/lkml/2008/8/16/274 Handled-By : Hugh Dickins <hugh@veritas.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (32 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (42 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 http://marc.info/?l=linux-kernel&m=122125737421332&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380 Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16 Submitter : Ingo Molnar <mingo@elte.hu> Date : 2008-08-20 6:44 (33 days old) References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382 Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM Submitter : David Vrabel <david.vrabel@csr.com> Date : 2008-08-08 10:47 (45 days old) References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4 Handled-By : Christopher Li <chrisl@vmware.com> --
From: "Rafael J. Wysocki" <rjw@sisk.pl>
Fixed by:
commit 78566fecbb12a7616ae9a88b2ffbc8062c4a89e3
Author: Christopher Li <chrisl@vmware.com>
Date: Fri Sep 5 14:04:05 2008 -0700
e1000: prevent corruption of EEPROM/NVM
Andrey reports e1000 corruption, and that a patch in vmware's ESX fixed
it.
The EEPROM corruption is triggered by concurrent access of the EEPROM
read/write. Putting a lock around it solve the problem.
[akpm@linux-foundation.org: use DEFINE_SPINLOCK to avoid confusing lockdep]
Signed-off-by: Christopher Li <chrisl@vmware.com>
Reported-by: Andrey Borzenkov <arvidjaar@mail.ru>
Cc: Zach Amsden <zach@vmware.com>
Cc: Pratap Subrahmanyam <pratap@vmware.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Bruce Allan <bruce.w.allan@intel.com>
Cc: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
--
Just noticed I replied to davem and not to everyone.. so I did some further hunting. Okay so e1000e seems to have a problem in this area, that this *DOESN'T* fix. I've reconstructed my boot timeline from message logs Sep 3rd, I booted rawhide kernel 2.6.27-0.290.rc5.fc10.i686 I suspended/resume a few times in between with no issues. Sep 8th I booted my own 2.6.27-rc5 kernel based from ec0c15afb41fd9ad45b53468b60db50170e22346 This got a corrupted e1000e checksum and every kernel since has. Dave. --
From: "Dave Airlie" <airlied@gmail.com> Ok. --
Have you restored the EEPROM contents after it got corrupted for the first time? Once the EEPROM contents get corrupted, the card will then be broken forever even on kernel that gets this fixed one day. This is pretty serious bug in fact, as it renders hardware of poor users unusable, and just patching kernel is then not enough to put things back to shape. -- Jiri Kosina SUSE Labs --
From: Jiri Kosina <jkosina@suse.cz> The top priority is to root cause this, so that we can stop the problem from happening as fast as possible, and I'm still waiting for the SHA1 ID that was used for the last kernel Dave booted before the problem occurred which is pretty damn critical for making forward progress here. It could even be some PCI or x86 layer change that caused the corruption, we don't even know yet. --
It was exactly 2.6.27-rc5 + Fedora at the time but we rarely touch these areas, most of the extra code is in other places, and since people are seeing it on !Fedora also I would assume it wasn't these. I think people have seen it on earlier kernels maybe but not sure. really Intel needs to get a fix of some sort out so we can repair the hw so we can root cause the probem. --
From: "Dave Airlie" <airlied@gmail.com>
So I went through the changes from 2.6.27-rc5 until the SHA1
ID ec0c15afb41fd9ad45b53468b60db50170e22346 and there were
definitely no E1000 or E1000E changes during that time.
Included in there is the HPET revert and other similarly themed
changes.
commit b4609472116bb806a95e98d04767189406c74c70
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri Aug 29 14:38:03 2008 -0700
Revert "x86: fix HPET regression in 2.6.26 versus 2.6.25, check hpet against BAR, v3"
This reverts commit a2bd7274b47124d2fc4dfdb8c0591f545ba749dd.
Some power management related changes stand out slightly:
commit 9d3593574702ae1899e23a1535da1ac71f928042
Author: John Kacur <jkacur@gmail.com>
Date: Tue Sep 2 14:36:13 2008 -0700
pm_qos_requirement might sleep
and
commit 74c4633da7994eddcfcd2762a448c6889cc2b5bd
Author: Rafael J. Wysocki <rjw@sisk.pl>
Date: Tue Sep 2 14:36:11 2008 -0700
rtc-cmos: wake again from S5
The rest of the changes in that range look completely benign.
--
Some recent comments on [1] seem to indicate that this is somehow coupled into prior problems/panics with Intel graphics. David, was this also your case, or did the EEPROM got garbled out of a sudden? [1] https://bugzilla.novell.com/show_bug.cgi?id=425480 -- Jiri Kosina SUSE Labs --
And... <http://lwn.net/Articles/299787> <http://bugzilla.kernel.org/show_bug.cgi?id=11382> <https://bugzilla.redhat.com/show_bug.cgi?id=459202> <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555> Best regards, Renato --
I have no evidence in my logs of a graphics panic, but I do do a lot of graphics devel, so it might be a possiblity but I'd hate to handwave it away at that. --
From: "Dave Airlie" <airlied@gmail.com> Let's not handwave, but rather try to figure out if that is part of the pattern. Right now we don't have any real leads, so data acquisition is really important at this phase. --
Isn't this reliably reproducible? Assuming yes, Intel are such swell guys that you might ask them to ship a few dozen cards to you to break until you've tracked down the problem. I mean, it's a lot easier to find this sorts of fault when you can see it first hand than trying to guess from third parties' reports, isn't it? For some reasonable value of "you", that so. --
From: Jiri Kosina <jkosina@suse.cz> My current suspicion in all of this is either the GEM kernel patches or recent X server. However, the eeprom/nvram programming sequence seems non-trivial on the e1000e. You have to execute a set of precise register writes and register polls to successfully write things out to the nvram. This makes something like a random scribble out to MMIO space less likely to cause this problem. Is there some linear mapping of the nvram that could be written to on these cards? --
I don't think OpenSUSE was shipping any of the GEM bits. --
From: "Dave Airlie" <airlied@gmail.com> Good data point, can someone confirm this? Also, what X server version is the effected OpenSUSE shipping? --
OpenSuSE 11 ships x server version 7.3. -- Cheers, Jeff --
Opensuse 11 is fine. The problem can be reproduced [not only] on opensuse 11.1 beta1, which has xorg-x11-7.4-1.6.x86_64.rpm -- Jiri Kosina --
From: Jiri Kosina <jkosina@suse.cz> I did some snooping around, and while doing so I noticed that the PCI mmap code for x86 doesn't do one bit of range checking on the size, or any other aspect of the request, wrt. the MMIO regions actually mapped in the BARs of the PCI device. Yikes! It just does a reserve_memtype() on the address range, and says "ok". So if, for example, the X server tries to mmap() more than an MMIO bar actually maps, the kernel lets the user do this. It would be very interesting to add the appropriate checks to pci_mmap_page_range() in arch/x86/pci/i386.c, anyone who wants to do this can use the code in arch/sparc64/kernel/pci.c: __pci_mmap_make_offset() as a guide, and see what happens. If the MMIO space regions of the video cards sit right before the E1000E ones on the effected systems, that would pretty much convince me that this is the kind of problem we are having here. This also reminds me that there was that whole set of issues that had to get worked out wrt. write-caching of mappings on x86. --
I'm still dubious about this, wouldn't we see other wierdass side effects if X was trashing the BARs on other devices? I think tglx is on the right path, same problem as e1000, code is stupid, it can reenter the nvram read/write code from irq context, and pwn itself. Dave. --
From: "Dave Airlie" <airlied@gmail.com> Sure. My theory is that it's a recent xorg change causing this, so I've been going through GIT history for xserver, libpciaccess, and the intel driver for the past year looking for clues. If there is usually a gap after the video device, there would just be no response from the PCI bus, and the way that's handled is chipset specific. At least a while back, most x86 systems would silently ignore writes and return all 1's in such a case, but they may be generating bus error events these days. I simply don't The e1000e side here is reproducable way too easily for it to be the same case, as far as I see it. The e1000 driver has probably had this problem for years and we've only recently had some concrete cases of it triggering. Also, what utility are you running on your system that is even accessing the NVRAM on the e1000e card? Knowing that might help us understand why this problem has appeared now. Maybe there is some diagnostic or monitoring tool that is now becoming prevalent in these distributions where it triggers. This problem started happening seemingly "all of a sudden", even to people who have been keeping sort-of recent with their kernels, such as yourself. Yet we can't get any sense yet what range of kernel versions are in use when the problem triggers. I'm about to leave for a week or so in Paris for the netfilter workshop, so I hope that someone other than myself will do some data mining like I have instead of (merely) tossing theories around and finger pointing. --
The only thing I can think off then is either the pciaccess conversion of the intel Xorg driver, The driver seems quite happy to access the NVRAM, I think Thomas has some backtraces that show I've seen it reported at least at 2.6.27-rc1 and maybe even one of Fedora's -rc0 kernels. --
From: "Dave Airlie" <airlied@gmail.com> I don't dispute that the locking is dodgy and likely needs to be fixed like e1000. I'm asking what userland tool or kernel event is triggering the nvram access. It shouldn't even touch the thing after probing and initializing the card. --
Hopefully tglx can supply some traces, I think getting an interrupt during device startup can possibly access the nvram http://www.tglx.de/~tglx/wtf2.txt seems to suggest bad things could happen. Dave. --
Actually another user has just reported [1] that his e1000e card got screwed up exactly at the point when the installer was probing the X configuration. So this really seems a lot like some lethal interaction between intel graphics and the network card. Dave (Airlie, too many Daves on CC here really), do you by any chance see any recent change in kernel intel graphic parts of DRM be causing this breakage? [1] https://bugzilla.novell.com/show_bug.cgi?id=425480#c69 -- Jiri Kosina SUSE Labs --
BTW, why is the PAT fix implented in commit 242e3df80 needed only for radeons? -- Jiri Kosina SUSE Labs --
Further important observation -- as far as I can see, all affected machines by this bug whatsoever (and the number of reportes is increasing) were using i915 DRM. -- Jiri Kosina SUSE Labs --
Good question, mainly because only radeons showed the illegal mapping crash, which was mapping via sysfs _wc files and then doing a UC mapping in the kernel over the same address space would fail. However this was VRAM related and these things don't have VRAM. --
Okay some from the kernel if this isn't in 2.6.26, the drm has introduced no patches I can even remotely claim might affect this. So its either userspace or PAT related. Dave. --
Another data point in the support of this theory - I've been running all various 2.6.27-rc releases (including rc7) on my HP machine which has an embedded 82566 and Radeon x1650 graphics - and so far I have not seen any problems. Parag --
On Wed, 24 Sep 2008 00:36:38 -0700 (PDT) A data point, just in case it helps... I've not had time to update my desktop system, so this all-Intel, ICH9, e1000e-based box has been stuck at 2.6.27-rc3. It has rawhide as of shortly after the floodgates reopened (but with my own kernel); that means X server 1.5.0 and i810 2.4.2-3. It's happy as a clam. I'm not sure how often this problem bites, but it hasn't gotten me. jon --
Thanks for the information. Seems like it quite often triggers during the very first probing of graphics card during the initial X startup. Karsten is currently writing a tool that will safely restore the EEPROM contents to the card. When this gets done, testing will get much easier and hopefully we'll be able to isolate whether it is e1000e driver (I currently don't think so), DRM kernel code, or xorg 7.4 causing this. Thanks, -- Jiri Kosina SUSE Labs --
I'm running a 2.6.26-rc6 kernel on a X61s laptop, which is an all-Intel ICH8, using the e1000e driver, and I haven't been been bitten with the problem either. I'm using an Ubuntu Hardy userspace, which means I'm using an 1.4.0.90 X Server with an i915 drm version 1.6.0 20060119, and my e1000 EEPROM hasn't been blasted to oblivion yet! Personally, I don't plan on upgrading to a newer userspace until we figure out what the heck is going on. :-) - Ted --
If any of you guys has Lenovo thinkpad (T60p ideally) with 8086:104b revision 3 card, could you please send me the respective "ethtool -e" dump? Thanks, -- Jiri Kosina SUSE Labs --
I've been working on a patch to detect (using a timer and checking at
up/down) whether or not the flash has been corrupted, and, if it is
rewrite it with the saved good copy (which obviously only helps if
it's the same boot.)
Unfortunately, I don't have enough time to finish it before I go away
for the weekend, so I'll toss it over the wall and see if it sticks to
anything.
At a glance, one would need to add support for rewriting
adapter->hw.flash from ethtool if someone reprograms the good firmware
back, and writing the good flash back on down/remove if it detects
a change.
Bear in mind, super quick hack, and I haven't even run-tested it yet.
If nobody decides to run with it, I'll probably give it another poke
late tonight.
Definitely-not-signed-off-by-or-tested-by: Kyle
At the very least, if someone pokes in a hexdump of the firmware, at
least we might be able to see some of the method to the madness of the
corruption pattern.
diff --git a/drivers/net/e1000e/e1000.h b/drivers/net/e1000e/e1000.h
index ac4e506..08cce8c 100644
--- a/drivers/net/e1000e/e1000.h
+++ b/drivers/net/e1000e/e1000.h
@@ -168,6 +168,7 @@ struct e1000_adapter {
struct timer_list watchdog_timer;
struct timer_list phy_info_timer;
struct timer_list blink_timer;
+ struct timer_list flash_timer;
struct work_struct reset_task;
struct work_struct watchdog_task;
diff --git a/drivers/net/e1000e/hw.h b/drivers/net/e1000e/hw.h
index 74f263a..ca3f645 100644
--- a/drivers/net/e1000e/hw.h
+++ b/drivers/net/e1000e/hw.h
@@ -863,6 +863,11 @@ struct e1000_hw {
u8 __iomem *hw_addr;
u8 __iomem *flash_address;
+ int flash_len;
+
+ u8 *flash;
+ u8 *flash_backup;
+ spinlock_t flashlock;
struct e1000_mac_info mac;
struct e1000_fc_info fc;
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index d266510..13f05f8 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -2535,6 +2535,7 @@ void e1000e_down(struct e1000_adapter ...Thanks Kyle! attached is a patch to dump the eeprom to dmesg (first 64 bytes) at boot for e1000e, which kind of goes along with your AWOOGA part of your patch.
From: Kyle McMartin <kyle@mcmartin.ca> Looks interesting, I hope someone runs with it :-) If the flash is seen as corrupt, we should print the current process that is running at the time, and perhaps a pt_regs dump, as these might provide the most important clues to diagnosing this. --
Thanks, looks interesting e1000e hack that might possibly be of some help. BUT! please have a look at http://lkml.org/lkml/2008/9/24/133 Looks like this device got a lot of 0xff written somewhere in its config space, right? But it isn't Intel card at all. -- Jiri Kosina SUSE Labs --
That looks like the device disappeared completely. -hpa --
I asked that person and he said reverting to an older kernel made it work again, also there is a fix out as romieu pointed out as well, so this seems to be a different bug for now. Auke --
Absolutely. Or we can even do some dirty hackery in userspace, like LD_PRELOADing X server and checking mmaps() that are close to MMIO regions Unfortunately, looking at the lspci outputs that are in https://bugzilla.novell.com/show_bug.cgi?id=425480 it seems to me that the MMIO regions are quite far away from each other. -- Jiri Kosina SUSE Labs --
Yup on my laptop these were far away and I wondered what could mangle things that badly. Well I'm out of the race, my attempts to re-write my eeprom using an eeprom from an equivalent laptop have totally failed and my BIOS won't boot anymore - so my laptop is == a brick. Dave. --
Uh oh. Shouldn't we put something like the patch below in Linus' tree unless we get this sorted out? Otherwise more and more people who use -rc kernels will run into this, and will get their hardware [hopefully temporarily, but not all users are able to re-flash their network card EEPROMs, right] bricked. I know that it is quite aggressive and is going to disable wired networking on a lot of systems that have been functioning properly, therefore RFC ... From: Jiri Kosina <jkosina@suse.cz> Subject: [PATCH] [RFC] E1000E: temporarily disable e1000e driver E1000E: temporarily disable e1000e driver There is a serious bug somewhere, that renders e1000e network cards unusable on certain hardware configurations by rewriting EEPROM with 0xff all over. Debugging this is not trivial, because: - it is not yet even clear whether the bug is caused by userspace (new version of xorg drivers, bad interaction with PAT, ...) or some bug in kernel code; it's even not yet certain at which exact combination of software versions and hardware configuration this started to trigger - you have only one attempt to test potential fix. If the fix doesn't work, the eeprom of the card is hosed and therefore fixing this has potential to take some time. The tool that will safely restore the previous contents of EEPROM is currently being written, but even this is not trivial (Dave Airlie has turned his notebook into brick while trying to restore the EEPROM contents). Let's therefore mark this driver as broken (though it is very well possible that this particular driver is not at fault at all) until this gets resolved, so that users of -rc kernels don't get their network cards totally unusable. References (information about sw/hw configurations of affected systems might be found in the ...
Something else to worry about is bisections. People seeing an unrelated issue with .27 after release may well be asked to do a bisection and could then run into the issue even if it is fixed before the release. Guess we'll need to wait and see what the root cause is to know if that's Extra datapoint. As far as I've seen this problem has not yet been reported by any people running Debian. This could point to X.Org as Debian currently has 7.3 while I think the reports so far have been with 7.4. I have been running .27-rc kernels myself on a HP 2510p laptop running Debian/lenny which does have the "bad" NIC (ICH9), but it's still working for me. I do have some vague resume from suspend problems, but for now I'm assuming those are unrelated. I have been running the kernels both with and without PAT enabled. Cheers, FJP --
Yes, I think that xorg/xorg i915 driver/libdrm/GEM/whatever are the biggest suspect currently, according to the data that has been gathered so far. Still, what confuses me a little bit -- the EEPROM of the card is set to all 0xff, once the corruption happens. Isn't that a quite a coincidence, that bytes representing "nothing" in this context are used? If being set to 0 (it's so easy to call memset(0) on a bogus pointer, there are usually lots of them in the code) or to random garbage, it would seem to be much more understandable, than 0xff. -- Jiri Kosina SUSE Labs --
Typical card EEPROMs are serial - either I2C or SPI. I believe the Intel cards use SPI EEPROMs, but I'm not sure. [Disclaimer: I don't actually know SPI all that well; I know I2C better. However, I'm pretty sure the following argument does apply to both.] Consider a corruption which turns a read command into a write command -- often just a single bit difference. Now, the EEPROM will expect data in to write, but nothing will be driving the data line, so it will typically be a 1. As the host tries to read, it will therefore fill the EEPROM with all ones. -hpa --
We have confirmation that this isn't GEM related; according to the Novell bug at https://bugzilla.novell.com/show_bug.cgi?id=425480 people have hit the problem with kernels w/o GEM. That doesn't rule out i915 (though I don't think any changes have gone in since 2.6.26 that would have caused this) or xf86-video-intel. It's possible that X is getting confused about BAR mappings somehow, resulting in a clobbered e1000e NVRAM, but why would the kernel version matter in that case? The only thing that comes to mind would be PAT... Recent versions of the X drivers (using recent libpciaccess code) will try to map the resourceN_wc file in sysfs. It's possible that the map size we end up using is wrong, leading to the situation Dave described earlier where we Presumably one has to write all ones to the EEPROM BAR of the e1000 device to see that pattern? Or is there some way of configuring the EEPROM such that it'll fail to respond to read cycles resulting in all ones for every read back (i.e. target abort)? Jesse --
But the xorg intel driver shipped with xorg 7.4 already has support for GEM, right? So there could still be some bug in the GEM-aware driver Yes, booting with 'nopat' is on my list to try immediately after we are This we could catch easily even with strace, right? -- Jiri Kosina SUSE Labs --
X.Org 7.4 came with xf86-video-intel 2.4.2 right? That doesn't have any GEM bits in it either. However, the "Factory" log at #425480 *does* indicate that a GEM aware 2D driver was loaded (the "[drm:i915_getparam] *ERROR* Unknown parameter 5" message indicates as much), but the kernel was definitely not GEM aware otherwise the call would have succeeded. So that rules out GEM proper, but it could still be a bug in one of the non-GEM paths in the experimental Yep, that one's easy to catch. -- Jesse Barnes, Intel Open Source Technology Center --
That was exactly the point I was trying to make, that these error paths will probably also need auditing, once we rule out the possibility of NVRAM being overwritten from kernelspace. Thanks, -- Jiri Kosina SUSE Labs --
Well the non-GEM paths are really the old codepaths we used in the older drivers.. So unless we do something really dumb... I'd target three areas PAT, pciaccess and e1000e itself. Dave. --
ubuntu has CONFIG_X86_PAT disabled for at least i386 arch, maybe that is relevant. --
On Fri, Sep 26, 2008 at 7:42 AM, Jesse Brandeburg It rules out PAT I suppose, they have seen the issue as well. Dave. --
I wasn't able to rule out PAT from the suse bugreports POV, as we have PAT enabled both for 32bit and 64bit x86. If Ubuntu has it disabled also for 64bit x86 (where do these guys have .config files to check?), I think we can definitely rule out PAT, as there has been at least one report from Ubuntu user on this very issue. -- Jiri Kosina SUSE Labs --
I'm testing ubuntu intrepid also "afected system" but not affected card and use e1000e driver. (i945g+ich7) as i can see ubuntu at least now do not use pat on 32bit system. config is attached.
Okay, I just had a scary and hopefully stupid thought. Especially Intel often has backchannels between the chipset and the Ethernet controller for management functions -- anything from WoL to IPMI -- generally over some kind of low-speed serial bus. We're not in a situation where the EEPROM can be touched from the chipset via the SMBus or some other non-CPU channel? -hpa --
I know next to nothing about SMBus and especially those other backchannels, but the 82566 product brief :-) lists support for: - Intel Active Management Technology (AMT) with "System Defence" (whatever that means) - ASF 2.0 I think ASF (Alert Standard Format) is somehow related to IPMI and uses I^2C or something similar (SMBus). 8254x manual says that the EEPROM is divided into 4 parts: one for E1000 hw initialization, one for ASF (Ethernet in ASF mode?), one for external BMC (TCO) (loaded by external BMC from the SMBus) and one for software only (not used by hardware). Some chips only support #1 (and #4 of course). I understand the driver reads the EEPROM using EERD register (which, according to the manual, requires no additional locking) or drives the EEPROM directly, with a lock/unlock protocol (using EECD register). Now some devices lack the lock/unlock bits, but they lack ASF/BMC as well. I imagine chips other than 8254x may be different here. Do we have some "master" bugzilla entry or something like that for these problems? -- Krzysztof Halasa --
We have had historic problems where a very non standard EEPROM setup on some ancient thinkpads ended up with bad stuff happening due to smbus probing and the like. You would then however expect to see the bug occur without loading the e1000* drivers (unless you needed an interaction between the two) --
Perhaps the entire chip has been erased with the "ERAL" (erase all) command. Requires previously issued EWEN (erase/write enable). Each command seems to require several writes to the EEPROM control register. -- Krzysztof Halasa --
From: Jiri Kosina <jkosina@suse.cz> Setting framebuffer bytes to 0xff is pretty common, for example for color keys and anti-aliasing pixel values. --
That seems a bit drastic, particularly when the debugging was beginning to point to another culprit. We have equal case at this point to disable r8169 and i915_drm, no? Jeff --
No we actually are more likely unable to do anything from the kernel, if its happening from userspace firstly we need a reflash utility that is safe, otherwise people who have the issue can't reproduce it, and people who don't have the issue don't want to play with it. I think e1000e may enable a BAR or something that causes the issue to break this hw., I haven't seen it broken on any machine where e1000e wasn't loaded yet. Again the r8169 might be the same issue, but it maybe because the bar was enabled. Dave. --
From: "Dave Airlie" <airlied@gmail.com> All PCI device drivers in the kernel first do pci_enable_device() which essentially enables all BARs. The flash lives in BAR 1 of the E1000E, for example. --
on my ich9 based system the e1000e BAR1 regions are back to back with both the vga memory map and the audio mem, either of which could be the mangler, but more likely vga device (say X maybe) since it is mapped I'm really sorry to hear that, I wonder if the laptop has an "emergency bios update" mode like many PCs used to through a jumper. Dave A., let us know if you make any recovery progress. I plan to try some random writes tomorrow to my BAR1 space and see if my flash gets erased. Jesse --
I guess it's more about the E1000's serial configuration EEPROM, the registers seem to live in BAR0 (EECD and for reading perhaps EERD). Corrupted EEPROM (and thus PCI config registers) can easily result in a dead machine. I will be writing a tool for writing 82541PI EEPROMs on a custom board soon (unless there is one available, for Linux, of course), I'm not sure it's the flash that is corrupted. Anyway booting the laptop should be quite easy (physically disabling the EEPROM on boot should do the trick), though it would require taking the machine apart. -- Krzysztof Halasa --
Moreover, we don't actually do any writing (that I know of) of the ROM image from the X drivers or the kernel. In fact, in many cases X should be accessing the RAM copy of the ROM at 0xc0000 rather than via the ROM BAR. That said, adding a check to the x86 code would be a good thing to do; I'll hack up a patch tomorrow unless someone beats me to it. -- Jesse Barnes, Intel Open Source Technology Center --
The problem here is that what we desperately need first is a method to restore the original EEPROM contents after it gets corrupted (David Airlie has, sadly, apparently bricked his notebook while trying to do so). Without this, we can put a lot of debugging/protecting patches into the kernel, but we won't be able to succesfully verify anything, because testing wouldn't be possible. Added Jesse and Karsten to CC, as they are working on such a tool right now, as far as I know. -- Jiri Kosina SUSE Labs --
I should be able to test the mmap fix independently of the e1000 breakage at least... lemme try it out now... -- Jesse Barnes, Intel Open Source Technology Center --
Here's a patch that adds range checking to the sysfs mappings at least. This patch should catch the case where X (or some other process) tries to map beyond the specific BAR it's (supposedly) trying to access, making things safer in general. FWIW both my F9 and development versions of X start up fine with this patch applied. DaveM, will this work for you on sparc? It looked like your code was allowing bridge window mappings, but that behavior should be preserved as long as your bridge devices reflect their window sizes correctly in their pdev->resources? If we add similar code to the procfs stuff we wouldn't need to do any checking in the arches. diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c index 9c71858..f4e8b4e 100644 --- a/drivers/pci/pci-sysfs.c +++ b/drivers/pci/pci-sysfs.c @@ -502,6 +502,8 @@ pci_mmap_resource(struct kobject *kobj, struct bin_attribute *attr, struct resource *res = (struct resource *)attr->private; enum pci_mmap_state mmap_type; resource_size_t start, end; + unsigned long map_len = vma->vm_end - vma->vm_start; + unsigned long map_offset = vma->vm_pgoff << PAGE_SHIFT; int i; for (i = 0; i < PCI_ROM_RESOURCE; i++) @@ -510,6 +512,13 @@ pci_mmap_resource(struct kobject *kobj, struct bin_attribute *attr, if (i >= PCI_ROM_RESOURCE) return -ENODEV; + /* + * Make sure the range the user is trying to map falls within + * the resource + */ + if (map_offset + map_len > pci_resource_len(pdev, i)) + return -EINVAL; + /* pci_mmap_page_range() expects the same kind of entry as coming * from /proc/bus/pci/ which is a "user visible" value. If this is * different from the resource itself, arch will do necessary fixup. --
Good. We will use this on affected machines after we start some real At least for debugging purposes I'd propose to put a printk() there with process name, and the range it tries to map. Thanks, -- Jiri Kosina SUSE Labs --
Actually there is no way of not shipping GEM when shipping xorg 7.4, isn't it? So definitely GEM could be potential cause here, I think. -- Jiri Kosina SUSE Labs --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407 Subject : suspend: unable to handle kernel paging request Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-08-21 17:28 (32 days old) References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Pekka Enberg <penberg@cs.helsinki.fi> Pavel Machek <pavel@suse.cz> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11439 Subject : [2.6.27-rc4-git4] compilation warnings Submitter : Rufus &amp; Azrael <rufus-azrael@numericable.fr> Date : 2008-08-26 9:37 (27 days old) References : http://marc.info/?l=linux-kernel&m=121974353815440&w=4 Handled-By : Greg KH <gregkh@suse.de> Patch : http://marc.info/?l=linux-kernel&m=121976424221858&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11442 Subject : btusb hibernation/suspend breakage in current -git Submitter : Rafael J. Wysocki <rjw@sisk.pl> Date : 2008-08-25 11:37 (28 days old) References : http://marc.info/?l=linux-bluetooth&m=121966402012074&w=4 Handled-By : Oliver Neukum <oliver@neukum.org> Patch : http://marc.info/?l=linux-bluetooth&m=121967226027323&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11459 Subject : kernel crash after wifi connection established Submitter : Alexey Kuznetsov <ak@axet.ru> Date : 2008-08-30 03:08 (23 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11501 Subject : Failed to open destination file: Permission deniedihex2fw Submitter : Andrew Morton <akpm@linux-foundation.org> Date : 2008-09-04 18:34 (18 days old) References : http://marc.info/?l=linux-kernel&m=122055342419068&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11505 Subject : oltp ~10% regression with 2.6.27-rc5 on stoakley machine Submitter : Lin Ming <ming.m.lin@intel.com> Date : 2008-09-04 7:06 (18 days old) References : http://marc.info/?l=linux-kernel&m=122051202202373&w=4 http://marc.info/?t=122089704700005&r=1&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> Gregory Haskins <ghaskins@novell.com> Ingo Molnar <mingo@elte.hu> Patch : http://marc.info/?l=linux-kernel&m=122194673932703&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11512 Subject : sort-of regression due to "kconfig: speed up all*config + randconfig" Submitter : Alexey Dobriyan <adobriyan@gmail.com> Date : 2008-09-05 22:50 (17 days old) References : http://marc.info/?l=linux-kernel&m=122065498013858&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11507 Subject : usb: sometimes dead keyboard after boot Submitter : Frans Pop <elendil@planet.nl> Date : 2008-08-26 21:03 (27 days old) References : http://marc.info/?l=linux-kernel&m=121977815018224&w=2 Handled-By : Alan Stern <stern@rowland.harvard.edu> Patch : http://www.spinics.net/lists/linux-usb/msg09735.html --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11476 Subject : failure to associate after resume from suspend to ram Submitter : Michael S. Tsirkin <m.s.tsirkin@gmail.com> Date : 2008-09-01 13:33 (21 days old) References : http://marc.info/?l=linux-kernel&m=122028529415108&w=4 Handled-By : Zhu Yi <yi.zhu@intel.com> Dan Williams <dcbw@redhat.com> Jouni Malinen <j@w1.fi> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11465 Subject : Linux-2.6.27-rc5, drm errors in log Submitter : Gene Heskett <gene.heskett@verizon.net> Date : 2008-08-30 18:52 (23 days old) References : http://marc.info/?l=linux-kernel&m=122012238925775&w=4 Handled-By : Dave Airlie <airlied@gmail.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11506 Subject : oops during unmount - ext3? (2.6.27-rc5) Submitter : Marcin Slusarz <marcin.slusarz@gmail.com> Date : 2008-09-04 19:14 (18 days old) References : http://marc.info/?l=linux-kernel&m=122055573123449&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11516 Subject : severe performance degradation on x86_64 going from 2.6.26-rc9 -&gt; 2.6.27= -rc5 Submitter : Jason Vas Dias <jason.vas.dias@gmail.com> Date : 2008-09-07 13:59 (15 days old) --
Hi -=20 Yes, this bug is still a problem with both the latest 2.6.27-rc6 kernel (fr= om Linus' tree 2008-09-21) and with the latest fedora 10 kernel . CPU Frequency switching is completely disabled both when powernow-k8 (the c= orrect cpufreq module for my x86_64 AMD TL-64x2 2.2GHz CPU) is installed as a module or is built-in , an= d the CPU frequency remains at its lowest setting; attempts to modify /sys/devices/system/cpu/cpu0/cpuf= req/scaling_max_freq and /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed are not honored, = even though=20 /sys/devices/system/cpu/cpu0/cpufreq/governor is "userspace"=20 and scaling_min_freq < scaling_setspeed > scaling_max_freq . I see no messages from powernow-k8 indicating that it is aware it was unabl= e to set the speed, though I do see a message if I attempt to set an invalid speed (eg 600000) . With 2.6.26-rc9, I get a default CPU clock frequency of 2200000 ; with 2.6.= 27-rc6, it becomes 800000 and is not switchable. For some reason, powernow-k8 does not autoload with UDEV= ; but I don't really need it if the speed is already set to its highest level. On 2.6.27-rc6. after it manages to boot, any low-latency drivers time out (= eg. USB, Terminal, Keyboard, Network)=20 and the machine does not get through the boot-up sequence without becoming = overloaded by the kernel's debugging log messages - neither the network , the terminal or the keyboard work usably.=20 Building a kernel with USB completely disabled and turning off debug log me= ssages allows the machine to boot=20 (after @ 15 minutes) but the speed is still at its lowest setting and canno= t be changed. Also, 2.6.27-rc6 is unable to reboot the machine: it can put the machine in= to the "HALT" state, with nothing displayed on the screen, but the machine does not power-off until manual reset with t= he power-button. Then, after the machine has powered-down, it cannot be powered up until the power-on button is depr= essed for at least two sections an released ...
I have to admit that I'm confused.
The dmesg output
[ 0.000000] Linux version 2.6.27-rc6.jvd ...
....
[ 26.204477] hub 3-0:1.0: state 7 ports 2 chg 0000 evt 0000
says 26 seconds up to the point where user space should start. Also
USB is active in that log and I dont see timeout messages at all.
I have a hard time to connect this to your problem description
(timeouts, USB off, 15 minutes)
Len, any opinon on this:
[ 0.000000] ACPI Error (tbfadt-0453): 32/64X address mismatch in "Pm2ControlBlock":
[00008800] [0000000000008100], using 64X [20080609]
Thanks,
tglx
--
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 Subject : kernel panic: softlockup in tick_periodic() ??? Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-09-11 16:46 (11 days old) References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 Handled-By : Thomas Gleixner <tglx@linutronix.de> Cyrill Gorcunov <gorcunov@gmail.com> Ingo Molnar <mingo@elte.hu> --
[Rafael J. Wysocki - Sun, Sep 21, 2008 at 08:54:19PM +0200] | This message has been generated automatically as a part of a report | of recent regressions. | | The following bug entry is on the current list of known regressions | from 2.6.26. Please verify if it still should be listed and let me know | (either way). | | | Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543 | Subject : kernel panic: softlockup in tick_periodic() ??? | Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> | Date : 2008-09-11 16:46 (11 days old) | References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4 | Handled-By : Thomas Gleixner <tglx@linutronix.de> | Cyrill Gorcunov <gorcunov@gmail.com> | Ingo Molnar <mingo@elte.hu> | | There are really multiple issues touched in report. nmi_watchdog hangs, rtc device creation, NULL deref... I've asked Joshua for more information. Since he must to use netdev tree for a while maybe we could wait 'till next merge window will be closed and check if nmi_watchdog does work. So the work in progress. - Cyrill - --
The softlockup issue itself is fixed, but there are issues with nmi_watchdog. I think we should remove the regression and keep the bug alive to chase the other issues. Thanks, tglx --
Well, for the sake of documentation I'd prefer to close this bug and create a new non-regression one for the other issues if that's not a problem. Thanks, Rafael --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11549 Subject : 2.6.27-rc5 acpi: EC Storm error message on bootup Submitter : <jmerkey@wolfmountaingroup.com> Date : 2008-09-02 21:27 (20 days old) References : http://marc.info/?l=linux-kernel&m=122039255517586&w=4 Handled-By : Alexey Starikovskiy <astarikovskiy@suse.de> Patch : http://marc.info/?l=linux-kernel&m=122098180019264&w=4 --
This bug is corrected by Alexey's patch and has passed all regression tests. Jeff --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11548 Subject : kernel BUG at drivers/pci/intel-iommu.c:1373! Submitter : Chris Mason <chris.mason@oracle.com> Date : 2008-09-08 14:26 (14 days old) References : http://marc.info/?l=linux-kernel&m=122088566310440&w=4 --
I'm unable to reproduce this on 2.6.27-rc7. I don't think it has been fixed, but I'm having a hard time finding a reliable way to trigger it on newer kernels. -chris --
Thanks for the update. For now, I'll close it as 'unreproducible'. Please reopen if it happens again. Thanks, Rafael --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11550 Subject : pnp: Huge number of "io resource overlap" messages Submitter : Frans Pop <elendil@planet.nl> Date : 2008-09-09 10:50 (13 days old) References : http://marc.info/?l=linux-kernel&m=122095745403793&w=4 Handled-By : Rene Herman <rene.herman@keyaccess.nl> Bjorn Helgaas <bjorn.helgaas@hp.com> Patch : http://marc.info/?l=linux-kernel&m=122098498125536&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11551 Subject : Semi-repeatable hard lockup on 2.6.27-rc6 Submitter : Steven Noonan <steven@uplinklabs.net> Date : 2008-09-10 18:07 (12 days old) References : http://marc.info/?l=linux-kernel&m=122107007407994&w=4 --
The machine with these symptoms was sent in for service on Friday. I suspect there may have been dodgy hardware involved on this one. I think this bug should be closed for the time being. Once I get the machine back, I'll reopen the bug if I can still reproduce it. - Steven --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11552 Subject : Disabling IRQ #23 Submitter : Justin Mattock <justinmattock@gmail.com> Date : 2008-09-09 19:08 (13 days old) References : http://marc.info/?l=linux-kernel&m=122098735230906&w=4 http://marc.info/?l=linux-kernel&m=122107367715361&w=4 Handled-By : David Brownell <david-b@pacbell.net> Alan Stern <stern@rowland.harvard.edu> Patch : http://marc.info/?l=linux-kernel&m=122187222705195&w=4 --
not sure if it should be; From over here, I did a bad install of isight-firmware-tools, causing hal and udev to clash. After making sure the package was either using hal or udev, there is no message of disable irq #23. If its not too much trouble is there a way to verify that this was the case, i.g. if udev creates a dev, then hal creates the same device will this cause ehci_hcd to have messages of this kind? If so then thats what happened, if not then theres something else causing this. -- Justin P. Mattock --
You didn't read what I wrote earlier, did you? The "HC died" message should NEVER occur! It doesn't matter what games you play with hal and udev -- it should NEVER occur. Not ever. And since the "HC died" is what causes IRQ #23 to be disabled, that shouldn't happen either. Alan Stern --
appologize for not fully understanidng, I'm just getting confused with why and what is causing this to occur. The only reason for playing with hal and udev is to have this message appear, if I leave them out of the picture the system runs fine. Anyways, I'm up to trying anything at this point, and again appologize for causing any heat. -- Justin P. Mattock --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11568 Subject : spontaneous reboot on resume with 2.6.27 Submitter : Andy Wettstein <ajw1980@gmail.com> Date : 2008-09-14 20:00 (8 days old) --
Just verified it is still a problem with 2.6.27-rc7. --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569 Subject : Don't complain about disabled irqs when the system has paniced Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-09-02 13:49 (20 days old) References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11590 Subject : Nokia 5310 Xpress usb-storage not mounting Submitter : David Almaroad <dalmaroad@gmail.com> Date : 2008-09-18 21:35 (4 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11608 Subject : 2.6.27-rc6 BUG: unable to handle kernel paging request Submitter : John Daiker <daikerjohn@gmail.com> Date : 2008-09-16 23:00 (6 days old) References : http://marc.info/?l=linux-kernel&m=122160611517267&w=4 --
On Sun, 21 Sep 2008 20:54:23 +0200 (CEST) As I said in the bugzilla entry: Oops: 000b Bit 3 is set -- the processor detected 1's in reserved bits of the page directory. That can't be good... --
54384.988151] BUG: unable to handle kernel paging request at ffff8800601dd000
[54384.992095] IP: [<ffffffff80375457>] clear_page_c+0x7/0x10
[54384.992095] PGD 202063 PUD 8067 PMD 65d54163 PTE 80002020601dd163
[54384.992095] Oops: 000b [1] SMP DEBUG_PAGEALLOC
I initially suspect PAT (maybe via DEBUG_PAGEALLOC)... but let's see if the
3rd line here is useful.
xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...RR.actuwp
PGD: 001000000010000001100011
xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...RR.actuwp
PUD: 1000000001100111
xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...Rs.actuwp
PMD: 01100101110101010100000101100011
xRRRRRRRRRRRRRRRRRRRRRRR|40b|<--MAXPHYS PHYS-->|...gP.actuwp
PTE: 1000000000000000001000000010000001100000000111011101000101100011
3210987654321098765432109876543210987654321098765432109876543210
Is this a 36-bit physical address CPU? In which case you have 2 bits in
the pte that are outside "maxphys". Or if it is a 40-bit CPU, then you
have just 1 bit outside maxphys, in which case I'd say it is memory
corruption (maybe a hardware bug, maybe a scribble from elsewhere). So
I'm wrong about PAT.
Interestingly, the PMD also has a 1 set in a reserved bit (page global),
but according to the Intel docs, the CPU doesn't check that bit, so it
is not faulting there.
Does the machine survive memtest? Is the bug reproduceable? If the
answer is no to either of these, I think we can take it off the
regression list. Otherwise, is it possible to track down to a specific
commit?
Thanks,
Nick
--
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11610 Subject : Problem with kernel commit 664d080c41463570b95717b5ad86e79dc1be0877 Submitter : Michal 'vorner' Vaner <vorner@ucw.cz> Date : 2008-09-21 17:35 (1 days old) References : http://marc.info/?l=linux-acpi&m=122201853409501&w=4 --
Hello Yes, it still does this with newest kernel (9824b8f11373b0df806c135a342da9319ef1d893). At last for me. With regards -- Please enter password: Michal 'vorner' Vaner --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11609 Subject : oops in find_get_page Submitter : Marcin Slusarz <marcin.slusarz@gmail.com> Date : 2008-09-20 14:53 (2 days old) References : http://marc.info/?l=linux-kernel&m=122192251101892&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11611 Subject : Commit 2344abbcbdb82140050e8be29d3d55e4f6fe860b breaks resume on nx6325 Submitter : Rafael J. Wysocki <rjw@sisk.pl> Date : 2008-09-20 23:24 (2 days old) References : http://marc.info/?l=linux-kernel&m=122195277606974&w=4 --
Hi Rafael, Correct patch is the one attached to bugzilla entry, not the one you mention. Regards, Alex. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
