login
Header Space

 
 

Re: 2.6.20-rc6: known unfixed regressions (v2) (part 2)

Previous thread: Problem booting a kernel built in an older environemnt in the newer LFS LiveCDs 6.2-3 and 6.2-4 environments. by Piet Delaney on Wednesday, January 24, 2007 - 10:24 pm. (1 message)

Next thread: RE: [patch] scsi: use lock per host instead of per device for shared queue tag host by Ed Lin on Wednesday, January 24, 2007 - 11:14 pm. (3 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, January 24, 2007 - 10:58 pm

It's been more than a week since -rc5, but I blame everybody (including 
me) being away for Linux.conf.au and then me waiting for a few days 
afterwards to let everybody sync up.

So there it is, -rc6, hopefully the last -rc of the series.

I'd like everybody to take a really good look at any regressions that 
Adrian has been pointing out, and that very much includes the people who 
reported them too, so that we can confirm whether they are still active 
and relevant.

As to -rc6 itself: the bulk of it are the MTD updates (including a few new 
drivers), and the POWER update (and the bulk of _that_ in terms of patch 
size being defconfig updates ;)

But there's various random fixes in infiniband, DVB, network drivers, 
scsi, usb, some filesystems (cifs, jffs2, nfs, ntfs, ocfs2) as well as 
core networking too.

Oh, and KVM, of course.

And stuff I probably have already forgotten.

ShortLog appended.

		Linus

---
Adrian Bunk (7):
      [MTD] SSFDC must depend on BLOCK
      [MTD] [NAND] rtc_from4.c: use lib/bitrev.c
      [MTD] make drivers/mtd/cmdlinepart.c:mtdpart_setup() static
      [SCSI] qla2xxx: make qla2x00_reg_remote_port() static
      more ftape removal
      [IRDA] vlsi_ir.{h,c}: remove kernel 2.4 code
      [NET]: Process include/linux/if_{addr,link}.h with unifdef

Adrian Friedli (1):
      HID: GEYSER4_ISO needs quirk

Adrian Hunter (2):
      [MTD] OneNAND: Implement read-while-load
      [MTD] OneNAND: Handle DDP chip boundary during read-while-load

Akinobu Mita (2):
      [JFFS2] Use rb_first() and rb_last() cleanup
      [SCSI] iscsi: fix crypto_alloc_hash() error check

Al Viro (5):
      funsoft: ktermios fix
      horizon.c: missing __devinit
      s2io bogus memset
      fix prototype of csum_ipv6_magic() (ia64)
      s2io bogus memset

Alan Cox (1):
      [MTD] MAPS: esb2rom: use hotplug safe interfaces

Alexey Dobriyan (2):
      [MTD] JEDEC probe: fix comment typo (devic)
      [MIPS] There is no __GNUC_MAJOR__

Amit Choudha...
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, Alan Cox <alan@...>
Date: Saturday, January 27, 2007 - 6:11 pm

ata_piix survives exactly one suspend resume cylce. After resuming the
second time the disk is not longer usable.

After the first resume a simple "emacs -nw bla.txt" takes already ~45sec
to launch, but there are no kernel messages.

During the second resume the ATA interrupt gets disabled due to an
unhandled interrupt.

This is 100% reproducible. So I can provide as much info as needed.

	tglx

Boot:

SCSI subsystem initialized
libata version 2.00 loaded.
ata_piix 0000:00:1f.2: version 2.00ac7
ata_piix 0000:00:1f.2: MAP [ P0 P2 XX XX ]
ata_piix 0000:00:1f.2: invalid MAP value 0
ACPI: PCI Interrupt 0000:00:1f.2[B] -&gt; GSI 22 (level, low) -&gt; IRQ 21
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ata1: SATA max UDMA/133 cmd 0x18D0 ctl 0x18C6 bmdma 0x18B0 irq 21
ata2: SATA max UDMA/133 cmd 0x18C8 ctl 0x18C2 bmdma 0x18B8 irq 21
scsi0 : ata_piix
PM: Adding info for No Bus:host0
ata1.00: ATA-7, max UDMA/133, 195371568 sectors: LBA48 NCQ (depth 0/32)
ata1.00: ata1: dev 0 multi count 16
ata1.00: configured for UDMA/133
scsi1 : ata_piix
PM: Adding info for No Bus:host1
scsi 0:0:0:0: Direct-Access     ATA      ST9100824AS      3.14 PQ: 0 ANSI: 5
PM: Adding info for scsi:0:0:0:0
SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sda: sda1 sda2 &lt; sda5 &gt; sda3
sd 0:0:0:0: Attached scsi disk sda

1st Suspend:

ata_piix 0000:00:1f.2: suspend
ACPI: PCI interrupt for device 0000:00:1f.2 disabled
PIIX_IDE 0000:00:1f.1: suspend
....
PIIX_IDE 0000:00:1f.1: LATE suspend

1st Resume:

ata1.00: configured for UDMA/133
SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
sda: Write Protect is off
sda: Mod...
To: <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Alan Cox <alan@...>
Date: Saturday, January 27, 2007 - 6:40 pm

Is this a regression, or behavior that's always been present?

If its a regression, what changeset caused the problem?

	Jeff



-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Alan Cox <alan@...>
Date: Saturday, January 27, 2007 - 6:44 pm

Hey. I just discovered that crap. I'm going to bisect tomorrow. Bed time
here in good old Europe. :)

	tglx


-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Alan Cox <alan@...>
Date: Sunday, January 28, 2007 - 6:05 pm

It seems to be there in 2.6.18 already, although it takes more
suspend/resume cycles to show up. So it's just the surfacing of some
longer standing problem. Just went unnoticed.

	tglx


-
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Stephen Hemminger <shemminger@...>, Jeff Garzik <jeff@...>
Date: Saturday, January 27, 2007 - 4:55 pm

Reverting commit 44ade178249fe53d055fd92113eaa271e06acddd
(sky2: power management/MSI workaround) makes the problem go away.

With the commit it breaks sky2 resume on my laptop:

1. request_irq in early resume is triggering:
BUG: sleeping function called from invalid context
at /home/tglx/work/kernel/vanilla/linux-2.6/mm/slab.c:3034

This is easy resolvable by moving the request_irq into the normal resume
path. There is no need to have this in early resume.

2. The network device is unusable after resume. The only way to resurect
it is: rmmod/insmod. 

The reason is, that the driver grabs the normal PCI irq on resume, but
the pci express resume routes it away. All we get is an unhandled
spurious interrupt on the irq line which was used by the net device
before suspend:

irq 219, desc: c045bb80, depth: 0, count: 9607, unhandled: 0
-&gt;handle_irq():  c0155c20, handle_bad_irq+0x0/0x1f0
-&gt;chip(): c0418920, no_irq_chip+0x0/0x40
-&gt;action(): 00000000
  IRQ_DISABLED set
unexpected IRQ trap at vector db

	tglx


-
To: <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Stephen Hemminger <shemminger@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 3:31 pm

Does this fix it?
---
 drivers/net/sky2.c |   43 ++++++++++++++++++-------------------------
 1 file changed, 18 insertions(+), 25 deletions(-)

--- sky2-2.6.orig/drivers/net/sky2.c	2007-01-29 10:05:12.000000000 -0800
+++ sky2-2.6/drivers/net/sky2.c	2007-01-29 10:29:56.000000000 -0800
@@ -3675,6 +3675,12 @@
 	sky2_write32(hw, B0_IMSK, 0);
 	sky2_power_aux(hw);
 
+	/* Turn off IRQ to avoid power management bug (see resume) */
+	if (hw-&gt;msi) {
+		free_irq(pdev-&gt;irq, hw);
+		pci_disable_msi(pdev);
+	}
+
 	pci_save_state(pdev);
 	pci_enable_wake(pdev, pci_choose_state(pdev, state), wol);
 	pci_set_power_state(pdev, pci_choose_state(pdev, state));
@@ -3700,6 +3706,18 @@
 
 	sky2_write32(hw, B0_IMSK, Y2_IS_BASE);
 
+	/* Can't re-enable MSI because kernel resume ordering is broken
+	 * and calls device resume before ACPI (BIOS) is called.
+	 * BIOS then resets device to INTx!
+	 */
+	if (hw-&gt;msi) {
+		err = request_irq(pdev-&gt;irq, sky2_intr, IRQF_SHARED,
+				  hw-&gt;dev[0]-&gt;name, hw);
+		if (err)
+			goto out;
+		hw-&gt;msi = 0;
+	}
+
 	for (i = 0; i &lt; hw-&gt;ports; i++) {
 		struct net_device *dev = hw-&gt;dev[i];
 		if (netif_running(dev)) {
@@ -3721,29 +3739,6 @@
 	pci_disable_device(pdev);
 	return err;
 }
-
-/* BIOS resume runs after device (it's a bug in PM)
- * as a temporary workaround on suspend/resume leave MSI disabled
- */
-static int sky2_suspend_late(struct pci_dev *pdev, pm_message_t state)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-
-	free_irq(pdev-&gt;irq, hw);
-	if (hw-&gt;msi) {
-		pci_disable_msi(pdev);
-		hw-&gt;msi = 0;
-	}
-	return 0;
-}
-
-static int sky2_resume_early(struct pci_dev *pdev)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-	struct net_device *dev = hw-&gt;dev[0];
-
-	return request_irq(pdev-&gt;irq, sky2_intr, IRQF_SHARED, dev-&gt;name, hw);
-}
 #endif
 
 static void sky2_shutdown(struct pci_dev *pdev)
@@ -3783,8 +3778,6 @@
 #ifdef CONFIG_PM
 	.suspend = sky2_suspend,
 	.resume = sky2...
To: Stephen Hemminger <shemminger@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Stephen Hemminger <shemminger@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 4:10 pm

patching file drivers/net/sky2.c
Hunk #1 FAILED at 3675.
Hunk #2 succeeded at 3625 (offset -81 lines).
Hunk #3 succeeded at 3738 with fuzz 1 (offset -1 lines).
Hunk #4 succeeded at 3668 with fuzz 2 (offset -110 lines).
1 out of 4 hunks FAILED -- saving rejects to file drivers/net/sky2.c.rej

# grep -c sky2_power_aux drivers/net/sky2.c
0

Shrug.

	tglx


-
To: <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 5:38 pm

On Mon, 29 Jan 2007 21:10:30 +0100

Sorry it was against the last patch I sent to Jeff for netdev.
Here is against 2.6.20-rc6

---
 drivers/net/sky2.c |   43 ++++++++++++++++++-------------------------
 1 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index a2e804d..d85de63 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -3598,6 +3598,12 @@ static int sky2_suspend(struct pci_dev *
 		}
 	}
 
+	/* Turn off IRQ to avoid power management bug (see resume) */
+	if (hw-&gt;msi) {
+		free_irq(pdev-&gt;irq, hw);
+		pci_disable_msi(pdev);
+	}
+
 	sky2_write32(hw, B0_IMSK, 0);
 	pci_save_state(pdev);
 	sky2_set_power_state(hw, pstate);
@@ -3619,6 +3625,18 @@ static int sky2_resume(struct pci_dev *p
 
 	sky2_write32(hw, B0_IMSK, Y2_IS_BASE);
 
+	/* Can't re-enable MSI because kernel resume ordering is broken
+	 * and calls device resume before ACPI (BIOS) is called.
+	 * BIOS then resets device to INTx!
+	 */
+	if (hw-&gt;msi) {
+		err = request_irq(pdev-&gt;irq, sky2_intr, IRQF_SHARED,
+				  hw-&gt;dev[0]-&gt;name, hw);
+		if (err)
+			goto out;
+		hw-&gt;msi = 0;
+	}
+
 	for (i = 0; i &lt; hw-&gt;ports; i++) {
 		struct net_device *dev = hw-&gt;dev[i];
 		if (netif_running(dev)) {
@@ -3639,29 +3657,6 @@ static int sky2_resume(struct pci_dev *p
 out:
 	return err;
 }
-
-/* BIOS resume runs after device (it's a bug in PM)
- * as a temporary workaround on suspend/resume leave MSI disabled
- */
-static int sky2_suspend_late(struct pci_dev *pdev, pm_message_t state)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-
-	free_irq(pdev-&gt;irq, hw);
-	if (hw-&gt;msi) {
-		pci_disable_msi(pdev);
-		hw-&gt;msi = 0;
-	}
-	return 0;
-}
-
-static int sky2_resume_early(struct pci_dev *pdev)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-	struct net_device *dev = hw-&gt;dev[0];
-
-	return request_irq(pdev-&gt;irq, sky2_intr, IRQF_SHARED, dev-&gt;name, hw);
-}
 #endif
 
 static struct pci_driver...
To: Stephen Hemminger <shemminger@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:23 pm

Still the same problem. The only difference of this patch to the
previous version is, that the unhandled interrupt message is gone.

As I said before:

Reverting commit 44ade178249fe53d055fd92113eaa271e06acddd, which added
this hackery in the first place, makes the device survive
suspend/resume.

	tglx


-
To: <tglx@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:38 pm

I see the same symptoms on my Intel Mac Mini, and reverting the commit
also allows the driver to seemingly resume correctly. 

However after coming out of sleep I need to reconfigure the network
interface. No need to rmmod/insmod, just ifdown/ifup is sufficient (but
of course shouldn't be necessary, should it?). If I don't reconfigure
it, ping from/to the box will work, but nothing more complicated like
ssh will go through.

Fred.

-
To: Frédéric <frederic.riss@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:45 pm

That's probably a userspace problem. Are you using DHCP ?

	tglx


	

-
To: <tglx@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:50 pm

Yep DHCP. Is that a known issue? I never had to reconfigure with older
kernels.

Fred.

-
To: Frédéric <frederic.riss@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:57 pm

Is dhclient running after resume ? What's the output of ifconfig (before
you do ifdown/up) ? Have you checked the syslog ?

	tglx


-
To: <tglx@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 7:26 pm

The process is of course in the process list, if that's what you mean by

The output is always the same modulo the transmitted packet numbers:

eth0      Link encap:Ethernet  HWaddr 00:16:CB:A2:E4:43  
          inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::216:cbff:fea2:e443/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:269 errors:0 dropped:0 overruns:0 frame:0
          TX packets:57 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:72528 (70.8 KiB)  TX bytes:7900 (7.7 KiB)
          Interrupt:17 


Yes of course. Nothing interesting.

Fred.

-
To: Frédéric <frederic.riss@...>
Cc: Stephen Hemminger <shemminger@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 7:37 pm

Just got the same issue on one of my test boxen. Different network card
though. The interface comes up fine, but DNS is not working. ifdown/up
resolves it.

/me keeps an eye on that.

	tglx




-
To: <unlisted-recipients@...>, <@...>, <UNEXPECTED_DATA_AFTER_ADDRESS@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 7:50 pm

The Sony VAIO BIOS resets to INTx on resume. This happens
after device resume, so device irq's get misrouted.

This hack turns off MSI on this laptop, until power management
initialization order is fixed.

Signed-off-by: Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;

---
 drivers/pci/quirks.c |   32 ++++++++++++++++++++++++++++++++
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ef882a8..9a64179 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -21,6 +21,7 @@ #include &lt;linux/pci.h&gt;
 #include &lt;linux/init.h&gt;
 #include &lt;linux/delay.h&gt;
 #include &lt;linux/acpi.h&gt;
+#include &lt;linux/dmi.h&gt;
 #include "pci.h"
 
 /* The Mellanox Tavor device gives false positive parity errors
@@ -1779,6 +1780,37 @@ static void __devinit quirk_nvidia_ck804
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_DEVICE_ID_NVIDIA_CK804_PCIE,
 			quirk_nvidia_ck804_msi_ht_cap);
+
+/* On Sony VAIO laptop, BIOS resets MSI during resume. */
+static __initdata struct dmi_system_id sony_dmi_table[] = {
+	{
+		.ident = "Sony Vaio",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Sony Corporation"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "PCG-"),
+		},
+	},
+	{
+		.ident = "Sony Vaio",
+		.matches = {
+			DMI_MATCH(DMI_SYS_VENDOR, "Sony Corporation"),
+			DMI_MATCH(DMI_PRODUCT_NAME, "VGN-"),
+		},
+	},
+	{ }
+};
+
+static void __init quirk_sony_msi(struct pci_dev *dev)
+{
+	if (!dmi_check_system(sony_dmi_table))
+		return;
+
+	pci_msi_quirk = 1;
+	printk(KERN_WARNING "PCI: MSI sony quirk detected. pci_msi_quirk set.\n");
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82801BA_6,
+			quirk_sony_msi);
+
 #endif /* CONFIG_PCI_MSI */
 
 EXPORT_SYMBOL(pcie_mch_quirk);
-- 
1.4.1

-
To: Stephen Hemminger <shemminger@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 8:22 pm

Err? My Sony VAIO does _NOT_ do that. It works fine without that. 
It's just the sky2 hackery which fucked up things.

	tglx


	

-
To: Stephen Hemminger <shemminger@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 8:26 pm

Still it stands: 

Your sky2 patch #44ade178249fe53d055fd92113eaa271e06acddd is broken.

Just get it.

	tglx


-
To: <tglx@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 8:21 pm

On Tue, 30 Jan 2007 01:22:54 +0100

What machine and BIOS version?


-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: Stephen Hemminger <shemminger@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 8:31 pm

VGN-SZ2XP_C
BIOS: R0081N0

	tglx


-
To: <tglx@...>
Cc: Frédéric <frederic.riss@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, <netdev@...>
Date: Monday, January 29, 2007 - 8:31 pm

On Tue, 30 Jan 2007 01:31:33 +0100

Mine is:
	VGN-N170G
	BIOS: R0020J4

It might be BIOS bug that has been fixed, but updating the
BIOS requires Windows. It checks for some ID so even Wine
won't work.


-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: Thomas Gleixner <tglx@...>
Cc: Stephen Hemminger <shemminger@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:37 pm

I suspect some BIOSes do *not* screw up the MSI thing on resume, and 
others do.

I would suggest that the real fix is to not do that kind of hackery at 
suspend/resume time (because we can't know what the heck the BIOS does), 
and instead just do one of two cases:

 - since MSI is known to be broken for the sky2 driver due to firmware 
   bugs, just disable it by default if CONFIG_PM is enabled. The 
   advantages of MSI just aren't all that compelling. Possibly add a 
   command line option to force MSI to be enabled regardless.

   Simple, direct, and should work for everybody.

 - Just add a command line to disable MSI for people that it breaks for. 

   I don't actually like this one. It defaults to the unsafe behaviour, 
   and while that makes sense in a "well, your machine is broken anyway" 
   kind of way, the thing is, the advantages of MSI just aren't big enough 
   to warrant defaulting to a known-unsafe thing, even if only a small 
   percentage of machines are affected.

With _eventually_ maybe having a third possible situation:

 - some way of figuring it out dynamically.

The third case doesn't seem to be very likely in the short term, though, 
which is why I'd suggest one of the first two (the first one being 
probably the best one).

Comments?

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Stephen Hemminger <shemminger@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 7:42 pm

commmit 44ade178249fe53d055fd92113eaa271e06acddd breaks sane
MSI/ACPI/BIOS combinations. It's impossible to keep broken and sane
MSI/ACPI/BIOSes happy at the same time.

Revert the patch and disable MSI for sky2 when CONFIG_PM is enabled.

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;

diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
index a2e804d..420fef7 100644
--- a/drivers/net/sky2.c
+++ b/drivers/net/sky2.c
@@ -91,7 +91,11 @@ static int copybreak __read_mostly = 128;
 module_param(copybreak, int, 0);
 MODULE_PARM_DESC(copybreak, "Receive copy threshold");
 
+#ifdef CONFIG_PM
+static int disable_msi = 1;
+#else
 static int disable_msi = 0;
+#endif
 module_param(disable_msi, int, 0);
 MODULE_PARM_DESC(disable_msi, "Disable Message Signaled Interrupt (MSI)");
 
@@ -3601,6 +3605,7 @@ static int sky2_suspend(struct pci_dev *pdev, pm_message_t state)
 	sky2_write32(hw, B0_IMSK, 0);
 	pci_save_state(pdev);
 	sky2_set_power_state(hw, pstate);
+
 	return 0;
 }
 
@@ -3640,28 +3645,6 @@ out:
 	return err;
 }
 
-/* BIOS resume runs after device (it's a bug in PM)
- * as a temporary workaround on suspend/resume leave MSI disabled
- */
-static int sky2_suspend_late(struct pci_dev *pdev, pm_message_t state)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-
-	free_irq(pdev-&gt;irq, hw);
-	if (hw-&gt;msi) {
-		pci_disable_msi(pdev);
-		hw-&gt;msi = 0;
-	}
-	return 0;
-}
-
-static int sky2_resume_early(struct pci_dev *pdev)
-{
-	struct sky2_hw *hw = pci_get_drvdata(pdev);
-	struct net_device *dev = hw-&gt;dev[0];
-
-	return request_irq(pdev-&gt;irq, sky2_intr, IRQF_SHARED, dev-&gt;name, hw);
-}
 #endif
 
 static struct pci_driver sky2_driver = {
@@ -3672,8 +3655,6 @@ static struct pci_driver sky2_driver = {
 #ifdef CONFIG_PM
 	.suspend = sky2_suspend,
 	.resume = sky2_resume,
-	.suspend_late = sky2_suspend_late,
-	.resume_early = sky2_resume_early,
 #endif
 };
 


-
To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:40 pm

On Mon, 29 Jan 2007 14:37:23 -0800 (PST)

MSI works fine for almost all systems (except AMD systems where

Module option out already exists.

-
To: Stephen Hemminger <shemminger@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 7:04 pm

Why do you ignore reality?

MSI does *not* work fine, exactly because the firmware screws it up.

The fact that on a "hardware level" it may work is totally irrelevant. The 
*only* thing that matters is what people actually see.

"Positivism" may not be a hot philosophy these days any more, but dang, it 
certainly is better than what you seem to espouse: "in theory things work 
fine".

And if you don't like positivism, how about just simple scientific method: 
a theory is *proven*wrong* by a single observation to the opposite. And we 
have several people standing up saying that your theory is wrong.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 7:45 pm

On Mon, 29 Jan 2007 15:04:06 -0800 (PST)

Why do you insist on maintaining the wrong initialization order
on resume? When I raised the issue, Len brought up that the resume
order did not match spec, but then there has been slow progress
in fixing it (it's buried in -mm tree).





-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: Stephen Hemminger <shemminger@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 8:12 pm

It's not getting merged, SINCE IT DOESN'T WORK. It causes all sorts of 
problems, because ACPI requires all kinds of things to be up and running 
in order to actually work, and that in turn breaks all the devices that 
have different ordering constraints.

ACPI is a piece of sh*t. It asks the OS to do impossible things, like 
running it early in the config sequence when it then at the same time 
wants to depend on stuff that are there *late* in the sequence. It's not 
the first time this insane situation has happened, either.

But we'll try to merge the patch that totally switches around the whole 
initialization order hopefully early after 2.6.20. But no way in hell do 
we do it now, and I personally suspect we'll end reverting it when we do 
try it just because it will probably break other things. But we'll see.

In the meantime, sky2 doesn't work with MSI.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, Rafael J. Wysocki <rjw@...>
Date: Tuesday, January 30, 2007 - 4:57 am

And it will not be the last:-)

There are really two cases, one is easy, one hard:

1. The ACPI spec and our knowledge of how the HW and talking to our own BIOS
    folks tells us quite a bit about how things are supposed to work.

2. "Windows Bug Compatibility" (tm)
    When OEMs build systems and test them only with Windows, then
    the implementation quirks of Windows get ingrained in the platforms.
    Linux then tries to run on the same platform and wonders why
    the BIOS does "unusual" things.  The answer is because it has been
    only tested on Windows and BIOS quirks slip through Windows testing.

    To be fair, the exact same thing would happen in reverse to Windows
    if vendors only tested with Linux.

    http://www.linuxfirmwarekit.org/ is intended to help mitigate some of this
    problem.  So at least vendors that care about Linux can make sure that
    they minimize the curve balls they throw us.

An example of a recent curve ball is when the BIOS supplies two APIC (MADT)
tables.  Well, the spec says there should be only one...  We have proof
that Windows doesn't use the 1st for enumerating processors because
Windows works on a box with a garbled 1st table.
If we prove that Windows doesn't use the second either then it means
they enumerate processors  via the DSDT -- which means bringing up
the ACPI interpreter before bringing up SMP -- and that would require

I agree with this plan, and I concur with your outlook.

I think Rafel is holding the ball here as we wait for an SMP-safe freezer:
http://lists.osdl.org/pipermail/linux-pm/2006-December/004233.html

cheers,
-Len
-
To: Len Brown <lenb@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, February 1, 2007 - 8:49 am

Well, as we can do cpu hotplug these days... we could do this. Just
boot up with single cpu, then bring up additional cpus at runtime...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To: Len Brown <lenb@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, January 30, 2007 - 12:01 pm

Hi,


Well, no longer. :-)

The freezer in 2.6.20-rc6 should be SMP-safe and the patches to change
the suspend-resume code ordering are in -mm:

pm-change-code-ordering-in-mainc.patch
swsusp-change-code-ordering-in-diskc.patch
swsusp-change-code-order-in-diskc-fix.patch
swsusp-change-code-ordering-in-userc.patch
swsusp-change-code-ordering-in-userc-sanity.patch
swsusp-change-pm_ops-handling-by-userland-interface.patch

I have no problems whatsoever with these patches on SMP boxes and if anyone
has, please let me know.

Greetings,
Rafael


-- 
If you don't have the time to read,
you don't have the time or the tools to write.
		- Stephen King
-
To: Rafael J. Wysocki <rjw@...>
Cc: Len Brown <lenb@...>, Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, January 30, 2007 - 5:28 pm

Hi.


I've been running an SMP box here with the matching changes for
Suspend2, with no problems. I believe the algorithm looks good.

Regards,

Nigel

-
To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 8:16 pm

On Mon, 29 Jan 2007 16:12:27 -0800 (PST)


On one and only one platform. It works fine on others. Don't blame the
driver, stop it in PCI.


-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: Stephen Hemminger <shemminger@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 8:25 pm

How sure are you that it's only those Sony laptops?

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, January 30, 2007 - 2:54 am

i'm wondering, could we go with Thomas' temporary patch that disables 
sky2 MSI if CONFIG_PM is enabled - we could revert that after 2.6.20. 
It's not like MSI is a life and death feature. On IO-APIC systems 
vectors are abundant and in any case we share irqs just fine. The true 
advantage of MSI is minimal. (MSI-X has the potential to be better by 
being message based, but in reality it still goes through the full IRQ 
layer.) MSI might be useful on really, really large systems - but i 
really hope those really large systems dont rely on CONFIG_PM. Meanwhile 
Thomas' patch maximizes the amount of working hardware (it has the 
chance to produce working systems in 100% of the cases) - which is a few 
orders of magnitude more important than IRQ management micro-costs. Am i 
missing anything?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 3:39 am

Sharing irqs /sucks/.  I routinely have to fight a USB device dying, 
because the ATA device is causing an interrupt storm, or vice versa. 
/Very/ common headache.

Other than that, they use a tiny bit fewer CPU cycles, and allow 
simplification of the interrupt handler (saving another few CPU cycles).

The biggest benefit is (a) for hardware designers, where MSI means a 
cleaner h/w design, and (b) preparation of drivers and the kernel 
systems for MSI-only hardware.

At present only high end hardware is MSI-only (like infiniband), but 
that's the future direction.

	Jeff


-
To: Jeff Garzik <jeff@...>, Alan <alan@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 1, 2007 - 2:15 am

Hi,
   TEST_UNIT_READY in get_capabilities (drivers/scsi/sr.c line 743, or
see below) always returns error.

  ---------------- code begin -----------------------------
    retries = 0;
    do {
        memset((void *)cmd, 0, MAX_COMMAND_SIZE);
        cmd[0] = TEST_UNIT_READY;

        the_result = scsi_execute_req (cd-&gt;device, cmd, DMA_NONE, NULL,
                           0, &amp;sshdr, SR_TIMEOUT,
                           MAX_RETRIES);

        retries++;
    } while (retries &lt; 5 &amp;&amp;
         (!scsi_status_is_good(the_result) ||
          (scsi_sense_valid(&amp;sshdr) &amp;&amp;
           sshdr.sense_key == UNIT_ATTENTION)));
  ---------------- code end -----------------------------

  I debugged all kernel versions from 2.6.17 to 2.6.20 on several AMD
and other vendor's PATA/IDE controllers, and I get the_result==0x8000002
and retries==5; on silicon image 3132, i get the_result=0x2eb.
  Does 0x8000002 mean ((DRIVER_SENSE &lt;&lt; 24) | SAM_STAT_CHECK_CONDITION)?
what's wrong?


Conke 



-
To: <conke.hu@...>
Cc: Alan <alan@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, February 7, 2007 - 8:40 am

What does the sense data returned in the sense buffer say is wrong?

	Jeff



-
To: Jeff Garzik <jeff@...>
Cc: Alan <alan@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 2, 2007 - 1:48 am

I dump scsi_sense_hdr as follows:
sshdr.response_code = 0x70
sshdr.sense_key = 0x2
sshdr.asc = 0x3a
sshdr.ascq = 0x1
sshdr.additional_length = 0x0

the sense_key is 0x2 (NOT_READY), but the expected UNIT_ATTENTION :(

BTW, I am sorry for a mistake, Sil3132 also returns 0x8000002, not 0x2eb
as I said in the first mail. In a word, all cases return "the_result" as
0x8000002.

Conke



-
To: <conke.hu@...>
Cc: Jeff Garzik <jeff@...>, Alan <alan@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 13, 2007 - 3:30 am

the bytes 0 ~ 13 in sense buffer are:
70 00 02 00 00 00 00 0a 00 00 00 00 3a
other bytes are all 0x00;

in fact this issue can be reproduced in any libata driver, either sata or pata.

Conke
-
To: Jeff Garzik <jeff@...>, Alan <alan@...>, Tejun Heo <htejun@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 15, 2007 - 2:30 am

[resend]
any suggestion ?
-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 4:03 am

btw., MSI is not really needed to avoid the sharing of irqs: x86 has 224 
IRQ vectors which is abundant for all but the largest boxes. Even the 
smallest laptop tends to have an IO-APIC with at least 24 pins - which 
is enough to never have to share irqs. How system designers can still 
end up with mapping so many devices to the same pin is really their 
fault.

so MSI's only true accomplishment AFAICS is that it now says on the 
hardware level that "you must not share IRQs". Well, doh...

	Ingo
-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 3:53 am

Yeah. Admittedly, ATA is very special because it is still edge-triggered 
most of the time (for legacy reasons):

 14:     389907          0   IO-APIC-edge      ide0

so if it shares an irq with a device that has level-triggered 
assumptions, those two dont intermix very well. That's why i have the 
delayed-disable patches (see the two patches below), which will unify 
the two methods, and the irq flow handling method will be mostly a 
'performance hint' not a correctness issue. This has been in -rt for 
quite a few weeks now and it works well.

btw., it would be great if you could help us here: could you perhaps, 
from a past example, outline a specific case of such an ATA/USB IRQ 
storm and how it occured (precisely) - and what the fix was? I'd like to 
analyze a specific case to make sure the genirq layer recovers from such 
cases more gracefully. In general, i think the IRQ subsystem needs to 
become more failure-resilient and needs to become more auto-learning 
(and these two dont stand in the way of good performance). This problem 
of shared IRQs will be with us for at least another 10 years, if not 
more. (for example ISA is /still/ not dead everywhere and it was already 
legacy technology 15 years ago when Linux was started.)

	Ingo

-------------------&gt;
Subject: irq: do not mask interrupts by default
From: Ingo Molnar &lt;mingo@elte.hu&gt;

never mask interrupts immediately upon request. Disabling interrupts in 
high-performance codepaths is rare, and on the other hand this change 
could recover lost edges (or even other types of lost interrupts) by 
conservatively only masking interrupts after they happen. (NOTE: with 
this change the highlevel irq-disable code still soft-disables this IRQ 
line - and if such an interrupt happens then the IRQ flow handler keeps 
the IRQ masked.)

mark i8529A controllers as 'never loses an edge'.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 arch/i386/kernel/i8259.c   |    1 +
 arch/x86_64/kernel/i8259.c |    1 +
...
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 4:02 am

Easy to name an example, as they are pretty generic.  When sharing irqs 
-- usually ATA is configured to PCI native (IO-APIC-fasteoi) -- any 
interrupt storm causes the other devices sharing that irq to crap 
themselves (kernel turns off irq, suggests irqpoll, etc.)

ATA is unfortunately easier to cause interrupt storms than most because 
the standard PCI IDE definition has __no__ possible way to indicate 
certain interrupt conditions are pending.  You have to /know/ that you 
are expecting an interrupt, which causes problems if the hardware 
decides to send the interrupt early or late, rather than when its 
expected.  Most modern hardware has a read/write/clear interrupt status 
  register that gives you an immediate summary of the pending interrupt 
conditions, and an easy way to ack the pending events.  ATA does not 
have any such capability.

That said, stuff like AHCI or sata_sil or sata_sil24 do have modern 
designs with the expected interrupt status register(s), so they do not 
suffer from the problems suffered by the more legacy-like hardware 
(ata_piix, sata_via, pata_*)

	Jeff


-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 4:08 am

ok. Can you suggest any way for me to reproduce such a bug artificially 
on a test system? [i have both old and new systems, so if you can think 
of a way for me to trigger this i'd be happy to try]

I /think/ my two patches should automatically avoid the 'cap themselves' 
effect you outlined: the absolutely worst case should be that we'll have 
twice the IRQ rate of the optimal one - but no irq storm nor lost 
interrupts should happen due to irq trigger type mismatches, ever - as 
long as the basic mapping of device to IRQ is correct. [ I tried to push 
to include this in v2.6.20 but i lost that argument ;-) ]

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, January 31, 2007 - 11:27 am

Should be pretty easy.  With either the old-IDE driver or libata, 
complete a command without acknowledging an interrupt.  For libata, that 
means poking around in ata_host_intr() and avoiding well-built hardware 
like AHCI.  Anything that uses ata_piix driver, basically all Intel 
machines, should be applicable in the "not well built" category... :)

	Jeff



-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, January 31, 2007 - 1:38 pm

ok, here's one victi^H^H^H^H testbox that seems to match your 
description:

 18:          3          0   IO-APIC-fasteoi   uhci_hcd:usb3, ohci1394
 19:    2413090          0   IO-APIC-fasteoi   uhci_hcd:usb2, libata
 22:        168          0   IO-APIC-fasteoi   HDA Intel
 23:          0          0   IO-APIC-fasteoi   uhci_hcd:usb1, ehci_hcd:usb5

so i should try to generate some missing ACK [this meaning a missing 
driver-level ack, right?] on IRQ#19's libata handler - and i should 
expect a screaming interrupt? Or non-working USB? Or both?

[ i can hunt for other hardware if this doesnt look broken enough to you
  :-) ]

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, January 31, 2007 - 1:52 pm

Yep, that's a good candidate for such experiments :)

	Jeff



-
To: Jeff Garzik <jeff@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, January 31, 2007 - 4:13 pm

Happens to be the same thing, which causes a stale interrupt on the
second suspend/resume cycle.

	tglx


-
To: Jeff Garzik <jeff@...>
Cc: Linus Torvalds <torvalds@...>, Stephen Hemminger <shemminger@...>, Thomas Gleixner <tglx@...>, Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 4:13 am

-
To: Linus Torvalds <torvalds@...>
Cc: Thomas Gleixner <tglx@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 8:26 pm

On Mon, 29 Jan 2007 16:25:48 -0800 (PST)

I do not underestimate the ability of BIOS writers to
screw things up.

-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: <tglx@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:23 pm

On Mon, 29 Jan 2007 23:23:21 +0100

But the fix is necessary on laptops where ACPI messes with MSI/INTx
on resume.

-- 
Stephen Hemminger &lt;shemminger@linux-foundation.org&gt;
-
To: Stephen Hemminger <shemminger@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Monday, January 29, 2007 - 6:31 pm

And the fix is unnecessary and counter productive on laptops, where ACPI
does the right thing.

	tglx


-
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ingo Molnar <mingo@...>, Arjan van de Ven <arjan@...>
Date: Saturday, January 27, 2007 - 4:47 pm

2.6.20-rc6-git (today) on a dual core laptop:

PM: Preparing system for mem sleep
Disabling non-boot CPUs ...

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.20-rc6 #3
-------------------------------------------------------
pm-suspend/3601 is trying to acquire lock:
 (cpu_bitmask_lock){--..}, at: [&lt;c032cd2b&gt;] mutex_lock+0x1c/0x1f

but task is already holding lock:
 (workqueue_mutex){--..}, at: [&lt;c032cd2b&gt;] mutex_lock+0x1c/0x1f

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-&gt; #3 (workqueue_mutex){--..}:
       [&lt;c0140880&gt;] __lock_acquire+0x8dd/0xa04
       [&lt;c0140c90&gt;] lock_acquire+0x56/0x6f
       [&lt;c032cb80&gt;] __mutex_lock_slowpath+0xe5/0x274
       [&lt;c032cd2b&gt;] mutex_lock+0x1c/0x1f
       [&lt;c0136d14&gt;] __create_workqueue+0x61/0x136
       [&lt;f8bfe62e&gt;] cpufreq_governor_dbs+0xa1/0x30e [cpufreq_ondemand]
       [&lt;c02b2c3c&gt;] __cpufreq_governor+0x9e/0xd2
       [&lt;c02b2df7&gt;] __cpufreq_set_policy+0x187/0x209
       [&lt;c02b3056&gt;] store_scaling_governor+0x164/0x1b1
       [&lt;c02b24f9&gt;] store+0x37/0x48
       [&lt;c01aeb8d&gt;] sysfs_write_file+0xb3/0xdb
       [&lt;c0175e0f&gt;] vfs_write+0xaf/0x163
       [&lt;c017645d&gt;] sys_write+0x3d/0x61
       [&lt;c0103f8c&gt;] sysenter_past_esp+0x5d/0x99
       [&lt;ffffffff&gt;] 0xffffffff

-&gt; #2 (dbs_mutex){--..}:
       [&lt;c0140880&gt;] __lock_acquire+0x8dd/0xa04
       [&lt;c0140c90&gt;] lock_acquire+0x56/0x6f
       [&lt;c032cb80&gt;] __mutex_lock_slowpath+0xe5/0x274
       [&lt;c032cd2b&gt;] mutex_lock+0x1c/0x1f
       [&lt;f8bfe612&gt;] cpufreq_governor_dbs+0x85/0x30e [cpufreq_ondemand]
       [&lt;c02b2c3c&gt;] __cpufreq_governor+0x9e/0xd2
       [&lt;c02b2df7&gt;] __cpufreq_set_policy+0x187/0x209
       [&lt;c02b3056&gt;] store_scaling_governor+0x164/0x1b1
       [&lt;c02b24f9&gt;] store+0x37/0x48
       [&lt;c01aeb8d&gt;] sy...
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Jan Altenberg <jan@...>, Ralf Baechle <ralf@...>, <linux-mips@...>, Andrew Clayton <andrew@...>, Trond Myklebust <trond.myklebust@...>
Date: Saturday, January 27, 2007 - 1:44 pm

This email lists some known regressions in 2.6.20-rc6 compared to 2.6.19
with patches available.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : MIPS Malta: CONFIG_MTD=n compile error
References : http://lkml.org/lkml/2007/1/25/122
Submitter  : Jan Altenberg &lt;jan@linutronix.de&gt;
Caused-By  : Ralf Baechle &lt;ralf@linux-mips.org&gt;
             commit b228f4c54df37b53c6f364aa7f3efa4280bcc4f0
Handled-By : Jan Altenberg &lt;jan@linutronix.de&gt;
Patch      : http://lkml.org/lkml/2007/1/25/122
Status     : patch available


Subject    : NFS triggers WARN_ON() in invalidate_inode_pages2_range()
References : http://bugzilla.kernel.org/show_bug.cgi?id=7826
Submitter  : Andrew Clayton &lt;andrew@digital-domain.net&gt;
Caused-By  : Andrew Morton &lt;akpm@osdl.org&gt;
             commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140
Handled-By : Trond Myklebust &lt;trond.myklebust@fys.uio.no&gt;
Patch      : http://lkml.org/lkml/2007/1/24/323
Status     : patch available



-
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Uwe Bugla <uwe.bugla@...>, <B.Zolnierkiewicz@...>, <linux-ide@...>, <alan@...>, Gerhard Dirschl <gd@...>, Christoph Hellwig <hch@...>, <petero2@...>, Livio Soares <livio@...>, Paul Mackerras <paulus@...>, <anton@...>, <linuxppc-dev@...>, Cijoml Cijomlovic Cijomlov <cijoml@...>, Nick Piggin <nickpiggin@...>, <ttb@...>, <rml@...>
Date: Saturday, January 27, 2007 - 1:42 pm

This email lists some known regressions in 2.6.20-rc6 compared to 2.6.19
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : problems with CD burning
References : http://www.spinics.net/lists/linux-ide/msg06545.html
Submitter  : Uwe Bugla &lt;uwe.bugla@gmx.de&gt;
Status     : unknown


Subject    : pktcdvd fails with pata_amd
References : http://bugzilla.kernel.org/show_bug.cgi?id=7810
             http://lkml.org/lkml/2007/1/25/128
Submitter  : Gerhard Dirschl &lt;gd@spherenet.de&gt;
Caused-By  : Christoph Hellwig &lt;hch@lst.de&gt;
             commit 3b00315799d78f76531b71435fbc2643cd71ae4c
             commit 406c9b605cbc45151c03ac9a3f95e9acf050808c
Status     : problem is being debugged


Subject    : powerpc64: performance monitor exception
References : http://ozlabs.org/pipermail/linuxppc-dev/2007-January/030045.html
Submitter  : Livio Soares &lt;livio@eecg.toronto.edu&gt;
Caused-By  : Paul Mackerras &lt;paulus@samba.org&gt;
             commit d04c56f73c30a5e593202ecfcf25ed43d42363a2
Status     : problem is being discussed


Subject    : BUG: at fs/inotify.c:172 set_dentry_child_flags()
References : http://bugzilla.kernel.org/show_bug.cgi?id=7785
Submitter  : Cijoml Cijomlovic Cijomlov &lt;cijoml@volny.cz&gt;
Handled-By : Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;
Status     : problem is being debugged


-
To: Adrian Bunk <bunk@...>, <akpm@...>, <torvalds@...>
Cc: <rml@...>, <ttb@...>, <nickpiggin@...>, <cijoml@...>, <linuxppc-dev@...>, <anton@...>, <paulus@...>, <livio@...>, <petero2@...>, <hch@...>, <gd@...>, <alan@...>, <linux-ide@...>, <B.Zolnierkiewicz@...>, <linux-kernel@...>
Date: Sunday, January 28, 2007 - 9:33 am

-------- Original-Nachricht --------
Datum: Sat, 27 Jan 2007 18:42:30 +0100
Von: Adrian Bunk &lt;bunk@stusta.de&gt;
An: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;, Andrew Morton &lt;akpm@osdl.org&gt;

Hi everybody,
the problem I already reported for earlier release candidates of kernel 2.6.20
(rc1 – 5) unfortunately stills persists.

The regression has become more extreme: While in earlier release candidates nerolinux recognized my burning devices at least after the first start and then never again after all following starts the situation in rc6 is different from that:

The CD and DVD burning devices aren´t recognized even once and the drive seek errors I already reported are still there.

nerolinux runs excellently with kernel 2.6.19.2, but only shows an “image recorder” (i. e. no burning device at all) in kernel 2.6.20-rc6.

Still hope that this terrible bug will not be part of the final version of 2.6.20!

Regards

Uwe

P. S.: I already reported that 2.6.20-rc4-mm1 is not bootable at all.

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
-
To: Uwe Bugla <uwe.bugla@...>
Cc: Adrian Bunk <bunk@...>, <akpm@...>, <torvalds@...>, <rml@...>, <ttb@...>, <nickpiggin@...>, <cijoml@...>, <linuxppc-dev@...>, <anton@...>, <paulus@...>, <livio@...>, <petero2@...>, <hch@...>, <gd@...>, <alan@...>, <linux-ide@...>, <B.Zolnierkiewicz@...>, <linux-kernel@...>
Date: Monday, January 29, 2007 - 2:26 am

FWIW, I just tried it with 2.6.20-rc6, and can confirm.  Once nero is
run, the kernel never gives up retrying whatever command failed, so I
get...

[ 4362.972995] hdd: status error: status=0x58 { DriveReady SeekComplete
DataRequest }
[ 4362.981475] ide: failed opcode was: unknown
[ 4362.986183] hdd: drive not ready for command

endlessly.

	-Mike

-
To: Mike Galbraith <efault@...>
Cc: Uwe Bugla <uwe.bugla@...>, Adrian Bunk <bunk@...>, <torvalds@...>, <rml@...>, <ttb@...>, <nickpiggin@...>, <cijoml@...>, <linuxppc-dev@...>, <anton@...>, <paulus@...>, <livio@...>, <petero2@...>, <hch@...>, <gd@...>, <alan@...>, <linux-ide@...>, <B.Zolnierkiewicz@...>, <linux-kernel@...>
Date: Monday, January 29, 2007 - 2:48 am

On Mon, 29 Jan 2007 07:26:03 +0100

Do you have time to bisect it?
-
To: Andrew Morton <akpm@...>
Cc: Uwe Bugla <uwe.bugla@...>, Adrian Bunk <bunk@...>, <torvalds@...>, <rml@...>, <ttb@...>, <nickpiggin@...>, <cijoml@...>, <linuxppc-dev@...>, <anton@...>, <paulus@...>, <livio@...>, <petero2@...>, <hch@...>, <gd@...>, <alan@...>, <linux-ide@...>, <B.Zolnierkiewicz@...>, <linux-kernel@...>
Date: Monday, January 29, 2007 - 3:08 am

Unfortunately, I'm git impaired.  I am rummaging as we speak though.

	-Mike

-
To: Mike Galbraith <efault@...>
Cc: Andrew Morton <akpm@...>, Uwe Bugla <uwe.bugla@...>, Adrian Bunk <bunk@...>, <rml@...>, <ttb@...>, <nickpiggin@...>, <cijoml@...>, <linuxppc-dev@...>, <anton@...>, <paulus@...>, <livio@...>, <petero2@...>, <hch@...>, <gd@...>, <alan@...>, <linux-ide@...>, <B.Zolnierkiewicz@...>, <linux-kernel@...>
Date: Monday, January 29, 2007 - 3:13 am

Ok, I'm personally heading to bed, but it rally should be as simple as

 - get the git tree in the first place
 - do

	git bisect good v2.6.19
	git bisect bad v2.6.20-rc2
	.. it will pick a point for you to try ..
	.. compile, boot, test ..

	"git bisect {good|bad}" depending on results

 - until (found)

(Of course, you should check that -rc2 really is bad to make sure. I think 
that's what Uwe reported, though. And I don't think we've done anything 
after -rc2 that could impact this, so I don't doubt it).

		Linus
-
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Vasquez <andrew.vasquez@...>, Jens Axboe <jens.axboe@...>, Berthold Cogel <cogel@...>, François <francois.valenduc@...>, Alan Stern <stern@...>, <greg@...>, <linux-usb-devel@...>, Prakash Punnoor <prakash@...>, Oliver Neukum <oliver@...>, Lennart Sorensen <lsorense@...>, takada <takada@...>, Jordan Crouse <jordan.crouse@...>
Date: Saturday, January 27, 2007 - 1:32 pm

This email lists some known regressions in 2.6.20-rc6 compared to 2.6.19
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : NULL pointer dereference at as_move_to_dispatch()
References : http://lkml.org/lkml/2007/1/22/141
Submitter  : Andrew Vasquez &lt;andrew.vasquez@qlogic.com&gt;
Status     : unknown


Subject    : reboot instead of powerdown  (CONFIG_USB_SUSPEND)
References : http://lkml.org/lkml/2006/12/25/40
             http://bugzilla.kernel.org/show_bug.cgi?id=7828
Submitter  : Berthold Cogel &lt;cogel@rrz.uni-koeln.de&gt;
             François Valenduc &lt;francois.valenduc@skynet.be&gt;
Handled-By : Alan Stern &lt;stern@rowland.harvard.edu&gt;
Status     : problem is being debugged


Subject    : usb somehow broken  (CONFIG_USB_SUSPEND)
References : http://lkml.org/lkml/2007/1/11/146
Submitter  : Prakash Punnoor &lt;prakash@punnoor.de&gt;
Handled-By : Oliver Neukum &lt;oliver@neukum.org&gt;
             Alan Stern &lt;stern@rowland.harvard.edu&gt;
Status     : problem is being debugged


Subject    : fix geode_configure()
References : http://lkml.org/lkml/2007/1/9/216
Submitter  : Lennart Sorensen &lt;lsorense@csclub.uwaterloo.ca&gt;
Caused-By  : takada &lt;takada@mbf.nifty.com&gt;
             commit e4f0ae0ea63caceff37a13f281a72652b7ea71ba
Handled-By : takada &lt;takada@mbf.nifty.com&gt;
             Lennart Sorensen &lt;lsorense@csclub.uwaterloo.ca&gt;
Status     : patches are being discussed

-
To: lkml <linux-kernel@...>
Cc: Linus Torvalds <torvalds@...>
Date: Thursday, January 25, 2007 - 6:09 am

It was a cool booting, have really enjoyed this.

I have one question which is open (seems ignored or missed by u guys).

migration_cost=33 for 2.6.20-rc5
migration_cost=159 for 2.6.20-rc6

~Akula2
-
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Venkat Yekkirala <vyekkirala@...>, David Miller <davem@...>
Date: Thursday, January 25, 2007 - 5:05 pm

Hi,


It doesn't build for me.

make O=/dir
[..]
security/built-in.o: In function `security_set_bools':
(.text+0x12471): undefined reference to `flow_cache_genid'
security/built-in.o: In function `security_load_policy':
(.text+0x128b3): undefined reference to `flow_cache_genid'
make[1]: *** [.tmp_vmlinux1] Error 1
make: *** [_all] Error 2

334c85569b8adeaa820c0f2fab3c8f0a9dc8b92e is first bad commit
commit 334c85569b8adeaa820c0f2fab3c8f0a9dc8b92e
Author: Venkat Yekkirala &lt;vyekkirala@TrustedCS.com&gt;
Date:   Mon Jan 15 16:38:45 2007 -0800

    [SELINUX]: increment flow cache genid

    Currently, old flow cache entries remain valid even after
    a reload of SELinux policy.

    This patch increments the flow cache generation id
    on policy (re)loads so that flow cache entries are
    revalidated as needed.

    Thanks to Herbet Xu for pointing this out. See:
    http://marc.theaimsgroup.com/?l=linux-netdev&amp;m=116841378704536&amp;w=2

    There's also a general issue as well as a solution proposed
    by David Miller for when flow_cache_genid wraps. I might be
    submitting a separate patch for that later.

    I request that this be applied to 2.6.20 since it's
    a security relevant fix.

    Signed-off-by: Venkat Yekkirala &lt;vyekkirala@TrustedCS.com&gt;

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/)
-