What should PCI core do during suspend-resume? (was: Re: 2.6.29-rc3: tg3 dead after resume)

Previous thread: Re: [PANIC] lro + iscsi or lro + skb text search causes panic by Herbert Xu on Wednesday, January 28, 2009 - 2:25 pm. (2 messages)

Next thread: [PATCH 3/5] ebtables: remove unneeded initializations by Stephen Hemminger on Wednesday, January 28, 2009 - 11:25 pm. (1 message)
From: Parag Warudkar
Date: Wednesday, January 28, 2009 - 5:14 pm

[Empty message]
From: Linus Torvalds
Date: Wednesday, January 28, 2009 - 6:09 pm

In fact, the new PCI suspend/restore code should have made that 
unnecessary, since the PCI layer now makes sure that a save/restore is 
done even if the driver hadn't done it.

But at the same time, still having the driver do it certainly shouldn't 
have _hurt_ anything either. But it's quite possible that the tg3 thing is 
very sensitive to the exact order things happen in - there's a lot of 

That seems to imply that even the reset failed, which is interesting. 

But it also possibly means that the problem is not necessarily the driver 
itself, but some cached state that we keep around in "struct pci_dev" even 
across a module load/unload. 

For example, if we get the "dev->current_state" cache wrong, then we may 
not actually end up changing it when we should, because we think we 
already match the target state. I don't _think_ that is it, but that's the 
kind of thing that could happen.

Can you do a

	lspci -vvxxx -s [tg3-device]

before-and-after suspend? Is there some state that looks like it got 
corrupted?

			Linus
--

From: Parag Warudkar
Date: Wednesday, January 28, 2009 - 6:49 pm

Sure, diff -u below. There are differences but not sure if they are 
abnormal or expected.

Also, BTW, reverting the only tg3 specific commit - 
commit 9e9fd12dc0679643c191fc9795a3021807e77de4
Author: Matt Carlson <mcarlson@broadcom.com>
Date:   Mon Jan 19 16:57:45 2009 -0800

    tg3: Fix firmware loading

did not help.

parag@parag-desktop:~$ diff -u lspci-pre-suspend lspci-post-suspend
--- lspci-pre-suspend   2009-01-28 20:35:37.070584068 -0500
+++ lspci-post-suspend  2009-01-28 20:36:56.922471408 -0500
@@ -12,7 +12,7 @@
        Capabilities: [50] Vital Product Data <?>
        Capabilities: [58] Vendor Specific Information <?>
        Capabilities: [e8] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/0 Enable+
-               Address: 00000000fee0f00c  Data: 41c9
+               Address: 00000000fee0f00c  Data: 41d1
        Capabilities: [d0] Express (v1) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
<4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
@@ -36,15 +36,15 @@
 20: 00 00 00 00 00 00 00 00 00 00 00 00 3c 10 07 13
 30: 00 00 04 20 48 00 00 00 00 00 00 00 03 01 00 00
 40: 00 00 00 00 00 00 00 00 01 50 03 c0 08 20 00 64
-50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7d c9 08 78
-60: 00 00 00 00 00 00 00 00 98 02 02 a0 00 00 18 76
-70: f2 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
-80: 3c 10 07 13 00 00 00 00 34 00 13 04 82 70 08 fc
-90: 19 be 00 01 00 00 00 b7 00 00 00 00 14 00 00 00
-a0: 00 00 00 00 4c 01 00 00 00 00 00 00 3e 01 00 00
-b0: 00 00 00 00 00 00 00 36 00 00 00 00 00 00 00 00
+50: 03 58 fc 00 00 00 00 78 09 e8 78 00 7e cb 08 a8
+60: 00 00 00 00 00 00 00 00 9a 02 02 a0 00 00 00 10
+70: 72 10 00 00 c0 00 00 00 2c 00 00 00 00 00 00 00
+80: 3c 10 07 13 00 00 00 00 00 00 00 00 fe 70 08 fc
+90: 11 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 c0: 00 00 00 00 ...
From: Linus Torvalds
Date: Wednesday, January 28, 2009 - 7:10 pm

Well, they're all in the "extended set", ie not the basic registers that 
the PCI layer saves. The PCI layer normally just saves the low 16 dwords, 
along with the PCI[EX] capability thing.

None of the PCI save/restore routines have ever saved the extended state 
(well, "ever" is a strong word - I think we long ago used to pass in how 
many bytes we wanted saved, but got rid of it), and it certainly didn't 
change with the recent PCI suspend/resume changes.

I get the feeling that it's some odd tg3 issue. That tg3 driver does have 
that special

        /* Make sure register accesses (indirect or otherwise)
         * will function correctly.
         */
        pci_write_config_dword(tp->pdev,
                               TG3PCI_MISC_HOST_CTRL,
                               tp->misc_host_ctrl);

in its own version of setting the power state, and maybe that really 
_must_ happen before we actually set the state back to PCI_D0. That sounds 
very odd, but hey..

I added Matt Carlson to the cc, since he seems to be the main tg3 
authority here.

Matt: the whole discussion is on netdev and the kernel mailing list, but 
the short version is that -rc3 suspends and resumes for Parag again 
(unlike -rc2), but tg3 doesn't appear to resume properly. The generic PCI 
layer now does more at resume time (very early, when interrupts are still 
off), see

 - pci_pm_resume_noirq ->
     pci_pm_default_resume_noirq() ->
       pci_restore_standard_config()

for more of the details (basically it always does that 
"pci_restore_state()" and tries to bring the device back to PCI_D0).

			Linus
--

From: Matt Carlson
Date: Wednesday, January 28, 2009 - 7:19 pm

Thanks Linus.  I'm looking over the diffs Parag sent and I already see
some suspicious register settings.  Let me think about this some more
and then I'll jump into the discussion.

--

From: Rafael J. Wysocki
Date: Thursday, January 29, 2009 - 3:22 pm

FWIW, I can't reproduce the problem with tg3 on my testbox.  Suspend to RAM
and resume seem to work correctly on it.

Thanks,
Rafael
--

From: Matt Carlson
Date: Thursday, January 29, 2009 - 11:42 am

O.K.  These differences can probably be attributed to the driver's chip
reset failure.  For some reason, the driver has lost communication with
the firmware through the device's shared memory.  A cascading series of
errors will probably be the consequence.

Can you apply the following test patch and see if it helps?  The patch
does two things.  First, it enables a bit which should restore firmware
communication.  If that fixes the problem, then let me know and I'll
spin a proper patch.

In the event that it doesn't work, the patch goes on to test the memory
mapping by simply printing the register value at offset 0x0.  The value
should be the device's vendor ID and device ID.  Please post the
results so that I can verify it.


diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 8b3f846..39fce42 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -7227,6 +7227,11 @@ static int tg3_init_hw(struct tg3 *tp, int reset_phy)
 {
 	tg3_switch_clocks(tp);
 
+	printk( KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
+		tp->dev->name, tr32(0x0) );
+
+	tw32(MEMARB_MODE, tr32(MEMARB_MODE) | MEMARB_MODE_ENABLE);
+
 	tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
 
 	return tg3_reset_hw(tp, reset_phy);

--

From: Parag Warudkar
Date: Thursday, January 29, 2009 - 3:06 pm

Hi Matt,

Thanks for the patch. It didn't help with resume - but below is the 
output after patching, let me know if you need more details.

( Looks like 0xffffffff is invalid/corrupted device id /vendor id? )

[  163.856001] tg3 0000:0e:00.0: restoring config space at offset 0xc (was 0x0, writing 0x20040000)                                                                                                            
[  163.856001] tg3 0000:0e:00.0: restoring config space at offset 0x3 (was 0x0, writing 0x10)                                                                                                                  
[  163.856001] tg3 0000:0e:00.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100006)                                                                                                         

[snip]

[  164.450277] pcieport-driver 0000:1e:00.0: setting latency timer to 64
[  164.450415] pcieport-driver 0000:1e:01.0: setting latency timer to 64
[  164.450493] tg3 0000:0e:00.0: restoring config space at offset 0xc (was 0x0, writing 0x20040000)
[  164.451110] serial 00:08: activated

[snip]

[  168.913863] Restarting tasks ... done.
[  170.332953] tg3 0000:0e:00.0: wake-up capability disabled by ACPI
[  170.332960] tg3 0000:0e:00.0: PME# disabled
[  170.333047] tg3 0000:0e:00.0: irq 54 for MSI/MSI-X
[  170.333250] eth0: Reg value at offset 0x0 is 0xffffffff
[  170.394281] [drm] Loading R500 Microcode
[  170.394330] [drm] Num pipes: 1
[  171.726650] tg3: eth0: No firmware running.
[  183.119745] ADDRCONF(NETDEV_UP): eth0: link is not ready


Parag
--

From: Matt Carlson
Date: Thursday, January 29, 2009 - 3:22 pm

^^^^^^^^^^^^^^^^^
So here is our problem.  For some reason the memory mapped IO is
failing.  I'll have to think about how and why that might happen.

FWIW, I can suspend and resume using the latest linux-2.6 kernel
on a machine with a similar chip here.  The problem doesn't seem to

--

From: Parag Warudkar
Date: Thursday, January 29, 2009 - 3:35 pm

It is failing for me on HP xw6600 workstation, if that helps in any way.

Parag
--

From: Rafael J. Wysocki
Date: Thursday, January 29, 2009 - 4:10 pm

Hm, I have an xw4600 nearby, will try tomorrow.

Thanks,
Rafael
--

From: Matt Carlson
Date: Friday, January 30, 2009 - 11:40 am

O.K.  Let's test some more assumptions.  Can you apply the following
patch and observe the system logs when the device is first loaded and
again after resume.  The patch looks at the pci command register to
verify that memory space IO is indeed enabled.  (It should be.)  This is
all that should be needed for MMIO to work.

If the PCI_COMMAND message doesn't match, then it means that the
PCI_COMMAND register isn't getting restored for some reason.  If they do
match, then something else in the system is not getting restored
correctly.


diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 8b3f846..67bb29f 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -7225,8 +7225,17 @@ static int tg3_reset_hw(struct tg3 *tp, int reset_phy)
  */
 static int tg3_init_hw(struct tg3 *tp, int reset_phy)
 {
+	u16 cmd;
+
 	tg3_switch_clocks(tp);
 
+	pci_read_config_word(tp->pdev, PCI_COMMAND, &cmd);
+
+	printk(KERN_NOTICE "%s: PCI_COMMAND reg = 0x%x (bit 1 is %s)\n",
+	       tp->dev->name, cmd, (cmd & PCI_COMMAND_MEMORY) ? "on" : "off");
+	printk(KERN_NOTICE "%s: Reg value at offset 0x0 is 0x%x\n",
+	       tp->dev->name, tr32(0x0));
+
 	tw32(TG3PCI_MEM_WIN_BASE_ADDR, 0);
 
 	return tg3_reset_hw(tp, reset_phy);

--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 3:50 pm

Here is the output after applying the patch (fresh boot btw) -

[   29.698877] eth0: PCI_COMMAND reg = 0x406 (bit 1 is on)
[   29.698880] eth0: Reg value at offset 0x0 is 0x167b14e4
[   29.758169] ADDRCONF(NETDEV_UP): eth0: link is not ready
[   31.295087] tg3: eth0: Link is up at 100 Mbps, full duplex.
[   31.295090] tg3: eth0: Flow control is off for TX and off for RX.
[   31.297574] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   41.872007] eth0: no IPv6 routers present
^^^ Pre-Suspend

[  245.924484] eth0: PCI_COMMAND reg = 0x406 (bit 1 is on)
[  245.924487] eth0: Reg value at offset 0x0 is 0xffffffff
[  247.317971] tg3: eth0: No firmware running.
[  258.710634] ADDRCONF(NETDEV_UP): eth0: link is not ready
^^^ Post-Suspend

So it looks like the memory space IO is enabled before and after suspend.
The device/vendor id goes 0xffffffff after resume - just like before.
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 4:06 pm

One thing strikes me - are there any bridges between the host (CPU) and 
that tg3 device?

Because we obviously have two people who say that their tg3 suspend/resume 
works fine, so the tg3 driver is obviously not _totally_ broken. So I'm 
wondering if there is something funny in between the CPU and the tg3, like 
a hotplug bridge that needs magic to wake up properly.

Because clearly the PCI config space addresses are working fine, but the 
thing is, while PCI config space accesses are routed by the device number 
(and the bridges notion of secondary bridging), the PCI memory space 
routing is based on address. So a PCI bridge can easily get one right (in 
fact, it's really hard to get config space accesses wrong without the 
bridges being _totally_ screwed up), while not routing the other at all.

So just do that "lspci -vvxxx" for the whole box, before and after, and 
send us the "before" and the "diff -u before after" thing, and maybe that 
shows something interesting. Because some bridge chip being confused would 
also explain why a total re-init of the whole tg3 chip by a driver unload 
and reload doesn't seem to help.

		Linus
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 4:33 pm

It might also be instructive to see the same thing for a working kernel. I 
assume plain 2.6.28 works for you?

			Linus
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 4:45 pm

Totally worth having this problem from a "getting an opportunity to 
understand" standpoint. This confirms my long standing suspicion that bugs 
in Linux kernel are merely a handiwork of few clever people to get more 
people to understand and contribute :)

Any how  here is the pre-suspend lspci -vvxxx output followed by diff -u -

00:00.0 Host bridge: Intel Corporation 5400 Chipset Memory Controller Hub (rev 20)
	Subsystem: Hewlett-Packard Company Device 1307
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 255
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Message Signalled Interrupts: Mask- 64bit- Queue=0/1 Enable-
		Address: fee00000  Data: 0000
	Capabilities: [6c] Express (v2) Root Port (Slot-), MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x4, ASPM L0s L1, Latency L0 <1us, L1 <4us
			ClockPM- Suprise+ LLActRep+ BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
	Capabilities: [100] Advanced Error Reporting <?>
00: 86 80 03 40 04 01 10 00 20 00 00 06 10 00 00 00
10: 00 00 00 00 ...
From: Linus Torvalds
Date: Friday, January 30, 2009 - 4:57 pm

We've disabled IO and MEM behind this bridge - the one that bridges to 
secondary bus 0x0e.

And your tg3 device? It's at 0000:0e:00.0. Yeah. Exactly the bus that 
we've disabled IO and MEM for.

In other words, it was never your tg3 suspend/resume that was buggy. It 
was the suspend/resume for the PCI-E port driver. In fact, I think it's 
_exactly_ the issue I just emailed out about.

I bet Rafael can whip up a patch in a minute. I have too much of a 
headache right now to look at my screen for a while, so I'll take a break.

			Linus
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 4:59 pm

[--snip--]


and the PCIe port driver may be at fault.

Can you try to remove the pci_save_state(dev) from pcie_port_suspend_late()
and see if that helps?

Rafael
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 5:28 pm

I assume you meant pcie_portdrv_suspend_late in 
drivers/pci/pcie/portdrv_pci.c - that one did not go well.

With that change machine refuses to suspend (comes back after attempting) 
and my keyboard goes dead.

Parag
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 5:38 pm

This gets more and more interesting.

Can you test the patch below, please?

Rafael

---
Subject: PCI PCIe portdrv: Implement pm object
From: Rafael J. Wysocki <rjw@sisk.pl>

Implement pm object for the PCI Express port driver in order to use
the new power management framework.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/pci/hotplug/pciehp_core.c |    4 +--
 drivers/pci/pcie/aer/aerdrv.c     |    6 -----
 drivers/pci/pcie/portdrv.h        |    4 +--
 drivers/pci/pcie/portdrv_core.c   |   14 +++++-------
 drivers/pci/pcie/portdrv_pci.c    |   43 ++++++++++++--------------------------
 include/linux/pcieport_if.h       |    2 -
 6 files changed, 25 insertions(+), 48 deletions(-)

Index: linux-2.6/drivers/pci/pcie/portdrv_pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/portdrv_pci.c
+++ linux-2.6/drivers/pci/pcie/portdrv_pci.c
@@ -49,33 +49,21 @@ static int pcie_portdrv_restore_config(s
 }
 
 #ifdef CONFIG_PM
-static int pcie_portdrv_suspend(struct pci_dev *dev, pm_message_t state)
-{
-	return pcie_port_device_suspend(dev, state);
-
-}
+static struct dev_pm_ops pcie_portdrv_pm_ops = {
+	.suspend = pcie_port_device_suspend,
+	.resume = pcie_port_device_resume,
+	.freeze = pcie_port_device_suspend,
+	.thaw = pcie_port_device_resume,
+	.poweroff = pcie_port_device_suspend,
+	.restore = pcie_port_device_resume,
+};
 
-static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
-{
-	return pci_save_state(dev);
-}
+#define PCIE_PORTDRV_PM_OPS	(&pcie_portdrv_pm_ops)
 
-static int pcie_portdrv_resume_early(struct pci_dev *dev)
-{
-	return pci_restore_state(dev);
-}
+#else /* !PM */
 
-static int pcie_portdrv_resume(struct pci_dev *dev)
-{
-	pcie_portdrv_restore_config(dev);
-	return pcie_port_device_resume(dev);
-}
-#else
-#define pcie_portdrv_suspend NULL
-#define pcie_portdrv_suspend_late NULL
-#define pcie_portdrv_resume_early NULL
-#define ...
From: Ingo Molnar
Date: Friday, January 30, 2009 - 5:44 pm

pet peeve: could we please use vertical spaces wherever they improve the 
code?

Something like:

static struct dev_pm_ops pcie_portdrv_pm_ops = {
	.suspend	= pcie_port_device_suspend,
	.resume		= pcie_port_device_resume,
	.freeze		= pcie_port_device_suspend,
	.thaw		= pcie_port_device_resume,
	.poweroff	= pcie_port_device_suspend,
	.restore	= pcie_port_device_resume,
};

... and it all becomes clear at a glance. Please?

	Ingo
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 5:47 pm

OK

Rafael
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 6:21 pm

Excellent! This patch works - tg3 comes back and gets link after resume.

Thank you!

Are the below differences worth worrying about - especially since post 
suspend some DevID/VendorID and some capabilities seem to be changed?

parag@parag-desktop:~$ diff -u lspci-pre-suspend lspci-post-fix 
--- lspci-pre-suspend   2009-01-30 18:19:50.752275695 -0500     
+++ lspci-post-fix      2009-01-30 20:14:22.605607870 -0500     
@@ -286,9 +286,9 @@                                             
 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
 20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 86 80            
 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
-40: 00 00 00 00 00 00 0d 00 00 00 00 00 00 00 00 00            
+40: 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 00 00            
 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
-60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
+60: 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00            
 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
@@ -353,7 +353,7 @@                                             
 40: 00 00 00 80 14 14 b0 b0 ff 01 00 30 ff 01 00 30            
 50: 00 00 00 07 00 00 00 00 ff 3f ff 3f 00 40 00 40            
 60: 00 40 00 40 01 80 01 80 00 00 00 00 00 00 00 00            
-70: 00 00 00 00 6b e0 6b e0 00 00 0b a9 00 00 0b a9            
+70: 00 00 00 00 00 c0 00 c0 01 1b 4c b8 01 1b 4c b8            
 80: 24 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00            
 90: 01 01 08 00 01 01 08 00 01 01 08 00 01 01 08 00            
 a0: 01 01 08 00 00 00 00 00 00 00 00 00 00 00 00 00            
@@ -395,7 +395,7 @@                                             
 40: 00 00 00 80 14 14 b0 b0 ff 01 00 30 ff 01 00 30            
 50: 00 00 00 07 00 00 00 00 ff 3f ff 3f 00 40 00 40            
 60: 00 40 00 40 01 ...
From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 6:37 pm

Great, thanks for testing.

It's in the Jesse's linux-next branch, so it should be easy to push

I can't tell you right now, I'm too tired. :-)

Anyway, they seem to be worth investigating.

Can you attach one of the files?  That will make it easier to look at the
differences.

Thanks,
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 6:42 pm

Sure - here is the lspci-pre-suspend file.

Parag
From: Rafael J. Wysocki
Date: Tuesday, February 3, 2009 - 2:29 am

Hi Parag,

Did you have a chance to check if the patch from Linus (reproduced
below) on top of current -git fixed the problem for you too?

Thanks,
Rafael

---
 drivers/pci/pcie/portdrv_pci.c |   14 --------------
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 99a914a..08a8e3c 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -55,16 +55,6 @@ static int pcie_portdrv_suspend(struct pci_dev *dev, pm_message_t state)
 
 }
 
-static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
-{
-	return pci_save_state(dev);
-}
-
-static int pcie_portdrv_resume_early(struct pci_dev *dev)
-{
-	return pci_restore_state(dev);
-}
-
 static int pcie_portdrv_resume(struct pci_dev *dev)
 {
 	pcie_portdrv_restore_config(dev);
@@ -72,8 +62,6 @@ static int pcie_portdrv_resume(struct pci_dev *dev)
 }
 #else
 #define pcie_portdrv_suspend NULL
-#define pcie_portdrv_suspend_late NULL
-#define pcie_portdrv_resume_early NULL
 #define pcie_portdrv_resume NULL
 #endif
 
@@ -292,8 +280,6 @@ static struct pci_driver pcie_portdriver = {
 	.remove		= pcie_portdrv_remove,
 
 	.suspend	= pcie_portdrv_suspend,
-	.suspend_late	= pcie_portdrv_suspend_late,
-	.resume_early	= pcie_portdrv_resume_early,
 	.resume		= pcie_portdrv_resume,
 
 	.err_handler 	= &pcie_portdrv_err_handler,

--

From: Parag Warudkar
Date: Tuesday, February 3, 2009 - 2:27 pm

Hi Rafael,

Yes, the patch below (atop today's git) does work for me.

--

From: Rafael J. Wysocki
Date: Tuesday, February 3, 2009 - 3:15 pm

Good.

Can you also test this one instead and tell me if it still works?

Rafael

---
 drivers/pci/pcie/portdrv_pci.c |   16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

Index: linux-2.6/drivers/pci/pcie/portdrv_pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pcie/portdrv_pci.c
+++ linux-2.6/drivers/pci/pcie/portdrv_pci.c
@@ -55,25 +55,13 @@ static int pcie_portdrv_suspend(struct p
 
 }
 
-static int pcie_portdrv_suspend_late(struct pci_dev *dev, pm_message_t state)
-{
-	return pci_save_state(dev);
-}
-
-static int pcie_portdrv_resume_early(struct pci_dev *dev)
-{
-	return pci_restore_state(dev);
-}
-
 static int pcie_portdrv_resume(struct pci_dev *dev)
 {
-	pcie_portdrv_restore_config(dev);
+	pci_set_master(dev);
 	return pcie_port_device_resume(dev);
 }
 #else
 #define pcie_portdrv_suspend NULL
-#define pcie_portdrv_suspend_late NULL
-#define pcie_portdrv_resume_early NULL
 #define pcie_portdrv_resume NULL
 #endif
 
@@ -292,8 +280,6 @@ static struct pci_driver pcie_portdriver
 	.remove		= pcie_portdrv_remove,
 
 	.suspend	= pcie_portdrv_suspend,
-	.suspend_late	= pcie_portdrv_suspend_late,
-	.resume_early	= pcie_portdrv_resume_early,
 	.resume		= pcie_portdrv_resume,
 
 	.err_handler 	= &pcie_portdrv_err_handler,

--

From: Parag Warudkar
Date: Tuesday, February 3, 2009 - 5:38 pm

Removing Linus' patch and applying the below one works as well.

--

From: Rafael J. Wysocki
Date: Tuesday, February 3, 2009 - 5:41 pm

Great, thanks for testing.

I'm going to push this one to Jesse.

Best,
--

From: Linus Torvalds
Date: Friday, February 6, 2009 - 8:00 pm

Jesse, should I take this directly (and presumably the related series by 
Rafael)?

I was going to release -rc4 this weekend (it's already more than a week - 
I'm bad), and I'd hate for pci-e bridges to not resume correctly. So I'd 
either like a pull request from you with the pending suspend fixes, or 
I'll just apply the patch(es) directly.

		Linus
--

From: Jesse Barnes
Date: Saturday, February 7, 2009 - 11:03 am

Ok, you can just pull from me then.  /me uses request-pull wow that's easy.


The following changes since commit eda58a85ec3fc05855a26654d97a2b53f0e715b9: 
  Linus Torvalds (1):                                                        
        Merge branch 'for-linus' of git://git.kernel.org/.../cooloney/blackfin-2.6                                                                              

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6 for-linus

Alex Chiang (1):
      PCI: properly clean up ASPM link state on device remove

Rafael J. Wysocki (7):
      PCI PM: Fix handling of devices without drivers
      PCI PM: Check if the state has been saved before trying to restore it
      PCI PM: Fix saving of device state in pci_legacy_suspend
      PCI: PCIe portdrv: Simplify suspend and resume
      PCI PM: Do not disable and enable bridges during suspend-resume
      PCI PM: Read power state from device after trying to change it on resume
      PCI PM: make the PM core more careful with drivers using the new PM framework

Timothy S. Nelson (1):
      PCI: return error on failure to read PCI ROMs

 Documentation/filesystems/sysfs-pci.txt |   13 +++-
 arch/ia64/sn/kernel/io_acpi_init.c      |    2 +-
 arch/ia64/sn/kernel/io_init.c           |    2 +-
 drivers/pci/pci-driver.c                |  164 ++++++++++++++++++++-----------
 drivers/pci/pci-sysfs.c                 |    4 +-
 drivers/pci/pci.c                       |    4 +-
 drivers/pci/pcie/aspm.c                 |    4 +-
 drivers/pci/pcie/portdrv_pci.c          |   16 +---
 drivers/pci/rom.c                       |    8 +-
 include/linux/pci.h                     |    2 +-
 10 files changed, 132 insertions(+), 87 deletions(-)

-- 
Jesse Barnes, Intel Open Source Technology Center
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 6:46 pm

I still think the patch isn't very good. See my previous email. 

The fact that your machine works again is good, though. But before we let 
this lie, I'd _really_ like to know what was broken in the legacy PM path, 
rather than "let's leave it behind". Because a broken legacy path will end 
up biting us for other drivers, and I think the new PM path will need more 

That's not a devid/vendorid, it's an extended "subsystem" capability that 
we didn't save. We could try to save/restore all capabilities, but right 
now we only do the ones we care about (pcie/pcix/msi, iirc).

I do suspect we might be better off saving everything we can, rather than 
deciding piece-meal to save specific capabilities we know about.

			Linus
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 6:54 pm

Ok - I will run with it temporarily and will do what I can to track down 
what was wrong with the legacy PM.

Thanks!
Parag

--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 7:25 pm

If you can try the patch I just sent out, and use that as a base for 
trying to track down why the heck the legacy code doesn't work, that would 
be great. It might fix it (assuming my guess about "restore_state while in 
PCI_D3 doesn't work" was correct), but quite frankly, it's equally 
possible that it just makes things worse. But it would be really 
interesting to hear..

Your machine does seem to be interesting, in that not only does it have a 
PCI-E bridge in it (the eeepc I was playing around with at LCA does not), 
but judging by the lost config state I also suspect that it actually loses 
power during STR.

Which is not at all necessarily a given - I suspect it depends on just how 
the power rails are set up on the motherboard. The fact that PCI-E bridges 
have apparently worked for others implies that your problems don't happen 
for everybody, and may relate to that issue.

		Linus
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 7:40 pm

Not sure what the significance of eeepc is in this case - mine being a 
standard Intel 5400 chipset I would have thought that's the last 

Do we know this for sure that PCI-E bridges + Suspend have worked for 
others - In this thread at least I think people reported tg3 worked but not 
necessarily with a PCI-E bridge.

Parag
--

From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 11:51 am

Technically, they are PCIe root ports and many systems have network adapters
connected through them (I have two such systems here, but none of them with
tg3-compatible hardware - my tg3 is a PCI device).

Thanks,
Rafael
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 7:19 pm

Ho humm. I have a feeling..

The legacy resume basically ends up doing just

	pci_restore_standard_config(dev)

in the resume_early path (in pci_pm_default_resume_noirq, which is shared 
with both the legacy and the new PM model).

The _new_ PM layer does that too (it's shared), but then in the regular 
resume sequence it _also_ does the pci_pm_reenable_device(), and I think 
this is key.

Why?

Look at pci_restore_standard_config(): it restores the PCI config space, 
but it does so _before_ actually turning the device into PCI_D0. And 
that's not actually guaranteed to work at all - if a device is in D3, you 
can still read from config space, but writing to it may or may not 
actually work.

This may explain why your PCIE bridge works when moving over to the new 
PM: because of the whole pci_pm_reenable_device() thing, we end up 
doing the pci_enable_resources() thing later, and now it's in PCI_D0, so 
now the device actually reacts to it.

This also explains why we don't care if we save the wrong state or not: 
even if we save state with IO/MEM disabled, we'll re-enable it at 
->resume() time.

HOWEVER, that's still buggy, since it's potentially too late, since any 
interrupts that come in before that will see the device without the IO/MEM 
set, leading to the whole "hung interrupt" issue again even if the device 
is now on.

So we really do want to restore the state to whatever saves state in the 
_early_ resume phase, both for legacy and new PM rules, I think. And we 
want to make sure that we restore state while the device is in D0, because 
otherwise I really think that it possibly could lose our writes.

(Somebody should check me on that - maybe I remember wrong, and config 
space writes are guaranteed to take effect even when something is in 
D3cold).

So I think we should:

 - make sure that the generic PCI layer saves the right state, for the new 
   PM model too. Don't save it if the driver already did (because the 
   driver may have turned off ...
From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 1:45 pm

Well, please remember that the PCIe port driver has ->resume() too, which is
pcie_portdrv_resume(), which calls pcie_portdrv_restore_config() and that
calls pci_enable_device() (not reenable, just plain enable).

However, that sees the device has been already enabled and doesn't enable


Yes, I sent a patch for it to Jesse, but he hasn't pushed it yet:
http://git.kernel.org/?p=linux/kernel/git/jbarnes/pci-2.6.git;a=commit;h=48f67f54a53bb...


Nice theory, but please also look at the original 2.6.28 code in
drivers/pci/pcie/portdrv_pci.c .  It does:
(suspend):
- suspend port services
- pci_save_state

(resume):
- pci_restore_state()
- pci_enable_device() -> this doesn't enable the device, because it sees the
  non-zero refrence count end doesn't do anything.

So, according to your theory, the 2.6.28 code shouldn't work for Parag, but it
does.

Moreover, why oh why the state we save would contain IO/MEM disabled?




Only in D3hot, but MSI-X tables reside in memory space and we need to restore
them too.  So, we need to put the device into D0 before restoring the state, if
possible, anyway.

Still, we can't bring any device from D3cold into D0 without ACPI and we can't


In the new PM model the idea is that drivers won't do that.  The core is

With the above-mentioned patch from the Jesse's tree, it should work this way.

Still, as I said before, I don't think this is relevant in this particular

I think something different from what you told is happening here.  I'm not sure
what it is, I need more information.

I'm very much interested in whether your patch below makes any difference for
Parag.

Thanks,
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 6:41 pm

Rafael, you're making this _way_ too difficult.

Don't make it use the new PM infrastructure, because that one is certainly 
broken: pci_pm_default_suspend_generic() is total crap.

It's saving the disabled state. No WAY is that correct.

That "pci_disable_enabled_device()" should be removed, but even then 
that's wrong, because if the driver suspend disabled it, you're now 
(again) saving the disabled state.

But all of that is only called if you use the new PM infrastructure. So 
the thing is, when you're trying to move the PCI-E drive to the new pm 
infrastructure, you're making things _worse_.

The legacy PM infrastructure at least does the whole

	pci_dev->state_saved = false;
	i = drv->suspend(pci_dev, state);
	..
	if (pci_dev->state_saved)
		goto Fixup;

thing, which will avoid overwriting the state if it was already saved.

HOWEVER. The problem here (I think) is that PCI-E actually does the state 
save late, so it won't ever see the "state_saved" in the early ->suspend. 
I think a patch like the one below at least simplifies this all, and lets 
the PCI layer itself do all the core restore stuff.

The new PM infrastructure gets this totally wrong, and
 (a) disables the device before saving state
and
 (b) overwrites the (now corrupted) saved state that the driver perhaps 
     already saved, after the driver may even have put it to sleep.

So let's not use the new PM infrastructure - I don't think it's ready yet.

Let's start simplifying first. Start off by getting rid of the 
suspend_early/resume_late, since the PCI layer now does it for us.

I don't see why we don't resume with IO/MEM on, though. The legacy suspend 
sequence shouldn't disable them, afaik.

		Linus

---
 drivers/pci/pcie/portdrv_pci.c |   14 --------------
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 99a914a..08a8e3c 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ ...
From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 2:08 pm

pci_disable_device() does really only one thing: it clears the bus master bit.
Yes, it also calls pcibios_disable_device(), but on x86 this is a NOP.


The driver using the new model is not supposed to save the state and power

No, it shouldn't.

However, the patch below actually may help and it really is not too different
from my "new infrastructure" patch.  It leaves the pcie_portdrv_restore_config()
in the PCIe port driver's ->resume(), but that shouldn't change things,
pci_enable_device() in there shouldn't do anything and the bus master bit
should already be set.

The "new infrastracture" patch makes pci_disable_enabled_device() be called in
the suspend code path, but that only disables the bus master bit, and
pci_reenable_device() be called in the resume code path, but that only sets the
bus master bit back again.  So, they are almost the same and I'd be surprised
if your patch didn't help.

I had that "new infrastracture" patch ready yesterday and I thought it might
help, so I sent it.  I was too tired to prepare a new patch and that would
probably look like the one below (I'd remove the pcie_portdrv_restore_config()
from ->resume too, but that's only a detail).

Thanks,
--

From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 2:42 pm

I should have said "in this particular case", because it actually makes a
difference for devices using interrupt pins.  Namely, pcibios_enable_device()
additionally enables PCI resources (that may make a difference, but everything
should have been restored already) and sets up an interrupt link for the
device.  If the link has been set up already, which often is the case, it
increases a reference count and exits.

Now, this reference count is not used for anything, but it was supposed to be
used for disabling the interrupt links that are no longer needed.  This
mechanism is currently disabled, but if we enable it at one point (which may
be necessary to fix suspend-resume on some boxes), it won't work if
the "enable" calls are not balanced with "disable" ones.  IOW, for every
pcibios_enable_device() call there should be a complementary
pcibios_disable_device() call.  That's why I decided to put the
disable/enable things into the PCI core's suspend/resume code paths (also,
because pci_reenable_device() has been already there for driverless devices).
This might not be the right choice, but I don't really think it does break
things.

Anyway, from what I can tell reading your messages in this thread so far,
you seem to want the PCI core to:
(1) save the state of devices during suspend (avoid doing that if the driver has
    already saved the state),
(2) put devices into D0 during resume (early),
(3) restore the state of devices during resume (early).
Still, you don't want the core to disable devices during suspend and to enable
(or reenable) them during resume.

What about putting devices into low power states?  [Note that ACPI may be
necessary for this purpose.]

What about devices with no drivers and/or without suspend/resume support?
Do you want them to be disabled during suspend and enabled during resume by
the core?  [I guess you do, they have no interrupt handlers that may break
after all.]

Thanks,
Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 2:59 pm

[Empty message]
From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 4:08 pm

(As I wrote in the other message just sent) I wonder if it's a good idea to
introduce an "override default PCI suspend/resume" flag for this purpose.

The drivers that want to handle the PCI stuff themselves would be able to use
this flag to tell the core not to try to power manage the device etc.

Alternatively, the core may check the state_saved bit and look at current_state
to see if the device need not be put into a low power state.  Still, putting



I know.  Still, all of the drivers that implement suspend-resume put the
devices into low power states in ->suspend and it's never been observed to
be a source of problems.

Unfortunately, on many systems we are supposed to use ACPI for putting PCI
devices into low power states and right now we can't do that with interrupts
off due to the limitations of our AML interpreter.

The same applies to resume, BTW, but resume is easier, because devices tend
to already be in D0 from the start.  Admittedly, though, we may have a problem
with devices that are not in D0 at that point and _require_ ACPI to power them
up (ie. don't support the native PCI PM).  Not that I know of any, but still.

ISTR that some devices will not wake up the system if they are not put into
the low power state by ACPI during suspend.

So, I wonder.  Do we need to make the AML interpreter allow us to run code

Yes, the bridges with drivers are somewhat special.

Thanks,
Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 4:27 pm

I'd be surprised if we didn't have devices that cannot act as wakeup 
devices in D3, for example. Yes, I'm sure it _should_ work, and I realize 
that the whole "platform_pci_choose_state()" is supposed to do this all 
right, but let me just take a wild stab at guessing that it's not always 
going to work.

So I would not be surprised if some devices will want to not be put in D3.

There's also the issue of debugging: we may well want to simply skip 
suspending the console device (and all bridges leading up to it). We don't 
do that right now, and we basically depend on "no_suspend_console" just 
working _despite_ that (because our legacy PCI power management never put 
things into sleep states), but that's another example of something where a 
driver might simply decide that it doesn't want the default PCI layer 
decisions: not because the device cannot do it, but because _we_ don't 

I do agree that problems at suspend time are probably somewhat less likely 
than at resume time. The devices are mostly in known states (rather than 
whatever random state they come up in and the BIOS early programming may 
do), and hopefully quiescent (because we told user space to freeze).

But I wouldn't bet on it in general. Think network cards on a shared 
interrupt, where the network card gets suspended _after_ the device that 
it shares interrupts with. Is the network quiescent? On a lot of desktop 
machines it probably is.

But how many people test STR while doing a "ping -f" from another machine?

It _should_ work. Do you guarantee that it does?

I think we should aim for "yes, we do guarantee that it does".

			Linus
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 4:39 pm

Btw, this really only is interesting if there's a shared interrupt. 

I'm sure that there are network drivers that will crash even on their own 
with _just_ the right timing (imagine having a delayed interrupt pending, 
then doing the "pci_set_power_state(PCI_D3hot)" thing, and then get the 
interrupt handler invoked on another CPU _just_ afterwards), but it's 
probably really hard to trigger, and a bug in that specific driver anyway.

But what's much more interesting (and not necessarily a driver bug, but a 
general PM infrastructure problem) is if we have that shared interrupt 
case, and the network driver gets lots of interrupts just as "driver X" is 
shutting down with that interrupt shared. Then, "driver X" will get 
interrupts after the PM layer has put its device to sleep, and now "driver 
X" is quite understandably confused - it didn't even do the "put to sleep" 
itself, but now its device is no longer responding.

And now it's not a really unlikely race condition any more. 

			Linus
--

From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 5:36 pm

All this leads to the conclusion that we should put devices into low power
states with interrupts off and this seems to imply that we'll need to make the
AML interpreter allow us to run AML with interrupts off.

Still, what about the following rule:
- If the device is supposed to wake up the system, the driver should prepare it
  and put it into a low power state using the existing PCI callbacks, in
  ->suspend().  In that case, the driver is also required to save the state of
  the device before putting it into the low power state.  It is also required
  to make sure that its interrupt handler will not get confused in case of
  shared interrupts.
- If the state of the device hasn't been saved by the driver, the core is
  required to save its state (with interrupts off, I suppose?).
- If the state of the device hasn't been saved by the driver, the core will
  attempt to put the device into a low power state, using the native PCI PM and
  with interrupts off, unless PCI_DEV_FLAGS_NO_D3 is set in dev->flags.

Thanks,
Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 6:06 pm

How many devices actually have the _PS3 method (or whatever it is that we 
end up executing)? We might be able to simply flag it, and say "ok, if we 
have a _PS3 method, we'll have to suspend early, otherwise we can leave it 
for a late suspend".

Definitely not perfect, but perhaps a way to get the safe thing on 99% of 
all cases, and have to live with the horrid ACPI rules on some things.

I thought the _DSW thing is common for setting up wakeup, but _PSx is not. 
But I have not looked at many ACPI tables in my life. I try to active 
avoid it if I at all humanly can.

			Linus
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 6:13 pm

Doing a quick grep on my laptop seems to confirm that. No actual _PSx 
things found in any acpi tables at all that I can see.

And on a mac mini, there are _PS3 entries that _look_ like they are 
connected to the IDE controller and the realtime clock, but it looks like 
they don't happen for the PCI devices.

But again - I may have screwed that up. ACPI tables are not my favourite 
data structure.

		Linus
--

From: Arjan van de Ven
Date: Saturday, January 31, 2009 - 6:20 pm

On Sat, 31 Jan 2009 17:06:47 -0800 (PST)

in this area there's a pet pieve of mine, or rather something that
shows up on kerneloops.org quite a bit:
There are several PCI quirks that get run with irqs off, but they are
just the "normal" boot time quirks, and when they get run there, they
can sleep. Some of them do things like call ioremap() and the like....
from memory some of the asus and via quirks come to mind...



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 6:24 pm

That seems doable at first sight, although I think we should take D1 and D2
into account too (the ACPI rules may be that for S3, ie. suspend to RAM, given
device should be put into D2, for example).  We have a function for checking
if device is power-manageable by ACPI.

Still, in that case, should the rule be that if the device is power-manageable
by ACPI, the PCI core is supposed to put it into a low power state (using ACPI)
with interrupts on and if the device is not power-manageable by ACPI, the

They are not that uncommon AFAICS.  On all of my boxes there are devices
power-manageable by ACPI.  Usually they are USB controllers and network

Usually, we need to use ACPI to set up wake-up and then use ACPI to put the
device into a low power state.  Otherwise, the wake-up may not work.

Thanks,
Rafael
--

From: Pavel Machek
Date: Saturday, February 7, 2009 - 2:21 am

Device with driver but with no suspend/resume support can still use
interrupts. But I don't think there's much we can do in that
case... except perhaps failing the suspend.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 2:47 pm

Imagine that you're a bridge. Imagine what this does to any device 
_behind_ you. Any device that plans to still do something at suspend_late 

And I suspect it simply doesn't. And since almost nobody uses the new PM 
state, you just don't see it.

But as to why it fixes Parag's case - I think that's because the new PM 
resume does more than the legacy resume does, so it ends up re-enabling 
things anyway. It does it too late, but it doesn't matter in this case (no 


It's bad. It means that DMA won't work across such a bridge. Yes, it is 
probably bridge-dependent, and I know for a fact that at least some Intel 
bridges just hard-code the busmaster bit to 1 (at a minimum the host 
bridges do, I'm not sure about others), but I also know for a fact that 
some other bridges _will_ stop DMA to devices behind them if the BM bit is 
clear.

But more importantly: Why do you do it? What's the upside? I don't see it. 
There's a known downside - you're saving state that is something else than 
what the driver really expects. 

So I think clearing bus-master is a huge bug on a bridge, but I think that 

How about devices that have magic power-down sequences? For example, a 
quick grep shows that USB on a PPC PMAC has a special "disable ASIC clocks 
for USB" thing after it puts the USB controller to sleep.

That was literally the _first_ driver I looked at. Admittedly because I 
knew that USB host controllers tend to be more aware of all the issues 
than most random drivers, but still...

I agree that the new model should turn off devices by default, but the 
thing is, it should also allow drivers that really know better to do magic 
things.

Of course, we can say that such devices should just continue to use the 
legacy model, but I thought that the long-term plan was to just replace 
it. And if you do, you need to allow for drivers that do special things 
due to known motherboard- or chip-specific "issues" (aka "PCI extensions" 
aka "hardware bugs").

And yes, I suspect ...
From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 3:46 pm

But many drivers have an analogous code sequence in their PM callbacks and

Still, the 2.6.28 resume didn't do the "reenable device" thing and it worked.



DMA will only not work until the ->resume sets the bus master bit, which
happes before the ->resume of any device behind the bridge runs.  There only
is a small window where something (theoretically) may go wrong and I really
don't expect any driver to start DMA from its ->resume_realy or an interrupt




Well, it's your patch after all, isn't it? ;-)

Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 4:01 pm

Rafael!

Read what I write. Twice.

Here it is again: "Imagine that you're a bridge."

Stop the idiocy of just ignoring what I write, and talking about something 
else.


It's possible. Still, it really shouldn't matter.

Or at least, it shouldn't matter, as long as what you restore is sane. 
You're just going to rewrite the same data, after all.

That's why I was trying to see if the IO/MEM bits got cleared in the save 

The problem is that you're looking at some individual devices, and saying 
"it works for them, so it must work for everybody". Add to that the fact 
that you apparently _still_ haven't figured out the difference between 
bridges and regular devices, and that most most motherboards probably keep 
the PCI bridges powered anyway, and...

And yes, I realize that this is how PM under Linux has worked for a long 
time. But it's what I think we should get away from. It's why I pushed so 
hard to get the whole interrupt handling sane and stable. 

The argument that "it works for a lot of machines" IS NOT AN ARGUMENT, 
Rafael! Stop using it as such.

We know that 2.6.28 suspend/resume works for a lot of laptops. Even 
possibly _most_ laptops. But it was still broken. We want to get _away_ 

Read my emails. THIS ISN'T EVEN A RESUME-TIME PROBLEM!

The problems happen on purely the suspend path. How the f*ck do you know 
that the drivers behind the bridge don't do everything at 'suspend_late' 
time, and expect to be working up until that time?

Here's a big hint: YOU DO NOT KNOW. YOU MUST NOT TURN OFF THE BRIDGE AT 
SUSPEND TIME!

I'm getting really fed up with you here. You're not even listening. And 

So?

Irrelevant. We want to handle the exceptional case too. And we generally 

.. why? Wouldn't it be a hell of a lot nicer if the PCI layer just did 
things right automatically. 

Which the legacy layer already does. It sees "ok, the driver did it's own 
pci_save_state(), I'm not going to do it for it".

THAT is robust. And simple. Wouldn't ...
From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 5:11 pm

[Confused] Yes, they are.

And yes, I agreed that the bus master bit shouldn't be cleared for them.  Which
does not necessarily imply that clearing that bit will always lead to problems

Yes, but I suspect the device is misbehaving due to some timing issues.

With devices that behave totally correctly the current code doesn't appear to


Look.  I have only a limited set of data from devices that have been tested.
On the other hand I have some pieces of documentation that are sometimes not
very clear.  I try to use the data and the docs I have to figure out how things
actually work.

That's why the cases like the Parag's one are so important.  They allow me to
verify assumptions.

So no, I don't say "it must work for everybody", but I have to assume
_something_ if I don't know that _for_ _sure_.  I sometimes make wrong
assumptions, which is a consequence of the fact that I can only use a limited
set of data to verify my observations.

And please note, we still don't know for sure why the Parag's box actually


Working on many (different) systems is a good indication of what is and what
is not going to work in general.  It obviously doesn't imply correctness,


Why are you shouting at me?  Did you read my reply to your other message?
How many times do I have to tell you I AGREE THAT I SHOULD NOT TURN OFF THE

On suspend we have two things to do:
- save the state
- put the device into a low power state

The state_saved flag can be used to see if the driver has saved the state.

What about powering off?  Are we going to assume that if the driver has saved
the state of the device, the core should not put the device into a low power
state?

Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 5:32 pm

Rafael. Stop being a total idiot.

Read what I wrote.

I'm saying that the driver may not do anything at all at suspend() time, 
and leaves everything until suspend_late. Then, at suspend_late(), it 
finally really shuts down.

That's actually a very reasonable thing to do in some circumstances. It 
simplifies everything, in particular all interrupt handling, since the 
device is now fully live all the way while interrupts can happen.

For a USB host controller, for example, it really could make sense to do 
that - just leave all the core host controller stuff running, and the only 
thing the "suspend()" callback does is to send the commands to the actual 
devices, it doesn't necessarily touch the host controller itself at all.

Then, at suspend_late time, you just clear the "running" bit in the 
controller (and perhaps not even that - because you want to still push 
things out for debugging). End result: you never actually had to shut 
anything down at all, and you could (for example) still run a USB serial 
port console all the way to shutdown.

And yes, I wanted to do basically exactly that when I was debugging some 
issues a year or two ago.

See? The device and driver may be totally alive over a ->suspend() call. 
And that means that the bridge CANNOT KNOW that it's ok to shut down DMA. 
Because DMA may be the only way the device communicates (again: USB 
actually works that way).

So dammit, just admit that you were wrong, instead of continually sending 
these _idiotic_ replies. That "turn off bus mastering" was bogus and 
idiotic, and had no real cause. Ok to just admit it already?

		Linus
--

From: Rafael J. Wysocki
Date: Saturday, January 31, 2009 - 5:41 pm

I said I was.

Thanks,
Rafael
--

From: Linus Torvalds
Date: Saturday, January 31, 2009 - 5:51 pm

Same is quite likely true of things like video graphics adapters. Again, 
for all the same reasons. Think about all those fbcon drivers. They will 
use DMA for things. And again, there are very compelling debugging reasons 
to not suspend them for real until suspend_late (if even then).

		Linus
--

From: Benjamin Herrenschmidt
Date: Friday, February 6, 2009 - 8:27 pm

Actually, there are things like service processors etc... that only know
to communicate via DMA and little else via a little mailbox in some
reserved area of system memory...

Ben.


--

From: Benjamin Herrenschmidt
Date: Friday, February 6, 2009 - 8:26 pm

A quick catch up on old emails ....

A good reason not to turn bridges off is that behind bridges, you'll
find more than your "normal" PCI devices. You'll find system devices,
things like your motherboard GPIO & clock control, i2c controllers,
service processors, your PIC, etc... 

And your BIOS will need them. Your ACPI stuff will indirectly access
these things, for example to get the machine into actual suspend state
etc...

Disabling bridges is bad :-)

If they have to be disabled for power management purposes, then it's
likely that ACPI/BIOS itself will do it at the very last minute when
going into S3, and bring them back itself on the way back.

For embedded platforms who want to override that, they can always stick
a driver there or do it manually from a sysdev or platform code.

IE. I very very much doubt windows does anything to bridges, appart
-maybe- saving/restoring the standard window settings -just-in-case-
they got wiped out and the BIOS didn't restore them.

Ben.


--

From: Rafael J. Wysocki
Date: Thursday, January 29, 2009 - 4:03 pm

[--snip--]

In the meantime I tried to rework tg3 suspend/resume so that it uses the new
PCI core capability of handling the PCI-specific parts of both operations.

The patch is appended, please see if it makes any difference.

Thanks,
Rafael

---
 drivers/net/tg3.c |   70 +++++++++++++++++++-----------------------------------
 1 file changed, 25 insertions(+), 45 deletions(-)

Index: linux-2.6/drivers/net/tg3.c
===================================================================
--- linux-2.6.orig/drivers/net/tg3.c
+++ linux-2.6/drivers/net/tg3.c
@@ -13330,18 +13330,13 @@ static void __devexit tg3_remove_one(str
 	}
 }
 
-static int tg3_suspend(struct pci_dev *pdev, pm_message_t state)
+#ifdef CONFIG_PM
+
+static int tg3_suspend(struct device *device)
 {
+	struct pci_dev *pdev = to_pci_dev(device);
 	struct net_device *dev = pci_get_drvdata(pdev);
 	struct tg3 *tp = netdev_priv(dev);
-	pci_power_t target_state;
-	int err;
-
-	/* PCI register 4 needs to be saved whether netif_running() or not.
-	 * MSI address and data need to be saved if using MSI and
-	 * netif_running().
-	 */
-	pci_save_state(pdev);
 
 	if (!netif_running(dev))
 		return 0;
@@ -13363,50 +13358,19 @@ static int tg3_suspend(struct pci_dev *p
 	tp->tg3_flags &= ~TG3_FLAG_INIT_COMPLETE;
 	tg3_full_unlock(tp);
 
-	target_state = pdev->pm_cap ? pci_target_state(pdev) : PCI_D3hot;
-
-	err = tg3_set_power_state(tp, target_state);
-	if (err) {
-		int err2;
-
-		tg3_full_lock(tp, 0);
-
-		tp->tg3_flags |= TG3_FLAG_INIT_COMPLETE;
-		err2 = tg3_restart_hw(tp, 1);
-		if (err2)
-			goto out;
-
-		tp->timer.expires = jiffies + tp->timer_offset;
-		add_timer(&tp->timer);
-
-		netif_device_attach(dev);
-		tg3_netif_start(tp);
-
-out:
-		tg3_full_unlock(tp);
-
-		if (!err2)
-			tg3_phy_start(tp);
-	}
-
-	return err;
+	return 0;
 }
 
-static int tg3_resume(struct pci_dev *pdev)
+static int tg3_resume(struct device *device)
 {
+	struct pci_dev *pdev = to_pci_dev(device);
 	struct ...
From: Matt Carlson
Date: Thursday, January 29, 2009 - 4:41 pm

tg3_set_power_state() does way more than configuring the power
management registers to the desired state though.  It sets up WOL,

...and here tg3_set_power_state() restores our ability to communicate
with the chip via MMIO.  Also, after restoring the power state to D0,

--

From: Rafael J. Wysocki
Date: Thursday, January 29, 2009 - 5:10 pm

OK, so it requires more care to be taken.

However, suspend-resume seems to work on my test box with this patch applied,
although admittedly I haven't tested WoL.


If that were the case, it wouldn't work after a resume with the patch applied.

Are you sure this actually happens?  On my box the chip is in D0 already

Anyway, I'd very much prefer to separate the generic PCI operations from the
device-specific code.

Thanks,
Rafael
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 3:31 pm

No difference - tg3 is still dead after resume.

BTW, I applied the patch on top of the one Matt gave earlier.
Machine booted with original tg3 which I rmmod'ed and then insmod'ed the 
new one (with Matt's and Rafael's patch) and then attempted a 
suspend-resume.

Is there a reason to try fresh boot with patched tg3 and without 
loading old module - I guess I will try that one later as well.

Thanks
Parag
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 3:36 pm

As long as you didn't suspend/resume with the old module loaded, it really 
shouldn't matter.

			Linus
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 3:54 pm

Thanks for testing.

Well, I'm not sure if tg3 is at fault, really.

What happens if you unload tg3 before suspend and load it back after the
resume?

Rafael
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 4:07 pm

Yes, good thing to test. See my previous email - maybe it's a bridge.

		Linus
--

From: Parag Warudkar
Date: Friday, January 30, 2009 - 4:13 pm

This time it fails with different error on loading after suspend/resume 
cycle -

 1196.873608] tg3 0000:0e:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 1196.873620] tg3 0000:0e:00.0: setting latency timer to 64
[ 1196.880017] tg3 0000:0e:00.0: PME# disabled
[ 1196.996270] tg3: (0000:0e:00.0) phy probe failed, err -19
[ 1197.508033] tg3: Problem fetching invariants of chip, aborting.
[ 1197.508048] tg3 0000:0e:00.0: PCI INT A disabled
 
Parag
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 4:31 pm

It seems like something between the tg3 chip and the host CPU doesn't work
correctly after resume, Linus is right.

I wonder if this change makes any difference:

--- linux-2.6.orig/drivers/pci/pci-driver.c
+++ linux-2.6/drivers/pci/pci-driver.c
@@ -501,6 +501,9 @@ static int pci_pm_suspend(struct device
 	if (pci_has_legacy_pm_support(pci_dev))
 		return pci_legacy_suspend(dev, PMSG_SUSPEND);
 
+	if (!drv || !drv->pm)
+		return 0;
+
 	if (drv && drv->pm && drv->pm->suspend) {
 		error = drv->pm->suspend(dev);
 		suspend_report_result(drv->pm->suspend, error);


Rafael
--

From: Linus Torvalds
Date: Friday, January 30, 2009 - 4:51 pm

I don't think that's right. Now you don't end up calling 
pci_pm_default_suspend_generic() at all, and this no pci_save_state().

But I think it could easily be the call to pci_disable_enabled_device(). 
It does that

	if (atomic_read(&dev->enable_cnt))
		do_pci_disable_device(dev);

and that ends up disabling PCI_COMMAND_MASTER and then calling 
pcibios_disable_device().

Any device we have ever done pci_enable_device() on would trigger this, 
which includes PCIE bridges, for example. And while the pcie driver does 
that

	pcie_portdrv_restore_config ->
		pci_enable_device(dev);

thing to re-enable it, that's a no-op since the enable_count is already 
non-zero.

And we do try to restore it (pci_restore_standard_config() will call 
pci_restore_state()), but since we've done the 
pci_disable_enabled_device() _before_ we did the pci_save_state(), we now 
restore a non-working setup. 

I think. The rules are too damn subtle there.  Rafael, can you look around 
a bit?

		Linus
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 5:07 pm

Sure, I'm looking at it right now.

Rafael
--

From: Rafael J. Wysocki
Date: Friday, January 30, 2009 - 5:34 pm

pci_disable_enabled_device() is not called for the PCIe port driver, because
it has the legacy PM support.

What happens is

pci_pm_suspend(port) ->
 		pci_legacy_suspend(port) ->
 		 		pcie_portdrv_suspend(port) [this doesn't save the state]
 		 		pci_save_state(port)

and then, with interrupts off

pci_pm_suspend_noirq(port) ->
 		pci_legacy_suspend_late(port) ->
 		 		pcie_portdrv_suspend_late(port) ->
 		 		 		pci_save_state(port)

and I suspect this last pci_save_state() breaks things.  I'm not sure why,
though.

Thanks,
Rafael
--

Previous thread: Re: [PANIC] lro + iscsi or lro + skb text search causes panic by Herbert Xu on Wednesday, January 28, 2009 - 2:25 pm. (2 messages)

Next thread: [PATCH 3/5] ebtables: remove unneeded initializations by Stephen Hemminger on Wednesday, January 28, 2009 - 11:25 pm. (1 message)