Re: 2.6.21-rc1: known regressions (v2) (part 2)

Previous thread: [patch 1/6] mm: debug check for the fault vs invalidate race by Nick Piggin on Tuesday, February 20, 2007 - 9:49 pm. (99 messages)

Next thread: [PATCH] devpts: add fsnotify create event by Florin Malita on Tuesday, February 20, 2007 - 10:15 pm. (2 messages)
From: Linus Torvalds
Date: Tuesday, February 20, 2007 - 9:53 pm

Ok, the merge window for 2.6.21 has closed, and -rc1 is out there.

There's a lot of changes, as is usual for an -rc1 thing, but at least so 
far it would seem that 2.6.20 has been a good base, and I don't think we 
have anything *really* scary here.

The most interesting core change may be the dyntick/nohz one, where timer 
ticks will only happen when needed. It's been brewing for a _loong_ time, 
but it's in the standard kernel now as an option. 

But there's a ton of architecture updates (arm, mips, powerpc, x86, you 
name it), ACPI updates, and lots of driver work. And just a lot of 
cleanups.

Have fun,

			Linus
-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 6:32 am

Yup. Fun starts in drivers/net/e1000

e1000 is not working anymore. ifup fails permanentely.
 ADDRCONF(NETDEV_UP): eth0: link is not ready
nothing else

bisect identifies:

d2ed16356ff4fb9de23fbc5e5d582ce580390106 is first bad commit
commit d2ed16356ff4fb9de23fbc5e5d582ce580390106
Author: Kok, Auke <auke-jan.h.kok@intel.com>
Date:   Fri Feb 16 14:39:26 2007 -0800

    e1000: fix shared interrupt warning message

    Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>

Reverting this patch on top of -rc1 helps as well.

	tglx


lspci output:

04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
        Subsystem: Intel Corporation Unknown device 30a5
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 218
        Region 0: Memory at 90100000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 2000 [size=32]
        Capabilities: [c8] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0100c  Data: 414a
        Capabilities: [e0] Express Endpoint IRQ 0
                Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
                Device: Latency L0s <512ns, L1 <64us
                Device: AtnBtn- AtnInd- PwrInd-
                Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
                Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
      ...
From: Kok, Auke
Date: Wednesday, February 21, 2007 - 8:58 am

> commit d2ed16356ff4fb9de23fbc5e5d582ce580390106

I think we need to drop this now. The report that says that this *fixes* 
something might have been on regular interrupts only. I currently suspect that 
it breaks all MSI interrupts, which would make sense if I look a the code. Very 
bad indeed.

I'll try to come up with something else or send a patch that reverts it.

Auke

-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 12:24 pm

I'm going to be off-line for a couple of days, so I just reverted it.

		Linus

---
commit b5bf28cde894b3bb3bd25c13a7647020562f9ea0
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Wed Feb 21 11:21:44 2007 -0800

    Revert "e1000: fix shared interrupt warning message"
    
    This reverts commit d2ed16356ff4fb9de23fbc5e5d582ce580390106.
    
    As Thomas Gleixner reports:
      "e1000 is not working anymore. ifup fails permanentely.
        ADDRCONF(NETDEV_UP): eth0: link is not ready
       nothing else"
    
    The broken commit was identified with "git bisect".
    
    Auke Kok says:
      "I think we need to drop this now.  The report that says that this
       *fixes* something might have been on regular interrupts only.  I
       currently suspect that it breaks all MSI interrupts, which would make
       sense if I look a the code.  Very bad indeed."
    
    Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
    Acked-by: Auke Kok <auke-jan.h.kok@intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Jeff Garzik <jeff@garzik.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index a710237..98215fd 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1417,6 +1417,10 @@ e1000_open(struct net_device *netdev)
 	if ((err = e1000_setup_all_rx_resources(adapter)))
 		goto err_setup_rx;
 
+	err = e1000_request_irq(adapter);
+	if (err)
+		goto err_req_irq;
+
 	e1000_power_up_phy(adapter);
 
 	if ((err = e1000_up(adapter)))
@@ -1427,10 +1431,6 @@ e1000_open(struct net_device *netdev)
 		e1000_update_mng_vlan(adapter);
 	}
 
-	err = e1000_request_irq(adapter);
-	if (err)
-		goto err_req_irq;
-
 	/* If AMT is enabled, let the firmware know that the network
 	 * interface is now open */
 	if (adapter->hw.mac_type == e1000_82573 &&
@@ -1439,10 +1439,10 @@ e1000_open(struct net_device ...
From: Andrew Morton
Date: Wednesday, February 21, 2007 - 12:45 pm

On Wed, 21 Feb 2007 11:24:33 -0800 (PST)


OK, but this change was needed because the new IRQ-debugging code reliably
causes e1000 to explode.

So perhaps until e1000 gets sorted out we should disable the debug code:

--- a/lib/Kconfig.debug~a
+++ a/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
 
 config DEBUG_SHIRQ
 	bool "Debug shared IRQ handlers"
-	depends on DEBUG_KERNEL && GENERIC_HARDIRQS
+	depends on DEBUG_KERNEL && GENERIC_HARDIRQS && BROKEN
 	help
 	  Enable this to generate a spurious interrupt as soon as a shared
 	  interrupt handler is registered, and just before one is deregistered.
_

-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 9:24 am

On i386 I get the following,

TCP cubic registered                                                            
NET: Registered protocol family 1                                               
NET: Registered protocol family 17                                              
Testing NMI watchdog ... CPU#0: NMI appears to be stuck (24->24)!               
CPU#1: NMI appears to be stuck (0->0)!                                          
CPU#2: NMI appears to be stuck (0->0)!                                          
CPU#3: NMI appears to be stuck (0->0)!              

when I add nmi_watchdog=1 to my boot args which worked on prior kernels.
On closer inspection it looks like arch/i386/kernel/io_apic.c :
check_timer() --> timer_irq_works() depends on IRQ0 incrementing jiffies
which is no longer the case AFAIK.

I'm not sure exactly how that relates to the NMI, but the check_timer()
function disabled the NMI through the io-apic if it can't get the
"timer" working through the io-apic.

Daniel

-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 10:07 am

At this point the PIT / HPET _is_ active and incrementing jiffies. The

Boot log please.

	tglx


-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 10:19 am

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 10:41 am

Why? The switch just stops the PIT/HPET. It does not fiddle with IO_APIC

Nothing obvious. Bisect time :(

	tglx


-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 10:38 am

I'm not an expert on the io-apic, but the check_timer() function seemed

Well, I'm pretty sure it's HRT, cause in prior versions this only
happened when HRT is enabled. Then you guys went to the lapic all the
time, and now this is happening all the time ..

You can't reproduce this?

Daniel

-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 11:18 am

Again: 

check_timer() is called _BEFORE_ we even touch the local APIC timers. At

The NMI is stuck:

                if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
                        printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
                                cpu,
                                prev_nmi_count[cpu],
                                nmi_count(cpu));

This has nothing to do with jiffies.


Nope.

Also all my machines emit something like:
"ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])"

In your boot log nothing to see.

	tglx


-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 11:23 am

Right, but eventually there isn't a regular timer interrupt through the
io-apic. I don't think in the past IRQ0 stops without the system
crashing, so check_timer() could assume the timer (IRQ0) is _always_
regular.

do you know what the requirement are for routing the NMI through the

I think it has to do with IRQ0 . Did I mention this doesn't happen in

Do you use nmi_watchdog=1 ?

Daniel

-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 12:23 pm

Sorry. I checked. switching PIT off really breaks nmi_watchdog=1, as
this just mirrors IRQ#0 to the NMI. No IRQ#0 from PIT, no NMI

We could keep PIT running with an empty interrupt handler when
nmi_watchdog=1 is set, but this interferes nicely with broadcasting.

Does nmi_watchdog=2 work ? We might switch to that, when a local APIC is
available. 

	tglx



-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 12:24 pm

Oddly, nmi_watchdog=2 doesn't work in 2.6.21-rc1, but it works in
2.6.20-rt8 however I'm not sure of the config could have been PREEMPT_RT
was on.

Daniel

-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 1:00 pm

There's a compile failure during my bisect.

distcc[3863] ERROR: compile /tmp//hrtimer.tmp.dwalker1.3795.i on dwalker3/120 failed
kernel/hrtimer.c: In function 'hrtimer_cpu_notify':
kernel/hrtimer.c:884: warning: implicit declaration of function 'clockevents_notify'
kernel/hrtimer.c:884: error: 'CLOCK_EVT_NOTIFY_CPU_DEAD' undeclared (first use in this function)
kernel/hrtimer.c:884: error: (Each undeclared identifier is reported only once 
kernel/hrtimer.c:884: error: for each function it appears in.)
drivers/ide/setup-pci.c: In function 'ide_scan_pcibus':
drivers/ide/setup-pci.c:866: warning: ignoring return value of '__pci_register_driver', declared with attribute warn_unused_result
make[1]: *** [kernel/hrtimer.o] Error 1


from this commit,

commit f8381cba04ba8173fd5a2b8e5cd8b3290ee13a98
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri Feb 16 01:28:02 2007 -0800

    [PATCH] tick-management: broadcast functionality
    
    With Ingo Molnar <mingo@elte.hu>
    
    Add broadcast functionality, so per cpu clock event devices can be registere
    as dummy devices or switched from/to broadcast on demand.  The broadcast
    function distributes the events via the broadcast function of the clock even
    device.  This is primarily designed to replace the switch apic timer to / fr
    IPI in power states, where the apic stops.
    


-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 1:43 pm

hrmpf. we made it bisectable at some point.

	tglx


-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 1:49 pm

It was related some code under a hot plug ifdef ..

Here's the final commit from the bisect which caused it . It says "No
changes to existing functionality" ?

e9e2cdb412412326c4827fc78ba27f410d837e6e is first bad commit
commit e9e2cdb412412326c4827fc78ba27f410d837e6e
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri Feb 16 01:28:04 2007 -0800

    [PATCH] clockevents: i386 drivers

    Add clockevent drivers for i386: lapic (local) and PIT/HPET (global).  Update
    the timer IRQ to call into the PIT/HPET driver's event handler and the
    lapic-timer IRQ to call into the lapic clockevent driver.  The assignement of
    timer functionality is delegated to the core framework code and replaces the
    compile and runtime evalution in do_timer_interrupt_hook()

    Use the clockevents broadcast support and implement the lapic_broadcast
    function for ACPI.

    No changes to existing functionality.



-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 2:06 pm

Ok, it wouldn't be the first time some change that is supposed to change 
nothing does actually change something.

That said, one thing to worry about when doing bisection: the kernel 
configuration.

If you always just do "make oldconfig" or something, the kernel config for 
the thing you test will depend on the _previous_ kernel you compiled, and 
that is not always what you want. I've once had a failing kernel, did 
bisection, and it turned out that since I had gone back in time to before 
the option that caused the failure even existed, I had (by mistake) then 
compiled some of the later kernels without that option enabled, and called 
them "good". 

The end result: "git bisect" didn't actually end up pointing to the right 
commit, just because I had effectively lied to it.

That said, considering that you did get a commit that doesn't look 
entirely unlikely (and that clearly changes things that are relevant), I 
suspect you did actually find the right one.

		Linus
-

From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 2:21 pm

Yup, thats the one which switches off PIT after we have the local APIC
timers up and running. Which turns out to cause the nmi_watchdog not
working anymore.

	tglx


-

From: Daniel Walker
Date: Wednesday, February 21, 2007 - 2:23 pm

In this case I don't think anything was specifically turned on, beyond
SMP. For instance HRT/dynamic tick was off.

I didn't run "make oldconfig", but just running "make" asked for options

I think if it's not that exact commit it's still one in that set. I
mainly wanted to confirm that it was an hrt/dynamic tick issue , and not
some left field patches..

Daniel

-

From: Pete Harlan
Date: Wednesday, February 21, 2007 - 3:05 pm

This bit me badly the one time I did a git bisect; it kept ping-
ponging around a big change (sata? xtables?) that required me to
answer the same two dozen questions about ten different times.

If git-bisect itself could manage .config as part of the process,
reverting to the one used with the most-recently-in-the-past try when
it backs up to a revision, that would be, you know, wonderful.

--Pete

----------------------------------
Pete Harlan
ArtSelect, Inc.
harlan@artselect.com
http://www.artselect.com
ArtSelect is a subsidiary of a21, Inc.
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 1:18 pm

When that happens, you need to pick another commit to try than the one git 
selected for you automatically. You can do that by doing

	git bisect visualize

and select another commit somewhere fairly close to a mid-point, and try 
that with

	git reset --hard <commit-sha1>

instead.

		Linus
-

From: Andrew Morton
Date: Friday, February 23, 2007 - 3:08 am

I already bisected this on my old pIII, which has the same problem:
clockevents-i386-drivers.patch
-

From: Ingo Molnar
Date: Friday, February 23, 2007 - 4:35 am

yes - we know what the problem is (and will fix it): the stopping of the 
PIT - nmi_watchdog=1 is hack to use the IO-APIC's PIT pin to also signal 
NMIs.

Just to clarify, this problem does not occur if HIGH_RES_TIMERS is off, 
correct?

	Ingo
-

From: Thomas Gleixner
Date: Friday, February 23, 2007 - 4:47 am

It does, as we switch off PIT when lapic is available in any case.

	tglx


-

From: Ingo Molnar
Date: Friday, February 23, 2007 - 4:39 am

s/does not occur/does occur/

we switch off the PIT whenever we can.

	Ingo
-

From: Faik Uygur
Date: Wednesday, February 21, 2007 - 6:26 am

Hi,


  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CHK     include/linux/compile.h
  CC [M]  drivers/char/ip2/ip2main.o
In file included from drivers/char/ip2/ip2main.c:285:
drivers/char/ip2/i2lib.c: In function `iiSendPendingMail_t':
drivers/char/ip2/i2lib.c:83: sorry, unimplemented: inlining failed in call 
to 'iiSendPendingMail': function body not available
drivers/char/ip2/i2lib.c:157: sorry, unimplemented: called from here
make[3]: *** [drivers/char/ip2/ip2main.o] Error 1
make[2]: *** [drivers/char/ip2] Error 2
make[1]: *** [drivers/char] Error 2
make: *** [drivers] Error 2

With cleanup changes in commit 40565f1962c5be9b9e285e05af01ab7771534868 
compilation fails.

Regards,
- Faik
-

From: Jiri Slaby
Date: Wednesday, February 21, 2007 - 11:41 am

What compiler?

thanks,
-- 
http://www.fi.muni.cz/~xslaby/            Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E
-

From: Jiri Slaby
Date: Wednesday, February 21, 2007 - 11:51 am

Oh, I can reproduce with gcc 3.4. Going to fix it.

thanks,
-- 
http://www.fi.muni.cz/~xslaby/            Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E
-

From: Linus Torvalds
Date: Wednesday, February 21, 2007 - 12:19 pm

Yeah, that thing was crud.

		Linus

---
commit 5fc7e655a50b0a19229a6b4a8a5e23bfedf700a4
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Wed Feb 21 11:18:26 2007 -0800

    Fix bogus 'inline' in drivers/char/ip2/i2lib.c
    
    Not only was the function way too big to be inlined in the first place,
    it was used before it was even defined.
    
    Noted-by: Faik Uygur <faik@pardus.org.tr>
    Cc: Jiri Slaby <jirislaby@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/drivers/char/ip2/i2lib.c b/drivers/char/ip2/i2lib.c
index f86fa0c..e46120d 100644
--- a/drivers/char/ip2/i2lib.c
+++ b/drivers/char/ip2/i2lib.c
@@ -80,7 +80,7 @@ static int i2RetryFlushOutput(i2ChanStrPtr);
 // Not a documented part of the library routines (careful...) but the Diagnostic
 // i2diag.c finds them useful to help the throughput in certain limited
 // single-threaded operations.
-static inline void iiSendPendingMail(i2eBordStrPtr);
+static void iiSendPendingMail(i2eBordStrPtr);
 static void serviceOutgoingFifo(i2eBordStrPtr);
 
 // Functions defined in ip2.c as part of interrupt handling
@@ -166,7 +166,7 @@ static void iiSendPendingMail_t(unsigned long data)
 // If any outgoing mail bits are set and there is outgoing mailbox is empty,
 // send the mail and clear the bits.
 //******************************************************************************
-static inline void
+static void
 iiSendPendingMail(i2eBordStrPtr pB)
 {
 	if (pB->i2eOutMailWaiting && (!pB->i2eWaitingForEmptyFifo) )
-

From: Luca Tettamanti
Date: Wednesday, February 21, 2007 - 4:04 pm

Hi Thomas,
I'm testing NO_HZ on my machines. On the laptop I see that the timer
interrupt counter is incremented (though slower than HZ). This machine
is running UP kernel.

On my desktop I see this:

           CPU0       CPU1       
  0:        114          0   IO-APIC-edge      timer
  1:       1624      10771   IO-APIC-edge      i8042
  6:          3          0   IO-APIC-edge      floppy
  7:          0          0   IO-APIC-edge      parport0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:      40111     184047   IO-APIC-edge      i8042
 16:      75624     998858   IO-APIC-fasteoi   radeon@pci:0000:01:00.0, uhci_hcd:usb1
 17:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 18:        711       5487   IO-APIC-fasteoi   ide1, libata, ehci_hcd:usb7, uhci_hcd:usb3
 19:        617       2254   IO-APIC-fasteoi   libata, uhci_hcd:usb2
 20:          0          0   IO-APIC-fasteoi   ehci_hcd:usb6, uhci_hcd:usb5
 21:    2483869          0   IO-APIC-fasteoi   eth0
 22:          2          0   IO-APIC-fasteoi   ohci1394
218:      28872     360643   PCI-MSI-edge      HDA Intel
219:      32932     138196   PCI-MSI-edge      libata
NMI:          0          0 
LOC:    2761191    2827539 
ERR:          0
MIS:          0

Interrupt 0 is stuck at 114 (the number is consistent across reboots). I
don't experience any problem, time is running fine. Still it's strange
that the timer is doing nothing; maybe something other than the PIT is
used for time keeping?

This is the dmesg of the "abnormal" machine (dual core, SMP kernel):

Linux version 2.6.20-ge696268a-dirty (kronos@dreamland.darkstar.lan) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #33 SMP PREEMPT Tue Feb 20 23:24:24 CET 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009c800 end: 000000000009c800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009c800 size: 0000000000003800 end: 00000000000a0000 ...
From: Thomas Gleixner
Date: Wednesday, February 21, 2007 - 4:17 pm

Yes, we switch away from PIT and use the local APIC timer. (LOC)

	tglx


-

From: Luca Tettamanti
Date: Wednesday, February 21, 2007 - 4:19 pm

Ok, thanks for the clarification.

Luca
-

From: Andi Kleen
Date: Friday, February 23, 2007 - 8:48 am

Before this becomes a FAQ. Would anybody mind if I just renamed "timer"
to "pit" to make this clear? 

-Andi
-

From: Jan Engelhardt
Date: Thursday, February 22, 2007 - 5:36 am

What's the benefit of doing so - and why has not it been done before?
I mean, I run a regular 2.6.18.6,
/sys/devices/system/clocksource/clocksource0/current_clocksource (is this
related?) shows "acpi_pm", but the IRQ0 counter increases at HZ. Maybe I
am confusing things, but why the need for PIT when clocksource is acpi_pm
anyway?


Jan
-- 
-

From: Arjan van de Ven
Date: Thursday, February 22, 2007 - 6:25 am

you're mixing up 2 concepts:
1) clocksource
2) eventsource

1) is for "what time is it now", and acpi_pm is useful for that, as are
several other things such as rdtsc
2) is for "I need THIS to happen X miliseconds from now". acpi_pm is not
useful for that, nor is rdtsc.

some can be used for both (PIT), but on a concept level the uses are
independent. The advantage of local apic over PIT is that local apic is
cheap to do "one shot" future events with, while the PIT will tick
periodic at a fixed frequency. With tickless idle.. that's not what you
want.


-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 7:10 am

So with a local apic, and acpi_pm as clocksource, I shouldn't be getting timer
interrupts? Yet I do. Which I assume means that the kernel will still get woken
up very often.

Rgds
-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org
-

From: Arjan van de Ven
Date: Thursday, February 22, 2007 - 7:20 am

timer interrupts as in "irq0"?

you shouldn't if you use the hrtimers/tickless stuff...


if irq0 keeps increasing at 100Hz or 1000Hz or so.. then yes
-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 7:51 am

Yes:


CONFIG_HIGH_RES_TIMERS=y

Sure. My dmesg is full of mmc debug crud right now, but I'll just reboot
and I'll have a clean one for you.

Rgds

-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org

-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 8:13 am

Here we go.

-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org
From: Thomas Gleixner
Date: Thursday, February 22, 2007 - 9:00 am

---------------------------------------------------------^^^^^

Here is the reason. The local APIC stops working in C3 state and we fall
back to the PIT in that case. Not really exciting for dynticks, but the
only way to keep the system alive. There is a patch coming up from
Intel, which finds out how to use HPET even if it is not enabled by the
BIOS. This will still end up on IRQ#0, but will give way longer idle
sleeps than the PIT.

	tglx





-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 9:27 am

So then the next two questions are; is it possible to disable C3 and is
it a net power gain to get rid of the wakeups in favor of having C3.

Rgds

-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org

-

From: Arjan van de Ven
Date: Thursday, February 22, 2007 - 9:42 am

no; c3 saves a TON more power. 

you can try enabling HPET in your BIOS...

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 2:07 pm

Hah, I wish! This is a laptop, so the BIOS is as brain dead and broken
as is humanly possible.

Can I determine if I have the required hardware? So I can tell if I'm
permanently screwed, or just temporarily.

Rgds

-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org

-

From: Arjan van de Ven
Date: Thursday, February 22, 2007 - 3:21 pm

if it's something built in the last year or two you have the hw.

-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

-

From: Pierre Ossman
Date: Thursday, February 22, 2007 - 11:55 pm

I have an ICH4-M, and from Intel's datasheets it looks like I got the
short straw..

-- 
     -- Pierre Ossman

  Linux kernel, MMC maintainer        http://www.kernel.org
  PulseAudio, core developer          http://pulseaudio.org
  rdesktop, core developer          http://www.rdesktop.org

-

From: Andreas Mohr
Date: Thursday, February 22, 2007 - 2:25 pm

Hi,


http://lkml.org/lkml/2006/11/14/153

and related posts in this older thread
("CONFIG_NO_HZ: missed ticks, stall (keyb IRQ required) [2.6.18-rc4-mm1]")
should help.

Andreas Mohr
-

From: Pallipadi, Venkatesh
Date: Thursday, February 22, 2007 - 12:58 pm

Thomas,

I have the patchset for this HPET part ready to roll out. But, looks
like 
NO_HZ and lapic eventsource support is only in i386 and not in x86-64 in
2.6.21-rc1.
Do you know the state of NO_HZ x86-64 support?

Will it get into git soon? In which case I can rebase my patches against
git
and send it out for review/testing.

Otherwise, I will send out my patches, which are against rt (and my
initial
implementation of this HPET part is only for x86-64).

Thanks,
Venki
-

From: Thomas Gleixner
Date: Thursday, February 22, 2007 - 8:51 am

acpi_pm is only a readout device to keep track of current time.

PIT and local APIC timer are used to provide either periodic or one shot
programmable timer events.

Up to now the kernel started PIT and local APIC timer in parallel with
the same period where PIT incremented jiffies and local APIC timer
called update_process_times() and profile_tick. We changed this to let
the boot cpu increment jiffies inside the local apic timer interrupt
after PIT has been stopped. A whole interrupt for jiffies64++ is waste.

Also when we switch to nohz / high resolution mode, we want to use the
local APIC, as it is much faster to access. The maximum delta to program
is 27ms for the pit and ~1sec for the local APIC. This is important for
dynticks, as we can achieve longer idle sleeps w/o reprogramming the
timer.

	tglx



-

From: David Miller
Date: Thursday, February 22, 2007 - 10:26 am

From: Thomas Gleixner <tglx@linutronix.de>

BTW, I'm adding support for sparc64, and before I get much further
will the code handle a oneshot-only device?  That's basically what I
have (sparc64 basically has a TSC and a "comparison" register, you
write the "comparison" register with the TSC value at which you'd like
the timer interrupt to occur, so it's one-shot and you have to write
it again to get the next timer).

I see logic in the generic code to do periodic events when the
timer only provides one-shot ticks.  But as far as I can tell
both the HPET and the local APIC support periodic timers so
I can't tell how much testing that code has gotten :-)

I, in fact, don't see any clockevent device in the current tree that
does not set CLOCK_EVT_FEAT_PERIODIC.

I guess you could clear that bit just to test those generic code
paths. :-)

BTW, sparc64 always did the trick where the do_timer() work was done
by one of the per-cpu local timer interrupts, I'm glad the idea gained
traction generically. :-)))
-

From: Thomas Gleixner
Date: Thursday, February 22, 2007 - 10:39 am

Yes, all you need is to omit the CLOCK_EVT_FEAT_PERIODIC flag when you




	tglx


-

From: David Miller
Date: Friday, February 23, 2007 - 2:25 am

From: Thomas Gleixner <tglx@linutronix.de>

Thanks a lot Thomas.

I noticed while doing this work that the generic clock code is
incompatible with the time interpolator, since both provide a
do_{get,set}timeofday() implementation.  Just FYI...
-

From: Thomas Gleixner
Date: Friday, February 23, 2007 - 2:56 am

David,

<trimmed CC due to sparcness, added John instaed >


John, can you have a look at this ?

	tglx


-

From: David Miller
Date: Friday, February 23, 2007 - 2:55 am

From: Thomas Gleixner <tglx@linutronix.de>

Note we may not care :-)  All the interpolator does is provide
a generic tick based gettimeofday implementation, and the generic
clock code does essentially the same thing.

And the generic clock code must in fact provide the implementations
in order to deal with dyntick issues.

The only other platform using the time interpolator is IA64,
and after my dyntick implementation on sparc64 the one and
only user will be IA64 :-)
-

From: john stultz
Date: Friday, February 23, 2007 - 12:51 pm

Yea. I actually have some in-progress patches from Peter Keilty that
convert ia64 and sparc64 time_interpolators to clocksources, then
removes the time_interpolator code.

The ia64 conversion is more complicated due to the fsyscall asm, but I
think the sparc64 conversion (below) is pretty straight forward. I've
only built tested this, so I have no clue if it actually works. 

Any thoughts?

thanks
-john


From: Peter Keilty <peter.keilty@hp.com>

Initial sparc64 conversion to generic clocksource/timekeeping code.

Signed-off-by: Peter Keilty <peter.keilty@hp.com>
Signed-off-by: John Stultz <johnstul@us.ibm.com>


 Kconfig       |    2 +-
 defconfig     |    2 +-
 kernel/time.c |   34 +++++++++++++++++++++++-----------
 3 files changed, 25 insertions(+), 13 deletions(-)

linux-2.6.21-rc1_timeofday-arch-sparc64_C7.patch
============================================
diff --git a/arch/sparc64/Kconfig b/arch/sparc64/Kconfig
index f75a686..13a4547 100644
--- a/arch/sparc64/Kconfig
+++ b/arch/sparc64/Kconfig
@@ -34,7 +34,7 @@ config LOCKDEP_SUPPORT
 	bool
 	default y
 
-config TIME_INTERPOLATION
+config GENERIC_TIME
 	bool
 	default y
 
diff --git a/arch/sparc64/defconfig b/arch/sparc64/defconfig
index 0f44a6a..33e061a 100644
--- a/arch/sparc64/defconfig
+++ b/arch/sparc64/defconfig
@@ -9,7 +9,7 @@ CONFIG_64BIT=y
 CONFIG_MMU=y
 CONFIG_STACKTRACE_SUPPORT=y
 CONFIG_LOCKDEP_SUPPORT=y
-CONFIG_TIME_INTERPOLATION=y
+CONFIG_GENERIC_TIME=y
 CONFIG_ARCH_MAY_HAVE_PC_FDC=y
 # CONFIG_ARCH_HAS_ILOG2_U32 is not set
 # CONFIG_ARCH_HAS_ILOG2_U64 is not set
diff --git a/arch/sparc64/kernel/time.c b/arch/sparc64/kernel/time.c
index f84da4f..497b29b 100644
--- a/arch/sparc64/kernel/time.c
+++ b/arch/sparc64/kernel/time.c
@@ -31,6 +31,7 @@ #include <linux/percpu.h>
 #include <linux/profile.h>
 #include <linux/miscdevice.h>
 #include <linux/rtc.h>
+#include <linux/clocksource.h>
 
 #include <asm/oplib.h>
 #include <asm/mostek.h>
@@ -620,7 +621,7 @@ #endif
 	if (!mregs ...
From: David Miller
Date: Friday, February 23, 2007 - 5:34 pm

From: john stultz <johnstul@us.ibm.com>

Hey John, I had to do this already in order to do the dynticks
port to sparc64, but nice to see another attempt :-)

Two things I did on my side:

+	.mask           = 0xffffffffffffffffLL,

I used CLOCKSOURCE_MASK(64) here.

+static cycle_t read_sparc64_cpuclock(void)
+{
+        return (cycle_t)get_cycles();
+}
 ...
+	.read           = read_sparc64_cpuclock,

You can just directly assign tick_ops->get_tick to .read at run-time
to avoid a stack frame and function call/return.

+	.shift          = 16,

These shift selections all seem rather arbitrary.

If it's not an arbitrary selection, it would be nice to have some
comments about how to go about choosing an appropriate shift.
I imagine the selections has to do with the possible range of
the frequencies the clocksource supports, and how much
accuracy you get for certain shift selections given that range.
-

From: john stultz
Date: Friday, February 23, 2007 - 5:53 pm

Correct. The higher the shift value, the more precise NTP multiplier
adjustment we can make.  However, too large w/ a high frequency
clocksource and you'll risk overflowing 64bits on the mult.

Although that's not super critical anymore since Roman implemented the
high-res error accounting, the net effect should be the same over the
long term(we just have to work harder w/ courser adjustments - resulting
in very small clock frequency oscillations).

I do need to get some documentation going. I had some back when I first
started pushing the changes, but so much was revised and rewritten it
stopped being correct.

-john

-

From: David Miller
Date: Friday, February 23, 2007 - 10:52 pm

From: john stultz <johnstul@us.ibm.com>

While we were discussing this I was debugging my clocksource sparc64
implementation, a shift of 32 didn't work whereas one of 16 (as
in your version of sparc64 support) did. :-)

There is also a very unfortunate inconsistency between how the shift
and multiplier are used in clock sources vs. clock events.

You can initialize your clocksource multiplier using:

	mult = clocksource_hz2mult(ticks_per_hz, SHIFT);

but WOE BE TO HE who tries to use that for clockevents (like I
did initially) :-)  You have to instead use something like:

	mult = div_sc(ticks_per_hz, NSEC_PER_SEC, SHIFT);

I would have been done with the sparc64 dynticks support yesterday if
I didn't trip over all this stuff.

It also isn't clear from the comments that the first argument
to clocksource_hz2mult() and div_sc() is indeed "ticks_per_hz".
I had to gleen over the hpet and LAPIC drivers to kind of figure
this out.  Another potential improvement for the comments :-)
-

From: David Miller
Date: Saturday, February 24, 2007 - 10:13 pm

As I suspected, the one-shot code wasn't very well tested and I'd be
the one to debug this thing on sparc64 :-)

When a timer exceeds the timer period, the one-shot handling code does
the following loop:

	for (;;) {
		ktime_t next = ktime_add(dev->next_event, tick_period);

		if (!clockevents_program_event(dev, next, ktime_get()))
			return;
		tick_periodic(cpu);
	}

So it just keeps running tick_periodic() until we "catch up".

Problem is, if clockevents_program_event() gets a "next" time in the
past, the very case where we'll loop, it DOES NOT update
dev->next_event.  It returns the error before doing so.

As a result of this, we'll loop forever here, the softlockup watchdog
will trigger, and the system will wedge completely.

I was getting a softlockup and immediate system hang, so to debug this
I kept a history of the last 8 TSC values when tick_periodic() was
invoked.  At softlockup trigger, I'd dump the log.  And what I saw
were TSC deltas that we so tiny as to be just enough to indicate
tick_periodic() was running in a tight loop :-)

I propose the following fix, which I'm about to test.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 4500e34..0986a2b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -77,6 +77,7 @@ static void tick_periodic(int cpu)
 void tick_handle_periodic(struct clock_event_device *dev)
 {
 	int cpu = smp_processor_id();
+	ktime_t next;
 
 	tick_periodic(cpu);
 
@@ -86,12 +87,12 @@ void tick_handle_periodic(struct clock_event_device *dev)
 	 * Setup the next period for devices, which do not have
 	 * periodic mode:
 	 */
+	next = ktime_add(dev->next_event, tick_period);
 	for (;;) {
-		ktime_t next = ktime_add(dev->next_event, tick_period);
-
 		if (!clockevents_program_event(dev, next, ktime_get()))
 			return;
 		tick_periodic(cpu);
+		next = ktime_add(next, tick_period);
 	}
 }
 
-

From: Thomas Gleixner
Date: Sunday, February 25, 2007 - 1:34 am

Yep, that's caused by a late change for the *!&#ed broadcast stuff for



-

From: Peter Keilty
Date: Friday, February 23, 2007 - 3:15 pm

From: Andi Kleen
Date: Friday, February 23, 2007 - 8:50 am

It was already implemented before for x86-64, but disabled 
by default because it ran into some platform issues. The new code
hopefully knows now how to work around them.

-Andi
-

From: YOSHIFUJI Hideaki /
Date: Wednesday, February 21, 2007 - 9:23 am

Hello.


I cannot boot 2.6.21-rc1; it falls into OOM-Killer.

Interesting error message I can see is:
   request_module: runaway loop modprobe net-pf-1

After bisecting, the commit
  Driver core: let request_module() send a /sys/modules/kmod/-uevent
(id c353c3fb0700a3c17ea2b0237710a184232ccd7f) is to blame.

Reverting it fixes the issue to me.

Regards,

--yoshfuji
-

From: Pavel Machek
Date: Thursday, February 22, 2007 - 10:25 pm

And lot of acpi/suspend changes, which seem to break my machine in
weird and not really reproducible way. I'm looking onto that.

(Yep, that should teach me to test -mm a bit more).
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Adrian Bunk
Date: Sunday, February 25, 2007 - 10:52 am

This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : resume: slab error in verify_redzone_free(): cache `size-512':
                     memory outside object was overwritten
References : http://lkml.org/lkml/2007/2/24/41
Submitter  : Pavel Machek <pavel@ucw.cz>
Handled-By : Marcel Holtmann <marcel@holtmann.org>
Status     : unknown


Subject    : ThinkPad T60: no screen after suspend to RAM
References : http://lkml.org/lkml/2007/2/22/391
Submitter  : Michael S. Tsirkin <mst@mellanox.co.il>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status     : unknown


Subject    : HP nx6325 notebook: usb mouse stops working after suspend to ram
References : http://lkml.org/lkml/2007/2/21/413
Submitter  : Arkadiusz Miskiewicz <arekm@maven.pl>
Caused-By  : Konstantin Karasyov <konstantin.a.karasyov@intel.com>
             commit 0a6139027f3986162233adc17285151e78b39cac
Status     : unknown


Subject    : MacBook: AE_NOT_FOUND ACPI messages
References : http://bugzilla.kernel.org/show_bug.cgi?id=8066
Submitter  : Thomas Meyer <thomas.mey@web.de>
Status     : unknown


Subject    : Asus A8N-VM motherboard: boot failure (ACPI related)
References : http://lkml.org/lkml/2007/2/23/132
Submitter  : Andrew <andrew@nelless.net>
Status     : unknown


Subject    : Asus M6Ne notebook: ACPI: First battery is not detected
References : http://bugzilla.kernel.org/show_bug.cgi?id=8080
Submitter  : Janosch Machowinski <jmachowinski@gmx.de>
Status     : unknown


Subject    : Asus M6Ne/M6A notebooks: SATA_ACPI errors during kernel boot
References : ...
From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 1:39 am

just found a new one: forcedeth.c broke since v2.6.20. It generates a 
few irqs after ifup and then stops dead (no in/out packets). Undoing all 
of these commits fixes the problem:

 commit 21828163b2b31e0675cb3e66a2a4a231dec06236
 commit 57fff6986b6daae13947c565786757c05303f0f6
 commit 4e16ed1b0e17a3832310031e2ddaeb0914eb837d
 commit f0734ab658390380079369f7090dcf7aa226f394
 commit b01867cbd1853995946c8c838eff93a0885d8bc6
 commit 445583b89d71b48cf8c64e26acc5a710248feed7
 commit aaa37d2d099f97ced415546e285ac901e47a2437
 commit 86b22b0dfbf462e6ed75e54fc83575dae01e3c69
 commit 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6
 commit 164a86e40e6c74ec5a91c364ccf7b1a2295b0a52
 commit 761fcd9e3e0aa2a02231d1631f31409be5e890d2
 commit d2f7841277d8613a780ab28d04d8f31a31816153
 commit 1d39ed565cfcc7c4fe586de621aef495c4f94ffb

that's 75K of delta - i'll bisect it some more. Hardware in question:

 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)

config options:

 CONFIG_FORCEDETH=y
 CONFIG_FORCEDETH_NAPI=y

	Ingo
-

From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 2:01 am

git bisection gives:

| 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6 is first bad commit
| commit 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6
| Author: Ayaz Abdulla <aabdulla@nvidia.com>
| Date:   Tue Jan 9 13:30:13 2007 -0500
|
|    forcedeth: rx skb recycle
|
|    This patch removes the code that recycled the skb on error. This will
|    help in reducing the branches in the main data paths.
|
|    Signed-Off-By: Ayaz Abdulla <aabdulla@nvidia.com>
|
|    Signed-off-by: Jeff Garzik <jeff@garzik.org>
|
| :040000 040000 0e4cb0aa12ba1df6a7ca84da697c75082b449ece b70edcf390d63ac9e0b35d2aea7b59efe2bfeaea M      drivers

but i think this is simply the first commit that makes the driver 
essentially non-bisectable - the failures in interim iterations changed 
from 'interface unusable' to boot hangs that were either forcedeth tx 
ring timeouts, or hangs after nv_sata initialization. I've attached the 
full .config below.

	Ingo

-------------------------->
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21-rc1-syslet-v5
# Tue Feb 27 09:46:29 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_ASYNC_SUPPORT=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# ...
From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 2:38 am

here's the boot log of the tx_ring timeout that happens at the following 
commit in the first step of bisection:

 commit aaa37d2d099f97ced415546e285ac901e47a2437
 Author: Ayaz Abdulla <aabdulla@nvidia.com>
 Date:   Sun Jan 21 18:10:42 2007 -0500

     forcedeth: tx limiting

(eth0/eth1 gets flipped during network startup, so eth1 is the forcedeth 
driver)

	Ingo

---------------------->
Linux version 2.6.21-rc1-syslet-v5 (mingo@dione) (gcc version 4.0.2) #8 SMP PREEMPT Tue Feb 27 10:33:37 CET 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009f800 end: 000000000009f800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009f800 size: 0000000000000800 end: 00000000000a0000 type: 2
copy_e820_map() start: 00000000000f0000 size: 0000000000010000 end: 0000000000100000 type: 2
copy_e820_map() start: 0000000000100000 size: 000000003fef0000 end: 000000003fff0000 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000003fff0000 size: 0000000000003000 end: 000000003fff3000 type: 4
copy_e820_map() start: 000000003fff3000 size: 000000000000d000 end: 0000000040000000 type: 3
copy_e820_map() start: 00000000e0000000 size: 0000000010000000 end: 00000000f0000000 type: 2
copy_e820_map() start: 00000000fec00000 size: 0000000001400000 end: 0000000100000000 type: 2
 BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
 BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
 BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS)
 BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
127MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f5680
Entering add_active_range(0, 0, 262128) 0 entries of 256 used
Zone PFN ...
From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 4:25 am

update: Jeff sent me 3 pending forcedeth.c fixes, and they indeed fix 
the regression. The one that fixed it is:

----------------->
The napi poll routine was missing the call to the optimized rx process 
routine. This patch adds the missing call for the optimized path.

Signed-Off-By: Ayaz Abdulla <aabdulla@nvidia.com>

--- orig/drivers/net/forcedeth.c	2007-02-20 03:17:21.000000000 -0500
+++ new/drivers/net/forcedeth.c	2007-02-20 03:28:31.000000000 -0500
@@ -3104,13 +3104,17 @@
 	struct fe_priv *np = netdev_priv(dev);
 	u8 __iomem *base = get_hwbase(dev);
 	unsigned long flags;
+	int retcode;
 
-	if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2)
+	if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) {
 		pkts = nv_rx_process(dev, limit);
-	else
+		retcode = nv_alloc_rx(dev);
+	} else {
 		pkts = nv_rx_process_optimized(dev, limit);
+		retcode = nv_alloc_rx_optimized(dev);
+	}
 
-	if (nv_alloc_rx(dev)) {
+	if (retcode) {
 		spin_lock_irqsave(&np->lock, flags);
 		if (!np->in_shutdown)
 			mod_timer(&np->oom_kick, jiffies + OOM_REFILL);

-

From: Linus Torvalds
Date: Tuesday, February 27, 2007 - 8:42 am

Ok, that was included in the bunch I just pulled and pushed out, so 
hopefully this one is resolved. Pls verify.

		Linus
-

From: Ingo Molnar
Date: Wednesday, February 28, 2007 - 12:36 am

yep, can confirm that it's now fixed upstream too.

	Ingo
-

From: Karasyov, Konstantin A
Date: Wednesday, February 28, 2007 - 11:16 am

Arkadiusz,

Could try the attached patch to see if it solves the problem?
If not, please send the output of acpidump command and log.

Regards.

From: Adrian Bunk
Date: Sunday, February 25, 2007 - 10:55 am

This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : ThinkPad T60: system doesn't come out of suspend to RAM
             (CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/22/391
Submitter  : Michael S. Tsirkin <mst@mellanox.co.il>
             Thomas Gleixner <tglx@linutronix.de>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status     : unknown


Subject    : kernel BUG at kernel/time/tick-sched.c:168  (CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/16/346
Submitter  : Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Status     : problem is being debugged


Subject    : BUG: soft lockup detected on CPU#0
             NOHZ: local_softirq_pending 20
References : http://lkml.org/lkml/2007/2/20/257
Submitter  : Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
             Ingo Molnar <mingo@elte.hu>
Status     : problem is being debugged


Subject    : i386: no boot with nmi_watchdog=1  (clockevents)
References : http://lkml.org/lkml/2007/2/21/208
Submitter  : Daniel Walker <dwalker@mvista.com>
Caused-By  : Thomas Gleixner <tglx@linutronix.de>
             commit e9e2cdb412412326c4827fc78ba27f410d837e6e
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Status     : problem is being debugged


-

From: Jens Axboe
Date: Tuesday, February 27, 2007 - 3:02 am

x60 doesn't resume from S2R either, it doesn't matter if CONFIG_NO_HZ is
set or not though. 2.6.20 worked fine.

-- 
Jens Axboe

-

From: Pavel Machek
Date: Tuesday, February 27, 2007 - 3:21 am

It somehow works for me. As long as I do not play with bluetooth and
suspend to disk...
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Jens Axboe
Date: Tuesday, February 27, 2007 - 3:30 am

It locks solid here on resume, going back to 2.6.20 makes it work
perfectly again. In between 2.6.20 and 2.6.21-rc1 some ACPI change broke
resume, but that got fixed. Some other change later snuck in that broke
it AGAIN for me, sigh.

I don't use bluetooth nor suspend to disk.

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 3:34 am

resume is stock on my T60 too. So you mean v2.6.21-rc1 vanilla works 
fine? Do you know a commit ID that works for sure? I'd like to bisect 
this, but this way i might just find that ACPI change that got already 
fixed later on (and then got re-broken).

	Ingo
-

From: Jens Axboe
Date: Tuesday, February 27, 2007 - 3:59 am

Nope, 2.6.21-rc1 vanilla does not work. 2.6.20 works. 2.6.20-gitX worked
until some acpi change broke it, the below patch fixed that for me. That
got merged in a later 2.6.20-gitY, but then some other patch broke it
again so that 2.6.21-rc1 is broken. Not much luck there :-)

So it looks like:

- c5a7156959e89b32260ad6072bbf5077bcdfbeee broke 2.6.20-git
- f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 should fix that.
- Something later than f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 broke it

Yeah, it gets trickier. I'll try
f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 now and see if that works, then
bisect to 2.6.21-rc1 to find the other offender. I hope the other
offender didn't get added before
f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38, we'll see :-)

-- 
Jens Axboe

-

From: Jens Axboe
Date: Tuesday, February 27, 2007 - 4:15 am

f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me, starting bisect.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Tuesday, February 27, 2007 - 6:09 am

Which got me nowhere, after bisecting down from 1213 revisions to
nothing. f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 certainly works, just
verified again. Trying only acpi related changes now...

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Thursday, March 1, 2007 - 2:34 am

update: f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me too, and 
01363220f5d23ef68276db8974e46a502e43d01d is broken. I too will attempt 
to bisect this.

	Ingo
-

From: Ingo Molnar
Date: Thursday, March 1, 2007 - 3:41 am

hm. There's some weird bisection artifact here. Here are the commits i 
tested, in git-log order:

#1 commit 01363220f5d23ef68276db8974e46a502e43d01d bad
#2 commit ee404566f97f9254433399fbbcfa05390c7c55f7 bad
#3 commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 good
#4 commit c827ba4cb49a30ce581201fd0ba2be77cde412c7 bad

if i tell git-bisect that #1 is bad and #3 is good, then it offers me #2 
- that's OK. But when i tell it that #2 is bad, it offers #4 - which is 
out of order! The bisection goes off into la-la land after that and 
never gets back to a commit that is /after/ the good commit. How is this 
possible? (I upgraded from git-1.4.4 to 1.5.0 to make sure this isnt 
some git bug that's already fixed.)

i'll try to straighten this out manually, perhaps #3 is in some merge 
branch that confuses bisection. Or maybe i misunderstood how git-bisect 
works.

	Ingo
-

From: Ingo Molnar
Date: Thursday, March 1, 2007 - 7:52 am

git-bisect gets royally confused on those ACPI merge branches around 
commit c0cd79d11412969b6b8fa1624cdc1277db82e2fe. Here are my test 
results so far:

 commit 01363220f5d23ef68276db8974e46a502e43d01d: bad
 commit 255f0385c8e0d6b9005c0e09fffb5bd852f3b506: bad
 commit c0cd79d11412969b6b8fa1624cdc1277db82e2fe: bad
 commit c24e912b61b1ab2301c59777134194066b06465c: good
 commit e9e2cdb412412326c4827fc78ba27f410d837e6e: bad
 commit 79bf2bb335b85db25d27421c798595a2fa2a0e82: bad
 commit fc955f670c0a66aca965605dae797e747b2bef7d: good
 commit 70c0846e430881967776582e13aefb81407919f1: good
 commit 414f827c46973ba39320cfb43feb55a0eeb9b4e8: bad
 commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38: good
 commit 5f0b1437e0708772b6fecae5900c01c3b5f9b512: bad
 commit b878ca5d37953ad1c4578b225a13a3c3e7e743b7: bad
 commit c2902c8ae06762d941fab64198467f78cab6f8cd: bad
 commit 12e74f7d430655f541b85018ea62bcd669094bd7: bad
 commit 3388c37e04ec0e35ebc1b4c732fdefc9ea938f3b: bad
 commit 9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff: bad

the results are totally reproducible (i re-tried a few of both the good 
and the bad commits), i.e. it's not a sporadic condition. Also, a number 
of the 'bad' commits have no dynticks stuff in them at all, so i'd 
exclude dynticks.

could someone suggest a sane way to go with this? Perhaps suggest 
specific commit IDs to test?

	Ingo
-

From: Rafael J. Wysocki
Date: Thursday, March 1, 2007 - 9:12 am

Hm, does 2.6.20-mm2 work?  If not, you can bisect the broken-out sereis
with quilt.

Rafael
-

From: Linus Torvalds
Date: Thursday, March 1, 2007 - 5:26 pm

Looks like git bisect worked for you, and wasn't confused at all. You 
started out with 2931 commits between your first known-bad and known-good 
commits, which means that you usually end up having to check "log2(n)+1" 
kernels, ie I'd have expected you to have to do 12-13 bisection attempts 
to cut it down to one.

You seem to have done 14 (you list 16 commits, two of which are the 
starting points), which is right in that range. The reason you sometimes 
get more is:

 - you "help" git bisect by choosing other commits than the optimal ones. 

 - with bad luck, it can be hard to get really close to "half the commits" 
   in the reachability analysis, especially if you have lots of merges 
   (and *especially* if you have octopus merges that merge more than two 
   branches of development). For example, say that you have something like

	           a
	           |
	   +---+---+---+---+
	   |   |   |   |   |
	   b   c   d   e   f

   where you have six commits - you can't test any "combinations" at all, 
   since they are all independent, so "git bisect" cannot test them three 
   and three to cut down the time, so if you don't know which one is bad, 
   you'll basically end up testing them all.

The bad luck case never really happens to that extreme in practice, and 
even when it does you can sometimes be lucky and just hit on the bug early 
(so "bad luck" may end up being "good luck" after all), but it explains 
why you can get more - or less - than log2(n)+1 attempts. More commonly 
one more.

A much *bigger* problem is if you mark something good or bad that isn't 
really. Ie if the bug comes and goes (it might be timing-dependent, for 
example), the problem will be that you'll always narrow things down 
(that's what bisection does), but you may not narrow it down to the right 
thing!

We've had that happen several times. If the bug (for example) means that 
suspend *often* breaks, but sometimes works just by luck, you might mark a 
kernel "good" when it ...
From: Ingo Molnar
Date: Friday, March 2, 2007 - 12:14 am

no - i simply picked them by hand, based on looking at gittk output, 
because bisection did not appear to find anything useful:

  9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff is first bad commit

And via that method i found a couple of more 'good' points - which 
git-bisect never picked up by itself. (and i did 3-4 separate git-bisect 
sessions, one of them was a "git-bisect start drivers/acpi/" - which is 
the main area of suspicion). I looked at git-bisect visualize more than 
once, and i've attached one of the bisection logs below.

i also think i know what happens. Firstly, my testing is reliable, as i 
mentioned it in the other mail i frequently re-visited commits to make 
sure that none of my bad/good decisions is spurios - but no, the test 
results are extremely reproducable: either the laptop resumes properly 
after flashing its disk light or it does not.

the problem i think is that i simply took git-bisect's behavior for 
granted (i used it many times already) but forgot about a very basic 
precondition: git-bisect will find only a /single/ good->bad transition.

If there is a bad->good transition combined with a good->bad transition 
then git-bisect will think it's the same 'badness', while it's a 
/former/ badness that it is honing in on - totally sending the bisection 
off into la-la-land.

so as i mentioned it in the first mail: i /know/ that this commit is a 
bad->good transition point:

  f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38

/and i only want to test commits that include this commit/ - because i 
know that without this commit git-bisect confuses the /other/ breakage 
with the new breakage. In the bisection log below, this choice of 
git-bisect:

  ee404566f97f9254433399fbbcfa05390c7c55f7

is 'bad' according to testing, but that's 'another' badness - and i 
missed it.

Now, having slept on it, the solution is very simple: whenever 
git-bisect picks a commit for which the following command comes up 
empty:
 
  git-log | grep ...
From: Ingo Molnar
Date: Friday, March 2, 2007 - 12:21 am

the situation is simpler than that: there is a /known/ bug, and i marked 
the bugfix commit as 'good'. I never met such a multiple-bugs scenario 
before and forgot that git-bisect could easily pick a tree without this 
essential bugfix and would not be able to make a distinction between the 
two types of badness.

I'll try what i've described in the previous mail: mark all bisection 
points that do not include f3ccb06f as 'good' - thus 'merging' the 
known-bad area with the first known-good commit, and thus eliminating it 
from the bisection space.

(but it might also be useful to have a "git-bisect must-include" kind of 
command that would allow this to be automated: mark a particular tree as 
an essential component of the search space.)

	Ingo
-

From: Ingo Molnar
Date: Friday, March 2, 2007 - 1:04 am

this got me quite a bit further:

 git-bisect start
 git-bisect bad       01363220f5d23ef68276db8974e46a502e43d01d
 git-bisect good      f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 
 git-bisect fake-good ee404566f97f9254433399fbbcfa05390c7c55f7
 git-bisect bad       d43a338e395371733a80ec473b40baac5f74d768
 git-bisect bad       255f0385c8e0d6b9005c0e09fffb5bd852f3b506
 git-bisect fake-good f99c6bb6e2e9c35bd3dc0b1d0faa28bd6970930d
 git-bisect fake-good 0187f221e96e3436d552c0c7143f183eb82fb658
 git-bisect bad       81450b73dde07f473a4a7208b209b4c8b7251d90
 git-bisect fake-good ef29498655b18d2bfd69048e20835d19333981ab
 git-bisect fake-good 8a03d9a498eaf02c8a118752050a5154852c13bf
 git-bisect good      5c95d3f5783ab184f64b7848f0a871352c35c3cf
 git-bisect good      ecb5f7521a309cb9c5fc0832b9705cd2a03d7d45
 git-bisect good      0539771d7236b425f285652f6f297cc7939c8f9a

 81450b73dde07f473a4a7208b209b4c8b7251d90 is first bad commit

[ note: by having the "git-bisect must-have-bugfix f3ccb06f" 
  functionality i mentioned in the previous mail git-bisect could
  have eliminated the fake-good steps. ]

it's not a resolution of this regression yet, because this commit is a 
merge with upstream:

|  commit 81450b73dde07f473a4a7208b209b4c8b7251d90
|  Merge: 8a03d9a... 0539771...
|  Author: Len Brown <len.brown@intel.com>
|  Date:   Fri Feb 16 18:52:41 2007 -0500
|
|      Pull misc-for-upstream into release branch

which means that the fix in Len's tree got broken by merging with 
upstream. Note: this 'upstream' in isolation is broken too, due to not 
having that essential fix from Len's tree!

So we quite likely have /two/ bugs, any of which breaks resume (which 
breakage looks the same, so no way to isolate them via testing).

I'll now try the following: i'll try to manually apply Len's fix to 
every tree that git-bisect offers me, in the hope of being able to 
isolate the /other/ bug.

[ But really, i'm not expecting any miracles because this is way out of
  league for ...
From: Ingo Molnar
Date: Friday, March 2, 2007 - 3:20 am

this method pointed out the real bug that we are interested in:

| 774c47f1d78e373a6bd2964f4e278d1ce26c21cb is first bad commit
| commit 774c47f1d78e373a6bd2964f4e278d1ce26c21cb
| Author: Avi Kivity <avi@qumranet.com>
| Date:   Mon Feb 12 00:54:47 2007 -0800
|     [PATCH] KVM: cpu hotplug support

undoing the 774c47f1 patch from HEAD gave me a working resume. I'll send 
a fix for this KVM breakage in a separate mail.

here's how the bisection went:

 bad:   [01363220f5d23ef68276db8974e46a502e43d01d] [PARISC] clocksource:
 good:  [fa285a3d7924a0e3782926e51f16865c5129a2f7] Linux 2.6.20
 +good: [bcdb81ae29091f6a66369aabfd8324e4a53d05dc] knfsd: SUNRPC: add a
 bad:   [920841d8d1d61bc12b43f95a579a5374f6d98f81] Merge branch 'for-linus'
 +bad:  [86a71dbd3e81e8870d0f0e56b87875f57e58222b] sysctl: hide the sysctl
 +bad:  [719c91ccadd3ed26570dbb29d54166914832eee9] [POWERPC] Use udbg_early
 +bad:  [ebaf0c6032f525ddb0158fb59848d41899dce8cd] Merge branch 'for-linus'
 +good: [8cd133073f9b5cd335c0b2e4740aceb025d50ca9] kvm: Fix mismatch between
 +bad:  [5b8e8ee6c65a34d8aafaeb8e2eaa97e496c2567c] ps3: add shutdown to
 +bad:  [a524d946bdced73c5fbe60170fb33611491c4211] tgafb: sync-on-green
 +bad:  [a268422de8bf1b4c0cb97987b6c329c9f6a3da4b] fbdev driver for S3
 +good: [8d0be2b3bf4a55606967d7d84e56c52521e94333] KVM: VMX: add vcpu_clear()
 +bad:  [59ae6c6b87711ceb2d1ea5f9e08bb13aee947a29] KVM: Host suspend/resume
 +bad:  [774c47f1d78e373a6bd2964f4e278d1ce26c21cb] KVM: cpu hotplug support

the commits prefixed with '+' were the ones where i had to hand-merge 
the f3ccb06f commit to. Near the end of the bisection it nicely honed in 
on the group of KVM changes that included the bad commit.

but the conclusion is clear: if multiple bugs are present in the search 
area then it gets quite difficult to sort it out via git-bisect - but 
it's not impossible either. The following git-bisect enhancement could 
have made things easier for me:

   git-bisect mark-must-have <tree>

which would mark ...
From: Ingo Molnar
Date: Friday, March 2, 2007 - 3:22 am

Subject: [patch] KVM: T60 resume fix
From: Ingo Molnar <mingo@elte.hu>

my T60 laptop does not resume correctly due to KVM attempting to send an 
IPI to a CPU that might be down (or not up yet). (Doing so also triggers 
the send_IPI_mask_bitmask() warning in arch/i386/kernel/smp.c, line 
732.)

with this fix applied my laptop does not hang during resume.

[ KVM will have to disable/enable virtualization on the CPU itself that
  goes down / comes up, not via an IPI sent from the requesting CPU. ]

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/kvm/kvm_main.c |    6 ------
 1 file changed, 6 deletions(-)

Index: linux/drivers/kvm/kvm_main.c
===================================================================
--- linux.orig/drivers/kvm/kvm_main.c
+++ linux/drivers/kvm/kvm_main.c
@@ -2083,12 +2083,6 @@ static int kvm_cpu_hotplug(struct notifi
 	case CPU_DEAD:
 	case CPU_UP_CANCELED:
 		decache_vcpus_on_cpu(cpu);
-		smp_call_function_single(cpu, kvm_arch_ops->hardware_disable,
-					 NULL, 0, 1);
-		break;
-	case CPU_UP_PREPARE:
-		smp_call_function_single(cpu, kvm_arch_ops->hardware_enable,
-					 NULL, 0, 1);
 		break;
 	}
 	return NOTIFY_OK;
-

From: Michael S. Tsirkin
Date: Friday, March 2, 2007 - 4:39 am

Since I don't normally have kvm loaded, this patch is unlikely
to help me, is that right?

-- 
MST
-

From: Avi Kivity
Date: Saturday, March 3, 2007 - 1:22 am

That is correct.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 3:23 am

> with this fix applied my laptop does not hang during resume.

On the off chance that this is relevant, could you post
your .config please?

-- 
MST
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 3:29 am

sure - find it below.

	Ingo

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.21
# Mon Mar  5 11:15:42 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_ASYNC_SUPPORT=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
# CONFIG_UTS_NS is not set
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
# CONFIG_IKCONFIG is not set
CONFIG_CPUSETS=y
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block ...
From: Avi Kivity
Date: Saturday, March 3, 2007 - 1:21 am

That is already CPU_ONLINE in my tree (and in the pull request sent to 
Linus a couple of days ago).



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-

From: Andrew Morton
Date: Saturday, March 3, 2007 - 4:57 am

I noticed ;)


Resend, please.
-

From: Junio C Hamano
Date: Saturday, March 3, 2007 - 5:07 am

Could you un-CC: git@vger.kernel.org from this thread, please?

-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 1:22 am

that solves the resume problem - but doesnt solve the CPU_DEAD issue of 
sending an IPI to an already offline CPU. Might be a better idea to do 
it in CPU_DOWN_PREPARE? (and then to also add a CPU_DOWN_FAILED branch?)

	Ingo
-

From: Avi Kivity
Date: Monday, March 5, 2007 - 1:50 am

Mainline now has DOWN_PREPARE and UP_CANCELED calling 
->hardware_disable(), and ONLINE calling ->hardware_enabled().  What 
tree are you looking at?

[but I do see the need for DOWN_FAILED now. Off to find a resumable 
machine...]


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 1:44 am

yeah, both DOWN_FAILED and UP_FAILED might trigger when some other bug 
prevents a resume.

	Ingo
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 1:57 am

suspend/resume works fine now and there are no warning messages 
whatsoever (with suspend simulation). Thanks Avi!

	Ingo
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 5:54 am

I just tried Ingo's .config and it hangs on resume for me (with suspend to memory).

-- 
MST
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 5:50 am

could you try 's2ram' from http://suspend.sf.net ? It's basically 
equivalent to 'echo mem > /sys/power/state', but you can usually also 
see the reason why it doesnt resume, on the vga console. (by default the 
kernel doesnt resume the t60's vga properly)

on my box resume works, but it's incredibly slow. Takes a few minutes to 
have total effect.

	Ingo
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 6:26 am

OK, I'll give it a try. Do I have to shut down X or is it enough to

I guess it is possible that what I mistook for a hang was actually an incredibly
slow system, and that it would resume, eventually. I waited for 5 min and gave

I have same here with my old .config (2-3 min).

However, after resume was completed, system would hang on the *next* suspend to
ram.

Ingo, do you see this behaviour, or can you suspend/resume any number of times?

-- 
MST
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 6:32 am

havent tried that yet - first trying to figure out why the first resume 
misbehaves ;)

	Ingo
-

From: Avi Kivity
Date: Monday, March 5, 2007 - 2:27 am

Where do I find this suspend simulation?  Sounds like a great suspend 
debugging tool.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 3:05 am

it's just the simple hack below.

	Ingo

------------------------>
Subject: [patch] suspend debugging: simulate suspend-to-RAM
From: Ingo Molnar <mingo@elte.hu>

most resume bugs are due to the kernel hanging or crashing while trying 
to resume a specific device. It is extremely hard to debug such hangs 
because often when the hang happens there's no console available yet.

Make debugging such bugs easier by offering a resume mode that does not 
actually call into the BIOS to turn off the machine. This, in 
combination with earlyprintk=serial,ttyS0,115200,keep , 
CONFIG_PM_DEBUG=y, CONFIG_DISABLE_CONSOLE_SUSPEND=y and 
/sys/power/filter enables me to have a fully functional serial console 
during the full suspend/resume cycle.

I debugged two suspend bugs in the -rt kernel this way already, so it's 
pretty useful.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 Documentation/kernel-parameters.txt |    4 ++++
 drivers/acpi/sleep/main.c           |   15 ++++++++++++++-
 include/linux/acpi.h                |    1 +
 kernel/sysctl.c                     |    8 ++++++++
 4 files changed, 27 insertions(+), 1 deletion(-)

Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -163,6 +163,10 @@ and is between 256 and 4096 characters. 
 
 	acpi_osi=	[HW,ACPI] empty param disables _OSI
 
+	acpi_simulate_suspend_to_ram
+			[KNL] Do not call into BIOS to do suspend-to-RAM
+			Format: <0/1>
+
 	acpi_serialize	[HW,ACPI] force serialization of AML methods
 
 	acpi_skip_timer_override [HW,ACPI]
Index: linux/drivers/acpi/sleep/main.c
===================================================================
--- linux.orig/drivers/acpi/sleep/main.c
+++ linux/drivers/acpi/sleep/main.c
@@ -35,6 +35,14 @@ static u32 acpi_suspend_states[] = {
 
 static int init_8259A_after_S1;
 
+/*
+ * simulate entry into the BIOS - this ...
From: Avi Kivity
Date: Monday, March 5, 2007 - 3:33 am

Thanks, I'll put it to good use.  Are you planning to upstream it?  Not 
for 2.6.21, presumably ;)



-- 
error compiling committee.c: too many arguments to function

-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 3:33 am

it needs an essential cleanup: instead of modifying 'mem' behavior, it 
should be a 'testmem' thing and no acpi_ flag.

	Ingo
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 3:40 am

I think Pavel wants it triggerable through /sys/power/state:
http://lkml.org/lkml/2007/1/25/99

-- 
MST
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 8:44 am

I just confirmed that 0539771d7236b425f285652f6f297cc7939c8f9a

Going to test that now.

-- 
MST
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 9:14 am

I have confirmed these two on my system.

-- 
MST
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 9:41 am

you could probably get quite a bit further in bisecting the other 
breakage, by using the following method:

manully apply the patch below to 81450b73dde and retest. It will most 
likely work. Then FIRST unapply the patch and mark the tree via 
'git-bisect good' and continue the bisection. Then try to apply the 
patch again. If it's already included - ignore the rejected patch. 
Whenever git-bisect offers you a new commit, just try to apply the 
patch. Ok? This way you'll be able to avoid the known ACPI breakage, and 
zoom in on the unknown breakage.

	Ingo

---------------->
commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38
Author: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Date:   Tue Feb 13 02:35:50 2007 -0500

    ACPI: Disable wake GPEs only once.
    
    fixes Suspend/Resume regressions due to recent ACPICA update.
    
    Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
    Signed-off-by: Len Brown <len.brown@intel.com>

diff --git a/drivers/acpi/events/evgpe.c b/drivers/acpi/events/evgpe.c
index dfac3ec..635ba44 100644
--- a/drivers/acpi/events/evgpe.c
+++ b/drivers/acpi/events/evgpe.c
@@ -636,17 +636,6 @@ acpi_ev_gpe_dispatch(struct acpi_gpe_event_info *gpe_event_info, u32 gpe_number)
 		}
 	}
 
-	if (!acpi_gbl_system_awake_and_running) {
-		/*
-		 * We just woke up because of a wake GPE. Disable any further GPEs
-		 * until we are fully up and running (Only wake GPEs should be enabled
-		 * at this time, but we just brute-force disable them all.)
-		 * 1) We must disable this particular wake GPE so it won't fire again
-		 * 2) We want to disable all wake GPEs, since we are now awake
-		 */
-		(void)acpi_hw_disable_all_gpes();
-	}
-
 	/*
 	 * Dispatch the GPE to either an installed handler, or the control method
 	 * associated with this GPE (_Lxx or _Exx). If a handler exists, we invoke
-

From: Jens Axboe
Date: Monday, March 5, 2007 - 11:16 am

BTW, the key here seems to be CONFIG_CC_OPTIMIZE_FOR_SIZE, at least that
is my current guess looking at the config options I still need to test
whether they make a difference. Either that, or the vmsplit option got
broken again. I'm attaching my x60 config that works for me.

-- 
Jens Axboe

From: Linus Torvalds
Date: Thursday, March 1, 2007 - 5:41 pm

Side note: it's still usually fairly easy. Especially if you have a known 
fix for the other bug, you can usually just do the equivalent of

	git cherry-pick <fixcommit>

at each point during this bisect (or just have a known patch that you keep 
applying and un-applying), and you're largely done.

Of course, if the area with the fix keeps changing, or if the fix is 
really intrusive and nasty, this gets hairy, but at least in this case the 
patch is fairly trivial and it shouldn't cause any trouble at all to do 
this.

The only real down-side is just the mindless extra work, and the possible 
added confusion you get from modifying the history at the points you're 
testing. "git bisect" is not necessarily happy about auto-picking a new 
bisection point with a dirty tree, for example, so before you mark 
something "good" or "bad", you should generally try to do so with a clean 
git tree (ie if you apply a patch at each stage, do "git reset --hard" to 
remove the patch before you do the "git bisect bad/good" stage).

Similarly, especially at the end of the bisection run, if you actually use 
"git cherry-pick" to *add* a commit, the bisection will now take that 
added commit into account when trying to pick the next commit to check, 
which is not what you really want. It probably doesn't matter that much 
during the early stages (when bisection is really jumping around wildly 
anyway, and one commit more or less doesn't really matter), but again, it 
might be a good idea to make a habit of undoing the cherry-pick, the same 
way you'd undo a patch (eg "git reset --hard HEAD^" would do it, if you 
have exactly one cherry-pick that you tested).

		Linus
-

From: Linus Torvalds
Date: Thursday, March 1, 2007 - 4:36 pm

No it's not. "git bisect" does exactly the right thing. There is no simple 
ordering in a complex branch-merge schenario, you can't just put the 
commits in some "ordering" and test things in time order. That would be 
totally broken, and idiotic. It doesn't give the right results.

What git bisect does is to find the commit that most closely *bisects* the 
history of commits, so that if it is marked good/bad, it will leave you 
with about 50% of the commits left. But if you are looking at date order, 
you're entirely confused.

For example, let's take a really simple case

	    a <- bad
	   / \
          b   c
	  |   |
	  d   e
	  |   |
	  f   g
	   \ /
	    h
            |
	    * <-good

and if you are looking to find something "in the middle", you might thing 
that "d" or "e" are the best choices, since time-wise, they are in the 
middle.

But that's not true AT ALL.

If you actually want to bisect that kind of history, you need to choose 
"b" or "c", even though they may both be *much* more "recent" than the 
others. Why? Because if you pick "d", you're really only testing three 
commits ('d' 'f' and 'h') out of the 8 commits you have to test.

In contrast, if you pick 'b', you are testing the effects of *four* 
commits ('b', 'd', 'f' and 'h') and you have thus neatly bisected the 
commits into two equal groups for testing (one group _with_ those four 
commits, and one group _without_) instead of having partitioned them as 3 
commits vs 5 commits.

So please realize that non-linear history very much means that you MUST 
NOT think that you just pick a commit "in the middle". No, git bisect is a 
LOT smarter than that - it picks a commit that *reaches* about half the 

It's possible because git knows what it is doing, and you didn't think 
things through.

The commits that "git bisect" picked out are the right ones. Quite often, 
there may be two or more "equally good" commits (in my example above, you 
can choose either "b" or "c", and it will ...
From: Pavel Machek
Date: Friday, March 2, 2007 - 3:07 am

Strange; on my x60, suspend to ram works okay.

(Resume is very slow, because disks are not spinned up properly; and
there's something wrong with timers; console beeps take way too long).

dmesg attached.

That's with

commit 7b965e0884cee430ffe5dc81cdb117b9316b0549
tree 754dce6432258e0a8c3a758e13a34eb3a1d22ee1
parent 5a39e8c6d655b4fe8305ef8cc2d9bbe782bfee5f
author Trond Myklebust <Trond.Myklebust@netapp.com> Wed, 28 Feb 2007
20:13:55 -0800
committer Linus Torvalds <torvalds@woody.linux-foundation.org> Thu, 01
Mar 2007 14:53:39 -0800

    [PATCH] VM: invalidate_inode_pages2_range() should not exit early

    Fix invalidate_inode_pages2_range() so that it does not
immediately exit
    just because a single page in the specified range could not be
removed.

    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 1:42 am

Pavel, I tried with your .config, and indeed the system came back to life after
2-3 minutes after I press Fn/F4, indeed the issue seems to be with the disk.
It could be that the same takes place with my original .config - maybe
I just wasn't patient enough. I'll need to re-test that.

However, I noticed that, after resume, when the system is presumably functional,
if I try to suspend to ram again, this second suspend hangs, displaying
the following on screen:

[   17.170000] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 20
[   17.170000] PCI: Setting latency timer of device 0000:02:00.0 to 64
[   17.250000] e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:41:5
4:6c:47
[   17.330000] e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

the crescent LED starts blinking and does not seem to stop for at lest 10 min,
I've run out of patience after that. It could be that it's just very slow again.

Pavel, did you try suspend to RAM after a successfull resume from RAM?

Under 2.6.20, the system suspends/resumes to memory within about 20 sec
any number of times.

-- 
MST
-

From: Ingo Molnar
Date: Monday, March 5, 2007 - 3:11 am

the spin-up takes a few seconds here under suspend/resume simulation:

 | ata1: waiting for device to spin up (7 secs)
 | Restarting tasks ... done.

 [5-10 seconds pass]

 | ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
 | ata1.00: configured for UDMA/100
 | SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
 | sda: Write Protect is off
 | sda: Mode Sense: 00 3a 00 00
 | SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

with real resume it takes even longer time - but i dont see where the 
delays come from in that case - i suspect it's SATA.

i'm also getting this WARN_ON() from e1000:

BUG: at drivers/pci/msi.c:611 pci_enable_msi()
 [<c01061bd>] show_trace_log_lvl+0x19/0x2e
 [<c01062b6>] show_trace+0x12/0x14
 [<c01062cc>] dump_stack+0x14/0x16
 [<c024fcc4>] pci_enable_msi+0x6d/0x203
 [<c02b709e>] e1000_request_irq+0x2e/0xe2
 [<c02bb742>] e1000_resume+0x7f/0xef
 [<c0249a68>] pci_device_resume+0x1a/0x44
 [<c02b39ec>] resume_device+0xf7/0x16f
 [<c02b3adb>] dpm_resume+0x77/0xcb
 [<c02b3b69>] device_resume+0x3a/0x51
 [<c014e669>] enter_state+0x193/0x1bb
 [<c014e712>] state_store+0x81/0x97
 [<c01b68bc>] subsys_attr_store+0x20/0x25
 [<c01b6feb>] sysfs_write_file+0xce/0xf6
 [<c017e16b>] vfs_write+0xb1/0x13a
 [<c017e899>] sys_write+0x3d/0x61
 [<c0105220>] syscall_call+0x7/0xb

seems harmless because it seems to work fine.

	Ingo
-

From: Jeff Garzik
Date: Monday, March 5, 2007 - 10:30 pm

I would poke Eric Biederman(sp?) about this one.  Maybe its even solved 
by the MSI-enable-related patch he posted in the past 24-48 hours.

	Jeff


-

From: Kok, Auke
Date: Monday, March 5, 2007 - 11:35 pm

Eric, Linus,

I tried the 3-patch series "[PATCH 0/3] Basic msi bug fixes.." and they fix this 
problem for me. Were you expecting the OOPS in the first place? In any case, it 
survived several suspend/resume cycles on both enabled (irq alloc'd and enabled) 
and disabled devices (only initialized).

Jens Axboe was seeing the same problem, perhaps he can confirm the fix as well.

In any case, the patches have my blessing :)

Please add my:

   Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>


Cheers,

Auke
-

From: Ingo Molnar
Date: Tuesday, March 6, 2007 - 2:04 am

the bug was the warning message (a WARN_ON()) above - not an oops. So 
that warning message is gone in your testing?

	Ingo
-

From: Kok, Auke
Date: Tuesday, March 6, 2007 - 8:34 am

yes.

Auke
-

From: Eric W. Biederman
Date: Tuesday, March 6, 2007 - 9:15 pm

Sorry for the slow delay.  I was out of town for my brothers wedding the last few
days.

I wasn't exactly expecting the WARN_ON to trigger.  What I fixed was
an inconsistency in handling our state bits.  Fixing that
inconsistency appears to have fixed the e1000 usage scenario mostly by
accident.

The basic issue is that pci_save_state saves the current msi state
along with other registers, and then the e1000 driver goes and
disables the msi irq after we have saved the irq state as on.

My code notices that the msi irq was disabled before restore time, so
it skips the restore.  However we now have a leak of the msi saved cap
because we are not freeing it. 

This leaves with some basic questions.
- Does it make sense for suspend/resume methods to request/free irqs?
- Does it make sense for suspend/resume methods to allocate/free msi irqs?
- Do we want pci_save/restore_cap to save/restore msi state?

The path of least resistance is to just free the extra state and we
are good.  I'm just not quite certain that is sane and it has been a
long day.

Eric
-

From: Kok, Auke
Date: Wednesday, March 7, 2007 - 9:31 am

we used to have a lengthy e1000_pci_save|restore_state in our code, which is now 
gone, so I'm all for that. A separate pci_save_pxie|msi(x)_state for every 
driver seems completely unnecessary. I can't think of a use case where 
saving+restoring everything hurts. That's what you want I presume.

We currently free all irq's and msi before going into suspend in e1000, and I 
think that is probably a good thing, somehow I can think of bad things happening 
if we dont, but I admit that I haven't tried it without alloc/free. We do this 
in e100 as well and it works.

Another motivation would be to leave this up to the driver: if the driver 
chooses to free/alloc interrupts because it makes sense, you probably would want 
to keep that choice available. Devices that don't need this can skip the 
alloc/free, but leave the choice open for others.

hth

Auke
-

From: Kok, Auke
Date: Wednesday, March 7, 2007 - 9:45 am

ah, looking at the code in e1000 we do:

_suspend:
	pci_save_state();
	free_irq()

_resume:
	pci_restore_state();
	alloc_irq();

I suppose that's not good either, and the major cause of the warning in the 
first place.

Maybe I can rollback your latest patches and try to fix that mess by postponing 
the pci_save_state until after we free'd the irq's.

Auke
-

From: Eric W. Biederman
Date: Wednesday, March 7, 2007 - 12:28 pm

I just want to understand why we have issues and to see if how we have
organized the suspend/resume path for dealing with msi irqs makes sense.

That is I haven't looked much at the suspend/resume path so I don't know it
well and I am afraid that your problem might be a symptom of a deeper

Currently the irq code supports operation without the
free_irq/request_irq.  Since the numbers given are pure linux
abstractions things should it is really a matter of just


Below is an additional set of warnings that should help debug this.
The old code just got lucky that it triggered a warning when this happens.

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 01869b1..5113913 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -613,6 +613,7 @@ int pci_enable_msi(struct pci_dev* dev)
 		return -EINVAL;
 
 	WARN_ON(!!dev->msi_enabled);
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));
 
 	/* Check whether driver already requested for MSI-X irqs */
 	if (dev->msix_enabled) {
@@ -638,6 +639,8 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!dev->msi_enabled)
 		return;
 
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));
+
 	msi_set_enable(dev, 0);
 	pci_intx(dev, 1);		/* enable intx */
 	dev->msi_enabled = 0;
@@ -739,6 +742,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 		}
 	}
 	WARN_ON(!!dev->msix_enabled);
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));
 
 	/* Check whether driver already requested for MSI irq */
    	if (dev->msi_enabled) {
@@ -763,6 +767,8 @@ void pci_disable_msix(struct pci_dev* dev)
 	if (!dev->msix_enabled)
 		return;
 
+	WARN_ON(!hlist_empty(&dev->saved_cap_space));
+
 	msix_set_enable(dev, 0);
 	pci_intx(dev, 1);		/* enable intx */
 	dev->msix_enabled = 0;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bd44a48..4418839 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -677,6 +677,7 @@ pci_restore_state(struct pci_dev *dev)
 	}
 	pci_restore_pcix_state(dev);
 ...
From: Andrew Morton
Date: Wednesday, March 7, 2007 - 7:53 pm

Got a hit on a powerpc g5:

PM: Writing back config space on device 0001:05:04.0 at offset 1 (was 2b00000, writing 2b00006)
------------[ cut here ]------------
Badness at drivers/pci/pci.c:679
Call Trace:
[C0000000080F7410] [C000000000011EFC] .show_stack+0x50/0x1cc (unreliable)
[C0000000080F74C0] [C0000000001AD610] .report_bug+0xa0/0x110
[C0000000080F7550] [C0000000000256E4] .program_check_exception+0xb4/0x670
[C0000000080F7630] [C0000000000046F4] program_check_common+0xf4/0x100
--- Exception: 700 at .pci_restore_state+0x310/0x340
    LR = .pci_restore_state+0x2e0/0x340
[C0000000080F79D0] [C00000000026A174] .tg3_chip_reset+0x19c/0xa04
[C0000000080F7A90] [C00000000026D948] .tg3_reset_hw+0xa4/0x2718
[C0000000080F7BA0] [C000000000270030] .tg3_init_hw+0x74/0x94
[C0000000080F7C30] [C000000000270BE0] .tg3_open+0x4c8/0x854
[C0000000080F7CF0] [C0000000003A74A4] .dev_open+0x100/0x12c
[C0000000080F7D90] [C0000000003BAEA8] .netpoll_setup+0x2dc/0x3ec
[C0000000080F7E40] [C000000000283450] .init_netconsole+0x64/0x8c
[C0000000080F7EC0] [C0000000005C0BE4] .init+0x1d0/0x390
[C0000000080F7F90] [C0000000000271F8] .kernel_thread+0x4c/0x68
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is on for TX and on for RX.
Scheduler bitmap error - bitmap being reconstructed..
netconsole: network logging started
Calling initcall 0xc0000000006bd180: .macio_module_init+0x0/0x3c()

That's:

        pci_restore_pcix_state(dev);
        pci_restore_msi_state(dev);
        WARN_ON(!hlist_empty(&dev->saved_cap_space));

        return 0;

-

From: Eric W. Biederman
Date: Wednesday, March 7, 2007 - 11:58 pm

Hmm.  Either I am confused of I just found an unanticipated leak.

pci_restore_msi_state should be out of the picture as we don't yet
have ppc msi support and I don't think the g5 generation hardware
supported it either.

The only case I can see which might trigger this is if we saved
pci-X state and then didn't restore it because we could not find
the capability on restore.

Any chance you could walk that list and find the cap_nr of the remaining
element?  

Something like:
{
	struct pci_cap_saved_state *tmp;
	struct hlist_node *pos;

	hlist_for_each_entry(tmp, pos, &pci_dev->saved_cap_space, next)
        	printk(KERN_INFO "saved_cap: 0x%02x\n", tmp->cap_nr);
}

Until I get the best scenario I can come up with is a tg3 hardware bug
that doesn't renable the pci-X capability after a restore of power state.

Getting that cap_nr will at least allow me to be certain if I am dealing
with msi, pci-X or pci-e.

Unanticipated bugs aren't supposed to be this easy to find!

Eric

-

From: Jeff Garzik
Date: Thursday, March 8, 2007 - 2:55 am

Speaking of tg3, make sure you are aware that the number of calls to 
save-state functions may not match the number of calls to the 
restore-state functions.  ISTR seeing some memleak bugs in PCI related 
to this.

	Jeff


-

From: Eric W. Biederman
Date: Thursday, March 8, 2007 - 10:27 am

Thanks that looks like the problem, multiple calls to save before one
call to restore when you have a pci-x capability would easily trigger
this problem.  I just surveyed a bunch of the pci_save_state and
pci_restore_state users and this appears to be a common idiom not just
a tg3 thing....

It looks like when code was added to save/restore the msi capability
was added to pci_save/restore_state that an assumption was added that
pci_save_state and pci_restore state were only used for suspend and
only used in pairs.  There is even a partial bug fix that removed the
worst of the symptoms of that assumption from the msi code but failed
to recognize the core problem.

Now that we have code to work with pcie and pcix capabilities as well
as msi this problem is much easier to hit.

All of pci_save_state and pci_restore_state is going to have to be
restructured to fix this, and it is late in the game.  Ugh.
Oh well, better to fix it now 

At least I get my answer about if what pci_save_state is doing
is reasonable. It is not.  pci_save_state no longer supports being
used in conjunction with hardware reset and has become a
suspend/resume specific function.

Now I'm off to wite some patches to fix this.

Eric
-

From: Eric W. Biederman
Date: Thursday, March 8, 2007 - 12:58 pm

Well this is clearly a weird tangent from the topic of this thread but
it looks to have found some real bugs even if they aren't the ones we
are looking for.

In short pci_save_state and pci_restore_state are used to two primary
was:  As a pair called from the suspend and restore routines.  One save
to multiple restores usually present in hardware reset routines.

The additions to save/restore msi, pci-e and pci-x state failed to take
this second usage scenario into account.

The next two patches address this problem.

This should directly fix the issue observed with the tg3, with multiple
saves and a single restore.

It happens that to handle the reset case cleanly we also need to support
the odd usage model of the current e1000 driver.  

I have never have figured out how to get suspend/resume actually
working on any of my machines so the code path is untested.   But the
patches are trivial and pretty much obviously correct.

So if a couple people could do a quick suspend/resume test on these
patches to confirm I'm not blind that would be great.

Eric
-

From: Eric W. Biederman
Date: Thursday, March 8, 2007 - 1:04 pm

There are two ways pci_save_state and pci_restore_state are used.  As
helper functions during suspend/resume, and as helper functions around
a hardware reset event.  When used as helper functions around a hardware
reset event there is no reason to believe the calls will be paired, nor
is there a good reason to believe that if we restore the msi state from
before the reset that it will match the current msi state.  Since arch
code may change the msi message without going through the driver, drivers
currently do not have enough information to even know when to call
pci_save_state to ensure they will have msi state in sync with the other
kernel irq reception data structures.

It turns out the solution is straight forward, cache the state in the
existing msi data structures (not the magic pci saved things) and
have the msi code update the cached state each time we write to the hardware.
This means we never need to read the hardware to figure out what the hardware
state should be.

By modifying the caching in this manner we get to remove our save_state
routines and only need to provide restore_state routines.

The only fields that were at all tricky to regenerate were the msi and msi-x
control registers and the way we regenerate them currently is a bit dependent
upon assumptions on how we use the allow msi registers to be configured and used
making the code a little bit brittle.  If we ever change what cases we allow
or how we configure the msi bits we can address the fragility then.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c        |  150 ++++++++--------------------------------------
 drivers/pci/pci.c        |    2 -
 drivers/pci/pci.h        |    2 -
 include/linux/msi.h      |    8 +--
 include/linux/pci_regs.h |    1 +
 5 files changed, 29 insertions(+), 134 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 01869b1..ad33e01 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -100,6 +100,7 @@ static void ...
From: gregkh
Date: Saturday, March 10, 2007 - 12:49 am

This is a note to let you know that I've just added the patch titled

     Subject: msi: Safer state caching.

to my gregkh-2.6 tree.  Its filename is

     msi-safer-state-caching.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/
From: Ingo Molnar
Date: Thursday, March 8, 2007 - 1:08 pm

ouch! Are you interested in getting it work?

	Ingo
-

From: Eric W. Biederman
Date: Thursday, March 8, 2007 - 1:26 pm

I haven't even seriously tried. I don't yet have a laptop and I like
staying logged.  The first time I rolled the jiffie counter it was a
happened after I had realized that if I kept my machine up for another
couple of months my jiffie counter would roll and I said why not  So
basically suspend hasn't been a feature I'm been interested in using.
I get more annoyed at my UPS failing..

I will probably get around to figuring out the whole suspend/resume
thing one of these days and it will probably take me a day or two.
But that doesn't result in bug fixes being delivered in a timely manner.

Eric
-

From: Michael S. Tsirkin
Date: Thursday, March 8, 2007 - 3:23 am

Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle
regular devices and seem to ignore the fact that for bridge PCI-X
capability has a different structure.

Is this intentional? If not, here's a patch to fix this.
Warning: completely untested.


PCI: restore bridge PCI-X capability registers after PM event

Restore PCI-X bridge up/downstream capability registers
after PM event.  This includes maxumum split transaction
commitment limit which might be vital for PCI X.

Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index df49530..4b788ef 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev)
 	if (pos <= 0)
 		return 0;
 
-	save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);
+	save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL);
 	if (!save_state) {
-		dev_err(&dev->dev, "Out of memory in pci_save_pcie_state\n");
+		dev_err(&dev->dev, "Out of memory in pci_save_pcix_state\n");
 		return -ENOMEM;
 	}
 	cap = (u16 *)&save_state->data[0];
 
-	pci_read_config_word(dev, pos + PCI_X_CMD, &cap[i++]);
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, &cap[i++]);
+		pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, &cap[i++]);
+	} else
+		pci_read_config_word(dev, pos + PCI_X_CMD, &cap[i++]);
+
 	pci_add_saved_cap(dev, save_state);
 	return 0;
 }
@@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
 		return;
 	cap = (u16 *)&save_state->data[0];
 
-	pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
+		pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
+	} else
+		pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
 ...
From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 4:11 am

Probably not a such.  I don't think we have any drivers for bridge
devices so I don't think it matters.  It likely wouldn't hurt to fix
it just in case though.


If you fix the offsets and diff this against my last fix (to never



Unless I am completely misreading the spec. While you have picked the
right register to save the offsets should be 0x08 and 0x0c or 8 and 12....

Eric
-

From: Michael S. Tsirkin
Date: Sunday, March 11, 2007 - 4:24 am

Yes but they do their own thing wrt saving/restoring registers.


No, the spec is written in terms of dwords (32 bit), we are storing words (16 bits).
The data at offsets 8 and 12 is read-only split transaction capacity.
Split transaction limit starts at bit 16 so you need to add 2 to byte offset.

Right?


-- 
MST
-

From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 10:37 am

thinking the code works.

The read-only and the read-write part are all defined as part of the
same register so I didn't expect them to be separate.  And I hadn't
paid attention enough to see that the code was only saving 16bit
values.

Rumor has it that some pci devices can't tolerate < 32bit accesses.
Although I have never met one.  The two factors together suggest that
for generic code it probably makes sense to operate on 32bit
quantities, and just to ignore the read-only portion.

Eric
-

From: Michael S. Tsirkin
Date: Sunday, March 11, 2007 - 11:03 am

The code for regular devices seems to use 16-bit accesses, so
I think it's best to stay consistent. Or do you want to change this too?

-- 
MST
-

From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 11:27 am

If we are stomping rare probabilities we might as well change that too.
The code to save pci-x state is relatively recent.  So it probably just
hasn't met a problem device yet (assuming they exist).

Eric
-

From: Michael S. Tsirkin
Date: Sunday, March 11, 2007 - 11:37 am

OK I guess. I gather we assume writing read-only registers has no side effects?
Are there rumors circulating wrt to these?

-- 
MST
-

From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 12:50 pm

I haven't heard anything about that, and if we are writing the same value back
it should be pretty safe.

I have heard it asserted that at least one version of the pci spec
only required 32bit accesses to be supported by the hardware.  One of
these days I will have to look that and see if it is true.  I do know
it can be weird for hardware developers to support multiple kinds of
decode.   As I recall for pci and pci-x at the hardware level the only
difference in between 32bit transactions and smaller ones is the state
of the byte-enable lines.

Eric
-

From: Michael S. Tsirkin
Date: Sunday, March 11, 2007 - 9:35 pm

Is this the only place where Linux uses pci_read_config_word/pci_read_config_dword?
I think such hardware will be pretty much DOA on all OS-es.  Why don't we wait

True, and same holds for PCI-Express.

So let's assume hardware implements RO correctly but ignores the BE bits -
nothing bad happens then, right?

-- 
MST
-

From: Michael S. Tsirkin
Date: Monday, April 16, 2007 - 12:56 pm

So, Eric, what are your thoughts on this patch (probably wrt 2.6.22)?
Since the rest of the save/restore code uses 16 bit transactions,
it should be safe here too?

-- 
MST
-

From: Kok, Auke
Date: Friday, March 9, 2007 - 4:06 pm

Eric W. Biederman wrote:



I'm trying this patch together with the other 2 that you sent out a few days 
ago. I'm seeing some minor issues with this and lots of bogus warnings as far as 
I can see.

If I suspend/resume and unload e1000, then reinsert e1000.ko, I immediately hit 
the WARN_ON at `msi.c:516: WARN_ON(!hlist_empty(&dev->saved_cap_space));`

I'm not sure that's useful debugging information. even though saved state exists 
the module has been removed, so you might want to purge the state table when the 
driver gets removed?


anyway, back to testing.

-

From: Eric W. Biederman
Date: Friday, March 9, 2007 - 8:41 pm

Sorry I should have been clear.

With the last two patches I sent out:
pci: Repair pci_save/restore_state so we can restore one save many times.
msi: Safer state caching.

Those WARN_ON's are now totally bogus.  

In essence the WARN_ON's were testing to ensure pci_save_state and
pci_restore_state were paired.

The assumptions was that pci_restore_state would remove everything
from the list that pci_save_state placed on it.

When I reworked the code and removed the bogus pairing requirement it meant
that if we ever save state and have a pci-e or pci-x capability we will have
a state structure on the list until the kernel reboots. 

In summary:
pci_save_state and pci_restore_state were making a wrong assumption
about the world.  The WARN_ON patch tested to ensure the world matched
those functions.  When I fixed those functions to match the world the
WARN_ON's became completely bogus.

Eric
-

From: Ingo Molnar
Date: Tuesday, March 6, 2007 - 2:06 am

update: Thomas' PIT/HPET resume-fix patch fixed the delay for me.

	Ingo
-

From: Thomas Gleixner
Date: Tuesday, March 6, 2007 - 9:26 am

SATA has another nice feature. Somehow there is an interrupt pending on
the SATA controller, which comes in somewhere in the middle of resume.
If it happens before the SATA code resumed, the SATA code ignores the
interrupt and the interrupt is disabled due to "nobody cared", which in
turn prevents SATA to ever become functional again.

Any idea on that one ?

	tglx


-

From: Linus Torvalds
Date: Tuesday, March 6, 2007 - 9:52 am

Jeff - that sounds like a SATA bug.

If you have an interrupt handler registered, you'd better handle the 
interrupt regardless of whether you think the hardware might be gone or 
not.

It's generally *not* ok to do

	if (device_offline())
		return IRQ_NONE;

at the top of an interrupt handler. 

Of course, if you think the hardware is supposed to be quiescent, then the 
only thing you should do is generally just do the "shut up" operation (ie 
read status, write it back or whatever). You must generally *not* try to 
pass any data upwards (ie if the higher layers have told you to shut up, 
you may need to handle the hardware, but you must not involve the higher 
layers themselves any more, because they expect you to be quiet).

And if you cannot do that because you need to resume in order to have the 
status register mapped, then you need to have an "early_resume()" function 
which gets called *before* interrupts are enabled. That's what 
early-resume (and late-suspend) are designed for: doing things that happen 
very early in the resume sequence before everything is up.

And if you don't want to do any of these things (or are unable to, because 
of some ordering constraint or bad design), then you simply need to 

Jeff, Auke, does this ring any bells?

		Linus
-

From: Kok, Auke
Date: Tuesday, March 6, 2007 - 10:09 am

For the e1000 issue, the problem is solved with Eric Biederman's 3-patch msi 
cleanups. You should have another message in your mailbox confirming that I 
tested his patches and the MSI warning for e1000 suspend-resume is gone with them.

Cheers,

Auke
-

From: Pavel Machek
Date: Thursday, March 8, 2007 - 11:44 pm

Seems to work ok in -rc3... as long as I do not mix s2ram with s2disk.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Michael S. Tsirkin
Date: Monday, March 5, 2007 - 8:34 am

f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me, too.

-- 
MST
-

From: Adrian Bunk
Date: Tuesday, February 27, 2007 - 3:09 pm

Is this

Subject    : ThinkPad T60: no screen after suspend to RAM
References : http://lkml.org/lkml/2007/2/22/391
Submitter  : Michael S. Tsirkin <mst@mellanox.co.il>
             Ingo Molnar <mingo@elte.hu>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status     : unknown


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Jens Axboe
Date: Wednesday, February 28, 2007 - 12:41 am

It doesn't resume at all.

-- 
Jens Axboe

-

From: Adrian Bunk
Date: Sunday, February 25, 2007 - 11:02 am

This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : forcedeth: skb_over_panic
References : http://bugzilla.kernel.org/show_bug.cgi?id=8058
Submitter  : Albert Hopkins <kernel@marduk.letterboxes.org>
Status     : unknown


Subject    : natsemi ethernet card not detected correctly
References : http://lkml.org/lkml/2007/2/23/4
             http://lkml.org/lkml/2007/2/23/7
Submitter  : Bob Tracy <rct@gherkin.frus.com>
Caused-By  : Mark Brown <broonie@sirena.org.uk>
Handled-By : Mark Brown <broonie@sirena.org.uk>
Patch      : http://lkml.org/lkml/2007/2/23/142
Status     : patch available


Subject    : request_module: runaway loop modprobe net-pf-1
References : http://lkml.org/lkml/2007/2/21/206
Submitter  : YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Caused-By  : Kay Sievers <kay.sievers@vrfy.org>
             commit c353c3fb0700a3c17ea2b0237710a184232ccd7f
Handled-By : Greg KH <greg@kroah.com>
Status     : problem is being discussed


Subject    : IPV6=m, SUNRPC=y compile error
References : http://bugzilla.kernel.org/show_bug.cgi?id=8050
             http://lkml.org/lkml/2007/2/12/442
             http://lkml.org/lkml/2007/2/20/384
Submitter  : Michael-Luke Jones <mlj28@cam.ac.uk>
             Pete Clements <clem@clem.clem-digital.net>
             Sid Boyce <g3vbv@blueyonder.co.uk>
Caused-By  : Chuck Lever <chuck.lever@oracle.com>
Handled-By : YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org>
Status     : patch available


Subject    : WARNING: "compat_agp_ioctl" undefined!
References : http://lkml.org/lkml/2007/2/21/272
Submitter  : Andreas Schwab ...
From: Greg KH
Date: Sunday, February 25, 2007 - 1:59 pm

Patch has been reverted and submitted to Linus to pull, but he's out of
town right now...

thanks,

greg k-h
-

From: Adrian Bunk
Date: Monday, February 26, 2007 - 3:01 pm

This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : resume: slab error in verify_redzone_free(): cache `size-512':
                     memory outside object was overwritten
References : http://lkml.org/lkml/2007/2/24/41
Submitter  : Pavel Machek <pavel@ucw.cz>
Handled-By : Marcel Holtmann <marcel@holtmann.org>
Status     : unknown


Subject    : ThinkPad T60: no screen after suspend to RAM
References : http://lkml.org/lkml/2007/2/22/391
Submitter  : Michael S. Tsirkin <mst@mellanox.co.il>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status     : unknown


Subject    : HP nx6325 notebook: usb mouse stops working after suspend to ram
References : http://lkml.org/lkml/2007/2/21/413
Submitter  : Arkadiusz Miskiewicz <arekm@maven.pl>
Caused-By  : Konstantin Karasyov <konstantin.a.karasyov@intel.com>
             commit 0a6139027f3986162233adc17285151e78b39cac
Status     : unknown


Subject    : ACPI update breaks kpowersave
References : http://lkml.org/lkml/2007/2/10/7
Submitter  : Ismail Dönmez <ismail@pardus.org.tr>
             Fabio Comolli <fabio.comolli@gmail.com>
Status     : unknown


Subject    : MacBook: AE_NOT_FOUND ACPI messages
References : http://bugzilla.kernel.org/show_bug.cgi?id=8066
Submitter  : Thomas Meyer <thomas.mey@web.de>
Status     : unknown


Subject    : Asus A8N-VM motherboard:
             framebuffer/console boot failure boot failure (ACPI related)
References : http://lkml.org/lkml/2007/2/23/132
Submitter  : Andrew Nelless <andrew@nelless.net>
Handled-By : Antonino A. Daplas <adaplas@gmail.com>
Status     : problem is being ...
From: Michael S. Tsirkin
Date: Wednesday, February 28, 2007 - 2:13 pm

Just reproduced this in -rc2.
Another thing I noticed:
with 2.6.20, pressing Fn/F4 generates an ACPI event and triggers suspend to RAM.

On 2.6.21-rc2, after resume (when the box is accessible from network),
pressing Fn/F4 again does not seem to have any effect.


-- 
MST
-

From: Thomas Gleixner
Date: Wednesday, February 28, 2007 - 2:27 pm

Can you please get the dmesg output after resume via the network ?

	tglx


-

From: Michael S. Tsirkin
Date: Wednesday, February 28, 2007 - 2:40 pm

The link above has it.

-- 
MST
-

From: Jeff Chua
Date: Wednesday, February 28, 2007 - 8:45 pm

I have the same problem on my IBM X60s on rc1 and rc2. Can't resume
from RAM, can't suspend to disk. It is possible to revert all the
changes to ACPI and test it?

Jeff.
-

From: Pavel Machek
Date: Friday, March 2, 2007 - 5:26 am

As I said elsewhere in the thread, suspend/resume to RAM works ok on
my thinkpad x60. I posted my .config there, perhaps difference is in
it? Ingo identified KVM as possible culprit.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Jens Axboe
Date: Saturday, March 3, 2007 - 4:17 am

I'll try your .config for kicks, the problem that Ingo pin pointed is
not what is affecting me.

-- 
Jens Axboe

-

From: Adrian Bunk
Date: Sunday, March 4, 2007 - 5:04 pm

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

-

From: Jeff Chua
Date: Monday, March 5, 2007 - 6:32 pm

Yes.
-

From: Jeff Chua
Date: Tuesday, March 6, 2007 - 5:03 am

I've tried with CONFIG_KVM=n and CONFIG_KVM=y and both does not suspend.
-

From: Michael S. Tsirkin
Date: Tuesday, March 6, 2007 - 5:08 am

Do you mean that they "do not resume after suspend"?

-- 
MST
-

From: Jeff Chua
Date: Tuesday, March 6, 2007 - 5:12 am

I can't even suspend to disk/ram. It just hangs and the lights just
blink and everything else hangs. With 2.6.20, it works fine.

Jeff.
-

From: Pavel Machek
Date: Monday, March 19, 2007 - 8:32 am

Turn up console loglevel, and see where it hangs...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Rafael J. Wysocki
Date: Monday, March 19, 2007 - 2:23 pm

I think CONFIG_DISABLE_CONSOLE_SUSPEND would have to be set for this purpose
too.

Greetings,
Rafael
-

From: Meelis Roos
Date: Tuesday, February 27, 2007 - 6:00 am

Still appears, but this does not seem to be 40/80 pin cable problem to 
be but rather ata-piix calling some acpi methods and this rulsts in acpi 
errors.

-- 
Meelis Roos (mroos@linux.ee)
-

From: Alan
Date: Tuesday, February 27, 2007 - 7:16 am

On Tue, 27 Feb 2007 15:00:29 +0200 (EET)

There are two separate problems showing up in the one trace - broken ACPI
spew and wrong cable detect. I don't think they are related
-

From: Adrian Bunk
Date: Monday, February 26, 2007 - 3:05 pm

This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.

If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.

Due to the huge amount of recipients, please trim the Cc when answering.


Subject    : forcedeth no longer works
References : http://bugzilla.kernel.org/show_bug.cgi?id=8090
Submitter  : David P. Reed <dpreed@reed.com>
Caused-By  : Ayaz Abdulla <aabdulla@nvidia.com>
Status     : unknown


Subject    : forcedeth: skb_over_panic
References : http://bugzilla.kernel.org/show_bug.cgi?id=8058
Submitter  : Albert Hopkins <kernel@marduk.letterboxes.org>
Status     : unknown


Subject    : natsemi ethernet card not detected correctly
References : http://lkml.org/lkml/2007/2/23/4
             http://lkml.org/lkml/2007/2/23/7
Submitter  : Bob Tracy <rct@gherkin.frus.com>
Caused-By  : Mark Brown <broonie@sirena.org.uk>
Handled-By : Mark Brown <broonie@sirena.org.uk>
Patch      : http://lkml.org/lkml/2007/2/23/142
Status     : patch available


Subject    : ThinkPad T60: system doesn't come out of suspend to RAM
             (CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/22/391
Submitter  : Michael S. Tsirkin <mst@mellanox.co.il>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
             Ingo Molnar <mingo@elte.hu>
Status     : unknown


Subject    : kernel BUG at kernel/time/tick-sched.c:168  (CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/16/346
Submitter  : Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Status     : problem is being debugged


Subject    : BUG: soft lockup detected on CPU#0
             NOHZ: local_softirq_pending 20  (SMT scheduler)
References : http://lkml.org/lkml/2007/2/20/257
Submitter  : Michal ...
From: Thomas Gleixner
Date: Tuesday, February 27, 2007 - 1:21 am

Adrian,


The BUG_ON() was replaced by a warning printk(). The BUG_ON() exposed a

Patch available, not confirmed yet.

	tglx


-

From: Michal Piotrowski
Date: Tuesday, February 27, 2007 - 1:33 am

I can confirm that the bug is fixed (over 20 hours of testing should be enough).

Huge thanks!

Regards,
Michal

-- 
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
-

From: Ingo Molnar
Date: Tuesday, February 27, 2007 - 1:33 am

thanks alot! I think this thing was a long-term performance/latency 
regression in HT scheduling as well.

	Ingo
-

From: Mike Galbraith
Date: Tuesday, February 27, 2007 - 1:54 am

Agreed.

I was recently looking at that spot because I found that niced tasks
were taking latency hits, and disabled it, which helped a bunch.  I also
can't understand why it would be OK to interleave a normal task with an
RT task sometimes, but not others.. that's meaningless to the RT task.

IMHO, SMT scheduling should be a buyer beware thing.  Maximizing your
core utilization comes at a price, but so does disabling it, so I think
letting the user decide what he wants is the right thing to do.

	-Mike

-

From: Con Kolivas
Date: Tuesday, February 27, 2007 - 4:07 pm

Apologies for the resend, lkml address got mangled...


Ingo I'm going to have to partially disagree with you on this. 

This has only become a problem because of what happens with dynticks now when 
rq->curr == rq->idle. Prior to this, that particular SMT code only leads to 
relative delays in scheduling for lower priority tasks. Whether or not that 
task is ksoftirqd should not matter because it is not like they are starved 
indefinitely, it is only that nice 19 tasks are relatively delayed, which by 
definition is implied with the usage of nice as a scheduler hint wouldn't you 
say? I know it has been discussed many times before as to whether 'nice' 
means less cpu and/or more latency, but in our current implementation, nice 
means both less cpu and more latency. So to me, the kernels without dynticks 
do not have a regression. This seems to only be a problem in the setting of 
the new dynticks code IMHO. That's not to say it isn't a bug! Nor am I saying 
that dynticks is a problem! Please don't misinterpret that.

The second issue is that this is a problem because of the fuzzy definition of 
what idle is for a runqueue in the setting of this SMT code. Normally, 
rq->curr==rq->idle means the runqueue is idle, but not with this code since 
there are still rq->nr_running on that runqueue. What dynticks in this 
implementation is doing is trying to idle a hyperthread sibling on a cpu 
whose logical partner is busy. I did not find that added any power saving on 
my earlier dynticks implementation, and found it easier to keep that sibling 
ticking at the same rate as its partner. Of course you may have found 
something different, and I definitely agree with what you are likely to say 
in response to this- we shouldn't have to special case logical siblings as 
having a different definition of idle than any other smp case. Ultimately, 
that leaves us with your simple patch as a reasonable solution for the 
dynticks case even though it does change the behaviour dramatically for a ...
From: Mike Galbraith
Date: Tuesday, February 27, 2007 - 9:21 pm

(hrmph.  having to copy/paste/try again.  evolution seems to be broken..
RCPT TO <linux-kernel@vg.kolivas.org> failed: Cannot resolve your domain {mp049}
..caused me to be unable to send despite receipts being disabled)


No I'm not, but let's go further in that direction just for the sake of
argument.  You're then saying that you prefer realtime priorities to not
work in the HT setting, given that realtime tasks don't participate in
the 'single stream me' program.
 
I'm saying only that we're defeating the purpose of HT, and overriding a


So?  User asked for HT.  That's hardware multiplexing. It ain't free.

I don't think it does actually. Let your RT task sleep regularly, and
ever so briefly.  We don't evict lower priority tasks from siblings upon

To me, the reason for interleaving is solely about keeping the core

Re-read this paragraph with realtime task priorities in mind, or for
that matter, dynamic priorities.  If you carry your priority/throughput
argument to it's logical conclusion, only instruction streams of
absolutely equal priority should be able to share the core at any given
time.  You may as well just disable HT and be done with it.

To me, siblings are logically separate units, and should be treated as
such (as they mostly are).  They share an important resource, but so do
physically discrete units.

	-Mike


-

From: Con Kolivas
Date: Wednesday, February 28, 2007 - 3:01 pm

Where do I say that? I do not presume to manage realtime priorities in any 
way. You're turning my argument about nice levels around and somehow saying 
that because hyperthreading breaks the single stream me semantics by 
parallelising them that I would want to stop that happening. Nowhere have I 
argued that realtime semantics should be changed to somehow work around 
hyperthreading. SMT nice is about managing nice only, and not realtime 

But the buyer is not aware. You are aware because you tinker, but the vast 
majority of users who enable hyperthreading in their shiny pcs are not aware. 
The only thing they know is that if they enable hyperthreading their programs 
run slower in multitasking environments no matter how much they nice the 
other processes. Buyers do not buy hardware knowing that the internal design 
breaks something as fundamental as 'nice'. You seem to presume that most 
people who get hyperthreading are happy to compromise 'nice' in order to get 
their second core working and I put it to you that they do not make that 

Well you know as well as I do that you're selecting out the exception rather 

Well that's certainly taking my logic for a ride. This is about managing 
_nice_ and _only_ nice. Nice specifies fixed interval static priorities where 
(in the risk of repeating myself) you are specifying that higher nice values 
tasks you wish to receive less cpu and more latency. Dynamic priorities have 
absolutely no effect on what the discrepancies are between the static 
priorities of differing nice values. As for realtime priorities, again, I do 
not presume to be managing them with SMT nice. They are unique entities 
unrelated to nice values. The only thing they have in common with nice levels 
is that if something is running without a realtime priority, it should be 
preempted by the realtime task as you have specified that the realtime task 
should receive all the cpu over the non-realtime task. I don't pretend that 
there is some cpu percentage ...
From: Mike Galbraith
Date: Wednesday, February 28, 2007 - 5:02 pm

I see no real difference between the two assertions.  Nice is just a
mechanism to set priority, so I applied your assertion to a different
range of priorities than nice covers, and returned it to show that the
code contradicts itself.  It can't be bad for a nice 1 task to run with
a nice 0 task, but OK for a minimum RT task to run with a maximum RT
task.  Iff HT without corrective measures breaks nice, then it breaks


To me it's pretty much black and white.  Either you want to split your
cpu into logical units, which means each has less to offer than the

I don't agree that it's the exception, and if you look at this HT thing
from the split cpu perspective, I'm not sure there's even a problem.

Scrolling down, I see that this is getting too long, and we aren't

The above will have to do.

	-Mike

-

From: Ingo Molnar
Date: Thursday, March 1, 2007 - 1:46 am

i'm starting to lean towards your view that we should not artificially 
keep tasks from running, when there's a free CPU available. We should 
still keep the 'other half' of SMT scheduling: the immediate pushing of 
tasks to a related core, but this bit of 'do not run tasks on this CPU' 
dependent-sleeper logic is i think a bit fragile. Plus these days SMT 
siblings do not tend to influence each other in such a negative way as 
older P4 ones where a HT sibling would slow down the other sibling 
significantly.

plus with an increasing number of siblings (which seems like an 
inevitable thing on the hardware side), the dependent-sleeper logic 
becomes less and less scalable. We'd have to cross-check every other 
'related' CPU's current priority to decide what to run.

if then there should be a mechanism /in the hardware/ to set the 
priority of a CPU - and then the hardware could decide how to prioritize 
between siblings. Doing this in software is really hard.

	Ingo
-

From: Con Kolivas
Date: Thursday, March 1, 2007 - 4:13 am

Well it is meant to be tuned to the cpu type in per_cpu_gain. So it should be 
easy to be set to the appropriate scaling. It was never meant to be a one 

Yes even I've commented before that this current system is unworkable come 
multiple shared power threads. This I do see as a real problem with it - in 

And that's the depressing part because of course I was interested in that as 
the original approach to the problem (and it was a big problem). When I spoke 
to Intel and AMD (of course to date no SMT AMD chip exists) at kernel summit 
they said it was too hard to implement hardware priorities well. Which is 
real odd since IBM have already done it with Power...

Still I think it has been working fine in software till now, but now it has to 
deal with the added confusion of dynticks, so I already know what will happen 
to it.

Hrm it's been a good time for my code all round... I think I'll just swap 
prefetch myself up the staircase to some pluggable scheduler that would 
hyperthread me to sleep as an idle priority task.

-- 
-ck
-

From: Thomas Gleixner
Date: Thursday, March 1, 2007 - 4:33 am

Well, it's not a dyntick problem in the first place. Even w/o dynticks
we go idle with local_softirq_pending(). Dynticks contains an explicit
check for that, which makes it visible.

	tglx


	

-

From: Con Kolivas
Date: Thursday, March 1, 2007 - 5:05 am

Oops I'm sorry if I made it sound like there's a dynticks problem. That was 
not my intent and I said as much in an earlier email. Even though I'm finding 
myself defending code that has already been softly tagged for redundancy, 
let's be clear here; we're talking about at most a further 70ms delay in 
scheduling a niced task in the presence of a nice 0 task, which is a 
reasonable delay for ksoftirqd which we nice the eyeballs out of in mainline. 
Considering under load our scheduler has been known to cause scheduling 
delays of 10 seconds I still don't see this as a bug. Dynticks just "points 
it out to us".

-- 
-ck
-

From: Thomas Gleixner
Date: Thursday, March 1, 2007 - 5:20 am

Well, dyntick might end up to delay it for X seconds as well, which _is_
observable and that's why the check was put there in the first place.

	tglx


-

From: Ingo Molnar
Date: Thursday, March 1, 2007 - 6:30 am

well, not running softirqs when we could is a bug. It's not a big bug, 
but it's a bug nevertheless. It doesnt matter that softirqs could be 
delayed even worse under high load - there was no 'high load' here.

	Ingo
-

From: Con Kolivas
Date: Thursday, March 1, 2007 - 2:51 pm

Gotcha. I'll prepare a smt-nice removal patch shortly. 

-- 
-ck
-

From: Con Kolivas
Date: Thursday, March 1, 2007 - 3:33 pm

Remove the SMT-nice feature which idles sibling cpus on SMT cpus to
facilitiate nice working properly where cpu power is shared. The idling
of cpus in the presence of runnable tasks is considered too fragile, easy
to break with outside code, and the complexity of managing this system if an
architecture comes along with many logical cores sharing cpu power will be
unworkable.

Remove the associated per_cpu_gain variable in sched_domains used only by
this code.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

---
 include/asm-i386/topology.h           |    1 
 include/asm-ia64/topology.h           |    2 
 include/asm-mips/mach-ip27/topology.h |    1 
 include/asm-powerpc/topology.h        |    1 
 include/asm-x86_64/topology.h         |    1 
 include/linux/sched.h                 |    1 
 include/linux/topology.h              |    4 
 kernel/sched.c                        |  155 ----------------------------------
 8 files changed, 1 insertion(+), 165 deletions(-)

Index: linux-2.6.21-rc2/kernel/sched.c
===================================================================
--- linux-2.6.21-rc2.orig/kernel/sched.c	2007-03-02 08:56:45.000000000 +1100
+++ linux-2.6.21-rc2/kernel/sched.c	2007-03-02 08:58:40.000000000 +1100
@@ -3006,23 +3006,6 @@ static inline void idle_balance(int cpu,
 }
 #endif
 
-static inline void wake_priority_sleeper(struct rq *rq)
-{
-#ifdef CONFIG_SCHED_SMT
-	if (!rq->nr_running)
-		return;
-
-	spin_lock(&rq->lock);
-	/*
-	 * If an SMT sibling task has been put to sleep for priority
-	 * reasons reschedule the idle task to see if it can now run.
-	 */
-	if (rq->nr_running)
-		resched_task(rq->idle);
-	spin_unlock(&rq->lock);
-#endif
-}
-
 DEFINE_PER_CPU(struct kernel_stat, kstat);
 
 EXPORT_PER_CPU_SYMBOL(kstat);
@@ -3239,10 +3222,7 @@ void scheduler_tick(void)
 
 	update_cpu_clock(p, rq, now);
 
-	if (p == rq->idle)
-		/* Task on the idle queue */
-		wake_priority_sleeper(rq);
-	else
+	if (p != rq->idle)
 		task_running_tick(rq, ...
From: Michal Piotrowski
Date: Tuesday, February 27, 2007 - 1:35 am

^^^^ almost ;)

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group (PL)
(http://www.stardust.webpages.pl/ltg/)
LTG - Linux Testers Group (EN)
(http://www.stardust.webpages.pl/linux_testers_group_en/)
-

From: Andreas Schwab
Date: Wednesday, February 21, 2007 - 11:34 am

I'm getting an undefined symbol with CONFIG_AGP=m:

WARNING: "compat_agp_ioctl" [drivers/char/agp/agpgart.ko] undefined!

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
-

From: Dave Jones
Date: Wednesday, February 21, 2007 - 11:40 am

On Wed, Feb 21, 2007 at 07:34:01PM +0100, Andreas Schwab wrote:
 > I'm getting an undefined symbol with CONFIG_AGP=m:
 > 
 > WARNING: "compat_agp_ioctl" [drivers/char/agp/agpgart.ko] undefined!

Fix went to Linus an hour ago.
It's been in -mm for a week, and agpgart.git for a day or so.

		Dave

-- 
http://www.codemonkey.org.uk
-

Previous thread: [patch 1/6] mm: debug check for the fault vs invalidate race by Nick Piggin on Tuesday, February 20, 2007 - 9:49 pm. (99 messages)

Next thread: [PATCH] devpts: add fsnotify create event by Florin Malita on Tuesday, February 20, 2007 - 10:15 pm. (2 messages)