Ok, the merge window for 2.6.21 has closed, and -rc1 is out there. There's a lot of changes, as is usual for an -rc1 thing, but at least so far it would seem that 2.6.20 has been a good base, and I don't think we have anything *really* scary here. The most interesting core change may be the dyntick/nohz one, where timer ticks will only happen when needed. It's been brewing for a _loong_ time, but it's in the standard kernel now as an option. But there's a ton of architecture updates (arm, mips, powerpc, x86, you name it), ACPI updates, and lots of driver work. And just a lot of cleanups. Have fun, Linus -
Yup. Fun starts in drivers/net/e1000
e1000 is not working anymore. ifup fails permanentely.
ADDRCONF(NETDEV_UP): eth0: link is not ready
nothing else
bisect identifies:
d2ed16356ff4fb9de23fbc5e5d582ce580390106 is first bad commit
commit d2ed16356ff4fb9de23fbc5e5d582ce580390106
Author: Kok, Auke <auke-jan.h.kok@intel.com>
Date: Fri Feb 16 14:39:26 2007 -0800
e1000: fix shared interrupt warning message
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Reverting this patch on top of -rc1 helps as well.
tglx
lspci output:
04:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Intel Corporation Unknown device 30a5
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 218
Region 0: Memory at 90100000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 2000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
Address: 00000000fee0100c Data: 414a
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag-
Device: Latency L0s <512ns, L1 <64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
...> commit d2ed16356ff4fb9de23fbc5e5d582ce580390106 I think we need to drop this now. The report that says that this *fixes* something might have been on regular interrupts only. I currently suspect that it breaks all MSI interrupts, which would make sense if I look a the code. Very bad indeed. I'll try to come up with something else or send a patch that reverts it. Auke -
I'm going to be off-line for a couple of days, so I just reverted it.
Linus
---
commit b5bf28cde894b3bb3bd25c13a7647020562f9ea0
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Wed Feb 21 11:21:44 2007 -0800
Revert "e1000: fix shared interrupt warning message"
This reverts commit d2ed16356ff4fb9de23fbc5e5d582ce580390106.
As Thomas Gleixner reports:
"e1000 is not working anymore. ifup fails permanentely.
ADDRCONF(NETDEV_UP): eth0: link is not ready
nothing else"
The broken commit was identified with "git bisect".
Auke Kok says:
"I think we need to drop this now. The report that says that this
*fixes* something might have been on regular interrupts only. I
currently suspect that it breaks all MSI interrupts, which would make
sense if I look a the code. Very bad indeed."
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Auke Kok <auke-jan.h.kok@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index a710237..98215fd 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1417,6 +1417,10 @@ e1000_open(struct net_device *netdev)
if ((err = e1000_setup_all_rx_resources(adapter)))
goto err_setup_rx;
+ err = e1000_request_irq(adapter);
+ if (err)
+ goto err_req_irq;
+
e1000_power_up_phy(adapter);
if ((err = e1000_up(adapter)))
@@ -1427,10 +1431,6 @@ e1000_open(struct net_device *netdev)
e1000_update_mng_vlan(adapter);
}
- err = e1000_request_irq(adapter);
- if (err)
- goto err_req_irq;
-
/* If AMT is enabled, let the firmware know that the network
* interface is now open */
if (adapter->hw.mac_type == e1000_82573 &&
@@ -1439,10 +1439,10 @@ e1000_open(struct net_device ...On Wed, 21 Feb 2007 11:24:33 -0800 (PST) OK, but this change was needed because the new IRQ-debugging code reliably causes e1000 to explode. So perhaps until e1000 gets sorted out we should disable the debug code: --- a/lib/Kconfig.debug~a +++ a/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config DEBUG_SHIRQ bool "Debug shared IRQ handlers" - depends on DEBUG_KERNEL && GENERIC_HARDIRQS + depends on DEBUG_KERNEL && GENERIC_HARDIRQS && BROKEN help Enable this to generate a spurious interrupt as soon as a shared interrupt handler is registered, and just before one is deregistered. _ -
On i386 I get the following, TCP cubic registered NET: Registered protocol family 1 NET: Registered protocol family 17 Testing NMI watchdog ... CPU#0: NMI appears to be stuck (24->24)! CPU#1: NMI appears to be stuck (0->0)! CPU#2: NMI appears to be stuck (0->0)! CPU#3: NMI appears to be stuck (0->0)! when I add nmi_watchdog=1 to my boot args which worked on prior kernels. On closer inspection it looks like arch/i386/kernel/io_apic.c : check_timer() --> timer_irq_works() depends on IRQ0 incrementing jiffies which is no longer the case AFAIK. I'm not sure exactly how that relates to the NMI, but the check_timer() function disabled the NMI through the io-apic if it can't get the "timer" working through the io-apic. Daniel -
Why? The switch just stops the PIT/HPET. It does not fiddle with IO_APIC Nothing obvious. Bisect time :( tglx -
I'm not an expert on the io-apic, but the check_timer() function seemed Well, I'm pretty sure it's HRT, cause in prior versions this only happened when HRT is enabled. Then you guys went to the lapic all the time, and now this is happening all the time .. You can't reproduce this? Daniel -
Again:
check_timer() is called _BEFORE_ we even touch the local APIC timers. At
The NMI is stuck:
if (nmi_count(cpu) - prev_nmi_count[cpu] <= 5) {
printk("CPU#%d: NMI appears to be stuck (%d->%d)!\n",
cpu,
prev_nmi_count[cpu],
nmi_count(cpu));
This has nothing to do with jiffies.
Nope.
Also all my machines emit something like:
"ACPI: LAPIC_NMI (acpi_id[0x00] dfl dfl lint[0x1])"
In your boot log nothing to see.
tglx
-
Right, but eventually there isn't a regular timer interrupt through the io-apic. I don't think in the past IRQ0 stops without the system crashing, so check_timer() could assume the timer (IRQ0) is _always_ regular. do you know what the requirement are for routing the NMI through the I think it has to do with IRQ0 . Did I mention this doesn't happen in Do you use nmi_watchdog=1 ? Daniel -
Sorry. I checked. switching PIT off really breaks nmi_watchdog=1, as this just mirrors IRQ#0 to the NMI. No IRQ#0 from PIT, no NMI We could keep PIT running with an empty interrupt handler when nmi_watchdog=1 is set, but this interferes nicely with broadcasting. Does nmi_watchdog=2 work ? We might switch to that, when a local APIC is available. tglx -
Oddly, nmi_watchdog=2 doesn't work in 2.6.21-rc1, but it works in 2.6.20-rt8 however I'm not sure of the config could have been PREEMPT_RT was on. Daniel -
There's a compile failure during my bisect.
distcc[3863] ERROR: compile /tmp//hrtimer.tmp.dwalker1.3795.i on dwalker3/120 failed
kernel/hrtimer.c: In function 'hrtimer_cpu_notify':
kernel/hrtimer.c:884: warning: implicit declaration of function 'clockevents_notify'
kernel/hrtimer.c:884: error: 'CLOCK_EVT_NOTIFY_CPU_DEAD' undeclared (first use in this function)
kernel/hrtimer.c:884: error: (Each undeclared identifier is reported only once
kernel/hrtimer.c:884: error: for each function it appears in.)
drivers/ide/setup-pci.c: In function 'ide_scan_pcibus':
drivers/ide/setup-pci.c:866: warning: ignoring return value of '__pci_register_driver', declared with attribute warn_unused_result
make[1]: *** [kernel/hrtimer.o] Error 1
from this commit,
commit f8381cba04ba8173fd5a2b8e5cd8b3290ee13a98
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Fri Feb 16 01:28:02 2007 -0800
[PATCH] tick-management: broadcast functionality
With Ingo Molnar <mingo@elte.hu>
Add broadcast functionality, so per cpu clock event devices can be registere
as dummy devices or switched from/to broadcast on demand. The broadcast
function distributes the events via the broadcast function of the clock even
device. This is primarily designed to replace the switch apic timer to / fr
IPI in power states, where the apic stops.
-
It was related some code under a hot plug ifdef ..
Here's the final commit from the bisect which caused it . It says "No
changes to existing functionality" ?
e9e2cdb412412326c4827fc78ba27f410d837e6e is first bad commit
commit e9e2cdb412412326c4827fc78ba27f410d837e6e
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Fri Feb 16 01:28:04 2007 -0800
[PATCH] clockevents: i386 drivers
Add clockevent drivers for i386: lapic (local) and PIT/HPET (global). Update
the timer IRQ to call into the PIT/HPET driver's event handler and the
lapic-timer IRQ to call into the lapic clockevent driver. The assignement of
timer functionality is delegated to the core framework code and replaces the
compile and runtime evalution in do_timer_interrupt_hook()
Use the clockevents broadcast support and implement the lapic_broadcast
function for ACPI.
No changes to existing functionality.
-
Ok, it wouldn't be the first time some change that is supposed to change nothing does actually change something. That said, one thing to worry about when doing bisection: the kernel configuration. If you always just do "make oldconfig" or something, the kernel config for the thing you test will depend on the _previous_ kernel you compiled, and that is not always what you want. I've once had a failing kernel, did bisection, and it turned out that since I had gone back in time to before the option that caused the failure even existed, I had (by mistake) then compiled some of the later kernels without that option enabled, and called them "good". The end result: "git bisect" didn't actually end up pointing to the right commit, just because I had effectively lied to it. That said, considering that you did get a commit that doesn't look entirely unlikely (and that clearly changes things that are relevant), I suspect you did actually find the right one. Linus -
Yup, thats the one which switches off PIT after we have the local APIC timers up and running. Which turns out to cause the nmi_watchdog not working anymore. tglx -
In this case I don't think anything was specifically turned on, beyond SMP. For instance HRT/dynamic tick was off. I didn't run "make oldconfig", but just running "make" asked for options I think if it's not that exact commit it's still one in that set. I mainly wanted to confirm that it was an hrt/dynamic tick issue , and not some left field patches.. Daniel -
This bit me badly the one time I did a git bisect; it kept ping- ponging around a big change (sata? xtables?) that required me to answer the same two dozen questions about ten different times. If git-bisect itself could manage .config as part of the process, reverting to the one used with the most-recently-in-the-past try when it backs up to a revision, that would be, you know, wonderful. --Pete ---------------------------------- Pete Harlan ArtSelect, Inc. harlan@artselect.com http://www.artselect.com ArtSelect is a subsidiary of a21, Inc. -
When that happens, you need to pick another commit to try than the one git selected for you automatically. You can do that by doing git bisect visualize and select another commit somewhere fairly close to a mid-point, and try that with git reset --hard <commit-sha1> instead. Linus -
I already bisected this on my old pIII, which has the same problem: clockevents-i386-drivers.patch -
yes - we know what the problem is (and will fix it): the stopping of the PIT - nmi_watchdog=1 is hack to use the IO-APIC's PIT pin to also signal NMIs. Just to clarify, this problem does not occur if HIGH_RES_TIMERS is off, correct? Ingo -
Hi, CHK include/linux/version.h CHK include/linux/utsrelease.h CHK include/linux/compile.h CC [M] drivers/char/ip2/ip2main.o In file included from drivers/char/ip2/ip2main.c:285: drivers/char/ip2/i2lib.c: In function `iiSendPendingMail_t': drivers/char/ip2/i2lib.c:83: sorry, unimplemented: inlining failed in call to 'iiSendPendingMail': function body not available drivers/char/ip2/i2lib.c:157: sorry, unimplemented: called from here make[3]: *** [drivers/char/ip2/ip2main.o] Error 1 make[2]: *** [drivers/char/ip2] Error 2 make[1]: *** [drivers/char] Error 2 make: *** [drivers] Error 2 With cleanup changes in commit 40565f1962c5be9b9e285e05af01ab7771534868 compilation fails. Regards, - Faik -
What compiler? thanks, -- http://www.fi.muni.cz/~xslaby/ Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E -
Oh, I can reproduce with gcc 3.4. Going to fix it. thanks, -- http://www.fi.muni.cz/~xslaby/ Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E -
Yeah, that thing was crud.
Linus
---
commit 5fc7e655a50b0a19229a6b4a8a5e23bfedf700a4
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Wed Feb 21 11:18:26 2007 -0800
Fix bogus 'inline' in drivers/char/ip2/i2lib.c
Not only was the function way too big to be inlined in the first place,
it was used before it was even defined.
Noted-by: Faik Uygur <faik@pardus.org.tr>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/drivers/char/ip2/i2lib.c b/drivers/char/ip2/i2lib.c
index f86fa0c..e46120d 100644
--- a/drivers/char/ip2/i2lib.c
+++ b/drivers/char/ip2/i2lib.c
@@ -80,7 +80,7 @@ static int i2RetryFlushOutput(i2ChanStrPtr);
// Not a documented part of the library routines (careful...) but the Diagnostic
// i2diag.c finds them useful to help the throughput in certain limited
// single-threaded operations.
-static inline void iiSendPendingMail(i2eBordStrPtr);
+static void iiSendPendingMail(i2eBordStrPtr);
static void serviceOutgoingFifo(i2eBordStrPtr);
// Functions defined in ip2.c as part of interrupt handling
@@ -166,7 +166,7 @@ static void iiSendPendingMail_t(unsigned long data)
// If any outgoing mail bits are set and there is outgoing mailbox is empty,
// send the mail and clear the bits.
//******************************************************************************
-static inline void
+static void
iiSendPendingMail(i2eBordStrPtr pB)
{
if (pB->i2eOutMailWaiting && (!pB->i2eWaitingForEmptyFifo) )
-
Hi Thomas,
I'm testing NO_HZ on my machines. On the laptop I see that the timer
interrupt counter is incremented (though slower than HZ). This machine
is running UP kernel.
On my desktop I see this:
CPU0 CPU1
0: 114 0 IO-APIC-edge timer
1: 1624 10771 IO-APIC-edge i8042
6: 3 0 IO-APIC-edge floppy
7: 0 0 IO-APIC-edge parport0
9: 0 0 IO-APIC-fasteoi acpi
12: 40111 184047 IO-APIC-edge i8042
16: 75624 998858 IO-APIC-fasteoi radeon@pci:0000:01:00.0, uhci_hcd:usb1
17: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
18: 711 5487 IO-APIC-fasteoi ide1, libata, ehci_hcd:usb7, uhci_hcd:usb3
19: 617 2254 IO-APIC-fasteoi libata, uhci_hcd:usb2
20: 0 0 IO-APIC-fasteoi ehci_hcd:usb6, uhci_hcd:usb5
21: 2483869 0 IO-APIC-fasteoi eth0
22: 2 0 IO-APIC-fasteoi ohci1394
218: 28872 360643 PCI-MSI-edge HDA Intel
219: 32932 138196 PCI-MSI-edge libata
NMI: 0 0
LOC: 2761191 2827539
ERR: 0
MIS: 0
Interrupt 0 is stuck at 114 (the number is consistent across reboots). I
don't experience any problem, time is running fine. Still it's strange
that the timer is doing nothing; maybe something other than the PIT is
used for time keeping?
This is the dmesg of the "abnormal" machine (dual core, SMP kernel):
Linux version 2.6.20-ge696268a-dirty (kronos@dreamland.darkstar.lan) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #33 SMP PREEMPT Tue Feb 20 23:24:24 CET 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009c800 end: 000000000009c800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009c800 size: 0000000000003800 end: 00000000000a0000 ...Yes, we switch away from PIT and use the local APIC timer. (LOC) tglx -
Ok, thanks for the clarification. Luca -
Before this becomes a FAQ. Would anybody mind if I just renamed "timer" to "pit" to make this clear? -Andi -
What's the benefit of doing so - and why has not it been done before? I mean, I run a regular 2.6.18.6, /sys/devices/system/clocksource/clocksource0/current_clocksource (is this related?) shows "acpi_pm", but the IRQ0 counter increases at HZ. Maybe I am confusing things, but why the need for PIT when clocksource is acpi_pm anyway? Jan -- -
you're mixing up 2 concepts: 1) clocksource 2) eventsource 1) is for "what time is it now", and acpi_pm is useful for that, as are several other things such as rdtsc 2) is for "I need THIS to happen X miliseconds from now". acpi_pm is not useful for that, nor is rdtsc. some can be used for both (PIT), but on a concept level the uses are independent. The advantage of local apic over PIT is that local apic is cheap to do "one shot" future events with, while the PIT will tick periodic at a fixed frequency. With tickless idle.. that's not what you want. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org -
So with a local apic, and acpi_pm as clocksource, I shouldn't be getting timer
interrupts? Yet I do. Which I assume means that the kernel will still get woken
up very often.
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
-
timer interrupts as in "irq0"? you shouldn't if you use the hrtimers/tickless stuff... if irq0 keeps increasing at 100Hz or 1000Hz or so.. then yes -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org -
Yes:
CONFIG_HIGH_RES_TIMERS=y
Sure. My dmesg is full of mmc debug crud right now, but I'll just reboot
and I'll have a clean one for you.
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
-
Here we go.
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
---------------------------------------------------------^^^^^ Here is the reason. The local APIC stops working in C3 state and we fall back to the PIT in that case. Not really exciting for dynticks, but the only way to keep the system alive. There is a patch coming up from Intel, which finds out how to use HPET even if it is not enabled by the BIOS. This will still end up on IRQ#0, but will give way longer idle sleeps than the PIT. tglx -
So then the next two questions are; is it possible to disable C3 and is
it a net power gain to get rid of the wakeups in favor of having C3.
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
-
no; c3 saves a TON more power. you can try enabling HPET in your BIOS... -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org -
Hah, I wish! This is a laptop, so the BIOS is as brain dead and broken
as is humanly possible.
Can I determine if I have the required hardware? So I can tell if I'm
permanently screwed, or just temporarily.
Rgds
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
-
if it's something built in the last year or two you have the hw. -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org -
I have an ICH4-M, and from Intel's datasheets it looks like I got the
short straw..
--
-- Pierre Ossman
Linux kernel, MMC maintainer http://www.kernel.org
PulseAudio, core developer http://pulseaudio.org
rdesktop, core developer http://www.rdesktop.org
-
Hi, http://lkml.org/lkml/2006/11/14/153 and related posts in this older thread ("CONFIG_NO_HZ: missed ticks, stall (keyb IRQ required) [2.6.18-rc4-mm1]") should help. Andreas Mohr -
Thomas, I have the patchset for this HPET part ready to roll out. But, looks like NO_HZ and lapic eventsource support is only in i386 and not in x86-64 in 2.6.21-rc1. Do you know the state of NO_HZ x86-64 support? Will it get into git soon? In which case I can rebase my patches against git and send it out for review/testing. Otherwise, I will send out my patches, which are against rt (and my initial implementation of this HPET part is only for x86-64). Thanks, Venki -
acpi_pm is only a readout device to keep track of current time. PIT and local APIC timer are used to provide either periodic or one shot programmable timer events. Up to now the kernel started PIT and local APIC timer in parallel with the same period where PIT incremented jiffies and local APIC timer called update_process_times() and profile_tick. We changed this to let the boot cpu increment jiffies inside the local apic timer interrupt after PIT has been stopped. A whole interrupt for jiffies64++ is waste. Also when we switch to nohz / high resolution mode, we want to use the local APIC, as it is much faster to access. The maximum delta to program is 27ms for the pit and ~1sec for the local APIC. This is important for dynticks, as we can achieve longer idle sleeps w/o reprogramming the timer. tglx -
From: Thomas Gleixner <tglx@linutronix.de> BTW, I'm adding support for sparc64, and before I get much further will the code handle a oneshot-only device? That's basically what I have (sparc64 basically has a TSC and a "comparison" register, you write the "comparison" register with the TSC value at which you'd like the timer interrupt to occur, so it's one-shot and you have to write it again to get the next timer). I see logic in the generic code to do periodic events when the timer only provides one-shot ticks. But as far as I can tell both the HPET and the local APIC support periodic timers so I can't tell how much testing that code has gotten :-) I, in fact, don't see any clockevent device in the current tree that does not set CLOCK_EVT_FEAT_PERIODIC. I guess you could clear that bit just to test those generic code paths. :-) BTW, sparc64 always did the trick where the do_timer() work was done by one of the per-cpu local timer interrupts, I'm glad the idea gained traction generically. :-))) -
Yes, all you need is to omit the CLOCK_EVT_FEAT_PERIODIC flag when you tglx -
From: Thomas Gleixner <tglx@linutronix.de>
Thanks a lot Thomas.
I noticed while doing this work that the generic clock code is
incompatible with the time interpolator, since both provide a
do_{get,set}timeofday() implementation. Just FYI...
-
David, <trimmed CC due to sparcness, added John instaed > John, can you have a look at this ? tglx -
From: Thomas Gleixner <tglx@linutronix.de> Note we may not care :-) All the interpolator does is provide a generic tick based gettimeofday implementation, and the generic clock code does essentially the same thing. And the generic clock code must in fact provide the implementations in order to deal with dyntick issues. The only other platform using the time interpolator is IA64, and after my dyntick implementation on sparc64 the one and only user will be IA64 :-) -
Yea. I actually have some in-progress patches from Peter Keilty that convert ia64 and sparc64 time_interpolators to clocksources, then removes the time_interpolator code. The ia64 conversion is more complicated due to the fsyscall asm, but I think the sparc64 conversion (below) is pretty straight forward. I've only built tested this, so I have no clue if it actually works. Any thoughts? thanks -john From: Peter Keilty <peter.keilty@hp.com> Initial sparc64 conversion to generic clocksource/timekeeping code. Signed-off-by: Peter Keilty <peter.keilty@hp.com> Signed-off-by: John Stultz <johnstul@us.ibm.com> Kconfig | 2 +- defconfig | 2 +- kernel/time.c | 34 +++++++++++++++++++++++----------- 3 files changed, 25 insertions(+), 13 deletions(-) linux-2.6.21-rc1_timeofday-arch-sparc64_C7.patch ============================================ diff --git a/arch/sparc64/Kconfig b/arch/sparc64/Kconfig index f75a686..13a4547 100644 --- a/arch/sparc64/Kconfig +++ b/arch/sparc64/Kconfig @@ -34,7 +34,7 @@ config LOCKDEP_SUPPORT bool default y -config TIME_INTERPOLATION +config GENERIC_TIME bool default y diff --git a/arch/sparc64/defconfig b/arch/sparc64/defconfig index 0f44a6a..33e061a 100644 --- a/arch/sparc64/defconfig +++ b/arch/sparc64/defconfig @@ -9,7 +9,7 @@ CONFIG_64BIT=y CONFIG_MMU=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y -CONFIG_TIME_INTERPOLATION=y +CONFIG_GENERIC_TIME=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set diff --git a/arch/sparc64/kernel/time.c b/arch/sparc64/kernel/time.c index f84da4f..497b29b 100644 --- a/arch/sparc64/kernel/time.c +++ b/arch/sparc64/kernel/time.c @@ -31,6 +31,7 @@ #include <linux/percpu.h> #include <linux/profile.h> #include <linux/miscdevice.h> #include <linux/rtc.h> +#include <linux/clocksource.h> #include <asm/oplib.h> #include <asm/mostek.h> @@ -620,7 +621,7 @@ #endif if (!mregs ...
From: john stultz <johnstul@us.ibm.com>
Hey John, I had to do this already in order to do the dynticks
port to sparc64, but nice to see another attempt :-)
Two things I did on my side:
+ .mask = 0xffffffffffffffffLL,
I used CLOCKSOURCE_MASK(64) here.
+static cycle_t read_sparc64_cpuclock(void)
+{
+ return (cycle_t)get_cycles();
+}
...
+ .read = read_sparc64_cpuclock,
You can just directly assign tick_ops->get_tick to .read at run-time
to avoid a stack frame and function call/return.
+ .shift = 16,
These shift selections all seem rather arbitrary.
If it's not an arbitrary selection, it would be nice to have some
comments about how to go about choosing an appropriate shift.
I imagine the selections has to do with the possible range of
the frequencies the clocksource supports, and how much
accuracy you get for certain shift selections given that range.
-
Correct. The higher the shift value, the more precise NTP multiplier adjustment we can make. However, too large w/ a high frequency clocksource and you'll risk overflowing 64bits on the mult. Although that's not super critical anymore since Roman implemented the high-res error accounting, the net effect should be the same over the long term(we just have to work harder w/ courser adjustments - resulting in very small clock frequency oscillations). I do need to get some documentation going. I had some back when I first started pushing the changes, but so much was revised and rewritten it stopped being correct. -john -
From: john stultz <johnstul@us.ibm.com> While we were discussing this I was debugging my clocksource sparc64 implementation, a shift of 32 didn't work whereas one of 16 (as in your version of sparc64 support) did. :-) There is also a very unfortunate inconsistency between how the shift and multiplier are used in clock sources vs. clock events. You can initialize your clocksource multiplier using: mult = clocksource_hz2mult(ticks_per_hz, SHIFT); but WOE BE TO HE who tries to use that for clockevents (like I did initially) :-) You have to instead use something like: mult = div_sc(ticks_per_hz, NSEC_PER_SEC, SHIFT); I would have been done with the sparc64 dynticks support yesterday if I didn't trip over all this stuff. It also isn't clear from the comments that the first argument to clocksource_hz2mult() and div_sc() is indeed "ticks_per_hz". I had to gleen over the hpet and LAPIC drivers to kind of figure this out. Another potential improvement for the comments :-) -
As I suspected, the one-shot code wasn't very well tested and I'd be
the one to debug this thing on sparc64 :-)
When a timer exceeds the timer period, the one-shot handling code does
the following loop:
for (;;) {
ktime_t next = ktime_add(dev->next_event, tick_period);
if (!clockevents_program_event(dev, next, ktime_get()))
return;
tick_periodic(cpu);
}
So it just keeps running tick_periodic() until we "catch up".
Problem is, if clockevents_program_event() gets a "next" time in the
past, the very case where we'll loop, it DOES NOT update
dev->next_event. It returns the error before doing so.
As a result of this, we'll loop forever here, the softlockup watchdog
will trigger, and the system will wedge completely.
I was getting a softlockup and immediate system hang, so to debug this
I kept a history of the last 8 TSC values when tick_periodic() was
invoked. At softlockup trigger, I'd dump the log. And what I saw
were TSC deltas that we so tiny as to be just enough to indicate
tick_periodic() was running in a tight loop :-)
I propose the following fix, which I'm about to test.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 4500e34..0986a2b 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -77,6 +77,7 @@ static void tick_periodic(int cpu)
void tick_handle_periodic(struct clock_event_device *dev)
{
int cpu = smp_processor_id();
+ ktime_t next;
tick_periodic(cpu);
@@ -86,12 +87,12 @@ void tick_handle_periodic(struct clock_event_device *dev)
* Setup the next period for devices, which do not have
* periodic mode:
*/
+ next = ktime_add(dev->next_event, tick_period);
for (;;) {
- ktime_t next = ktime_add(dev->next_event, tick_period);
-
if (!clockevents_program_event(dev, next, ktime_get()))
return;
tick_periodic(cpu);
+ next = ktime_add(next, tick_period);
}
}
-
Yep, that's caused by a late change for the *!&#ed broadcast stuff for -
-
It was already implemented before for x86-64, but disabled by default because it ran into some platform issues. The new code hopefully knows now how to work around them. -Andi -
Hello. I cannot boot 2.6.21-rc1; it falls into OOM-Killer. Interesting error message I can see is: request_module: runaway loop modprobe net-pf-1 After bisecting, the commit Driver core: let request_module() send a /sys/modules/kmod/-uevent (id c353c3fb0700a3c17ea2b0237710a184232ccd7f) is to blame. Reverting it fixes the issue to me. Regards, --yoshfuji -
And lot of acpi/suspend changes, which seem to break my machine in weird and not really reproducible way. I'm looking onto that. (Yep, that should teach me to test -mm a bit more). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.
If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.
Due to the huge amount of recipients, please trim the Cc when answering.
Subject : resume: slab error in verify_redzone_free(): cache `size-512':
memory outside object was overwritten
References : http://lkml.org/lkml/2007/2/24/41
Submitter : Pavel Machek <pavel@ucw.cz>
Handled-By : Marcel Holtmann <marcel@holtmann.org>
Status : unknown
Subject : ThinkPad T60: no screen after suspend to RAM
References : http://lkml.org/lkml/2007/2/22/391
Submitter : Michael S. Tsirkin <mst@mellanox.co.il>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status : unknown
Subject : HP nx6325 notebook: usb mouse stops working after suspend to ram
References : http://lkml.org/lkml/2007/2/21/413
Submitter : Arkadiusz Miskiewicz <arekm@maven.pl>
Caused-By : Konstantin Karasyov <konstantin.a.karasyov@intel.com>
commit 0a6139027f3986162233adc17285151e78b39cac
Status : unknown
Subject : MacBook: AE_NOT_FOUND ACPI messages
References : http://bugzilla.kernel.org/show_bug.cgi?id=8066
Submitter : Thomas Meyer <thomas.mey@web.de>
Status : unknown
Subject : Asus A8N-VM motherboard: boot failure (ACPI related)
References : http://lkml.org/lkml/2007/2/23/132
Submitter : Andrew <andrew@nelless.net>
Status : unknown
Subject : Asus M6Ne notebook: ACPI: First battery is not detected
References : http://bugzilla.kernel.org/show_bug.cgi?id=8080
Submitter : Janosch Machowinski <jmachowinski@gmx.de>
Status : unknown
Subject : Asus M6Ne/M6A notebooks: SATA_ACPI errors during kernel boot
References : ...just found a new one: forcedeth.c broke since v2.6.20. It generates a few irqs after ifup and then stops dead (no in/out packets). Undoing all of these commits fixes the problem: commit 21828163b2b31e0675cb3e66a2a4a231dec06236 commit 57fff6986b6daae13947c565786757c05303f0f6 commit 4e16ed1b0e17a3832310031e2ddaeb0914eb837d commit f0734ab658390380079369f7090dcf7aa226f394 commit b01867cbd1853995946c8c838eff93a0885d8bc6 commit 445583b89d71b48cf8c64e26acc5a710248feed7 commit aaa37d2d099f97ced415546e285ac901e47a2437 commit 86b22b0dfbf462e6ed75e54fc83575dae01e3c69 commit 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6 commit 164a86e40e6c74ec5a91c364ccf7b1a2295b0a52 commit 761fcd9e3e0aa2a02231d1631f31409be5e890d2 commit d2f7841277d8613a780ab28d04d8f31a31816153 commit 1d39ed565cfcc7c4fe586de621aef495c4f94ffb that's 75K of delta - i'll bisect it some more. Hardware in question: 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3) config options: CONFIG_FORCEDETH=y CONFIG_FORCEDETH_NAPI=y Ingo -
git bisection gives: | 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6 is first bad commit | commit 0d63fb32b2b8c3464d9c1afc3ce3fd3ceec025b6 | Author: Ayaz Abdulla <aabdulla@nvidia.com> | Date: Tue Jan 9 13:30:13 2007 -0500 | | forcedeth: rx skb recycle | | This patch removes the code that recycled the skb on error. This will | help in reducing the branches in the main data paths. | | Signed-Off-By: Ayaz Abdulla <aabdulla@nvidia.com> | | Signed-off-by: Jeff Garzik <jeff@garzik.org> | | :040000 040000 0e4cb0aa12ba1df6a7ca84da697c75082b449ece b70edcf390d63ac9e0b35d2aea7b59efe2bfeaea M drivers but i think this is simply the first commit that makes the driver essentially non-bisectable - the failures in interim iterations changed from 'interface unusable' to boot hangs that were either forcedeth tx ring timeouts, or hangs after nv_sata initialization. I've attached the full .config below. Ingo --------------------------> # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21-rc1-syslet-v5 # Tue Feb 27 09:46:29 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_ASYNC_SUPPORT=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # ...
here's the boot log of the tx_ring timeout that happens at the following
commit in the first step of bisection:
commit aaa37d2d099f97ced415546e285ac901e47a2437
Author: Ayaz Abdulla <aabdulla@nvidia.com>
Date: Sun Jan 21 18:10:42 2007 -0500
forcedeth: tx limiting
(eth0/eth1 gets flipped during network startup, so eth1 is the forcedeth
driver)
Ingo
---------------------->
Linux version 2.6.21-rc1-syslet-v5 (mingo@dione) (gcc version 4.0.2) #8 SMP PREEMPT Tue Feb 27 10:33:37 CET 2007
BIOS-provided physical RAM map:
sanitize start
sanitize end
copy_e820_map() start: 0000000000000000 size: 000000000009f800 end: 000000000009f800 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000000009f800 size: 0000000000000800 end: 00000000000a0000 type: 2
copy_e820_map() start: 00000000000f0000 size: 0000000000010000 end: 0000000000100000 type: 2
copy_e820_map() start: 0000000000100000 size: 000000003fef0000 end: 000000003fff0000 type: 1
copy_e820_map() type is E820_RAM
copy_e820_map() start: 000000003fff0000 size: 0000000000003000 end: 000000003fff3000 type: 4
copy_e820_map() start: 000000003fff3000 size: 000000000000d000 end: 0000000040000000 type: 3
copy_e820_map() start: 00000000e0000000 size: 0000000010000000 end: 00000000f0000000 type: 2
copy_e820_map() start: 00000000fec00000 size: 0000000001400000 end: 0000000100000000 type: 2
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS)
BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
127MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000f5680
Entering add_active_range(0, 0, 262128) 0 entries of 256 used
Zone PFN ...update: Jeff sent me 3 pending forcedeth.c fixes, and they indeed fix
the regression. The one that fixed it is:
----------------->
The napi poll routine was missing the call to the optimized rx process
routine. This patch adds the missing call for the optimized path.
Signed-Off-By: Ayaz Abdulla <aabdulla@nvidia.com>
--- orig/drivers/net/forcedeth.c 2007-02-20 03:17:21.000000000 -0500
+++ new/drivers/net/forcedeth.c 2007-02-20 03:28:31.000000000 -0500
@@ -3104,13 +3104,17 @@
struct fe_priv *np = netdev_priv(dev);
u8 __iomem *base = get_hwbase(dev);
unsigned long flags;
+ int retcode;
- if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2)
+ if (np->desc_ver == DESC_VER_1 || np->desc_ver == DESC_VER_2) {
pkts = nv_rx_process(dev, limit);
- else
+ retcode = nv_alloc_rx(dev);
+ } else {
pkts = nv_rx_process_optimized(dev, limit);
+ retcode = nv_alloc_rx_optimized(dev);
+ }
- if (nv_alloc_rx(dev)) {
+ if (retcode) {
spin_lock_irqsave(&np->lock, flags);
if (!np->in_shutdown)
mod_timer(&np->oom_kick, jiffies + OOM_REFILL);
-
Ok, that was included in the bunch I just pulled and pushed out, so hopefully this one is resolved. Pls verify. Linus -
yep, can confirm that it's now fixed upstream too. Ingo -
Arkadiusz, Could try the attached patch to see if it solves the problem? If not, please send the output of acpidump command and log. Regards.
This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.
If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.
Due to the huge amount of recipients, please trim the Cc when answering.
Subject : ThinkPad T60: system doesn't come out of suspend to RAM
(CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/22/391
Submitter : Michael S. Tsirkin <mst@mellanox.co.il>
Thomas Gleixner <tglx@linutronix.de>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status : unknown
Subject : kernel BUG at kernel/time/tick-sched.c:168 (CONFIG_NO_HZ)
References : http://lkml.org/lkml/2007/2/16/346
Submitter : Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Status : problem is being debugged
Subject : BUG: soft lockup detected on CPU#0
NOHZ: local_softirq_pending 20
References : http://lkml.org/lkml/2007/2/20/257
Submitter : Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Ingo Molnar <mingo@elte.hu>
Status : problem is being debugged
Subject : i386: no boot with nmi_watchdog=1 (clockevents)
References : http://lkml.org/lkml/2007/2/21/208
Submitter : Daniel Walker <dwalker@mvista.com>
Caused-By : Thomas Gleixner <tglx@linutronix.de>
commit e9e2cdb412412326c4827fc78ba27f410d837e6e
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Status : problem is being debugged
-
x60 doesn't resume from S2R either, it doesn't matter if CONFIG_NO_HZ is set or not though. 2.6.20 worked fine. -- Jens Axboe -
It somehow works for me. As long as I do not play with bluetooth and suspend to disk... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
It locks solid here on resume, going back to 2.6.20 makes it work perfectly again. In between 2.6.20 and 2.6.21-rc1 some ACPI change broke resume, but that got fixed. Some other change later snuck in that broke it AGAIN for me, sigh. I don't use bluetooth nor suspend to disk. -- Jens Axboe -
resume is stock on my T60 too. So you mean v2.6.21-rc1 vanilla works fine? Do you know a commit ID that works for sure? I'd like to bisect this, but this way i might just find that ACPI change that got already fixed later on (and then got re-broken). Ingo -
Nope, 2.6.21-rc1 vanilla does not work. 2.6.20 works. 2.6.20-gitX worked until some acpi change broke it, the below patch fixed that for me. That got merged in a later 2.6.20-gitY, but then some other patch broke it again so that 2.6.21-rc1 is broken. Not much luck there :-) So it looks like: - c5a7156959e89b32260ad6072bbf5077bcdfbeee broke 2.6.20-git - f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 should fix that. - Something later than f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 broke it Yeah, it gets trickier. I'll try f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 now and see if that works, then bisect to 2.6.21-rc1 to find the other offender. I hope the other offender didn't get added before f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38, we'll see :-) -- Jens Axboe -
f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me, starting bisect. -- Jens Axboe -
Which got me nowhere, after bisecting down from 1213 revisions to nothing. f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 certainly works, just verified again. Trying only acpi related changes now... -- Jens Axboe -
update: f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me too, and 01363220f5d23ef68276db8974e46a502e43d01d is broken. I too will attempt to bisect this. Ingo -
hm. There's some weird bisection artifact here. Here are the commits i tested, in git-log order: #1 commit 01363220f5d23ef68276db8974e46a502e43d01d bad #2 commit ee404566f97f9254433399fbbcfa05390c7c55f7 bad #3 commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 good #4 commit c827ba4cb49a30ce581201fd0ba2be77cde412c7 bad if i tell git-bisect that #1 is bad and #3 is good, then it offers me #2 - that's OK. But when i tell it that #2 is bad, it offers #4 - which is out of order! The bisection goes off into la-la land after that and never gets back to a commit that is /after/ the good commit. How is this possible? (I upgraded from git-1.4.4 to 1.5.0 to make sure this isnt some git bug that's already fixed.) i'll try to straighten this out manually, perhaps #3 is in some merge branch that confuses bisection. Or maybe i misunderstood how git-bisect works. Ingo -
git-bisect gets royally confused on those ACPI merge branches around commit c0cd79d11412969b6b8fa1624cdc1277db82e2fe. Here are my test results so far: commit 01363220f5d23ef68276db8974e46a502e43d01d: bad commit 255f0385c8e0d6b9005c0e09fffb5bd852f3b506: bad commit c0cd79d11412969b6b8fa1624cdc1277db82e2fe: bad commit c24e912b61b1ab2301c59777134194066b06465c: good commit e9e2cdb412412326c4827fc78ba27f410d837e6e: bad commit 79bf2bb335b85db25d27421c798595a2fa2a0e82: bad commit fc955f670c0a66aca965605dae797e747b2bef7d: good commit 70c0846e430881967776582e13aefb81407919f1: good commit 414f827c46973ba39320cfb43feb55a0eeb9b4e8: bad commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38: good commit 5f0b1437e0708772b6fecae5900c01c3b5f9b512: bad commit b878ca5d37953ad1c4578b225a13a3c3e7e743b7: bad commit c2902c8ae06762d941fab64198467f78cab6f8cd: bad commit 12e74f7d430655f541b85018ea62bcd669094bd7: bad commit 3388c37e04ec0e35ebc1b4c732fdefc9ea938f3b: bad commit 9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff: bad the results are totally reproducible (i re-tried a few of both the good and the bad commits), i.e. it's not a sporadic condition. Also, a number of the 'bad' commits have no dynticks stuff in them at all, so i'd exclude dynticks. could someone suggest a sane way to go with this? Perhaps suggest specific commit IDs to test? Ingo -
Hm, does 2.6.20-mm2 work? If not, you can bisect the broken-out sereis with quilt. Rafael -
Looks like git bisect worked for you, and wasn't confused at all. You started out with 2931 commits between your first known-bad and known-good commits, which means that you usually end up having to check "log2(n)+1" kernels, ie I'd have expected you to have to do 12-13 bisection attempts to cut it down to one. You seem to have done 14 (you list 16 commits, two of which are the starting points), which is right in that range. The reason you sometimes get more is: - you "help" git bisect by choosing other commits than the optimal ones. - with bad luck, it can be hard to get really close to "half the commits" in the reachability analysis, especially if you have lots of merges (and *especially* if you have octopus merges that merge more than two branches of development). For example, say that you have something like a | +---+---+---+---+ | | | | | b c d e f where you have six commits - you can't test any "combinations" at all, since they are all independent, so "git bisect" cannot test them three and three to cut down the time, so if you don't know which one is bad, you'll basically end up testing them all. The bad luck case never really happens to that extreme in practice, and even when it does you can sometimes be lucky and just hit on the bug early (so "bad luck" may end up being "good luck" after all), but it explains why you can get more - or less - than log2(n)+1 attempts. More commonly one more. A much *bigger* problem is if you mark something good or bad that isn't really. Ie if the bug comes and goes (it might be timing-dependent, for example), the problem will be that you'll always narrow things down (that's what bisection does), but you may not narrow it down to the right thing! We've had that happen several times. If the bug (for example) means that suspend *often* breaks, but sometimes works just by luck, you might mark a kernel "good" when it ...
no - i simply picked them by hand, based on looking at gittk output, because bisection did not appear to find anything useful: 9f4bd5dde81b5cb94e4f52f2f05825aa0422f1ff is first bad commit And via that method i found a couple of more 'good' points - which git-bisect never picked up by itself. (and i did 3-4 separate git-bisect sessions, one of them was a "git-bisect start drivers/acpi/" - which is the main area of suspicion). I looked at git-bisect visualize more than once, and i've attached one of the bisection logs below. i also think i know what happens. Firstly, my testing is reliable, as i mentioned it in the other mail i frequently re-visited commits to make sure that none of my bad/good decisions is spurios - but no, the test results are extremely reproducable: either the laptop resumes properly after flashing its disk light or it does not. the problem i think is that i simply took git-bisect's behavior for granted (i used it many times already) but forgot about a very basic precondition: git-bisect will find only a /single/ good->bad transition. If there is a bad->good transition combined with a good->bad transition then git-bisect will think it's the same 'badness', while it's a /former/ badness that it is honing in on - totally sending the bisection off into la-la-land. so as i mentioned it in the first mail: i /know/ that this commit is a bad->good transition point: f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 /and i only want to test commits that include this commit/ - because i know that without this commit git-bisect confuses the /other/ breakage with the new breakage. In the bisection log below, this choice of git-bisect: ee404566f97f9254433399fbbcfa05390c7c55f7 is 'bad' according to testing, but that's 'another' badness - and i missed it. Now, having slept on it, the solution is very simple: whenever git-bisect picks a commit for which the following command comes up empty: git-log | grep ...
the situation is simpler than that: there is a /known/ bug, and i marked the bugfix commit as 'good'. I never met such a multiple-bugs scenario before and forgot that git-bisect could easily pick a tree without this essential bugfix and would not be able to make a distinction between the two types of badness. I'll try what i've described in the previous mail: mark all bisection points that do not include f3ccb06f as 'good' - thus 'merging' the known-bad area with the first known-good commit, and thus eliminating it from the bisection space. (but it might also be useful to have a "git-bisect must-include" kind of command that would allow this to be automated: mark a particular tree as an essential component of the search space.) Ingo -
this got me quite a bit further: git-bisect start git-bisect bad 01363220f5d23ef68276db8974e46a502e43d01d git-bisect good f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 git-bisect fake-good ee404566f97f9254433399fbbcfa05390c7c55f7 git-bisect bad d43a338e395371733a80ec473b40baac5f74d768 git-bisect bad 255f0385c8e0d6b9005c0e09fffb5bd852f3b506 git-bisect fake-good f99c6bb6e2e9c35bd3dc0b1d0faa28bd6970930d git-bisect fake-good 0187f221e96e3436d552c0c7143f183eb82fb658 git-bisect bad 81450b73dde07f473a4a7208b209b4c8b7251d90 git-bisect fake-good ef29498655b18d2bfd69048e20835d19333981ab git-bisect fake-good 8a03d9a498eaf02c8a118752050a5154852c13bf git-bisect good 5c95d3f5783ab184f64b7848f0a871352c35c3cf git-bisect good ecb5f7521a309cb9c5fc0832b9705cd2a03d7d45 git-bisect good 0539771d7236b425f285652f6f297cc7939c8f9a 81450b73dde07f473a4a7208b209b4c8b7251d90 is first bad commit [ note: by having the "git-bisect must-have-bugfix f3ccb06f" functionality i mentioned in the previous mail git-bisect could have eliminated the fake-good steps. ] it's not a resolution of this regression yet, because this commit is a merge with upstream: | commit 81450b73dde07f473a4a7208b209b4c8b7251d90 | Merge: 8a03d9a... 0539771... | Author: Len Brown <len.brown@intel.com> | Date: Fri Feb 16 18:52:41 2007 -0500 | | Pull misc-for-upstream into release branch which means that the fix in Len's tree got broken by merging with upstream. Note: this 'upstream' in isolation is broken too, due to not having that essential fix from Len's tree! So we quite likely have /two/ bugs, any of which breaks resume (which breakage looks the same, so no way to isolate them via testing). I'll now try the following: i'll try to manually apply Len's fix to every tree that git-bisect offers me, in the hope of being able to isolate the /other/ bug. [ But really, i'm not expecting any miracles because this is way out of league for ...
this method pointed out the real bug that we are interested in: | 774c47f1d78e373a6bd2964f4e278d1ce26c21cb is first bad commit | commit 774c47f1d78e373a6bd2964f4e278d1ce26c21cb | Author: Avi Kivity <avi@qumranet.com> | Date: Mon Feb 12 00:54:47 2007 -0800 | [PATCH] KVM: cpu hotplug support undoing the 774c47f1 patch from HEAD gave me a working resume. I'll send a fix for this KVM breakage in a separate mail. here's how the bisection went: bad: [01363220f5d23ef68276db8974e46a502e43d01d] [PARISC] clocksource: good: [fa285a3d7924a0e3782926e51f16865c5129a2f7] Linux 2.6.20 +good: [bcdb81ae29091f6a66369aabfd8324e4a53d05dc] knfsd: SUNRPC: add a bad: [920841d8d1d61bc12b43f95a579a5374f6d98f81] Merge branch 'for-linus' +bad: [86a71dbd3e81e8870d0f0e56b87875f57e58222b] sysctl: hide the sysctl +bad: [719c91ccadd3ed26570dbb29d54166914832eee9] [POWERPC] Use udbg_early +bad: [ebaf0c6032f525ddb0158fb59848d41899dce8cd] Merge branch 'for-linus' +good: [8cd133073f9b5cd335c0b2e4740aceb025d50ca9] kvm: Fix mismatch between +bad: [5b8e8ee6c65a34d8aafaeb8e2eaa97e496c2567c] ps3: add shutdown to +bad: [a524d946bdced73c5fbe60170fb33611491c4211] tgafb: sync-on-green +bad: [a268422de8bf1b4c0cb97987b6c329c9f6a3da4b] fbdev driver for S3 +good: [8d0be2b3bf4a55606967d7d84e56c52521e94333] KVM: VMX: add vcpu_clear() +bad: [59ae6c6b87711ceb2d1ea5f9e08bb13aee947a29] KVM: Host suspend/resume +bad: [774c47f1d78e373a6bd2964f4e278d1ce26c21cb] KVM: cpu hotplug support the commits prefixed with '+' were the ones where i had to hand-merge the f3ccb06f commit to. Near the end of the bisection it nicely honed in on the group of KVM changes that included the bad commit. but the conclusion is clear: if multiple bugs are present in the search area then it gets quite difficult to sort it out via git-bisect - but it's not impossible either. The following git-bisect enhancement could have made things easier for me: git-bisect mark-must-have <tree> which would mark ...
Subject: [patch] KVM: T60 resume fix From: Ingo Molnar <mingo@elte.hu> my T60 laptop does not resume correctly due to KVM attempting to send an IPI to a CPU that might be down (or not up yet). (Doing so also triggers the send_IPI_mask_bitmask() warning in arch/i386/kernel/smp.c, line 732.) with this fix applied my laptop does not hang during resume. [ KVM will have to disable/enable virtualization on the CPU itself that goes down / comes up, not via an IPI sent from the requesting CPU. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> --- drivers/kvm/kvm_main.c | 6 ------ 1 file changed, 6 deletions(-) Index: linux/drivers/kvm/kvm_main.c =================================================================== --- linux.orig/drivers/kvm/kvm_main.c +++ linux/drivers/kvm/kvm_main.c @@ -2083,12 +2083,6 @@ static int kvm_cpu_hotplug(struct notifi case CPU_DEAD: case CPU_UP_CANCELED: decache_vcpus_on_cpu(cpu); - smp_call_function_single(cpu, kvm_arch_ops->hardware_disable, - NULL, 0, 1); - break; - case CPU_UP_PREPARE: - smp_call_function_single(cpu, kvm_arch_ops->hardware_enable, - NULL, 0, 1); break; } return NOTIFY_OK; -
Since I don't normally have kvm loaded, this patch is unlikely to help me, is that right? -- MST -
That is correct. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
> with this fix applied my laptop does not hang during resume. On the off chance that this is relevant, could you post your .config please? -- MST -
sure - find it below. Ingo # # Automatically generated make config: don't edit # Linux kernel version: 2.6.21 # Mon Mar 5 11:15:42 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_ASYNC_SUPPORT=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y # CONFIG_TASK_XACCT is not set # CONFIG_UTS_NS is not set CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y # CONFIG_IKCONFIG is not set CONFIG_CPUSETS=y CONFIG_SYSFS_DEPRECATED=y CONFIG_RELAY=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y CONFIG_KALLSYMS_EXTRA_PASS=y CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_MODVERSIONS=y CONFIG_MODULE_SRCVERSION_ALL=y CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block ...
That is already CPU_ONLINE in my tree (and in the pull request sent to Linus a couple of days ago). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
I noticed ;) Resend, please. -
Could you un-CC: git@vger.kernel.org from this thread, please? -
that solves the resume problem - but doesnt solve the CPU_DEAD issue of sending an IPI to an already offline CPU. Might be a better idea to do it in CPU_DOWN_PREPARE? (and then to also add a CPU_DOWN_FAILED branch?) Ingo -
Mainline now has DOWN_PREPARE and UP_CANCELED calling ->hardware_disable(), and ONLINE calling ->hardware_enabled(). What tree are you looking at? [but I do see the need for DOWN_FAILED now. Off to find a resumable machine...] -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
suspend/resume works fine now and there are no warning messages whatsoever (with suspend simulation). Thanks Avi! Ingo -
I just tried Ingo's .config and it hangs on resume for me (with suspend to memory). -- MST -
could you try 's2ram' from http://suspend.sf.net ? It's basically equivalent to 'echo mem > /sys/power/state', but you can usually also see the reason why it doesnt resume, on the vga console. (by default the kernel doesnt resume the t60's vga properly) on my box resume works, but it's incredibly slow. Takes a few minutes to have total effect. Ingo -
OK, I'll give it a try. Do I have to shut down X or is it enough to I guess it is possible that what I mistook for a hang was actually an incredibly slow system, and that it would resume, eventually. I waited for 5 min and gave I have same here with my old .config (2-3 min). However, after resume was completed, system would hang on the *next* suspend to ram. Ingo, do you see this behaviour, or can you suspend/resume any number of times? -- MST -
Where do I find this suspend simulation? Sounds like a great suspend debugging tool. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -
it's just the simple hack below.
Ingo
------------------------>
Subject: [patch] suspend debugging: simulate suspend-to-RAM
From: Ingo Molnar <mingo@elte.hu>
most resume bugs are due to the kernel hanging or crashing while trying
to resume a specific device. It is extremely hard to debug such hangs
because often when the hang happens there's no console available yet.
Make debugging such bugs easier by offering a resume mode that does not
actually call into the BIOS to turn off the machine. This, in
combination with earlyprintk=serial,ttyS0,115200,keep ,
CONFIG_PM_DEBUG=y, CONFIG_DISABLE_CONSOLE_SUSPEND=y and
/sys/power/filter enables me to have a fully functional serial console
during the full suspend/resume cycle.
I debugged two suspend bugs in the -rt kernel this way already, so it's
pretty useful.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
Documentation/kernel-parameters.txt | 4 ++++
drivers/acpi/sleep/main.c | 15 ++++++++++++++-
include/linux/acpi.h | 1 +
kernel/sysctl.c | 8 ++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -163,6 +163,10 @@ and is between 256 and 4096 characters.
acpi_osi= [HW,ACPI] empty param disables _OSI
+ acpi_simulate_suspend_to_ram
+ [KNL] Do not call into BIOS to do suspend-to-RAM
+ Format: <0/1>
+
acpi_serialize [HW,ACPI] force serialization of AML methods
acpi_skip_timer_override [HW,ACPI]
Index: linux/drivers/acpi/sleep/main.c
===================================================================
--- linux.orig/drivers/acpi/sleep/main.c
+++ linux/drivers/acpi/sleep/main.c
@@ -35,6 +35,14 @@ static u32 acpi_suspend_states[] = {
static int init_8259A_after_S1;
+/*
+ * simulate entry into the BIOS - this ...Thanks, I'll put it to good use. Are you planning to upstream it? Not for 2.6.21, presumably ;) -- error compiling committee.c: too many arguments to function -
it needs an essential cleanup: instead of modifying 'mem' behavior, it should be a 'testmem' thing and no acpi_ flag. Ingo -
I think Pavel wants it triggerable through /sys/power/state: http://lkml.org/lkml/2007/1/25/99 -- MST -
I just confirmed that 0539771d7236b425f285652f6f297cc7939c8f9a Going to test that now. -- MST -
I have confirmed these two on my system. -- MST -
you could probably get quite a bit further in bisecting the other
breakage, by using the following method:
manully apply the patch below to 81450b73dde and retest. It will most
likely work. Then FIRST unapply the patch and mark the tree via
'git-bisect good' and continue the bisection. Then try to apply the
patch again. If it's already included - ignore the rejected patch.
Whenever git-bisect offers you a new commit, just try to apply the
patch. Ok? This way you'll be able to avoid the known ACPI breakage, and
zoom in on the unknown breakage.
Ingo
---------------->
commit f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38
Author: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Date: Tue Feb 13 02:35:50 2007 -0500
ACPI: Disable wake GPEs only once.
fixes Suspend/Resume regressions due to recent ACPICA update.
Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
diff --git a/drivers/acpi/events/evgpe.c b/drivers/acpi/events/evgpe.c
index dfac3ec..635ba44 100644
--- a/drivers/acpi/events/evgpe.c
+++ b/drivers/acpi/events/evgpe.c
@@ -636,17 +636,6 @@ acpi_ev_gpe_dispatch(struct acpi_gpe_event_info *gpe_event_info, u32 gpe_number)
}
}
- if (!acpi_gbl_system_awake_and_running) {
- /*
- * We just woke up because of a wake GPE. Disable any further GPEs
- * until we are fully up and running (Only wake GPEs should be enabled
- * at this time, but we just brute-force disable them all.)
- * 1) We must disable this particular wake GPE so it won't fire again
- * 2) We want to disable all wake GPEs, since we are now awake
- */
- (void)acpi_hw_disable_all_gpes();
- }
-
/*
* Dispatch the GPE to either an installed handler, or the control method
* associated with this GPE (_Lxx or _Exx). If a handler exists, we invoke
-
BTW, the key here seems to be CONFIG_CC_OPTIMIZE_FOR_SIZE, at least that is my current guess looking at the config options I still need to test whether they make a difference. Either that, or the vmsplit option got broken again. I'm attaching my x60 config that works for me. -- Jens Axboe
Side note: it's still usually fairly easy. Especially if you have a known fix for the other bug, you can usually just do the equivalent of git cherry-pick <fixcommit> at each point during this bisect (or just have a known patch that you keep applying and un-applying), and you're largely done. Of course, if the area with the fix keeps changing, or if the fix is really intrusive and nasty, this gets hairy, but at least in this case the patch is fairly trivial and it shouldn't cause any trouble at all to do this. The only real down-side is just the mindless extra work, and the possible added confusion you get from modifying the history at the points you're testing. "git bisect" is not necessarily happy about auto-picking a new bisection point with a dirty tree, for example, so before you mark something "good" or "bad", you should generally try to do so with a clean git tree (ie if you apply a patch at each stage, do "git reset --hard" to remove the patch before you do the "git bisect bad/good" stage). Similarly, especially at the end of the bisection run, if you actually use "git cherry-pick" to *add* a commit, the bisection will now take that added commit into account when trying to pick the next commit to check, which is not what you really want. It probably doesn't matter that much during the early stages (when bisection is really jumping around wildly anyway, and one commit more or less doesn't really matter), but again, it might be a good idea to make a habit of undoing the cherry-pick, the same way you'd undo a patch (eg "git reset --hard HEAD^" would do it, if you have exactly one cherry-pick that you tested). Linus -
No it's not. "git bisect" does exactly the right thing. There is no simple
ordering in a complex branch-merge schenario, you can't just put the
commits in some "ordering" and test things in time order. That would be
totally broken, and idiotic. It doesn't give the right results.
What git bisect does is to find the commit that most closely *bisects* the
history of commits, so that if it is marked good/bad, it will leave you
with about 50% of the commits left. But if you are looking at date order,
you're entirely confused.
For example, let's take a really simple case
a <- bad
/ \
b c
| |
d e
| |
f g
\ /
h
|
* <-good
and if you are looking to find something "in the middle", you might thing
that "d" or "e" are the best choices, since time-wise, they are in the
middle.
But that's not true AT ALL.
If you actually want to bisect that kind of history, you need to choose
"b" or "c", even though they may both be *much* more "recent" than the
others. Why? Because if you pick "d", you're really only testing three
commits ('d' 'f' and 'h') out of the 8 commits you have to test.
In contrast, if you pick 'b', you are testing the effects of *four*
commits ('b', 'd', 'f' and 'h') and you have thus neatly bisected the
commits into two equal groups for testing (one group _with_ those four
commits, and one group _without_) instead of having partitioned them as 3
commits vs 5 commits.
So please realize that non-linear history very much means that you MUST
NOT think that you just pick a commit "in the middle". No, git bisect is a
LOT smarter than that - it picks a commit that *reaches* about half the
It's possible because git knows what it is doing, and you didn't think
things through.
The commits that "git bisect" picked out are the right ones. Quite often,
there may be two or more "equally good" commits (in my example above, you
can choose either "b" or "c", and it will ...Strange; on my x60, suspend to ram works okay.
(Resume is very slow, because disks are not spinned up properly; and
there's something wrong with timers; console beeps take way too long).
dmesg attached.
That's with
commit 7b965e0884cee430ffe5dc81cdb117b9316b0549
tree 754dce6432258e0a8c3a758e13a34eb3a1d22ee1
parent 5a39e8c6d655b4fe8305ef8cc2d9bbe782bfee5f
author Trond Myklebust <Trond.Myklebust@netapp.com> Wed, 28 Feb 2007
20:13:55 -0800
committer Linus Torvalds <torvalds@woody.linux-foundation.org> Thu, 01
Mar 2007 14:53:39 -0800
[PATCH] VM: invalidate_inode_pages2_range() should not exit early
Fix invalidate_inode_pages2_range() so that it does not
immediately exit
just because a single page in the specified range could not be
removed.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel, I tried with your .config, and indeed the system came back to life after 2-3 minutes after I press Fn/F4, indeed the issue seems to be with the disk. It could be that the same takes place with my original .config - maybe I just wasn't patient enough. I'll need to re-test that. However, I noticed that, after resume, when the system is presumably functional, if I try to suspend to ram again, this second suspend hangs, displaying the following on screen: [ 17.170000] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 20 [ 17.170000] PCI: Setting latency timer of device 0000:02:00.0 to 64 [ 17.250000] e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:41:5 4:6c:47 [ 17.330000] e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection the crescent LED starts blinking and does not seem to stop for at lest 10 min, I've run out of patience after that. It could be that it's just very slow again. Pavel, did you try suspend to RAM after a successfull resume from RAM? Under 2.6.20, the system suspends/resumes to memory within about 20 sec any number of times. -- MST -
the spin-up takes a few seconds here under suspend/resume simulation: | ata1: waiting for device to spin up (7 secs) | Restarting tasks ... done. [5-10 seconds pass] | ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) | ata1.00: configured for UDMA/100 | SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) | sda: Write Protect is off | sda: Mode Sense: 00 3a 00 00 | SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA with real resume it takes even longer time - but i dont see where the delays come from in that case - i suspect it's SATA. i'm also getting this WARN_ON() from e1000: BUG: at drivers/pci/msi.c:611 pci_enable_msi() [<c01061bd>] show_trace_log_lvl+0x19/0x2e [<c01062b6>] show_trace+0x12/0x14 [<c01062cc>] dump_stack+0x14/0x16 [<c024fcc4>] pci_enable_msi+0x6d/0x203 [<c02b709e>] e1000_request_irq+0x2e/0xe2 [<c02bb742>] e1000_resume+0x7f/0xef [<c0249a68>] pci_device_resume+0x1a/0x44 [<c02b39ec>] resume_device+0xf7/0x16f [<c02b3adb>] dpm_resume+0x77/0xcb [<c02b3b69>] device_resume+0x3a/0x51 [<c014e669>] enter_state+0x193/0x1bb [<c014e712>] state_store+0x81/0x97 [<c01b68bc>] subsys_attr_store+0x20/0x25 [<c01b6feb>] sysfs_write_file+0xce/0xf6 [<c017e16b>] vfs_write+0xb1/0x13a [<c017e899>] sys_write+0x3d/0x61 [<c0105220>] syscall_call+0x7/0xb seems harmless because it seems to work fine. Ingo -
I would poke Eric Biederman(sp?) about this one. Maybe its even solved by the MSI-enable-related patch he posted in the past 24-48 hours. Jeff -
Eric, Linus, I tried the 3-patch series "[PATCH 0/3] Basic msi bug fixes.." and they fix this problem for me. Were you expecting the OOPS in the first place? In any case, it survived several suspend/resume cycles on both enabled (irq alloc'd and enabled) and disabled devices (only initialized). Jens Axboe was seeing the same problem, perhaps he can confirm the fix as well. In any case, the patches have my blessing :) Please add my: Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com> Cheers, Auke -
the bug was the warning message (a WARN_ON()) above - not an oops. So that warning message is gone in your testing? Ingo -
Sorry for the slow delay. I was out of town for my brothers wedding the last few days. I wasn't exactly expecting the WARN_ON to trigger. What I fixed was an inconsistency in handling our state bits. Fixing that inconsistency appears to have fixed the e1000 usage scenario mostly by accident. The basic issue is that pci_save_state saves the current msi state along with other registers, and then the e1000 driver goes and disables the msi irq after we have saved the irq state as on. My code notices that the msi irq was disabled before restore time, so it skips the restore. However we now have a leak of the msi saved cap because we are not freeing it. This leaves with some basic questions. - Does it make sense for suspend/resume methods to request/free irqs? - Does it make sense for suspend/resume methods to allocate/free msi irqs? - Do we want pci_save/restore_cap to save/restore msi state? The path of least resistance is to just free the extra state and we are good. I'm just not quite certain that is sane and it has been a long day. Eric -
we used to have a lengthy e1000_pci_save|restore_state in our code, which is now gone, so I'm all for that. A separate pci_save_pxie|msi(x)_state for every driver seems completely unnecessary. I can't think of a use case where saving+restoring everything hurts. That's what you want I presume. We currently free all irq's and msi before going into suspend in e1000, and I think that is probably a good thing, somehow I can think of bad things happening if we dont, but I admit that I haven't tried it without alloc/free. We do this in e100 as well and it works. Another motivation would be to leave this up to the driver: if the driver chooses to free/alloc interrupts because it makes sense, you probably would want to keep that choice available. Devices that don't need this can skip the alloc/free, but leave the choice open for others. hth Auke -
ah, looking at the code in e1000 we do: _suspend: pci_save_state(); free_irq() _resume: pci_restore_state(); alloc_irq(); I suppose that's not good either, and the major cause of the warning in the first place. Maybe I can rollback your latest patches and try to fix that mess by postponing the pci_save_state until after we free'd the irq's. Auke -
I just want to understand why we have issues and to see if how we have
organized the suspend/resume path for dealing with msi irqs makes sense.
That is I haven't looked much at the suspend/resume path so I don't know it
well and I am afraid that your problem might be a symptom of a deeper
Currently the irq code supports operation without the
free_irq/request_irq. Since the numbers given are pure linux
abstractions things should it is really a matter of just
Below is an additional set of warnings that should help debug this.
The old code just got lucky that it triggered a warning when this happens.
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 01869b1..5113913 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -613,6 +613,7 @@ int pci_enable_msi(struct pci_dev* dev)
return -EINVAL;
WARN_ON(!!dev->msi_enabled);
+ WARN_ON(!hlist_empty(&dev->saved_cap_space));
/* Check whether driver already requested for MSI-X irqs */
if (dev->msix_enabled) {
@@ -638,6 +639,8 @@ void pci_disable_msi(struct pci_dev* dev)
if (!dev->msi_enabled)
return;
+ WARN_ON(!hlist_empty(&dev->saved_cap_space));
+
msi_set_enable(dev, 0);
pci_intx(dev, 1); /* enable intx */
dev->msi_enabled = 0;
@@ -739,6 +742,7 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
}
}
WARN_ON(!!dev->msix_enabled);
+ WARN_ON(!hlist_empty(&dev->saved_cap_space));
/* Check whether driver already requested for MSI irq */
if (dev->msi_enabled) {
@@ -763,6 +767,8 @@ void pci_disable_msix(struct pci_dev* dev)
if (!dev->msix_enabled)
return;
+ WARN_ON(!hlist_empty(&dev->saved_cap_space));
+
msix_set_enable(dev, 0);
pci_intx(dev, 1); /* enable intx */
dev->msix_enabled = 0;
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index bd44a48..4418839 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -677,6 +677,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
...Got a hit on a powerpc g5:
PM: Writing back config space on device 0001:05:04.0 at offset 1 (was 2b00000, writing 2b00006)
------------[ cut here ]------------
Badness at drivers/pci/pci.c:679
Call Trace:
[C0000000080F7410] [C000000000011EFC] .show_stack+0x50/0x1cc (unreliable)
[C0000000080F74C0] [C0000000001AD610] .report_bug+0xa0/0x110
[C0000000080F7550] [C0000000000256E4] .program_check_exception+0xb4/0x670
[C0000000080F7630] [C0000000000046F4] program_check_common+0xf4/0x100
--- Exception: 700 at .pci_restore_state+0x310/0x340
LR = .pci_restore_state+0x2e0/0x340
[C0000000080F79D0] [C00000000026A174] .tg3_chip_reset+0x19c/0xa04
[C0000000080F7A90] [C00000000026D948] .tg3_reset_hw+0xa4/0x2718
[C0000000080F7BA0] [C000000000270030] .tg3_init_hw+0x74/0x94
[C0000000080F7C30] [C000000000270BE0] .tg3_open+0x4c8/0x854
[C0000000080F7CF0] [C0000000003A74A4] .dev_open+0x100/0x12c
[C0000000080F7D90] [C0000000003BAEA8] .netpoll_setup+0x2dc/0x3ec
[C0000000080F7E40] [C000000000283450] .init_netconsole+0x64/0x8c
[C0000000080F7EC0] [C0000000005C0BE4] .init+0x1d0/0x390
[C0000000080F7F90] [C0000000000271F8] .kernel_thread+0x4c/0x68
tg3: eth0: Link is up at 1000 Mbps, full duplex.
tg3: eth0: Flow control is on for TX and on for RX.
Scheduler bitmap error - bitmap being reconstructed..
netconsole: network logging started
Calling initcall 0xc0000000006bd180: .macio_module_init+0x0/0x3c()
That's:
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
WARN_ON(!hlist_empty(&dev->saved_cap_space));
return 0;
-
Hmm. Either I am confused of I just found an unanticipated leak.
pci_restore_msi_state should be out of the picture as we don't yet
have ppc msi support and I don't think the g5 generation hardware
supported it either.
The only case I can see which might trigger this is if we saved
pci-X state and then didn't restore it because we could not find
the capability on restore.
Any chance you could walk that list and find the cap_nr of the remaining
element?
Something like:
{
struct pci_cap_saved_state *tmp;
struct hlist_node *pos;
hlist_for_each_entry(tmp, pos, &pci_dev->saved_cap_space, next)
printk(KERN_INFO "saved_cap: 0x%02x\n", tmp->cap_nr);
}
Until I get the best scenario I can come up with is a tg3 hardware bug
that doesn't renable the pci-X capability after a restore of power state.
Getting that cap_nr will at least allow me to be certain if I am dealing
with msi, pci-X or pci-e.
Unanticipated bugs aren't supposed to be this easy to find!
Eric
-
Speaking of tg3, make sure you are aware that the number of calls to save-state functions may not match the number of calls to the restore-state functions. ISTR seeing some memleak bugs in PCI related to this. Jeff -
Thanks that looks like the problem, multiple calls to save before one call to restore when you have a pci-x capability would easily trigger this problem. I just surveyed a bunch of the pci_save_state and pci_restore_state users and this appears to be a common idiom not just a tg3 thing.... It looks like when code was added to save/restore the msi capability was added to pci_save/restore_state that an assumption was added that pci_save_state and pci_restore state were only used for suspend and only used in pairs. There is even a partial bug fix that removed the worst of the symptoms of that assumption from the msi code but failed to recognize the core problem. Now that we have code to work with pcie and pcix capabilities as well as msi this problem is much easier to hit. All of pci_save_state and pci_restore_state is going to have to be restructured to fix this, and it is late in the game. Ugh. Oh well, better to fix it now At least I get my answer about if what pci_save_state is doing is reasonable. It is not. pci_save_state no longer supports being used in conjunction with hardware reset and has become a suspend/resume specific function. Now I'm off to wite some patches to fix this. Eric -
Well this is clearly a weird tangent from the topic of this thread but it looks to have found some real bugs even if they aren't the ones we are looking for. In short pci_save_state and pci_restore_state are used to two primary was: As a pair called from the suspend and restore routines. One save to multiple restores usually present in hardware reset routines. The additions to save/restore msi, pci-e and pci-x state failed to take this second usage scenario into account. The next two patches address this problem. This should directly fix the issue observed with the tg3, with multiple saves and a single restore. It happens that to handle the reset case cleanly we also need to support the odd usage model of the current e1000 driver. I have never have figured out how to get suspend/resume actually working on any of my machines so the code path is untested. But the patches are trivial and pretty much obviously correct. So if a couple people could do a quick suspend/resume test on these patches to confirm I'm not blind that would be great. Eric -
There are two ways pci_save_state and pci_restore_state are used. As helper functions during suspend/resume, and as helper functions around a hardware reset event. When used as helper functions around a hardware reset event there is no reason to believe the calls will be paired, nor is there a good reason to believe that if we restore the msi state from before the reset that it will match the current msi state. Since arch code may change the msi message without going through the driver, drivers currently do not have enough information to even know when to call pci_save_state to ensure they will have msi state in sync with the other kernel irq reception data structures. It turns out the solution is straight forward, cache the state in the existing msi data structures (not the magic pci saved things) and have the msi code update the cached state each time we write to the hardware. This means we never need to read the hardware to figure out what the hardware state should be. By modifying the caching in this manner we get to remove our save_state routines and only need to provide restore_state routines. The only fields that were at all tricky to regenerate were the msi and msi-x control registers and the way we regenerate them currently is a bit dependent upon assumptions on how we use the allow msi registers to be configured and used making the code a little bit brittle. If we ever change what cases we allow or how we configure the msi bits we can address the fragility then. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> --- drivers/pci/msi.c | 150 ++++++++-------------------------------------- drivers/pci/pci.c | 2 - drivers/pci/pci.h | 2 - include/linux/msi.h | 8 +-- include/linux/pci_regs.h | 1 + 5 files changed, 29 insertions(+), 134 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 01869b1..ad33e01 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -100,6 +100,7 @@ static void ...
This is a note to let you know that I've just added the patch titled
Subject: msi: Safer state caching.
to my gregkh-2.6 tree. Its filename is
msi-safer-state-caching.patch
This tree can be found at
http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/
ouch! Are you interested in getting it work? Ingo -
I haven't even seriously tried. I don't yet have a laptop and I like staying logged. The first time I rolled the jiffie counter it was a happened after I had realized that if I kept my machine up for another couple of months my jiffie counter would roll and I said why not So basically suspend hasn't been a feature I'm been interested in using. I get more annoyed at my UPS failing.. I will probably get around to figuring out the whole suspend/resume thing one of these days and it will probably take me a day or two. But that doesn't result in bug fixes being delivered in a timely manner. Eric -
Hmm. pci_save_pcix_state/pci_restore_pcix_state seem to only handle
regular devices and seem to ignore the fact that for bridge PCI-X
capability has a different structure.
Is this intentional? If not, here's a patch to fix this.
Warning: completely untested.
PCI: restore bridge PCI-X capability registers after PM event
Restore PCI-X bridge up/downstream capability registers
after PM event. This includes maxumum split transaction
commitment limit which might be vital for PCI X.
Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il>
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index df49530..4b788ef 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -597,14 +597,19 @@ static int pci_save_pcix_state(struct pci_dev *dev)
if (pos <= 0)
return 0;
- save_state = kzalloc(sizeof(*save_state) + sizeof(u16), GFP_KERNEL);
+ save_state = kzalloc(sizeof(*save_state) + sizeof(u16) * 2, GFP_KERNEL);
if (!save_state) {
- dev_err(&dev->dev, "Out of memory in pci_save_pcie_state\n");
+ dev_err(&dev->dev, "Out of memory in pci_save_pcix_state\n");
return -ENOMEM;
}
cap = (u16 *)&save_state->data[0];
- pci_read_config_word(dev, pos + PCI_X_CMD, &cap[i++]);
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ pci_read_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, &cap[i++]);
+ pci_read_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, &cap[i++]);
+ } else
+ pci_read_config_word(dev, pos + PCI_X_CMD, &cap[i++]);
+
pci_add_saved_cap(dev, save_state);
return 0;
}
@@ -621,7 +626,11 @@ static void pci_restore_pcix_state(struct pci_dev *dev)
return;
cap = (u16 *)&save_state->data[0];
- pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ pci_write_config_word(dev, pos + PCI_X_BRIDGE_UP_SPL_CTL, cap[i++]);
+ pci_write_config_word(dev, pos + PCI_X_BRIDGE_DN_SPL_CTL, cap[i++]);
+ } else
+ pci_write_config_word(dev, pos + PCI_X_CMD, cap[i++]);
...Probably not a such. I don't think we have any drivers for bridge devices so I don't think it matters. It likely wouldn't hurt to fix it just in case though. If you fix the offsets and diff this against my last fix (to never Unless I am completely misreading the spec. While you have picked the right register to save the offsets should be 0x08 and 0x0c or 8 and 12.... Eric -
Yes but they do their own thing wrt saving/restoring registers. No, the spec is written in terms of dwords (32 bit), we are storing words (16 bits). The data at offsets 8 and 12 is read-only split transaction capacity. Split transaction limit starts at bit 16 so you need to add 2 to byte offset. Right? -- MST -
thinking the code works. The read-only and the read-write part are all defined as part of the same register so I didn't expect them to be separate. And I hadn't paid attention enough to see that the code was only saving 16bit values. Rumor has it that some pci devices can't tolerate < 32bit accesses. Although I have never met one. The two factors together suggest that for generic code it probably makes sense to operate on 32bit quantities, and just to ignore the read-only portion. Eric -
The code for regular devices seems to use 16-bit accesses, so I think it's best to stay consistent. Or do you want to change this too? -- MST -
If we are stomping rare probabilities we might as well change that too. The code to save pci-x state is relatively recent. So it probably just hasn't met a problem device yet (assuming they exist). Eric -
OK I guess. I gather we assume writing read-only registers has no side effects? Are there rumors circulating wrt to these? -- MST -
I haven't heard anything about that, and if we are writing the same value back it should be pretty safe. I have heard it asserted that at least one version of the pci spec only required 32bit accesses to be supported by the hardware. One of these days I will have to look that and see if it is true. I do know it can be weird for hardware developers to support multiple kinds of decode. As I recall for pci and pci-x at the hardware level the only difference in between 32bit transactions and smaller ones is the state of the byte-enable lines. Eric -
Is this the only place where Linux uses pci_read_config_word/pci_read_config_dword? I think such hardware will be pretty much DOA on all OS-es. Why don't we wait True, and same holds for PCI-Express. So let's assume hardware implements RO correctly but ignores the BE bits - nothing bad happens then, right? -- MST -
So, Eric, what are your thoughts on this patch (probably wrt 2.6.22)? Since the rest of the save/restore code uses 16 bit transactions, it should be safe here too? -- MST -
Eric W. Biederman wrote: I'm trying this patch together with the other 2 that you sent out a few days ago. I'm seeing some minor issues with this and lots of bogus warnings as far as I can see. If I suspend/resume and unload e1000, then reinsert e1000.ko, I immediately hit the WARN_ON at `msi.c:516: WARN_ON(!hlist_empty(&dev->saved_cap_space));` I'm not sure that's useful debugging information. even though saved state exists the module has been removed, so you might want to purge the state table when the driver gets removed? anyway, back to testing. -
Sorry I should have been clear. With the last two patches I sent out: pci: Repair pci_save/restore_state so we can restore one save many times. msi: Safer state caching. Those WARN_ON's are now totally bogus. In essence the WARN_ON's were testing to ensure pci_save_state and pci_restore_state were paired. The assumptions was that pci_restore_state would remove everything from the list that pci_save_state placed on it. When I reworked the code and removed the bogus pairing requirement it meant that if we ever save state and have a pci-e or pci-x capability we will have a state structure on the list until the kernel reboots. In summary: pci_save_state and pci_restore_state were making a wrong assumption about the world. The WARN_ON patch tested to ensure the world matched those functions. When I fixed those functions to match the world the WARN_ON's became completely bogus. Eric -
update: Thomas' PIT/HPET resume-fix patch fixed the delay for me. Ingo -
SATA has another nice feature. Somehow there is an interrupt pending on the SATA controller, which comes in somewhere in the middle of resume. If it happens before the SATA code resumed, the SATA code ignores the interrupt and the interrupt is disabled due to "nobody cared", which in turn prevents SATA to ever become functional again. Any idea on that one ? tglx -
Jeff - that sounds like a SATA bug. If you have an interrupt handler registered, you'd better handle the interrupt regardless of whether you think the hardware might be gone or not. It's generally *not* ok to do if (device_offline()) return IRQ_NONE; at the top of an interrupt handler. Of course, if you think the hardware is supposed to be quiescent, then the only thing you should do is generally just do the "shut up" operation (ie read status, write it back or whatever). You must generally *not* try to pass any data upwards (ie if the higher layers have told you to shut up, you may need to handle the hardware, but you must not involve the higher layers themselves any more, because they expect you to be quiet). And if you cannot do that because you need to resume in order to have the status register mapped, then you need to have an "early_resume()" function which gets called *before* interrupts are enabled. That's what early-resume (and late-suspend) are designed for: doing things that happen very early in the resume sequence before everything is up. And if you don't want to do any of these things (or are unable to, because of some ordering constraint or bad design), then you simply need to Jeff, Auke, does this ring any bells? Linus -
For the e1000 issue, the problem is solved with Eric Biederman's 3-patch msi cleanups. You should have another message in your mailbox confirming that I tested his patches and the MSI warning for e1000 suspend-resume is gone with them. Cheers, Auke -
Seems to work ok in -rc3... as long as I do not mix s2ram with s2disk. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
f3ccb06f3b8e0cf42b579db21f3ca7f17fcc3f38 works for me, too. -- MST -
Is this Subject : ThinkPad T60: no screen after suspend to RAM References : http://lkml.org/lkml/2007/2/22/391 Submitter : Michael S. Tsirkin <mst@mellanox.co.il> Ingo Molnar <mingo@elte.hu> Handled-By : Ingo Molnar <mingo@elte.hu> Status : unknown cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed -
It doesn't resume at all. -- Jens Axboe -
This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20 that are not yet fixed in Linus' tree. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject : forcedeth: skb_over_panic References : http://bugzilla.kernel.org/show_bug.cgi?id=8058 Submitter : Albert Hopkins <kernel@marduk.letterboxes.org> Status : unknown Subject : natsemi ethernet card not detected correctly References : http://lkml.org/lkml/2007/2/23/4 http://lkml.org/lkml/2007/2/23/7 Submitter : Bob Tracy <rct@gherkin.frus.com> Caused-By : Mark Brown <broonie@sirena.org.uk> Handled-By : Mark Brown <broonie@sirena.org.uk> Patch : http://lkml.org/lkml/2007/2/23/142 Status : patch available Subject : request_module: runaway loop modprobe net-pf-1 References : http://lkml.org/lkml/2007/2/21/206 Submitter : YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> Caused-By : Kay Sievers <kay.sievers@vrfy.org> commit c353c3fb0700a3c17ea2b0237710a184232ccd7f Handled-By : Greg KH <greg@kroah.com> Status : problem is being discussed Subject : IPV6=m, SUNRPC=y compile error References : http://bugzilla.kernel.org/show_bug.cgi?id=8050 http://lkml.org/lkml/2007/2/12/442 http://lkml.org/lkml/2007/2/20/384 Submitter : Michael-Luke Jones <mlj28@cam.ac.uk> Pete Clements <clem@clem.clem-digital.net> Sid Boyce <g3vbv@blueyonder.co.uk> Caused-By : Chuck Lever <chuck.lever@oracle.com> Handled-By : YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> Status : patch available Subject : WARNING: "compat_agp_ioctl" undefined! References : http://lkml.org/lkml/2007/2/21/272 Submitter : Andreas Schwab ...
Patch has been reverted and submitted to Linus to pull, but he's out of town right now... thanks, greg k-h -
This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20
that are not yet fixed in Linus' tree.
If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.
Due to the huge amount of recipients, please trim the Cc when answering.
Subject : resume: slab error in verify_redzone_free(): cache `size-512':
memory outside object was overwritten
References : http://lkml.org/lkml/2007/2/24/41
Submitter : Pavel Machek <pavel@ucw.cz>
Handled-By : Marcel Holtmann <marcel@holtmann.org>
Status : unknown
Subject : ThinkPad T60: no screen after suspend to RAM
References : http://lkml.org/lkml/2007/2/22/391
Submitter : Michael S. Tsirkin <mst@mellanox.co.il>
Handled-By : Ingo Molnar <mingo@elte.hu>
Status : unknown
Subject : HP nx6325 notebook: usb mouse stops working after suspend to ram
References : http://lkml.org/lkml/2007/2/21/413
Submitter : Arkadiusz Miskiewicz <arekm@maven.pl>
Caused-By : Konstantin Karasyov <konstantin.a.karasyov@intel.com>
commit 0a6139027f3986162233adc17285151e78b39cac
Status : unknown
Subject : ACPI update breaks kpowersave
References : http://lkml.org/lkml/2007/2/10/7
Submitter : Ismail Dönmez <ismail@pardus.org.tr>
Fabio Comolli <fabio.comolli@gmail.com>
Status : unknown
Subject : MacBook: AE_NOT_FOUND ACPI messages
References : http://bugzilla.kernel.org/show_bug.cgi?id=8066
Submitter : Thomas Meyer <thomas.mey@web.de>
Status : unknown
Subject : Asus A8N-VM motherboard:
framebuffer/console boot failure boot failure (ACPI related)
References : http://lkml.org/lkml/2007/2/23/132
Submitter : Andrew Nelless <andrew@nelless.net>
Handled-By : Antonino A. Daplas <adaplas@gmail.com>
Status : problem is being ...Just reproduced this in -rc2. Another thing I noticed: with 2.6.20, pressing Fn/F4 generates an ACPI event and triggers suspend to RAM. On 2.6.21-rc2, after resume (when the box is accessible from network), pressing Fn/F4 again does not seem to have any effect. -- MST -
Can you please get the dmesg output after resume via the network ? tglx -
The link above has it. -- MST -
I have the same problem on my IBM X60s on rc1 and rc2. Can't resume from RAM, can't suspend to disk. It is possible to revert all the changes to ACPI and test it? Jeff. -
As I said elsewhere in the thread, suspend/resume to RAM works ok on my thinkpad x60. I posted my .config there, perhaps difference is in it? Ingo identified KVM as possible culprit. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I'll try your .config for kicks, the problem that Ingo pin pointed is not what is affecting me. -- Jens Axboe -
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
-
I've tried with CONFIG_KVM=n and CONFIG_KVM=y and both does not suspend. -
Do you mean that they "do not resume after suspend"? -- MST -
I can't even suspend to disk/ram. It just hangs and the lights just blink and everything else hangs. With 2.6.20, it works fine. Jeff. -
Turn up console loglevel, and see where it hangs... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I think CONFIG_DISABLE_CONSOLE_SUSPEND would have to be set for this purpose too. Greetings, Rafael -
Still appears, but this does not seem to be 40/80 pin cable problem to be but rather ata-piix calling some acpi methods and this rulsts in acpi errors. -- Meelis Roos (mroos@linux.ee) -
On Tue, 27 Feb 2007 15:00:29 +0200 (EET) There are two separate problems showing up in the one trace - broken ACPI spew and wrong cable detect. I don't think they are related -
This email lists some known regressions in 2.6.21-rc1 compared to 2.6.20 that are not yet fixed in Linus' tree. If you find your name in the Cc header, you are either submitter of one of the bugs, maintainer of an affectected subsystem or driver, a patch of you caused a breakage or I'm considering you in any other way possibly involved with one or more of these issues. Due to the huge amount of recipients, please trim the Cc when answering. Subject : forcedeth no longer works References : http://bugzilla.kernel.org/show_bug.cgi?id=8090 Submitter : David P. Reed <dpreed@reed.com> Caused-By : Ayaz Abdulla <aabdulla@nvidia.com> Status : unknown Subject : forcedeth: skb_over_panic References : http://bugzilla.kernel.org/show_bug.cgi?id=8058 Submitter : Albert Hopkins <kernel@marduk.letterboxes.org> Status : unknown Subject : natsemi ethernet card not detected correctly References : http://lkml.org/lkml/2007/2/23/4 http://lkml.org/lkml/2007/2/23/7 Submitter : Bob Tracy <rct@gherkin.frus.com> Caused-By : Mark Brown <broonie@sirena.org.uk> Handled-By : Mark Brown <broonie@sirena.org.uk> Patch : http://lkml.org/lkml/2007/2/23/142 Status : patch available Subject : ThinkPad T60: system doesn't come out of suspend to RAM (CONFIG_NO_HZ) References : http://lkml.org/lkml/2007/2/22/391 Submitter : Michael S. Tsirkin <mst@mellanox.co.il> Handled-By : Thomas Gleixner <tglx@linutronix.de> Ingo Molnar <mingo@elte.hu> Status : unknown Subject : kernel BUG at kernel/time/tick-sched.c:168 (CONFIG_NO_HZ) References : http://lkml.org/lkml/2007/2/16/346 Submitter : Michal Piotrowski <michal.k.k.piotrowski@gmail.com> Handled-By : Thomas Gleixner <tglx@linutronix.de> Status : problem is being debugged Subject : BUG: soft lockup detected on CPU#0 NOHZ: local_softirq_pending 20 (SMT scheduler) References : http://lkml.org/lkml/2007/2/20/257 Submitter : Michal ...
Adrian, The BUG_ON() was replaced by a warning printk(). The BUG_ON() exposed a Patch available, not confirmed yet. tglx -
I can confirm that the bug is fixed (over 20 hours of testing should be enough). Huge thanks! Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) -
thanks alot! I think this thing was a long-term performance/latency regression in HT scheduling as well. Ingo -
Agreed. I was recently looking at that spot because I found that niced tasks were taking latency hits, and disabled it, which helped a bunch. I also can't understand why it would be OK to interleave a normal task with an RT task sometimes, but not others.. that's meaningless to the RT task. IMHO, SMT scheduling should be a buyer beware thing. Maximizing your core utilization comes at a price, but so does disabling it, so I think letting the user decide what he wants is the right thing to do. -Mike -
Apologies for the resend, lkml address got mangled... Ingo I'm going to have to partially disagree with you on this. This has only become a problem because of what happens with dynticks now when rq->curr == rq->idle. Prior to this, that particular SMT code only leads to relative delays in scheduling for lower priority tasks. Whether or not that task is ksoftirqd should not matter because it is not like they are starved indefinitely, it is only that nice 19 tasks are relatively delayed, which by definition is implied with the usage of nice as a scheduler hint wouldn't you say? I know it has been discussed many times before as to whether 'nice' means less cpu and/or more latency, but in our current implementation, nice means both less cpu and more latency. So to me, the kernels without dynticks do not have a regression. This seems to only be a problem in the setting of the new dynticks code IMHO. That's not to say it isn't a bug! Nor am I saying that dynticks is a problem! Please don't misinterpret that. The second issue is that this is a problem because of the fuzzy definition of what idle is for a runqueue in the setting of this SMT code. Normally, rq->curr==rq->idle means the runqueue is idle, but not with this code since there are still rq->nr_running on that runqueue. What dynticks in this implementation is doing is trying to idle a hyperthread sibling on a cpu whose logical partner is busy. I did not find that added any power saving on my earlier dynticks implementation, and found it easier to keep that sibling ticking at the same rate as its partner. Of course you may have found something different, and I definitely agree with what you are likely to say in response to this- we shouldn't have to special case logical siblings as having a different definition of idle than any other smp case. Ultimately, that leaves us with your simple patch as a reasonable solution for the dynticks case even though it does change the behaviour dramatically for a ...
(hrmph. having to copy/paste/try again. evolution seems to be broken..
RCPT TO <linux-kernel@vg.kolivas.org> failed: Cannot resolve your domain {mp049}
..caused me to be unable to send despite receipts being disabled)
No I'm not, but let's go further in that direction just for the sake of
argument. You're then saying that you prefer realtime priorities to not
work in the HT setting, given that realtime tasks don't participate in
the 'single stream me' program.
I'm saying only that we're defeating the purpose of HT, and overriding a
So? User asked for HT. That's hardware multiplexing. It ain't free.
I don't think it does actually. Let your RT task sleep regularly, and
ever so briefly. We don't evict lower priority tasks from siblings upon
To me, the reason for interleaving is solely about keeping the core
Re-read this paragraph with realtime task priorities in mind, or for
that matter, dynamic priorities. If you carry your priority/throughput
argument to it's logical conclusion, only instruction streams of
absolutely equal priority should be able to share the core at any given
time. You may as well just disable HT and be done with it.
To me, siblings are logically separate units, and should be treated as
such (as they mostly are). They share an important resource, but so do
physically discrete units.
-Mike
-
Where do I say that? I do not presume to manage realtime priorities in any way. You're turning my argument about nice levels around and somehow saying that because hyperthreading breaks the single stream me semantics by parallelising them that I would want to stop that happening. Nowhere have I argued that realtime semantics should be changed to somehow work around hyperthreading. SMT nice is about managing nice only, and not realtime But the buyer is not aware. You are aware because you tinker, but the vast majority of users who enable hyperthreading in their shiny pcs are not aware. The only thing they know is that if they enable hyperthreading their programs run slower in multitasking environments no matter how much they nice the other processes. Buyers do not buy hardware knowing that the internal design breaks something as fundamental as 'nice'. You seem to presume that most people who get hyperthreading are happy to compromise 'nice' in order to get their second core working and I put it to you that they do not make that Well you know as well as I do that you're selecting out the exception rather Well that's certainly taking my logic for a ride. This is about managing _nice_ and _only_ nice. Nice specifies fixed interval static priorities where (in the risk of repeating myself) you are specifying that higher nice values tasks you wish to receive less cpu and more latency. Dynamic priorities have absolutely no effect on what the discrepancies are between the static priorities of differing nice values. As for realtime priorities, again, I do not presume to be managing them with SMT nice. They are unique entities unrelated to nice values. The only thing they have in common with nice levels is that if something is running without a realtime priority, it should be preempted by the realtime task as you have specified that the realtime task should receive all the cpu over the non-realtime task. I don't pretend that there is some cpu percentage ...
I see no real difference between the two assertions. Nice is just a mechanism to set priority, so I applied your assertion to a different range of priorities than nice covers, and returned it to show that the code contradicts itself. It can't be bad for a nice 1 task to run with a nice 0 task, but OK for a minimum RT task to run with a maximum RT task. Iff HT without corrective measures breaks nice, then it breaks To me it's pretty much black and white. Either you want to split your cpu into logical units, which means each has less to offer than the I don't agree that it's the exception, and if you look at this HT thing from the split cpu perspective, I'm not sure there's even a problem. Scrolling down, I see that this is getting too long, and we aren't The above will have to do. -Mike -
i'm starting to lean towards your view that we should not artificially keep tasks from running, when there's a free CPU available. We should still keep the 'other half' of SMT scheduling: the immediate pushing of tasks to a related core, but this bit of 'do not run tasks on this CPU' dependent-sleeper logic is i think a bit fragile. Plus these days SMT siblings do not tend to influence each other in such a negative way as older P4 ones where a HT sibling would slow down the other sibling significantly. plus with an increasing number of siblings (which seems like an inevitable thing on the hardware side), the dependent-sleeper logic becomes less and less scalable. We'd have to cross-check every other 'related' CPU's current priority to decide what to run. if then there should be a mechanism /in the hardware/ to set the priority of a CPU - and then the hardware could decide how to prioritize between siblings. Doing this in software is really hard. Ingo -
Well it is meant to be tuned to the cpu type in per_cpu_gain. So it should be easy to be set to the appropriate scaling. It was never meant to be a one Yes even I've commented before that this current system is unworkable come multiple shared power threads. This I do see as a real problem with it - in And that's the depressing part because of course I was interested in that as the original approach to the problem (and it was a big problem). When I spoke to Intel and AMD (of course to date no SMT AMD chip exists) at kernel summit they said it was too hard to implement hardware priorities well. Which is real odd since IBM have already done it with Power... Still I think it has been working fine in software till now, but now it has to deal with the added confusion of dynticks, so I already know what will happen to it. Hrm it's been a good time for my code all round... I think I'll just swap prefetch myself up the staircase to some pluggable scheduler that would hyperthread me to sleep as an idle priority task. -- -ck -
Well, it's not a dyntick problem in the first place. Even w/o dynticks we go idle with local_softirq_pending(). Dynticks contains an explicit check for that, which makes it visible. tglx -
Oops I'm sorry if I made it sound like there's a dynticks problem. That was not my intent and I said as much in an earlier email. Even though I'm finding myself defending code that has already been softly tagged for redundancy, let's be clear here; we're talking about at most a further 70ms delay in scheduling a niced task in the presence of a nice 0 task, which is a reasonable delay for ksoftirqd which we nice the eyeballs out of in mainline. Considering under load our scheduler has been known to cause scheduling delays of 10 seconds I still don't see this as a bug. Dynticks just "points it out to us". -- -ck -
Well, dyntick might end up to delay it for X seconds as well, which _is_ observable and that's why the check was put there in the first place. tglx -
well, not running softirqs when we could is a bug. It's not a big bug, but it's a bug nevertheless. It doesnt matter that softirqs could be delayed even worse under high load - there was no 'high load' here. Ingo -
Gotcha. I'll prepare a smt-nice removal patch shortly. -- -ck -
Remove the SMT-nice feature which idles sibling cpus on SMT cpus to
facilitiate nice working properly where cpu power is shared. The idling
of cpus in the presence of runnable tasks is considered too fragile, easy
to break with outside code, and the complexity of managing this system if an
architecture comes along with many logical cores sharing cpu power will be
unworkable.
Remove the associated per_cpu_gain variable in sched_domains used only by
this code.
Signed-off-by: Con Kolivas <kernel@kolivas.org>
---
include/asm-i386/topology.h | 1
include/asm-ia64/topology.h | 2
include/asm-mips/mach-ip27/topology.h | 1
include/asm-powerpc/topology.h | 1
include/asm-x86_64/topology.h | 1
include/linux/sched.h | 1
include/linux/topology.h | 4
kernel/sched.c | 155 ----------------------------------
8 files changed, 1 insertion(+), 165 deletions(-)
Index: linux-2.6.21-rc2/kernel/sched.c
===================================================================
--- linux-2.6.21-rc2.orig/kernel/sched.c 2007-03-02 08:56:45.000000000 +1100
+++ linux-2.6.21-rc2/kernel/sched.c 2007-03-02 08:58:40.000000000 +1100
@@ -3006,23 +3006,6 @@ static inline void idle_balance(int cpu,
}
#endif
-static inline void wake_priority_sleeper(struct rq *rq)
-{
-#ifdef CONFIG_SCHED_SMT
- if (!rq->nr_running)
- return;
-
- spin_lock(&rq->lock);
- /*
- * If an SMT sibling task has been put to sleep for priority
- * reasons reschedule the idle task to see if it can now run.
- */
- if (rq->nr_running)
- resched_task(rq->idle);
- spin_unlock(&rq->lock);
-#endif
-}
-
DEFINE_PER_CPU(struct kernel_stat, kstat);
EXPORT_PER_CPU_SYMBOL(kstat);
@@ -3239,10 +3222,7 @@ void scheduler_tick(void)
update_cpu_clock(p, rq, now);
- if (p == rq->idle)
- /* Task on the idle queue */
- wake_priority_sleeper(rq);
- else
+ if (p != rq->idle)
task_running_tick(rq, ...^^^^ almost ;) Regards, Michal -- Michal K. K. Piotrowski LTG - Linux Testers Group (PL) (http://www.stardust.webpages.pl/ltg/) LTG - Linux Testers Group (EN) (http://www.stardust.webpages.pl/linux_testers_group_en/) -
I'm getting an undefined symbol with CONFIG_AGP=m: WARNING: "compat_agp_ioctl" [drivers/char/agp/agpgart.ko] undefined! Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." -
On Wed, Feb 21, 2007 at 07:34:01PM +0100, Andreas Schwab wrote: > I'm getting an undefined symbol with CONFIG_AGP=m: > > WARNING: "compat_agp_ioctl" [drivers/char/agp/agpgart.ko] undefined! Fix went to Linus an hour ago. It's been in -mm for a week, and agpgart.git for a day or so. Dave -- http://www.codemonkey.org.uk -
