Hi, With latest -git (1fca25427482387689fa27594c992a961d98768f), I got this on reading from /dev/cpu/*/* while hot-unplugging cpu1. ------------[ cut here ]------------ WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/arch/x86/kernel/ipi.c:123 send_IPI_mask_bitmask+0xc3/0xe0() Pid: 3881, comm: cat Not tainted 2.6.27-rc3-00464-g1fca254 #12 [<c013591f>] warn_on_slowpath+0x4f/0x80 [<c010a300>] ? native_sched_clock+0x80/0x110 [<c010a335>] ? native_sched_clock+0xb5/0x110 [<c015ae5a>] ? __lock_acquire+0x27a/0xa00 [<c015635b>] ? trace_hardirqs_off+0xb/0x10 [<c010a335>] ? native_sched_clock+0xb5/0x110 [<c01563bd>] ? put_lock_stats+0xd/0x30 [<c0118a43>] send_IPI_mask_bitmask+0xc3/0xe0 [<c01017c8>] send_IPI_mask+0x8/0x10 [<c0118307>] native_send_call_func_single_ipi+0x27/0x30 [<c0160a2b>] generic_exec_single+0x7b/0x80 [<c0160adf>] smp_call_function_single+0x5f/0x110 [<c037a440>] ? __rdmsr_safe_on_cpu+0x0/0x60 [<c037a440>] ? __rdmsr_safe_on_cpu+0x0/0x60 [<c037a597>] _rdmsr_on_cpu+0x27/0x60 [<c037a5ea>] rdmsr_safe_on_cpu+0x1a/0x20 [<c011733e>] msr_read+0x6e/0xa0 [<c01a87b4>] vfs_read+0x94/0x130 [<c01172d0>] ? msr_read+0x0/0xa0 [<c01a8b5d>] sys_read+0x3d/0x70 [<c01040db>] sysenter_do_call+0x12/0x3f ======================= ---[ end trace fe4338948cb73be2 ]--- BUG: soft lockup - CPU#0 stuck for 61s! [cat:3881] irq event stamp: 14632440 hardirqs last enabled at (14632439): [<c015968b>] trace_hardirqs_on+0xb/0x10 hardirqs last disabled at (14632440): [<c015635b>] trace_hardirqs_off+0xb/0x10 softirqs last enabled at (14632434): [<c013a4d1>] __do_softirq+0xe1/0x100 softirqs last disabled at (14632427): [<c013a595>] do_softirq+0xa5/0xb0 Pid: 3881, comm: cat Tainted: G W (2.6.27-rc3-00464-g1fca254 #12) EIP: 0060:[<c0160952>] EFLAGS: 00200202 CPU: 0 EIP is at csd_flag_wait+0x12/0x20 EAX: f5f31ef0 EBX: c215dc60 ECX: ffffb300 EDX: 000008fa ESI: 00200292 EDI: c215dc68 EBP: f5f31ec0 ESP: f5f31ec0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: ...
It's generally known the oprofile doesn't support CPU hotplug well. Someone needs to make a project out of fixing it properly. Right now it's just a "don't do that when it hurts" -Andi --
Hm. What you say is true, but this one in particular has nothing to do with oprofile! It has something to do with reading /dev/cpu/*/msr while hot-unplugging cpu1: [<c011733e>] msr_read+0x6e/0xa0 [<c01a87b4>] vfs_read+0x94/0x130 I wasn't using oprofile when this happened. So I think it should also be considered a separate issue. Though yes -- CPU hotplug in general tends to break a lot of things. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
On Wed, Aug 20, 2008 at 08:26:19AM +0200, Vegard Nossum wrote: > On Wed, Aug 20, 2008 at 3:39 AM, Andi Kleen <andi@firstfloor.org> wrote: > > On Tue, Aug 19, 2008 at 09:51:44PM +0200, Vegard Nossum wrote: > >> Hi, > >> > >> With latest -git (1fca25427482387689fa27594c992a961d98768f), I got > >> this on reading from /dev/cpu/*/* while hot-unplugging cpu1. > > > > It's generally known the oprofile doesn't support CPU hotplug well. > > Someone needs to make a project out of fixing it properly. Right now > > it's just a "don't do that when it hurts" > > Hm. What you say is true, but this one in particular has nothing to do > with oprofile! It has something to do with reading /dev/cpu/*/msr > while hot-unplugging cpu1: > > [<c011733e>] msr_read+0x6e/0xa0 > [<c01a87b4>] vfs_read+0x94/0x130 > > I wasn't using oprofile when this happened. So I think it should also > be considered a separate issue. Though yes -- CPU hotplug in general > tends to break a lot of things. From my reading of the msr code, we check that the cpu is online in ->open, but we never check it again, and also, we make no guarantees that it won't go away before we ->read or even ->close it. Would adding a get_cpu/put_cpu across the open/close solve this? Peter? Dave -- http://www.codemonkey.org.uk --
A get_cpu/put_cpu across the whole open..close sequence would seem to
be, ahem, rude, since userspace could hold it for an arbitrary amount of
time (plus, there is no guarantee that they are invoked on the same CPU.)
The cpuid driver has the same problem, obviously.
get_online_cpus() and put_online_cpus() around the call to
{rd,wr}msr_safe_on_cpu() should work; and the CPU hotplug documentation
seems to claim that we can just disable preemption around those calls,
which is exactly what get_cpu()..put_cpu() does, so I guess
get_cpu()..put_cpu() here is fine. Now, the big question is: should
this really be done in the MSR/CPUID drivers, or should it be done in
smp_call_function_single(), which is the generic code invoked by this?
It seems to be that doing it in smp_call_function_single() would be more
correct as it's already protected by get_cpu()..put_cpu() and a
cpu_online() test in there should not be expensive in comparison to the
whole rest of the code.
You may want to see if this patch fixes the problem; it does *NOT* have
the correct error behaviour (some of the intervening layers don't
propagate errors), but it should make the fault go away.
-hpa
The alternative would be to just take out those msr_on_cpu() interfaces again. Right now they are useless in the kernel, but still cause problems. They were only added for OpenVZ's vCPUs which they back then promised me would hit mainline soon. But that was some time ago and there wasn't much progress on this. -Andi --
We still need the equivalent functionality, though. The midlayer (msr_on_cpu) may be pointless, but that doesn't change the fact that putting this functionality in the lower layer (smp_call_function_single) makes more sense. -hpa --
Assuming you can actually have interrupts enabled at these point and be otherwise ready to do call_function_simple (e.g. cpu hotplug locking etc.) For a lot of MSR accesses in more complicated subsystems like cpufreq that requires complications. I would think for many circumstances it's better to simply set affinity of the thread before at a higher level. In hindsight I think it was my mistake to ever merge that. I admit I never liked it, but just merged it because I wasn't able to come up with a strong enough counter argument back then. -Andi --
Well, smp_call_function_single already does all necessary locking; it makes more sense for it to check that what it's about to call still exists while inside the lock, instead of requiring the higher layers to guarantee that cannot happen on it. This is simply a matter of the cost of checking at this point being quite low. -hpa --
It does, already doesn't it? Hm, smp_call_function_mask() ands the
provided mask with the online mask, but it doesn't look like
smp_call_function_single() does the equivalent.
J
--
It doesn't, and that's how this bug was introduced. It's a trivial add (see test patch already posted) and should hardly matter in terms of execution time. I'll write up a clean patch with all the error propagation tomorrow or Sunday. -hpa --
Hm. Kernel fails to detect cpu1 at all. I am currently unsure of whether it's your patch or not. But it's the same config that I've been booting for ages (and I copy it over for each new kernel version I check out). Processor #0 (Bootup-CPU) I/O APIC #2 Version 32 at 0xFEC00000. Enabling APIC mode: Flat. Using 1 I/O APICs Processors: 1 SMP: Allowing 1 CPUs, 0 hotplug CPUs mapped APIC to ffffb000 (fee00000) mapped IOAPIC to ffffa000 (fec00000) Allocating PCI resources starting at 50000000 (gap: 40000000:bee00000) PERCPU: Allocating 1221764 bytes of per cpu data NR_CPUS: 7, nr_cpu_ids: 1, nr_node_ids 1 I really don't get it. Is this something that can be caused by your patch _at all_ ? Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Well, if smp_call_function_single() is called during the CPU up sequence, without the CPU having been added to the online mask, then yes, it could. The most likely place would be from a notifier. That makes it ugly. Need to track down the reason. -hpa --
Could you try this patch? It should (hopefully) tell us if there is any such invocations and what the call trace looks like. -hpa
I'm sorry, I _just_ reverted your patch and tested the bare kernel... but it still only detects cpu0 :-( Apart from that, it's also incredibly slow and I get some "end_request: I/O error, dev fd0, sector 0" messages. Start-up (init 3 on a F7) takes closer to 10 minutes. Will now take a closer look at my config. Oh. I _just_ noticed a completely different change -- I added acpi=off to my boot line *blush* Will now remove it and retry your original patch. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Removing acpi=off helps with the CPU detection problem. The kernel is still really slow, though. From /proc/cpuinfo: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Pentium(R) 4 CPU 3.00GHz stepping : 5 cpu MHz : 375.000 cache size : 2048 KB Why is MHz on 375!? I tried cpufreq-selector, but nothing changed. Maybe calling acpi_cpufreq_init+0x0/0x90 initcall acpi_cpufreq_init+0x0/0x90 returned -19 after 0 msecs There's also this: SMP: Allowing 2 CPUs, 0 hotplug CPUs (but CPU hotplug still work, is the line above about something different, like physical hotplug?) Apart from that, with your patch applied, hotplug seems to work OK (no warnings). Okay, now I used cpufreq-selector to change to "ondemand" governor, and MHz goes back to 3000. Weird. Why would "performance" governor put my machine to a constant 375? Thanks, Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
That would be a problem... I presume this problem is independent of the patch, though? -hpa --
On Sun, Aug 24, 2008 at 07:45:48PM +0200, Vegard Nossum wrote: > Removing acpi=off helps with the CPU detection problem. The kernel is > still really slow, though. From /proc/cpuinfo: > > processor : 1 > vendor_id : GenuineIntel > cpu family : 15 > model : 6 > model name : Intel(R) Pentium(R) 4 CPU 3.00GHz > stepping : 5 > cpu MHz : 375.000 > cache size : 2048 KB > > Why is MHz on 375!? I tried cpufreq-selector, but nothing changed. Maybe > > calling acpi_cpufreq_init+0x0/0x90 > initcall acpi_cpufreq_init+0x0/0x90 returned -19 after 0 msecs -ENODEV. Because you don't have frequency scaling capable CPU. > Okay, now I used cpufreq-selector to change to "ondemand" governor, > and MHz goes back to 3000. Weird. Why would "performance" governor put > my machine to a constant 375? Probably because you're using p4-clockmod, and it's crap. Dave -- http://www.codemonkey.org.uk --
I sorted it -- thanks! It turned out to be pretty obscure; my tty setting for the receiving end of the serial console was set to echo. So when the machine booted, it was echoing lots of characters into the Fedora 7 init, which would prompt for the starting of cpuspeed initscript. Turning off echo for the tty was what triggered the slowness; removing cpuspeed from the runlevel entirely solved the problem. Don't know why cpuspeed would select a governor which runs the CPU at a constant 300 MHz, though. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
On Mon, Aug 25, 2008 at 08:31:04PM +0200, Vegard Nossum wrote: > Fedora 7 init, which would prompt for the starting of cpuspeed > initscript. Turning off echo for the tty was what triggered the > slowness; removing cpuspeed from the runlevel entirely solved the > problem. > > Don't know why cpuspeed would select a governor which runs the CPU at > a constant 300 MHz, though. p4-clockmod is the only cpufreq driver that can run on your hardware. There's nothing better. A while back, Fedora stopped loading (and even building) p4-clockmod, because it sucks so bad. I can't remember when we made that change, but it sounds like it must have been a post F7 thing. Dave -- http://www.codemonkey.org.uk --
> Probably because you're using p4-clockmod, and it's crap. Really should really bite the bullet and just remove it. People run in this all the time and I bet you can count the people who actually use it consciously and usefully with one hand. Or at least only make it run when the user set a "I_REALLY_KNOW_WHAT_I_AM_DOING" option explicitely. -Andi --
On Mon, Aug 25, 2008 at 08:36:11PM +0200, Andi Kleen wrote: > > Probably because you're using p4-clockmod, and it's crap. > > Really should really bite the bullet and just remove it. People > run in this all the time and I bet you can count the people who > actually use it consciously and usefully with one hand. > > Or at least only make it run when the user set a "I_REALLY_KNOW_WHAT_I_AM_DOING" > option explicitely. We can't really remove it until ACPI processor driver has a better response than 'thermal event, argh!, shut down'. When that happens, I'll be glad to see it go. Dave -- http://www.codemonkey.org.uk --
It only does that when the critical trip point is reached (which basically means that the BIOS tells it -- "I'm on fire"). What else should it do in your opinion when this happens? -Andi --
On Mon, Aug 25, 2008 at 09:39:26PM +0200, Andi Kleen wrote: > On Mon, Aug 25, 2008 at 02:54:51PM -0400, Dave Jones wrote: > > On Mon, Aug 25, 2008 at 08:36:11PM +0200, Andi Kleen wrote: > > > > Probably because you're using p4-clockmod, and it's crap. > > > > > > Really should really bite the bullet and just remove it. People > > > run in this all the time and I bet you can count the people who > > > actually use it consciously and usefully with one hand. > > > > > > Or at least only make it run when the user set a "I_REALLY_KNOW_WHAT_I_AM_DOING" > > > option explicitely. > > > > We can't really remove it until ACPI processor driver has a better > > response than 'thermal event, argh!, shut down'. > > It only does that when the critical trip point is reached (which > basically means that the BIOS tells it -- "I'm on fire"). What else should > it do in your opinion when this happens? On some systems (for which there aren't BIOS updates) the trip points are set too low. If we get a thermal event that was caused by temporary increased workload, temperature will drop off again when that workload is complete. For sustained workloads we'd get additional thermal events, at which time we make a decision "ok, we've throttled as far as we can, and things are still going badly, power off". In the event of a failed fan or similar, shutting down is obviously the right thing to do, and we'd get further thermal events after throttling which would allow us to do so. Dave -- http://www.codemonkey.org.uk --
There were patches floating to make this configurable. I was always But none of the cpufreq governours do this. They only care about So you're saying processor_thermal should let the system cook for some time first before really taking action? -Andi --
On Mon, Aug 25, 2008 at 10:36:49PM +0200, Andi Kleen wrote: > > If we get a thermal event that was caused by temporary > > increased workload, temperature will drop off again when that workload > > is complete. > > But none of the cpufreq governours do this. They only care about > load, not about temperature. Which is good enough to stop p4 laptops from shutting down as soon as they've finished booting up. > > For sustained workloads we'd get additional thermal events, at which > > time we make a decision "ok, we've throttled as far as we can, and > > things are still going badly, power off". > > That is what the ACPI driver does when the trip point is reached. yes, except for that "we've throttled" part. Dave -- http://www.codemonkey.org.uk --
On Mon, 25 Aug 2008 16:47:02 -0400 that's such an enormous gamble it's not funny. really; if your bios has broken trippoints we should use the kernel commandline to disable them (and a dmi blacklist if the amount of bioses that have it wrong is low.. maybe combined with a date based threshold). Just praying that p4clockmod keeps it kinda low enough is not the answer. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
On Mon, Aug 25, 2008 at 12:08:23PM -0700, H. Peter Anvin wrote: > Andi Kleen wrote: > >> Probably because you're using p4-clockmod, and it's crap. > > > > Really should really bite the bullet and just remove it. People > > run in this all the time and I bet you can count the people who > > actually use it consciously and usefully with one hand. > > > > Or at least only make it run when the user set a "I_REALLY_KNOW_WHAT_I_AM_DOING" > > option explicitely. > > CONFIG_BROKEN? It's not really broken (at least in the CONFIG_BROKEN sense), it just sucks when used in the wrong situations. (Which is 99% of the use-cases people try to use it). Dave -- http://www.codemonkey.org.uk --
