On some strange workload involving strace and fuse I get ocasional long periods (10-100s) of total unresponsiveness, not even SysRq-* working. Then the machine continues as normal. Nothing in dmesg, absolutely no indication about what is happening. Tried nmi_watchdog=1, but then the machine locks up hard shortly after boot. Tried nmi_watchdog=1 acpi=off, then I can't reproduce the problem. Tried 2.6.22-rc2 and 2.6.21, both with the same result. .config and dmesg attached. Any ideas? Possibly something ACPI related? Thanks, Miklos Linux version 2.6.22-rc2 (mszeredi@tucsk) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #3 SMP Tue May 22 17:55:42 CEST 2007 Command line: root=/dev/sda2 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved) BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003fed0000 (usable) BIOS-e820: 000000003fed0000 - 000000003fedf000 (ACPI data) BIOS-e820: 000000003fedf000 - 000000003ff00000 (ACPI NVS) BIOS-e820: 000000003ff00000 - 0000000040000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved) BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved) BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved) Entering add_active_range(0, 0, 159) 0 entries of 256 used Entering add_active_range(0, 256, 261840) 1 entries of 256 used end_pfn_map = 1048576 DMI present. ACPI: RSDP 000F67D0, 0024 (r2 LENOVO) ACPI: XSDT 3FED1308, 008C (r1 LENOVO TP-79 2110 LTP 0) ACPI: FACP 3FED1400, 00F4 (r3 LENOVO TP-79 2110 LNVO 1) ACPI Warning (tbfadt-0434): Optional field "G...
NMIs in some thinkpads are bad trouble, they lock up the blasted IBM/Lenovo SMBIOS if they happen to hit it while it is servicing a SMI, and thinkpads do SMIs like crazy. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh -
there's also a L1 ASPM issue with the e1000 chipset in the way (for T60/X60 only). Make sure you are using 2.6.21 or newer. See netdev archives for more on that. Auke -
Miklos is using latest -git and he has nmi_watchdog disabled - still the long pauses persist. (i've never seen that on my t60) Ingo -
how reproducable are these lockups - could you possibly trace it? If yes then please apply: http://www.tglx.de/private/tglx/ht-debug/tracer.diff and run the attached trace-it-1sec.c thing in a loop: echo 1 > /proc/sys/kernel/mcount_enabled while true; do ./trace-it-1sec > trace-`date`.txt done and wait for the lockup. Once it happens, please upload the trace*.txt file that contains the lockup, i guess we'll be able to tell you more about the nature of the lockup. (Perhaps increase the sleep(1) to sleep(5) to capture longer periods and to increase the odds that you catch the lockup while the utility is tracing.) Ingo
With this patch boot stops at segfaulting fsck. I enabled all the new
config options, is that not a good idea? Which one exactly do I need?
Thanks,
Miklos
PS. tracer.diff needed some hacking to make it apply/compile.
Index: linux-2.6.22-rc2/include/linux/linkage.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/linkage.h 2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/linkage.h 2007-05-24 15:57:43.000000000 +0200
@@ -3,6 +3,8 @@
#include <asm/linkage.h>
+#define notrace __attribute ((no_instrument_function))
+
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
Index: linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt
===================================================================
--- linux-2.6.22-rc2.orig/Documentation/stable_api_nonsense.txt 2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt 2007-05-24 15:57:42.000000000 +0200
@@ -62,6 +62,9 @@ consider the following facts about the L
- different structures can contain different fields
- Some functions may not be implemented at all, (i.e. some locks
compile away to nothing for non-SMP builds.)
+ - Parameter passing of variables from function to function can be
+ done in different ways (the CONFIG_REGPARM option controls
+ this.)
- Memory within the kernel can be aligned in different ways,
depending on the build options.
- Linux runs on a wide range of different processor architectures.
Index: linux-2.6.22-rc2/arch/i386/Kconfig
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/Kconfig 2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/Kconfig 2007-05-24 15:57:42.000000000 +0200
@@ -764,6 +764,14 @@ config BOOT_IOREMAP
depends on (((X86_SUMMIT || X86_GENERICARCH) && NUMA) || (X86 && EFI))
default y
+#
+# function tracing might turn this ...hm, you should only need these: CONFIG_EVENT_TRACE=y CONFIG_FUNCTION_TRACE=y # CONFIG_WAKEUP_TIMING is not set # CONFIG_CRITICAL_IRQSOFF_TIMING is not set CONFIG_MCOUNT=y does it boot with these? Ingo -
Nope. Same segfault. If I try to continue manually with 'init 5', then init segfaults as well :( Miklos -
does it go away if you turn off CONFIG_FUNCTION_TRACE? (that will make the trace a lot less verbose, but still informative) Ingo -
could you just try v2.6.21 plus the -rt patch, which has the tracer built-in? That's a combination that should work well. You can pick it up from: http://people.redhat.com/mingo/realtime-preempt/ same config options as above. If you dont turn on PREEMPT_RT you'll get an almost-vanilla kernel. Ingo -
2.6.22-rc2, only EVENT_TRACE - boots, can't rerpoduce 2.6.21-vanila - can reproduce 2.6.21-rt7, trace options off - can reproduce 2.6.21-rt7, trace options on - can't reproduce Possibly something timing related, that's altered by the trace code. I tried the trace kernel without starting the trace app, but still no bug. Any other ideas? Thanks, Miklos -
perhaps try 2.6.21-rt7 with EVENT_TRACE on (the other trace options off) - does that hide the bug too? Ingo -
The option itself doesn't hide the bug this time, I got one freeze pretty quickly. But after starting the trace-it-1sec loop, I couldn't get it any more. Normally I get a freeze after 3-5 minutes of testing, but with trace-it-1sec there's still nothing after 30 minutes. If I stop trace-it, and do "echo 0 > /proc/sys/kernel/trace_enabled", I get the freeze again. It's a perfect heisenbug. This issue came up when I was testing a userspace fuse bug, and now the reporter of that bug (added to CC) says that he also sometimes experienced a hard lockup during testing but ignored it up to now. So we may yet get some info from him. It could be something fuse related, but it's very hard to imagine how fuse could trigger such a low-level problem. Miklos -
I've got some more info about this bug. It is gathered with nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of calling die_nmi() just prints a line and calls show_registers(). This makes the machine actually survive the NMI tracing. The attached traces are gathered over about an hour of stressing. An mp3 player is also going on continually, and I can hear a couple of seconds of "looping" quite often, but it gets as far as the NMI trace only rarely. AFAICS only the last pair shows a trace for both CPUs during the same "freeze". I've put some effort into understanding what's going on, but I'm not familiar with how interrupts work and that sort of thing. The pattern that emerges is that on CPU0 we have an interrupt, which is trying to acquire the rq lock, but can't. On CPU1 we have strace which is doing wait_task_inactive(), which sort of spins acquiring and releasing the rq lock. I've checked some of the traces and it is just before acquiring the rq lock, or just after releasing it, but is not actually holding it. So is it possible that wait_task_inactive() could be starving the other waiters of the rq spinlock? Any ideas? Thanks, Miklos NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: fuse e1000 Pid: 4625, comm: strace Not tainted 2.6.22-rc4 #10 RIP: 0010:[<ffffffff80227ce5>] [<ffffffff80227ce5>] task_rq_lock+0x14/0x6f RSP: 0018:ffff81001cf17ed8 EFLAGS: 00000046 RAX: 0000000000000246 RBX: ffff81001c9da540 RCX: ffff81003fd1e5e8 RDX: 0000000000000001 RSI: ffff81001cf17f10 RDI: ffff81001c9da540 RBP: ffff81001cf17ef8 R08: 0000000000000003 R09: 0000000000000000 R10: 00007fffbcd7c018 R11: 0000000000000246 R12: ffff81001c9da540 R13: ffff81001cf17f10 R14: ffff81001c9da540 R15: 00007fffbcd7d44c FS: 00002b05ee28a6f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaac02f000 CR3: 000000001ce52000 CR4: 00000000000006e0 Process strace (pid: 4625, threadinfo ff...
hm, this is really interesting, and indeed a smoking gun. The T60 has a Core2Duo and i've _never_ seen MESI starvation happen on dual-core single-socket CPUs! (The only known serious MESI starvation i know about is on multi-socket Opterons: there the trylock loop of spinlock debugging is known to starve some CPUs out of those locks that are being polled, so we had to turn off that aspect of spinlock debugging.) wait_task_inactive(), although it busy-loops, is pretty robust: it does a proper spin-lock/spin-unlock sequence and has a cpu_relax() inbetween. Furthermore, the rep_nop() that cpu_relax() is based on is unconditional, so it's not like we could somehow end up not having the REP; NOP sequence there (which should make the lock polling even more fair) could you try the quick hack below, ontop of cfs-v17? It adds two things to wait_task_inactive(): - a cond_resched() [in case you are running !PREEMPT] - use MONITOR+MWAIT to monitor memory transactions to the rq->curr cacheline. This should make the polling loop definitely fair. If this solves the problem on your box then i'll do a proper fix and introduce a cpu_relax_memory_change(*addr) type of API to around monitor/mwait. This patch boots fine on my T60 - but i never saw your problem. [ btw., utrace IIRC fixes ptrace to get rid of wait_task_interactive(). ] Ingo Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -834,6 +834,16 @@ repeat: cpu_relax(); if (preempted) yield(); + else + cond_resched(); + /* + * Wait for "curr" to change: + */ + __monitor((void *)&rq->curr, 0, 0); + smp_mb(); + if (rq->curr != p) + __mwait(0, 0); + goto repeat; } task_rq_unlock(rq, &flags); -
Yes, the patch does make the pauses go away. In fact just the I looked at the utrace patch, and it still has wait_task_inactive(), and I can still reproduce the freezes with the utrace patch applied. Miklos -
cool! Could you send me the smallest patch you tried that still made the hangs go away? Ingo -
My previous attempt was just commenting out parts of your patch. But maybe it's more logical to move the barrier to immediately after the unlock. With this patch I can't reproduce the problem, which may not mean very much, since it was rather a "fragile" bug anyway. But at least the fix looks pretty harmless. Thanks, Miklos Index: linux-2.6.22-rc4/kernel/sched.c =================================================================== --- linux-2.6.22-rc4.orig/kernel/sched.c 2007-06-18 08:59:17.000000000 +0200 +++ linux-2.6.22-rc4/kernel/sched.c 2007-06-18 09:04:13.000000000 +0200 @@ -1168,6 +1168,11 @@ repeat: /* If it's preempted, we yield. It could be a while. */ preempted = !task_running(rq, p); task_rq_unlock(rq, &flags); + /* + * Without this barrier, wait_task_inactive() can starve + * waiters of rq->lock (observed on Core2Duo) + */ + smp_mb(); cpu_relax(); if (preempted) yield(); -
yeah. The problem is, that the open-coded loop there is totally fine, and we have similar loops elsewhere, so this problem could hit us again, in an even harder to debug place! Since this affects our basic SMP primitives, quite some caution is warranted i think. So ... to inquire about the scope of the problem, another possibility would be for the _spin loop_ being 'too nice', not wait_task_inactive() being too agressive! To test this theory, could you try the patch below, does this fix your hangs too? This change causes the memory access of the "easy" spin-loop portion to be more agressive: after the REP; NOP we'd not do the 'easy-loop' with a simple CMPB, but we'd re-attempt the atomic op. (in theory the non-LOCK-ed memory accesses should have a similar effect, but maybe the Core2Duo has some special optimization for non-LOCK-ed cacheline accesses that causes cacheline starvation?) Ingo ----------------------------------------------------> Subject: [patch] x86: fix spin-loop starvation bug From: Ingo Molnar <mingo@elte.hu> Miklos Szeredi reported very long pauses (several seconds, sometimes more) on his T60 (with a Core2Duo) which he managed to track down to wait_task_inactive()'s open-coded busy-loop. He observed that an interrupt on one core tries to acquire the runqueue-lock but does not succeed in doing so for a very long time - while wait_task_inactive() on the other core loops waiting for the first core to deschedule a task (which it wont do while spinning in an interrupt handler). The problem is: both the spin_lock() code and the wait_task_inactive() loop uses cpu_relax()/rep_nop(), so in theory the CPU should have guaranteed MESI-fairness to the two cores - but that didnt happen: one of the cores was able to monopolize the cacheline that holds the runqueue lock, for extended periods of time. This patch changes the spin-loop to assert an atomic op after every REP NOP instance - this will cause the CPU to express its "MESI in...
I really think this the the wrong approach, although *testing* it makes
sense.
I think we need to handle loops that take, release, and then immediately
re-take differently.
Such loops are _usually_ of the form where they really just release the
lock for some latency reason, but in this case I think it's actually just
a bug.
That code does:
if (unlikely(p->array || task_running(rq, p))) {
to decide if it needs to just unlock and repeat, but then to decide if it
need to *yield* it only uses *one* of those tests (namely
preempted = !task_running(rq, p);
..
if (preempted)
yield();
and I think that's just broken. It basically says:
- if the task is running, I will busy-loop on getting/releasing the
task_rq_lock
and that is the _real_ bug here.
Trying to make the spinlocks do somethign else than what they do is just
papering over the real bug. The real bug is that anybody who just
busy-loops getting a lock is wasting resources so much that we should not
be at all surprised that some multi-core or NUMA situations will get
starvation.
Blaming some random Core 2 hardware implementation issue that just makes
it show up is wrong. It's a software bug, plain and simple.
So how about this diff? The diff looks big, but the *code* is actually
simpler and shorter, I just added tons of comments, which is what blows it
up.
The new *code* looks like this:
repeat:
/* Unlocked, optimistic looping! */
rq = task_rq(p);
while (task_running(rq, p))
cpu_relax();
/* Get the *real* values */
rq = task_rq_lock(p, &flags);
running = task_running(rq, p);
array = p->array;
task_rq_unlock(rq, &flags);
/* Check them.. */
if (unlikely(running)) {
cpu_relax();
goto repeat;
}
if (unlikely(array)) {
yield();
goto repeat;
}
and while I haven't tes...I don't agree with this (+ I know it doesn't matter). The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair". And here they are simply lawlessly not fair. I cannot see any reason why any of tasks doing simultaneously "busy-loop on getting/releasing" a spinlock should starve almost to death another one doing the same (or simply waiting to get this lock) even without cpu_relax. Of course, lawfulness of such On the other hand it seems spinlocks should be at least a little more immune to such bugs: slowdown is OK but not freezing. Current behavior could suggest this unfairness could harm some tasks even without any loops present - but it's not visible enough. So, I'm surprised this thread seems to stop after this patch, and there is no try to make the most of this ideal testing case to improve spinlock design btw. Regards, Jarek P. -
Well, that's certainly a valid standpoint. I wouldn't claim you're _wrong_. At least in theory. In *practice*, fair spinlocks are simply not possible. Not as in "it's hard to do", but as in "you simply cannot do it". The thing is, most spinlocks get held for a rather short time (if that wasn't the case, you'd use something like a mutex), and it's acually a short time not on a "human scale", but on a "CPU core scale". IOW, the cost of doing the cacheline bouncing is often equal to, or bigger than, the actual cost of the operation that happens inside the spinlock! What does this result in? It automatically means that software simply *cannot* do fair spinlocks, because all the timing costs are in parts that software doesn't even have any visibility into, or control over! Yeah, you could add artificial delays, by doing things like cycle counting around the spinlock operation to see if you got the lock when it was _contended_, or whether you got a lock that wasn't, and then adding some statistics etc, but at that point, you don't have a spinlock any more, you have something else. I don't know what to call it. And you could add flags like "this spinlock is getting a lot of contention", try some other algorithm etc. But it's all complicated and against the whole *point* of a spinlock, which is to get in and out as fast as humanly possible. So in practice, spinlock fairness is inherently tied to the hardware behaviour of the cache coherency algorithm. Which gets us to the next level: we can consider hardware that isn't "fair" in its cache coherency to be buggy hardware. That's also a perfectly fine standpoint, and it's actually one that I have a lot of sympathy for. I think it's much easier to some degree to do fairness in the cache coherency than at any other level, because it's something where the hardware really *could* do things like counting bouncing etc. However, while it's a perfectly fine standpoint, it's also totally unrealistic. Fi...
yeah, and if there's no easy solution, change it to a mutex. Fastpath performance of spinlocks and mutexes is essentially the same, and if there's any measurable contention then the scheduler is pretty good at sorting things out. Say if the average contention is longer than 10-20 microseconds then likely we could already win by scheduling away to some other task. (the best is of course to have no contention at all - but there are causes where it is real hard, and there are cases where it's outright unmaintainable.) Hw makers are currently producing transistors disproportionatly faster than humans are producing parallel code, as a result of which we've got more CPU cache than ever, even taking natural application bloat into account. (it just makes no sense to spend those transistors on parallelism when applications are just not making use of it yet. Plus caches are a lot less power intense than functional units of the CPU, and the limit these days is power input.) So scheduling more frequently and more agressively makes more sense than what worries me a bit though is that my patch that made spinlocks equally agressive to that loop didnt solve the hangs! So there is some issue we dont understand yet - why was the wait_inactive_task() open-coded spin-trylock loop starving the other core which had ... an open-coded spin-trylock loop coded up in assembly? And we've got a handful of other open-coded loops in the kernel (networking for example) so this issue could come back and haunt us in a situation where we dont have a gifted hacker like Miklos being able to spend _weeks_ to track down the problem... Ingo -
Your parch kept doing "spin_trylock()", didn't it? That's a read-modify-write thing, and keeps bouncing the cacheline back and forth, and together with the fact that even *after* you get the spinlock the "wait_for_inactive()" would actually end up looping back, releasing it, and re-getting it. So the problem was that "wait_for_inactive()" kept the lock (because it actually *got* it), and looped over getting it, and because it was an exclusive cacheline ownership, that implies that somebody else is not getting it, and is kept from ever getting it. So trying to use "trylock" doesn't help. It still has all the same bad sides - it still gets the lock (getting the lock wasn't the problem: _holding_ the lock was the problem), and it still kept the cache line for the lock on one core. The only way to avoid lock contention is to avoid any exclusive use at all. Linus -
yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a trylock which tries to dirty the cacheline. There was a "REP NOP" after ok, it's not completely clear where exactly the other core was spinning, but i took it from Miklos' observations that the other core was hanging in the _very same_ task_rq_lock() - which is a true spinlock as well that acquires it. So on one core the spin_lock() was starving, on so the problem was not the trylock based spin_lock() itself (no matter how it's structured in the assembly), the problem was actually modifying the lock and re-modifying it again and again in a very tight yeah - i'm not at all arguing in favor of the BTRL patch i did: i always liked the 'nicer' inner loop of spinlocks, which could btw also easily use MONITOR/MWAIT. (my patch is also quite close to what we did in spinlocks many years ago, so it's more of a step backwards than real progress.) So it seems the problem was that if a core kept _truly_ modifying a cacheline via atomics in a high enough frequency, it could artificially starve the other core. (which would keep waiting for the cacheline to be released one day, and which kept the first core from ever making any progress) To me that looks like a real problem on the hardware side - shouldnt cacheline ownership be arbitrated a bit better than that? Up to the point where some external event (perhaps a periodic SMM related to thermal management) broke the deadlock/livelock scenario? Ingo -
The "nice" inner loop is necessary or else it would generate huge amounts A while ago I showed that spinlocks were a lot more fair when doing unlock with the xchg instruction on x86. Probably the arbitration is all screwed up because we use a mov instruction, which while atomic is not locked. -
No, the cache line arbitration doesn't know anything about "locked" vs "unlocked" instructions (it could, but there really is no point). The real issue is that locked instructions on x86 are serializing, which makes them extremely slow (compared to just a write), and then by being slow they effectively just make the window for another core bigger. IOW, it doesn't "fix" anything, it just hides the bug with timing. You can hide the problem other ways by just increasing the delay between the unlock and the lock (and adding one or more serializing instruction in between is generally the best way of doing that, simply because otherwise micro- architecture may just be re-ordering things on you, so that your "delay" isn't actually in between any more!). But adding delays doesn't really fix anything, of course. It makes things "fairer" by making *both* sides suck more, but especially if both sides are actually the same exact thing, I could well imagine that they'd both just suck equally, and get into some pattern where they are now both slower, but still see exactly the same problem! Of course, as long as interrupts are on, or things like DMA happen etc, it's really *really* hard to be totally unlucky, and after a while you're likely to break out of the steplock on your own, just because the CPU's get interrupted by something else. It's in fact entirely possible that the long freezes have always been there, but the NOHZ option meant that we had much longer stretches of time without things like timer interrupts to jumble up the timing! So maybe the freezes existed before, but with timer interrupts happening hundreds of times a second, they weren't noticeable to humans. (Btw, that's just _one_ theory. Don't take it _too_ seriously, but it could be one of the reasons why this showed up as a "new" problem, even though I don't think the "wait_for_inactive()" thing has changed lately.) Linus -
yeah. I think Linux is i think the only OS on the planet that is using the movb trick for unlock, it even triggered a hardware erratum ;) So it might surprise some hw makers who might rely on the heuristics that each critical section lock and unlock is a LOCK-ed instruction. Ingo -
I'm pretty sure others do it too.
Maybe not on an OS level (but I actually doubt that - I'd be surprised if
Windows doesn't do the exact same thing), but I know for a fact that a lot
of people in threaded libraries end up depending very much on the "simple
store closes a locked section".
Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards
for Windows developers with open-coded spinlocks like
int ResourceFlag = 0; // 0=Free, 1=Inuse
...
// Wait until we get the resource
while(InterlockedExchange(&ResourceFlag, 1) != 0) {
Sleep(0); } // Wait a tad
// Have the resource
... // do your thing
ResourceFlag = 0; // Release the resource
and that's definitely Windows code, not some Linux person doing it.
And this is from an OS2 forum
unsigned owned=0;
void request() {
while(LockedExchanged(&owned,1)!=0)
;
}
void release() {
owned = 0;
}
so it's not even something unusual.
So while arguably these people don't know (and don't care) about subtle
issues like memory ordering, I can *guarantee* that a lot of programs
depend on them, even if that dependency may often come from a lack of
knowledge, rather than actively understanding what we do like in the Linux
kernel community.
(And yes, they rely on compilers not reordering either. Tough.)
Linus
-the freezes that Miklos was seeing were hardirq contexts blocking in task_rq_lock() - that is done with interrupts disabled. (Miklos i think also tried !NOHZ kernels and older kernels, with a similar result.) plus on the ptrace side, the wait_task_inactive() code had most of its overhead in the atomic op, so if any timer IRQ hit _that_ core, it was likely while we were still holding the runqueue lock! i think the only thing that eventually got Miklos' laptop out of the wedge were timer irqs hitting the ptrace CPU in exactly those instructions where it was not holding the runqueue lock. (or perhaps an asynchronous SMM event delaying it for a long time) Ingo -
even considering that the 'LOCK'-ed intruction was the heaviest in the busy-loop, the numbers still just dont add up to 'tens of seconds of lockups', so there must be something else happening too. So here's an addition to the existing theories: the Core2Duo is a 4-issue CPU architecture. Now, why does this matter? It matters to the timing of the delivery of interrupts. For example, on a 3-issue architecture, the instruction level profile of well-cached workloads often looks like this: c05a3b71: 710 89 d6 mov %edx,%esi c05a3b73: 0 8b 55 c0 mov 0xffffffc0(%ebp),%edx c05a3b76: 0 89 c3 mov %eax,%ebx c05a3b78: 775 8b 82 e8 00 00 00 mov 0xe8(%edx),%eax c05a3b7e: 0 8b 48 18 mov 0x18(%eax),%ecx c05a3b81: 0 8b 45 c8 mov 0xffffffc8(%ebp),%eax c05a3b84: 792 89 1c 24 mov %ebx,(%esp) c05a3b87: 0 89 74 24 04 mov %esi,0x4(%esp) c05a3b8b: 0 ff d1 call *%ecx c05a3b8d: 0 8b 4d c8 mov 0xffffffc8(%ebp),%ecx c05a3b90: 925 8b 41 6c mov 0x6c(%ecx),%eax c05a3b93: 0 39 41 10 cmp %eax,0x10(%ecx) c05a3b96: 0 0f 85 a8 01 00 00 jne c05a3d44 <schedule+0x2a4> c05a3b9c: 949 89 da mov %ebx,%edx c05a3b9e: 0 89 f1 mov %esi,%ecx c05a3ba0: 0 8b 45 c8 mov 0xffffffc8(%ebp),%eax the second column is the number of times the profiling interrupt has hit that particular instruction. Note the many zero entries - this means that for instructions that are well-cached, the issue order _prevents_ interrupts from _ever_ hitting to within a bundle of micro-ops that the decoder will issue! The above workload was a plain lat_ctx, so nothing special, and i...
No. If anything it made the freezes somwhat more frequent.
And it's not a NO_HZ kernel.
What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:
CPU0 CPU1
0: 220496 42 IO-APIC-edge timer
1: 3841 0 IO-APIC-edge i8042
8: 1 0 IO-APIC-edge rtc
9: 2756 0 IO-APIC-fasteoi acpi
12: 2638 0 IO-APIC-edge i8042
14: 7776 0 IO-APIC-edge ide0
16: 6083 0 IO-APIC-fasteoi uhci_hcd:usb2
17: 34414 3 IO-APIC-fasteoi uhci_hcd:usb3, HDA Intel
18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4
19: 32 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb5
313: 11405 1 PCI-MSI-edge eth0
314: 29417 10 PCI-MSI-edge ahci
NMI: 164 118
LOC: 220499 220463
ERR: 0
and the freezes don't really change that. And the NMI traces show,
that it's always CPU1 which is spinning in wait_task_inactive().
Miklos
-On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote: ... BTW, maybe I've missed this and it's unconnected, but I hope the first config has been changed - especially this CONFIG_AGP_AMD64 = y, and this bug from mm/slab.c has gone long ago... Jarek P. -
Well, the LOC thing is for the local apic timer, so while regular interrupts are indeed very skewed, both CPU's is nicely getting the local apic timer thing.. That said, the timer interrupt generally happens just a few hundred times a second, and if there's just a higher likelihood that it happens when the spinlock is taken, then half-a-second pauses could easily be just because even when the interrupt happens, it could be skewed to happen when the lock is held. And that definitely is the case: the most expensive instruction _by_far_ in that loop is the actual locked instruction that acquires the lock (especially with the cache-line bouncing around), so an interrupt would be much more likely to happen right after that one rather than after the store that releases the lock, which can be buffered. It can be quite interesting to look at instruction-level cycle profiling with oprofile, just to see where the costs are.. Linus -
On Thu, 21 Jun 2007 10:31:53 -0700 (PDT) This reminds me Nick's proposal of 'queued spinlocks' 3 months ago Maybe this should be re-considered ? (unlock is still a non atomic op, so we dont pay the serializing cost twice) http://lwn.net/Articles/227506/ extract : Implement queued spinlocks for i386. This shouldn't increase the size of the spinlock structure, while still able to handle 2^16 CPUs. The queued spinlock has 2 fields, a head and a tail, which are indexes into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an "atomic_inc_return" on the head index, and keeps the returned value as a ticket. The CPU then spins until the tail index is equal to that ticket. To unlock a spinlock, the tail index is incremented (this can be non atomic, because only the lock owner will modify tail). Implementation inefficiencies aside, this change should have little effect on performance for uncontended locks, but will have quite a large cost for highly contended locks [O(N) cacheline transfers vs O(1) per lock aquisition, where N is the number of CPUs contending]. The benefit is is that contended locks will not cause any starvation. Just an idea. Big NUMA hardware seems to have fairness logic that prevents starvation for the regular spinlock logic. But it might be interesting for -rt kernel or systems with starvation issues. -
No. The point is simple: IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG! I don't understand why this is even controversial. Especially since we have a patch for the problem that proves my point: the _proper_ way to fix things is to just not do the bad thing, instead of trying to allow the bad behaviour and try to handle it. Things like queued spinlocks just make excuses for bad code. We don't do nesting locking either, for exactly the same reason. Are nesting locks "easier"? Absolutely. They are also almost always a sign of a *bug*. So making spinlocks and/or mutexes nest by default is just a way Umm. i386 spinlocks could and should be *one*byte*. In fact, I don't even know why they are wasting four bytes right now: the fact that somebody made them an "int" just wastes memory. All the actual code uses "decb", so it's not even a question of safety. I wonder why we have that 32-bit thing and the ugly casts. Ingo, any memory of that? (And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think such an insane machine has ever existed). Linus -
Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. They should also improve performance in heavily contended case due to the nature of how they spin, but I know that's not something you want to hear about. And theoretically there should be no reason why xadd is any slower than dec and look at the status flags, should there? I never implementedit in optimised assembly to test though... Some hardware seems to have no idea of fair cacheline scheduling, and especially when there are more than 2 CPUs contending for the cacheline, there can be large starvations. And actually some times we have code that really wants to drop the lock and queue behind other contenders. Most of the lockbreak stuff for example. Suppose we could have a completely fair spinlock primitive that has *virtually* no downsides over the unfair version, you'd take the fair one, right? Not that I'm saying they'd ever be a good solution to bad code, but I Anyway, I think the important point is that they can remain within 4 bytes, which is obviously the most important boundary (they could be 2 bytes on i386). -- SUSE Labs, Novell Inc. -
xadd really generally *is* slower than an add. One is often microcoded, the other is not. But the real problem is that your "unlock" sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) Linus -
Oh. I found xadd to be not hugely slower on my P4, but it was a little I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't No real reason I guess. I'll change it. -- SUSE Labs, Novell Inc. -
Oh, that releasing "add" can be unlocked, and only the holder of the lock ever touches that field? I must not have looked closely enough. In that case, I withdraw that objection, and the sequence-number-based spinlock sounds like a perfectly fine one. Yes, the add will be slightly slower than the plain byte move, and the locked xadd will be slightly slower than a regular locked add, but compared to the serialization cost, that should be small. For some reason I thought you needed a locked instruction for the unlock too. So try it with just a byte counter, and test some stupid micro-benchmark on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make it the normal spinlock sequence just because it isn't noticeably slower. In fact, I think a "incb <mem>" instruction is even a byte shorter than "movb $1,mem", and with "unlock" being inlined, that could actually be a slight _win_. Linus -
Nick, call me a worry-wart, but I slept on this, and started worrying.. So I thought about this a bit more, and I like your sequence counter approach, but it still worried me. In the current spinlock code, we have a very simple setup for a successful grab of the spinlock: CPU#0 CPU#1 A (= code before the spinlock) lock release lock decb mem (serializing instruction) B (= code after the spinlock) and there is no question that memory operations in B cannot leak into A. With the sequence counters, the situation is more complex: CPU #0 CPU #1 A (= code before the spinlock) lock xadd mem (serializing instruction) B (= code afte xadd, but not inside lock) lock release cmp head, tail C (= code inside the lock) Now, B is basically the empty set, but that's not the issue I worry about. The thing is, I can guarantee by the Intel memory ordering rules that neither B nor C will ever have memops that leak past the "xadd", but I'm not at all as sure that we cannot have memops in C that leak into B! And B really isn't protected by the lock - it may run while another CPU still holds the lock, and we know the other CPU released it only as part of the compare. But that compare isn't a serializing instruction! IOW, I could imagine a load inside C being speculated, and being moved *ahead* of the load that compares the spinlock head with the tail! IOW, the load that is _inside_ the spinlock has effectively moved to outside the protected region, and the spinlock isn't really a reliable mutual exclusion barrier any more! (Yes, there is a data-dependency on the compare, but it is only used for a conditional branch, and conditional branches are control dependencies and can be speculated, so CPU speculation can easily break that apparent dependency chain and do later loads *before* the spinlock load completes!) Now, I have good reason to believe that all Intel and AMD CPU's have a stricter-than-documented memory o...
... Haven't made too much progress on this, but I have asked someone@amd who might be able to at least know the right person to ask :P (might be faster to ask Andi to ask :)) If you know someone at Intel then that would be appreciated. It would be nice if it is safe (and can be guaranteed to be safe in future). However OTOH, the fastpath may be even faster if we do it in a "definitely safe" way. That is, do the xaddw against 16-bits with the head in 8 of those and the tail in the other 8. Then compare the byte registers of the register returned by xaddw for the test. Although the xaddw is going to be slower than an xaddb, this way we subsequently avoid the extra load completely, while avoiding ordering issues. In the slowpath we would have to have a token locked op in there (like the current spinlocks do), but this could be taken out iff our inquiries come back positive. Anyway, I'll try redoing the patch and getting some numbers. -- SUSE Labs, Novell Inc. -
Nice catch ;) But wasn't Intel suggesting in not relying on the old "strict" ordering rules? IOW shouldn't an mfence always be there? Not only loads could leak up into the wait phase, but stores too, if they have no dependency with the "head" and "tail" loads. - Davide -
Actually, both Intel and AMD engineers have been talking about making the documentation _stricter_, rather than looser. They apparently already are pretty damn strict, because not being stricter than the docs imply just ends up exposing too many potential problems in software that didn't really follow the rules. For example, it's quite possible to do loads out of order, but guarantee that the result is 100% equivalent with a totally in-order machine. One way you do that is to keep track of the cacheline for any speculative loads, and if it gets invalidated before the speculative instruction has completed, you just throw the speculation away. End result: you can do any amount of speculation you damn well please at a micro-architectural level, but if the speculation would ever have been architecturally _visible_, it never happens! (Yeah, that is just me in my non-professional capacity of hw engineer wanna-be: I'm not saying that that is necessarily what Intel or AMD Stores never "leak up". They only ever leak down (ie past subsequent loads or stores), so you don't need to worry about them. That's actually already documented (although not in those terms), and if it wasn't true, then we couldn't do the spin unlock with just a regular store anyway. (There's basically never any reason to "speculate" stores before other mem ops. It's hard, and pointless. Stores you want to just buffer and move as _late_ as possible, loads you want to speculate and move as _early_ as possible. Anything else doesn't make sense). So I'm fairly sure that the only thing you really need to worry about in this thing is the load-load ordering (the load for the spinlock compare vs any loads "inside" the spinlock), and I'm reasonably certain that no existing x86 (and likely no future x86) will make load-load reordering effects architecturally visible, even if the implementation may do so *internally* when it's not possible to see it in the end result. Linus -
Yes, Intel has never done that. They'll probably never do it since it'll break a lot of system software (unless they use a new mode-bit that allows system software to enable lose-ordering). Although I clearly remember to have read in one of their P4 optimization manuals to not assume this in the future. - Davide -
That optimization manual was confused.
The Intel memory ordering documentation *clearly* states that only reads
pass writes, not the other way around.
Some very confused people have thought that "pass" is a two-way thing.
It's not. "Passing" in the Intel memory ordering means "go _ahead_ of",
exactly the same way it means in traffic. You don't "pass" people by
falling behind them.
It's also obvious from reading the manual, because any other reading would
be very strange: it says
1. Reads can be carried out speculatively and in any order
2. Reads can pass buffered writes, but the processor is self-consistent
3. Writes to memory are always carried out in program order [.. and then
lists exceptions that are not interesting - it's clflush and the
non-temporal stores, not any normal writes ]
4. Writes can be buffered
5. Writes are not performed speculatively; they are only performed for
instructions that have actually been retired.
6. Data from buffered writes can be forwarded to waiting reads within the
processor.
7. Reads or writes cannot pass (be carried out ahead of) I/O
instructions, locked instructions or serializing instructions.
8. Reads cannot pass LFENCE and MFENCE instructions.
9. Writes cannot pass SFENCE or MFENCE instructions.
The thing to note is:
a) in (1), Intel says that reads can occur in any order, but (2) makes it
clear that that is only relevant wrt other _reads_
b) in (2), they say "pass", but then they actually explain that "pass"
means "be carried out ahead of" in (7).
HOWEVER, it should be obvious in (2) even _without_ the explicit
clarification in (7) that "pass" is a one-way thing, because otherwise
(2) is totally _meaningless_. It would be meaningless for two reasons:
- (1) already said that reads can be done in any order, so if that
was a "any order wrt writes", then (2) would be pointless. So (2)
must mean something *else* than "...Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I remember them saying to not assume any ordering besides the data dependency and the CPU self-consistency in the future CPUs, and to use *fence instructions when certain semantics were required. But google did not help me in finding that doc, so maybe I were really on crack :) - Davide -
hm, i agree with you that this is problematic. Especially on an SMT CPU it would be a big architectural restriction if prefetches couldnt cross cache misses. (and that's the only way i could see Nick's scheme working: MESI coherency coupled with the speculative use of that cacheline's value never "surviving" a MESI invalidation of that cacheline. That would guarantee that once we have the lock, any speculative result is fully coherent and no other CPU has modified it.) Ingo -
OK, I'll try running some tests and get back to you on it. -- SUSE Labs, Novell Inc. -
On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote: ... BTW, could you explain why the below diagnose doesn't relate to your solution? On 06/21/2007 12:08 PM, Ingo Molnar wrote: Thanks & regards, Jarek P. -
and if people _really_ want to boot a large-smp 32-bit kernel on some new, tons-of-cpus box, as a workaround they can enable the spinlock debugging code, which has no limitation on the number of CPUs supported. Ingo -
no real reason that i can recall - i guess nobody dared to touch it
because it used to have that 'volatile', indicating black voodoo ;-) Now
that the bad stigma has been removed, we could try the patch below. It
boots fine here, and we save 1K of kernel text size:
text data bss dec hex filename
6236003 611992 401408 7249403 6e9dfb vmlinux.before
6235075 611992 401408 7248475 6e9a5b vmlinux.after
I can understand why no data is saved by this change: gcc is aligning
the next field to a natural boundary anyway and we dont really have
arrays of spinlocks (fortunately). [and we save no data even if using
the ((packed)) attribute.] Perhaps some data structure that is never in
the kernel image itself still got smaller? Any good way to determine
that?
But why is the text size different? Ah: i think it's spin_lock_init()
getting shorter :-)
but this is certainly not something for 2.6.22, it's an early 2.6.23
matter i suspect.
Ingo
------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [patch] spinlocks i386: change them to byte fields
all spinlock ops are on byte operands, so change the spinlock field to
be unsigned char. This saves a bit of kernel text size:
text data bss dec hex filename
6236003 611992 401408 7249403 6e9dfb vmlinux.before
6235075 611992 401408 7248475 6e9a5b vmlinux.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/asm-i386/spinlock_types.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux/include/asm-i386/spinlock_types.h
===================================================================
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
#endif
typedef struct {
- unsigned int slock;
+ unsigned char slock;
} raw_spinlock_t;
#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
-Actually, some data structures could well shrink. Look at "struct task_struct", for example. Right now it has two spinlocks right next to each other (alloc_lock and pi_lock). Other data structures may have things like bitfields etc. But yeah, I'd not expect that to be very common, and in some cases you might have to re-order data structures to take advantage of better Oh, absolutely. Linus -
yeah. We've got init_task's task-struct embedded in the vmlinux, but it's aligned to 32 bytes, which probably hides this effect. We'd only see it if the size change just happened to bring a key data structure (which is also embedded in the vmlinux) just below a modulo 32 bytes boundary. The chance for that is around 6:32 per 'merge' event. That means that there cannot be all that many such cases ;-) anyway, the shorter init sequence is worth it already. Ingo -
Side note, and as a "truth in advertising" section: I'll have to admit that I argued against fair semaphores on the same grounds. I was wrong then (and eventually admitted it, and we obviously try to make our mutexes and semaphores fair these days!), and maybe I'm wrong now. If somebody can actually come up with a sequence where we have spinlock starvation, and it's not about an example of bad locking, and nobody really can come up with any other way to fix it, we may eventually have to add the notion of "fair spinlocks". So my arguments are purely pragmatic. It's not that I hate fairness per se. I dislike it only when it's used to "solve" (aka hide) other problems. In the end, some situations do need fairness, and the fact that aiming for fairness is often harder, slower, and more complicated than not doing so at that point turns into a non-argument. If you need it, you need it. I just don't think we need it, and we're better off solving problems other ways. (For example, we might also solve such problems by creating a separate "fair_spin_lock" abstraction, and only making the particular users that need it actually use it. It would depend a bit on whether the cost of implementing the fairness is noticeable enough for it to be worth having a separate construct for it). Linus -
I tried to find such a sequence, but I think its more a matter of hardware evolution, and some degenerated cases. In some years (months ?), it might possible to starve say the file struct spinlock of a process in a open()/close() infinite loop. This because the number of instruction per 'memory cache line transfert between cpus/core' is raising. But then one can say its a bug in user code :) Another way to starve kernel might be a loop doing settime() , since seqlock are quite special in serialization : Only seqlock's writers perform atomic ops, readers could be starved because of Maybe some *big* NUMA machines really want this fairness (even if it cost some cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am just guessing since I cannot test such monsters. I tested Davide program on a Dual Opteron and got some perf difference. $ ./qspins -n 2 now testing: TICKLOCK timeres=4000 uscycles=1991 AVG[0]: 2195.250000 cycles/loop SIG[0]: 11.813657 AVG[1]: 2212.312500 cycles/loop SIG[1]: 38.038991 $ ./qspins -n 2 -s now testing: SPINLOCK timeres=4000 uscycles=1991 AVG[0]: 2066.000000 cycles/loop SIG[0]: 0.000000 AVG[1]: 2115.687500 cycles/loop -
there was one bad case i can remember: the spinlock debugging code had a trylock open-coded loop and on certain Opterons CPUs were starving each other. This used to trigger with the ->tree_lock rwlock i think, on heavy MM loads. The starvation got so bad that the NMI watchdog started triggering ... int
