login
Header Space

 
 

Re: [BUG] long freezes on thinkpad t60

Previous thread: [PATCH] smpboot: cachesize comparison fix in smp_tune_scheduling() by Jarek Poplawski on Thursday, May 24, 2007 - 6:33 am. (3 messages)

Next thread: none
To: <linux-kernel@...>
Cc: <mingo@...>, <linux-acpi@...>
Date: Thursday, May 24, 2007 - 8:04 am

On some strange workload involving strace and fuse I get ocasional
long periods (10-100s) of total unresponsiveness, not even SysRq-*
working.  Then the machine continues as normal.  Nothing in dmesg,
absolutely no indication about what is happening.

Tried nmi_watchdog=1, but then the machine locks up hard shortly after
boot.

Tried nmi_watchdog=1 acpi=off, then I can't reproduce the problem.

Tried 2.6.22-rc2 and 2.6.21, both with the same result.

.config and dmesg attached.

Any ideas?  Possibly something ACPI related?

Thanks,
Miklos

Linux version 2.6.22-rc2 (mszeredi@tucsk) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #3 SMP Tue May 22 17:55:42 CEST 2007
Command line: root=/dev/sda2
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003fed0000 (usable)
 BIOS-e820: 000000003fed0000 - 000000003fedf000 (ACPI data)
 BIOS-e820: 000000003fedf000 - 000000003ff00000 (ACPI NVS)
 BIOS-e820: 000000003ff00000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 0000000100000000 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 261840) 1 entries of 256 used
end_pfn_map = 1048576
DMI present.
ACPI: RSDP 000F67D0, 0024 (r2 LENOVO)
ACPI: XSDT 3FED1308, 008C (r1 LENOVO TP-79        2110  LTP        0)
ACPI: FACP 3FED1400, 00F4 (r3 LENOVO TP-79        2110 LNVO        1)
ACPI Warning (tbfadt-0434): Optional field "G...
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <mingo@...>, <linux-acpi@...>
Date: Thursday, May 24, 2007 - 6:08 pm

NMIs in some thinkpads are bad trouble, they lock up the blasted IBM/Lenovo
SMBIOS if they happen to hit it while it is servicing a SMI, and thinkpads
do SMIs like crazy.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
-
To: Henrique de Moraes Holschuh <hmh@...>
Cc: Miklos Szeredi <miklos@...>, <linux-kernel@...>, <mingo@...>, <linux-acpi@...>
Date: Thursday, May 24, 2007 - 6:13 pm

there's also a L1 ASPM issue with the e1000 chipset in the way (for T60/X60 
only). Make sure you are using 2.6.21 or newer. See netdev archives for more on 
that.

Auke
-
To: Kok, Auke <auke-jan.h.kok@...>
Cc: Henrique de Moraes Holschuh <hmh@...>, Miklos Szeredi <miklos@...>, <linux-kernel@...>, <linux-acpi@...>
Date: Friday, May 25, 2007 - 2:58 am

Miklos is using latest -git and he has nmi_watchdog disabled - still the 
long pauses persist. (i've never seen that on my t60)

	Ingo
-
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, Thomas Gleixner <tglx@...>
Date: Thursday, May 24, 2007 - 8:54 am

how reproducable are these lockups - could you possibly trace it? If yes 
then please apply:

  http://www.tglx.de/private/tglx/ht-debug/tracer.diff

and run the attached trace-it-1sec.c thing in a loop:

  echo 1 &gt; /proc/sys/kernel/mcount_enabled

  while true; do
     ./trace-it-1sec &gt; trace-`date`.txt
  done

and wait for the lockup. Once it happens, please upload the trace*.txt 
file that contains the lockup, i guess we'll be able to tell you more 
about the nature of the lockup. (Perhaps increase the sleep(1) to 
sleep(5) to capture longer periods and to increase the odds that you 
catch the lockup while the utility is tracing.)

	Ingo
To: <mingo@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 10:03 am

With this patch boot stops at segfaulting fsck.  I enabled all the new
config options, is that not a good idea?  Which one exactly do I need?

Thanks,
Miklos

PS. tracer.diff needed some hacking to make it apply/compile.

Index: linux-2.6.22-rc2/include/linux/linkage.h
===================================================================
--- linux-2.6.22-rc2.orig/include/linux/linkage.h	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/include/linux/linkage.h	2007-05-24 15:57:43.000000000 +0200
@@ -3,6 +3,8 @@
 
 #include &lt;asm/linkage.h&gt;
 
+#define notrace __attribute ((no_instrument_function))
+
 #ifdef __cplusplus
 #define CPP_ASMLINKAGE extern "C"
 #else
Index: linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt
===================================================================
--- linux-2.6.22-rc2.orig/Documentation/stable_api_nonsense.txt	2007-04-26 05:08:32.000000000 +0200
+++ linux-2.6.22-rc2/Documentation/stable_api_nonsense.txt	2007-05-24 15:57:42.000000000 +0200
@@ -62,6 +62,9 @@ consider the following facts about the L
       - different structures can contain different fields
       - Some functions may not be implemented at all, (i.e. some locks
 	compile away to nothing for non-SMP builds.)
+      - Parameter passing of variables from function to function can be
+	done in different ways (the CONFIG_REGPARM option controls
+	this.)
       - Memory within the kernel can be aligned in different ways,
 	depending on the build options.
   - Linux runs on a wide range of different processor architectures.
Index: linux-2.6.22-rc2/arch/i386/Kconfig
===================================================================
--- linux-2.6.22-rc2.orig/arch/i386/Kconfig	2007-05-22 16:25:05.000000000 +0200
+++ linux-2.6.22-rc2/arch/i386/Kconfig	2007-05-24 15:57:42.000000000 +0200
@@ -764,6 +764,14 @@ config BOOT_IOREMAP
 	depends on (((X86_SUMMIT || X86_GENERICARCH) &amp;&amp; NUMA) || (X86 &amp;&amp; EFI))
 	default y
 
+#
+# function tracing might turn this ...
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 10:10 am

hm, you should only need these:

 CONFIG_EVENT_TRACE=y
 CONFIG_FUNCTION_TRACE=y
 # CONFIG_WAKEUP_TIMING is not set
 # CONFIG_CRITICAL_IRQSOFF_TIMING is not set
 CONFIG_MCOUNT=y

does it boot with these?

	Ingo
-
To: <mingo@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 10:28 am

Nope.  Same segfault.  If I try to continue manually with 'init 5',
then init segfaults as well :(

Miklos
-
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 10:42 am

does it go away if you turn off CONFIG_FUNCTION_TRACE? (that will make 
the trace a lot less verbose, but still informative)

	Ingo
-
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 10:44 am

could you just try v2.6.21 plus the -rt patch, which has the tracer 
built-in? That's a combination that should work well. You can pick it up 
from:

   http://people.redhat.com/mingo/realtime-preempt/

same config options as above. If you dont turn on PREEMPT_RT you'll get 
an almost-vanilla kernel.

	Ingo
-
To: <mingo@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 1:09 pm

2.6.22-rc2, only EVENT_TRACE - boots, can't rerpoduce
2.6.21-vanila - can reproduce
2.6.21-rt7, trace options off - can reproduce
2.6.21-rt7, trace options on - can't reproduce

Possibly something timing related, that's altered by the trace code.
I tried the trace kernel without starting the trace app, but still no
bug.

Any other ideas?

Thanks,
Miklos
-
To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Thursday, May 24, 2007 - 5:01 pm

perhaps try 2.6.21-rt7 with EVENT_TRACE on (the other trace options off) 
- does that hide the bug too?

	Ingo
-
To: <mingo@...>
Cc: <chris@...>, <linux-kernel@...>, <linux-acpi@...>, <tglx@...>
Date: Friday, May 25, 2007 - 5:51 am

The option itself doesn't hide the bug this time, I got one freeze
pretty quickly.  But after starting the trace-it-1sec loop, I couldn't
get it any more.

Normally I get a freeze after 3-5 minutes of testing, but with
trace-it-1sec there's still nothing after 30 minutes.

If I stop trace-it, and do "echo 0 &gt; /proc/sys/kernel/trace_enabled",
I get the freeze again. It's a perfect heisenbug.

This issue came up when I was testing a userspace fuse bug, and now
the reporter of that bug (added to CC) says that he also sometimes
experienced a hard lockup during testing but ignored it up to now.  So
we may yet get some info from him.

It could be something fuse related, but it's very hard to imagine how
fuse could trigger such a low-level problem.

Miklos
-
To: <mingo@...>
Cc: <chris@...>, <linux-kernel@...>, <tglx@...>
Date: Thursday, June 14, 2007 - 12:04 pm

I've got some more info about this bug.  It is gathered with
nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of
calling die_nmi() just prints a line and calls show_registers().

This makes the machine actually survive the NMI tracing.  The attached
traces are gathered over about an hour of stressing.  An mp3 player is
also going on continually, and I can hear a couple of seconds of
"looping" quite often, but it gets as far as the NMI trace only
rarely.  AFAICS only the last pair shows a trace for both CPUs during
the same "freeze".

I've put some effort into understanding what's going on, but I'm not
familiar with how interrupts work and that sort of thing.

The pattern that emerges is that on CPU0 we have an interrupt, which
is trying to acquire the rq lock, but can't.

On CPU1 we have strace which is doing wait_task_inactive(), which sort
of spins acquiring and releasing the rq lock.  I've checked some of
the traces and it is just before acquiring the rq lock, or just after
releasing it, but is not actually holding it.

So is it possible that wait_task_inactive() could be starving the
other waiters of the rq spinlock?  Any ideas?

Thanks,
Miklos

NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: fuse e1000
Pid: 4625, comm: strace Not tainted 2.6.22-rc4 #10
RIP: 0010:[&lt;ffffffff80227ce5&gt;]  [&lt;ffffffff80227ce5&gt;] task_rq_lock+0x14/0x6f
RSP: 0018:ffff81001cf17ed8  EFLAGS: 00000046
RAX: 0000000000000246 RBX: ffff81001c9da540 RCX: ffff81003fd1e5e8
RDX: 0000000000000001 RSI: ffff81001cf17f10 RDI: ffff81001c9da540
RBP: ffff81001cf17ef8 R08: 0000000000000003 R09: 0000000000000000
R10: 00007fffbcd7c018 R11: 0000000000000246 R12: ffff81001c9da540
R13: ffff81001cf17f10 R14: ffff81001c9da540 R15: 00007fffbcd7d44c
FS:  00002b05ee28a6f0(0000) GS:ffff810001fd8ec0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaac02f000 CR3: 000000001ce52000 CR4: 00000000000006e0
Process strace (pid: 4625, threadinfo ff...
To: Miklos Szeredi <miklos@...>
Cc: <chris@...>, <linux-kernel@...>, <tglx@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Saturday, June 16, 2007 - 6:37 am

hm, this is really interesting, and indeed a smoking gun. The T60 has a 
Core2Duo and i've _never_ seen MESI starvation happen on dual-core 
single-socket CPUs! (The only known serious MESI starvation i know about 
is on multi-socket Opterons: there the trylock loop of spinlock 
debugging is known to starve some CPUs out of those locks that are being 
polled, so we had to turn off that aspect of spinlock debugging.)

wait_task_inactive(), although it busy-loops, is pretty robust: it does 
a proper spin-lock/spin-unlock sequence and has a cpu_relax() inbetween. 
Furthermore, the rep_nop() that cpu_relax() is based on is 
unconditional, so it's not like we could somehow end up not having the 
REP; NOP sequence there (which should make the lock polling even more 
fair)

could you try the quick hack below, ontop of cfs-v17? It adds two things 
to wait_task_inactive():

- a cond_resched() [in case you are running !PREEMPT]

- use MONITOR+MWAIT to monitor memory transactions to the rq-&gt;curr 
  cacheline. This should make the polling loop definitely fair.

If this solves the problem on your box then i'll do a proper fix and 
introduce a cpu_relax_memory_change(*addr) type of API to around 
monitor/mwait. This patch boots fine on my T60 - but i never saw your 
problem.

[ btw., utrace IIRC fixes ptrace to get rid of wait_task_interactive(). ]

	Ingo

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -834,6 +834,16 @@ repeat:
 		cpu_relax();
 		if (preempted)
 			yield();
+		else
+			cond_resched();
+		/*
+		 * Wait for "curr" to change:
+		 */
+		__monitor((void *)&amp;rq-&gt;curr, 0, 0);
+		smp_mb();
+		if (rq-&gt;curr != p)
+			__mwait(0, 0);
+
 		goto repeat;
 	}
 	task_rq_unlock(rq, &amp;flags);
-
To: <mingo@...>
Cc: <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <torvalds@...>, <akpm@...>
Date: Sunday, June 17, 2007 - 5:46 pm

Yes, the patch does make the pauses go away.  In fact just the

I looked at the utrace patch, and it still has wait_task_inactive(),
and I can still reproduce the freezes with the utrace patch applied.

Miklos
-
To: Miklos Szeredi <miklos@...>
Cc: <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <torvalds@...>, <akpm@...>
Date: Monday, June 18, 2007 - 2:43 am

cool! Could you send me the smallest patch you tried that still made the 
hangs go away?

	Ingo
-
To: <mingo@...>
Cc: <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <torvalds@...>, <akpm@...>
Date: Monday, June 18, 2007 - 3:24 am

My previous attempt was just commenting out parts of your patch.  But
maybe it's more logical to move the barrier to immediately after the
unlock.

With this patch I can't reproduce the problem, which may not mean very
much, since it was rather a "fragile" bug anyway.  But at least the
fix looks pretty harmless.

Thanks,
Miklos

Index: linux-2.6.22-rc4/kernel/sched.c
===================================================================
--- linux-2.6.22-rc4.orig/kernel/sched.c	2007-06-18 08:59:17.000000000 +0200
+++ linux-2.6.22-rc4/kernel/sched.c	2007-06-18 09:04:13.000000000 +0200
@@ -1168,6 +1168,11 @@ repeat:
 		/* If it's preempted, we yield.  It could be a while. */
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &amp;flags);
+		/*
+		 * Without this barrier, wait_task_inactive() can starve
+		 * waiters of rq-&gt;lock (observed on Core2Duo)
+		 */
+		smp_mb();
 		cpu_relax();
 		if (preempted)
 			yield();
-
To: Miklos Szeredi <miklos@...>
Cc: <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <torvalds@...>, <akpm@...>
Date: Monday, June 18, 2007 - 4:12 am

yeah. The problem is, that the open-coded loop there is totally fine, 
and we have similar loops elsewhere, so this problem could hit us again, 
in an even harder to debug place! Since this affects our basic SMP 
primitives, quite some caution is warranted i think.

So ... to inquire about the scope of the problem, another possibility 
would be for the _spin loop_ being 'too nice', not wait_task_inactive() 
being too agressive!

To test this theory, could you try the patch below, does this fix your 
hangs too? This change causes the memory access of the "easy" spin-loop 
portion to be more agressive: after the REP; NOP we'd not do the 
'easy-loop' with a simple CMPB, but we'd re-attempt the atomic op. (in 
theory the non-LOCK-ed memory accesses should have a similar effect, but 
maybe the Core2Duo has some special optimization for non-LOCK-ed 
cacheline accesses that causes cacheline starvation?)

	Ingo

----------------------------------------------------&gt;
Subject: [patch] x86: fix spin-loop starvation bug
From: Ingo Molnar &lt;mingo@elte.hu&gt;

Miklos Szeredi reported very long pauses (several seconds, sometimes 
more) on his T60 (with a Core2Duo) which he managed to track down to 
wait_task_inactive()'s open-coded busy-loop. He observed that an 
interrupt on one core tries to acquire the runqueue-lock but does not 
succeed in doing so for a very long time - while wait_task_inactive() on 
the other core loops waiting for the first core to deschedule a task 
(which it wont do while spinning in an interrupt handler).

The problem is: both the spin_lock() code and the wait_task_inactive() 
loop uses cpu_relax()/rep_nop(), so in theory the CPU should have 
guaranteed MESI-fairness to the two cores - but that didnt happen: one 
of the cores was able to monopolize the cacheline that holds the 
runqueue lock, for extended periods of time.

This patch changes the spin-loop to assert an atomic op after every REP 
NOP instance - this will cause the CPU to express its "MESI in...
To: Ingo Molnar <mingo@...>
Cc: Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Monday, June 18, 2007 - 12:34 pm

I really think this the the wrong approach, although *testing* it makes 
sense.

I think we need to handle loops that take, release, and then immediately 
re-take differently.

Such loops are _usually_ of the form where they really just release the 
lock for some latency reason, but in this case I think it's actually just 
a bug.

That code does:

	if (unlikely(p-&gt;array || task_running(rq, p))) {

to decide if it needs to just unlock and repeat, but then to decide if it 
need to *yield* it only uses *one* of those tests (namely 

	preempted = !task_running(rq, p);
	..
	if (preempted)
		yield();

and I think that's just broken. It basically says:

 - if the task is running, I will busy-loop on getting/releasing the 
   task_rq_lock

and that is the _real_ bug here.

Trying to make the spinlocks do somethign else than what they do is just 
papering over the real bug. The real bug is that anybody who just 
busy-loops getting a lock is wasting resources so much that we should not 
be at all surprised that some multi-core or NUMA situations will get 
starvation.

Blaming some random Core 2 hardware implementation issue that just makes 
it show up is wrong. It's a software bug, plain and simple.

So how about this diff? The diff looks big, but the *code* is actually 
simpler and shorter, I just added tons of comments, which is what blows it 
up.

The new *code* looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
	        rq = task_rq(p);
	        while (task_running(rq, p))
	                cpu_relax();

		/* Get the *real* values */
	        rq = task_rq_lock(p, &amp;flags);
	        running = task_running(rq, p);
	        array = p-&gt;array;
	        task_rq_unlock(rq, &amp;flags);

		/* Check them.. */
	        if (unlikely(running)) {
	                cpu_relax();
	                goto repeat;
	        }

	        if (unlikely(array)) {
	                yield();
	                goto repeat;
	        }

and while I haven't tes...
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 20, 2007 - 5:36 am

I don't agree with this (+ I know it doesn't matter).

The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
And here they are simply lawlessly not fair.

I cannot see any reason why any of tasks doing simultaneously
"busy-loop on getting/releasing" a spinlock should starve almost
to death another one doing the same (or simply waiting to get this
lock) even without cpu_relax. Of course, lawfulness of such

On the other hand it seems spinlocks should be at least a little
more immune to such bugs: slowdown is OK but not freezing. Current
behavior could suggest this unfairness could harm some tasks even
without any loops present - but it's not visible enough.

So, I'm surprised this thread seems to stop after this patch, and
there is no try to make the most of this ideal testing case to
improve spinlock design btw.

Regards,
Jarek P.
-
To: Jarek Poplawski <jarkao2@...>
Cc: Ingo Molnar <mingo@...>, Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 20, 2007 - 1:34 pm

Well, that's certainly a valid standpoint. I wouldn't claim you're 
_wrong_.

At least in theory.

In *practice*, fair spinlocks are simply not possible. Not as in "it's 
hard to do", but as in "you simply cannot do it".

The thing is, most spinlocks get held for a rather short time (if that 
wasn't the case, you'd use something like a mutex), and it's acually a 
short time not on a "human scale", but on a "CPU core scale".

IOW, the cost of doing the cacheline bouncing is often equal to, or bigger 
than, the actual cost of the operation that happens inside the spinlock!

What does this result in? It automatically means that software simply 
*cannot* do fair spinlocks, because all the timing costs are in parts that 
software doesn't even have any visibility into, or control over!

Yeah, you could add artificial delays, by doing things like cycle counting 
around the spinlock operation to see if you got the lock when it was 
_contended_, or whether you got a lock that wasn't, and then adding some 
statistics etc, but at that point, you don't have a spinlock any more, you 
have something else. I don't know what to call it.

And you could add flags like "this spinlock is getting a lot of 
contention", try some other algorithm etc. But it's all complicated and 
against the whole *point* of a spinlock, which is to get in and out as 
fast as humanly possible.

So in practice, spinlock fairness is inherently tied to the hardware 
behaviour of the cache coherency algorithm.

Which gets us to the next level: we can consider hardware that isn't 
"fair" in its cache coherency to be buggy hardware.

That's also a perfectly fine standpoint, and it's actually one that I have 
a lot of sympathy for. I think it's much easier to some degree to do 
fairness in the cache coherency than at any other level, because it's 
something where the hardware really *could* do things like counting 
bouncing etc.

However, while it's a perfectly fine standpoint, it's also totally 
unrealistic. Fi...
To: Linus Torvalds <torvalds@...>
Cc: Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 3:30 am

yeah, and if there's no easy solution, change it to a mutex. Fastpath 
performance of spinlocks and mutexes is essentially the same, and if 
there's any measurable contention then the scheduler is pretty good at 
sorting things out. Say if the average contention is longer than 10-20 
microseconds then likely we could already win by scheduling away to some 
other task. (the best is of course to have no contention at all - but 
there are causes where it is real hard, and there are cases where it's 
outright unmaintainable.)

Hw makers are currently producing transistors disproportionatly faster 
than humans are producing parallel code, as a result of which we've got 
more CPU cache than ever, even taking natural application bloat into 
account. (it just makes no sense to spend those transistors on 
parallelism when applications are just not making use of it yet. Plus 
caches are a lot less power intense than functional units of the CPU, 
and the limit these days is power input.)

So scheduling more frequently and more agressively makes more sense than 

what worries me a bit though is that my patch that made spinlocks 
equally agressive to that loop didnt solve the hangs! So there is some 
issue we dont understand yet - why was the wait_inactive_task() 
open-coded spin-trylock loop starving the other core which had ... an 
open-coded spin-trylock loop coded up in assembly? And we've got a 
handful of other open-coded loops in the kernel (networking for example) 
so this issue could come back and haunt us in a situation where we dont 
have a gifted hacker like Miklos being able to spend _weeks_ to track 
down the problem...

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 11:50 am

Your parch kept doing "spin_trylock()", didn't it?

That's a read-modify-write thing, and keeps bouncing the cacheline back 
and forth, and together with the fact that even *after* you get the 
spinlock the "wait_for_inactive()" would actually end up looping back, 
releasing it, and re-getting it.

So the problem was that "wait_for_inactive()" kept the lock (because it 
actually *got* it), and looped over getting it, and because it was an 
exclusive cacheline ownership, that implies that somebody else is not 
getting it, and is kept from ever getting it.

So trying to use "trylock" doesn't help. It still has all the same bad 
sides - it still gets the lock (getting the lock wasn't the problem: 
_holding_ the lock was the problem), and it still kept the cache line for 
the lock on one core.

The only way to avoid lock contention is to avoid any exclusive use at 
all.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 12:08 pm

yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a 
trylock which tries to dirty the cacheline. There was a "REP NOP" after 

ok, it's not completely clear where exactly the other core was spinning, 
but i took it from Miklos' observations that the other core was hanging 
in the _very same_ task_rq_lock() - which is a true spinlock as well 
that acquires it. So on one core the spin_lock() was starving, on 

so the problem was not the trylock based spin_lock() itself (no matter 
how it's structured in the assembly), the problem was actually modifying 
the lock and re-modifying it again and again in a very tight 

yeah - i'm not at all arguing in favor of the BTRL patch i did: i always 
liked the 'nicer' inner loop of spinlocks, which could btw also easily 
use MONITOR/MWAIT. (my patch is also quite close to what we did in 
spinlocks many years ago, so it's more of a step backwards than real 
progress.)

So it seems the problem was that if a core kept _truly_ modifying a 
cacheline via atomics in a high enough frequency, it could artificially 
starve the other core. (which would keep waiting for the cacheline to be 
released one day, and which kept the first core from ever making any 
progress) To me that looks like a real problem on the hardware side - 
shouldnt cacheline ownership be arbitrated a bit better than that?

Up to the point where some external event (perhaps a periodic SMM 
related to thermal management) broke the deadlock/livelock scenario?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 12:44 pm

The "nice" inner loop is necessary or else it would generate huge amounts

A while ago I showed that spinlocks were a lot more fair when doing
unlock with the xchg instruction on x86. Probably the arbitration is all
screwed up because we use a mov instruction, which while atomic is not
locked.

-
To: Chuck Ebbert <cebbert@...>
Cc: Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 1:31 pm

No, the cache line arbitration doesn't know anything about "locked" vs 
"unlocked" instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which 
makes them extremely slow (compared to just a write), and then by being 
slow they effectively just make the window for another core bigger.

IOW, it doesn't "fix" anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between 
the unlock and the lock (and adding one or more serializing instruction in 
between is generally the best way of doing that, simply because otherwise 
micro- architecture may just be re-ordering things on you, so that your 
"delay" isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things 
"fairer" by making *both* sides suck more, but especially if both sides 
are actually the same exact thing, I could well imagine that they'd both 
just suck equally, and get into some pattern where they are now both 
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc, 
it's really *really* hard to be totally unlucky, and after a while you're 
likely to break out of the steplock on your own, just because the CPU's 
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been 
there, but the NOHZ option meant that we had much longer stretches of time 
without things like timer interrupts to jumble up the timing! So maybe the 
freezes existed before, but with timer interrupts happening hundreds of 
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it 
could be one of the reasons why this showed up as a "new" problem, even 
though I don't think the "wait_for_inactive()" thing has changed lately.)

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:18 pm

yeah. I think Linux is i think the only OS on the planet that is using 
the movb trick for unlock, it even triggered a hardware erratum ;) So it 
might surprise some hw makers who might rely on the heuristics that each 
critical section lock and unlock is a LOCK-ed instruction.

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:36 pm

I'm pretty sure others do it too.

Maybe not on an OS level (but I actually doubt that - I'd be surprised if 
Windows doesn't do the exact same thing), but I know for a fact that a lot 
of people in threaded libraries end up depending very much on the "simple 
store closes a locked section".

Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards 
for Windows developers with open-coded spinlocks like


	int ResourceFlag = 0; // 0=Free, 1=Inuse
	...
	// Wait until we get the resource
	while(InterlockedExchange(&amp;ResourceFlag, 1) != 0) {
	   Sleep(0); } // Wait a tad
	// Have the resource
	... // do your thing
	ResourceFlag = 0; // Release the resource


and that's definitely Windows code, not some Linux person doing it.

And this is from an OS2 forum

	unsigned owned=0;

	void request() {
	  while(LockedExchanged(&amp;owned,1)!=0)
	    ;
	}

	void release() {
	  owned = 0;
	}

so it's not even something unusual.

So while arguably these people don't know (and don't care) about subtle 
issues like memory ordering, I can *guarantee* that a lot of programs 
depend on them, even if that dependency may often come from a lack of 
knowledge, rather than actively understanding what we do like in the Linux 
kernel community.

(And yes, they rely on compilers not reordering either. Tough.)

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:16 pm

the freezes that Miklos was seeing were hardirq contexts blocking in 
task_rq_lock() - that is done with interrupts disabled. (Miklos i think 
also tried !NOHZ kernels and older kernels, with a similar result.)

plus on the ptrace side, the wait_task_inactive() code had most of its 
overhead in the atomic op, so if any timer IRQ hit _that_ core, it was 
likely while we were still holding the runqueue lock!

i think the only thing that eventually got Miklos' laptop out of the 
wedge were timer irqs hitting the ptrace CPU in exactly those 
instructions where it was not holding the runqueue lock. (or perhaps an 
asynchronous SMM event delaying it for a long time)

	Ingo
-
To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Friday, June 22, 2007 - 4:17 am

even considering that the 'LOCK'-ed intruction was the heaviest in the 
busy-loop, the numbers still just dont add up to 'tens of seconds of 
lockups', so there must be something else happening too.

So here's an addition to the existing theories: the Core2Duo is a 
4-issue CPU architecture. Now, why does this matter? It matters to the 
timing of the delivery of interrupts. For example, on a 3-issue 
architecture, the instruction level profile of well-cached workloads 
often looks like this:

c05a3b71:      710      89 d6                   mov    %edx,%esi
c05a3b73:        0      8b 55 c0                mov    0xffffffc0(%ebp),%edx
c05a3b76:        0      89 c3                   mov    %eax,%ebx
c05a3b78:      775      8b 82 e8 00 00 00       mov    0xe8(%edx),%eax
c05a3b7e:        0      8b 48 18                mov    0x18(%eax),%ecx
c05a3b81:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax
c05a3b84:      792      89 1c 24                mov    %ebx,(%esp)
c05a3b87:        0      89 74 24 04             mov    %esi,0x4(%esp)
c05a3b8b:        0      ff d1                   call   *%ecx
c05a3b8d:        0      8b 4d c8                mov    0xffffffc8(%ebp),%ecx
c05a3b90:      925      8b 41 6c                mov    0x6c(%ecx),%eax
c05a3b93:        0      39 41 10                cmp    %eax,0x10(%ecx)
c05a3b96:        0      0f 85 a8 01 00 00       jne    c05a3d44 &lt;schedule+0x2a4&gt;
c05a3b9c:      949      89 da                   mov    %ebx,%edx
c05a3b9e:        0      89 f1                   mov    %esi,%ecx
c05a3ba0:        0      8b 45 c8                mov    0xffffffc8(%ebp),%eax

the second column is the number of times the profiling interrupt has hit 
that particular instruction.

Note the many zero entries - this means that for instructions that are 
well-cached, the issue order _prevents_ interrupts from _ever_ hitting 
to within a bundle of micro-ops that the decoder will issue! The above 
workload was a plain lat_ctx, so nothing special, and i...
To: <mingo@...>
Cc: <torvalds@...>, <cebbert@...>, <jarkao2@...>, <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Saturday, June 23, 2007 - 6:36 am

No.  If anything it made the freezes somwhat more frequent.

And it's not a NO_HZ kernel.

What I notice is that the interrupt distribution between the CPUs is
very asymmetric like this:

           CPU0       CPU1
  0:     220496         42   IO-APIC-edge      timer
  1:       3841          0   IO-APIC-edge      i8042
  8:          1          0   IO-APIC-edge      rtc
  9:       2756          0   IO-APIC-fasteoi   acpi
 12:       2638          0   IO-APIC-edge      i8042
 14:       7776          0   IO-APIC-edge      ide0
 16:       6083          0   IO-APIC-fasteoi   uhci_hcd:usb2
 17:      34414          3   IO-APIC-fasteoi   uhci_hcd:usb3, HDA Intel
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 19:         32          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
313:      11405          1   PCI-MSI-edge      eth0
314:      29417         10   PCI-MSI-edge      ahci
NMI:        164        118
LOC:     220499     220463
ERR:          0

and the freezes don't really change that.  And the NMI traces show,
that it's always CPU1 which is spinning in wait_task_inactive().

Miklos
-
To: Miklos Szeredi <miklos@...>
Cc: <mingo@...>, <torvalds@...>, <cebbert@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Monday, June 25, 2007 - 2:45 am

On Sat, Jun 23, 2007 at 12:36:08PM +0200, Miklos Szeredi wrote:
...

BTW, maybe I've missed this and it's unconnected, but I hope the
first config has been changed - especially this CONFIG_AGP_AMD64 = y,
and this bug from mm/slab.c has gone long ago...

Jarek P.
-
To: Miklos Szeredi <miklos@...>
Cc: <mingo@...>, <cebbert@...>, <jarkao2@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Saturday, June 23, 2007 - 12:39 pm

Well, the LOC thing is for the local apic timer, so while regular 
interrupts are indeed very skewed, both CPU's is nicely getting the local 
apic timer thing..

That said, the timer interrupt generally happens just a few hundred times 
a second, and if there's just a higher likelihood that it happens when the 
spinlock is taken, then half-a-second pauses could easily be just because 
even when the interrupt happens, it could be skewed to happen when the 
lock is held.

And that definitely is the case: the most expensive instruction _by_far_ 
in that loop is the actual locked instruction that acquires the lock 
(especially with the cache-line bouncing around), so an interrupt would be 
much more likely to happen right after that one rather than after the 
store that releases the lock, which can be buffered.

It can be quite interesting to look at instruction-level cycle profiling 
with oprofile, just to see where the costs are..

		Linus


-
To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 2:29 pm

On Thu, 21 Jun 2007 10:31:53 -0700 (PDT)

This reminds me Nick's proposal of 'queued spinlocks' 3 months ago

Maybe this should be re-considered ? (unlock is still a non atomic op, 
so we dont pay the serializing cost twice)

http://lwn.net/Articles/227506/

extract : 

Implement queued spinlocks for i386. This shouldn't increase the size of
the spinlock structure, while still able to handle 2^16 CPUs.

The queued spinlock has 2 fields, a head and a tail, which are indexes
into a FIFO of waiting CPUs. To take a spinlock, a CPU performs an
"atomic_inc_return" on the head index, and keeps the returned value as
a ticket. The CPU then spins until the tail index is equal to that
ticket.

To unlock a spinlock, the tail index is incremented (this can be non
atomic, because only the lock owner will modify tail).

Implementation inefficiencies aside, this change should have little
effect on performance for uncontended locks, but will have quite a
large cost for highly contended locks [O(N) cacheline transfers vs
O(1) per lock aquisition, where N is the number of CPUs contending].
The benefit is is that contended locks will not cause any starvation.

Just an idea. Big NUMA hardware seems to have fairness logic that
prevents starvation for the regular spinlock logic. But it might be
interesting for -rt kernel or systems with starvation issues. 
-
To: Eric Dumazet <dada1@...>
Cc: Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 2:44 pm

No. The point is simple:

	IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we 
have a patch for the problem that proves my point: the _proper_ way to fix 
things is to just not do the bad thing, instead of trying to allow the bad 
behaviour and try to handle it.

Things like queued spinlocks just make excuses for bad code. 

We don't do nesting locking either, for exactly the same reason. Are 
nesting locks "easier"? Absolutely. They are also almost always a sign of 
a *bug*. So making spinlocks and/or mutexes nest by default is just a way 

Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the 
fact that somebody made them an "int" just wastes memory. All the actual 
code uses "decb", so it's not even a question of safety. I wonder why we 
have that 32-bit thing and the ugly casts.

Ingo, any memory of that?

(And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think 
such an insane machine has ever existed).

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Tuesday, June 26, 2007 - 4:42 am

Hmm, not that I have a strong opinion one way or the other, but I
don't know that they would encourage bad code. They are not going to
reduce latency under a locked section, but will improve determinism
in the contended case.

They should also improve performance in heavily contended case due to
the nature of how they spin, but I know that's not something you want
to hear about. And theoretically there should be no reason why xadd is
any slower than dec and look at the status flags, should there? I never
implementedit in optimised assembly to test though...

Some hardware seems to have no idea of fair cacheline scheduling, and
especially when there are more than 2 CPUs contending for the
cacheline, there can be large starvations.

And actually some times we have code that really wants to drop the
lock and queue behind other contenders. Most of the lockbreak stuff
for example.

Suppose we could have a completely fair spinlock primitive that has
*virtually* no downsides over the unfair version, you'd take the fair
one, right?

Not that I'm saying they'd ever be a good solution to bad code, but I

Anyway, I think the important point is that they can remain within 4
bytes, which is obviously the most important boundary (they could be
2 bytes on i386).

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Tuesday, June 26, 2007 - 1:23 pm

xadd really generally *is* slower than an add. One is often microcoded, 
the other is not.

But the real problem is that your "unlock" sequence is now about two 
orders of magnitude slower than it used to be. So it used to be that a 
spinlocked sequence only had a single synchronization point, now it has 
two. *That* is really bad, and I guarantee that it makes your spinlocks 
effectively twice as slow for the non-contended parts.

But your xadd thing might be worth looking at, just to see how expensive 
it is. As an _alternative_ to spinlocks, it's certainly viable.

(Side note: why make it a word? Word operations are slower on many x86 
implementations, because they add yet another prefix. You only need a 
byte)

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 27, 2007 - 1:23 am

Oh. I found xadd to be not hugely slower on my P4, but it was a little

I don't know why my unlock sequence should be that much slower? Unlocked
mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't

No real reason I guess. I'll change it.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 27, 2007 - 2:04 am

Oh, that releasing "add" can be unlocked, and only the holder of the lock 
ever touches that field?

I must not have looked closely enough. In that case, I withdraw that 
objection, and the sequence-number-based spinlock sounds like a perfectly 
fine one.

Yes, the add will be slightly slower than the plain byte move, and the 
locked xadd will be slightly slower than a regular locked add, but 
compared to the serialization cost, that should be small. For some reason 
I thought you needed a locked instruction for the unlock too.

So try it with just a byte counter, and test some stupid micro-benchmark 
on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make 
it the normal spinlock sequence just because it isn't noticeably slower.

In fact, I think a "incb &lt;mem&gt;" instruction is even a byte shorter than 
"movb $1,mem", and with "unlock" being inlined, that could actually be a 
slight _win_.

			Linus
-
To: Nick Piggin <nickpiggin@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 27, 2007 - 3:47 pm

Nick,
 call me a worry-wart, but I slept on this, and started worrying..


So I thought about this a bit more, and I like your sequence counter 
approach, but it still worried me.

In the current spinlock code, we have a very simple setup for a 
successful grab of the spinlock:

	CPU#0					CPU#1

	A (= code before the spinlock)
						lock release

	lock decb mem	(serializing instruction)

	B (= code after the spinlock)

and there is no question that memory operations in B cannot leak into A.

With the sequence counters, the situation is more complex:

	CPU #0					CPU #1

	A (= code before the spinlock)

	lock xadd mem	(serializing instruction)

	B (= code afte xadd, but not inside lock)

						lock release

	cmp head, tail

	C (= code inside the lock)

Now, B is basically the empty set, but that's not the issue I worry about. 
The thing is, I can guarantee by the Intel memory ordering rules that 
neither B nor C will ever have memops that leak past the "xadd", but I'm 
not at all as sure that we cannot have memops in C that leak into B!

And B really isn't protected by the lock - it may run while another CPU 
still holds the lock, and we know the other CPU released it only as part 
of the compare. But that compare isn't a serializing instruction!

IOW, I could imagine a load inside C being speculated, and being moved 
*ahead* of the load that compares the spinlock head with the tail! IOW, 
the load that is _inside_ the spinlock has effectively moved to outside 
the protected region, and the spinlock isn't really a reliable mutual 
exclusion barrier any more!

(Yes, there is a data-dependency on the compare, but it is only used for a 
conditional branch, and conditional branches are control dependencies and 
can be speculated, so CPU speculation can easily break that apparent 
dependency chain and do later loads *before* the spinlock load completes!)

Now, I have good reason to believe that all Intel and AMD CPU's have a 
stricter-than-documented memory o...
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Monday, July 2, 2007 - 3:06 am

...


Haven't made too much progress on this, but I have asked someone@amd who
might be able to at least know the right person to ask :P (might be faster
to ask Andi to ask :))

If you know someone at Intel then that would be appreciated.

It would be nice if it is safe (and can be guaranteed to be safe in future).

However OTOH, the fastpath may be even faster if we do it in a "definitely
safe" way.

That is, do the xaddw against 16-bits with the head in 8 of those and the
tail in the other 8. Then compare the byte registers of the register
returned by xaddw for the test. Although the xaddw is going to be slower
than an xaddb, this way we subsequently avoid the extra load completely,
while avoiding ordering issues.

In the slowpath we would have to have a token locked op in there (like
the current spinlocks do), but this could be taken out iff our inquiries
come back positive.

Anyway, I'll try redoing the patch and getting some numbers.

-- 
SUSE Labs, Novell Inc.
-
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, Linux Kernel Mailing List <linux-kernel@...>, <tglx@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 27, 2007 - 4:17 pm

Nice catch ;) But wasn't Intel suggesting in not relying on the old 
"strict" ordering rules? IOW shouldn't an mfence always be there? Not only 
loads could leak up into the wait phase, but stores too, if they have no 
dependency with the "head" and "tail" loads.



- Davide


-
To: Davide Libenzi <davidel@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, Linux Kernel Mailing List <linux-kernel@...>, <tglx@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 27, 2007 - 6:11 pm

Actually, both Intel and AMD engineers have been talking about making the 
documentation _stricter_, rather than looser. They apparently already are 
pretty damn strict, because not being stricter than the docs imply just 
ends up exposing too many potential problems in software that didn't 
really follow the rules.

For example, it's quite possible to do loads out of order, but guarantee 
that the result is 100% equivalent with a totally in-order machine. One 
way you do that is to keep track of the cacheline for any speculative 
loads, and if it gets invalidated before the speculative instruction has 
completed, you just throw the speculation away.

End result: you can do any amount of speculation you damn well please at a 
micro-architectural level, but if the speculation would ever have been 
architecturally _visible_, it never happens!

(Yeah, that is just me in my non-professional capacity of hw engineer 
wanna-be: I'm not saying that that is necessarily what Intel or AMD 

Stores never "leak up". They only ever leak down (ie past subsequent loads 
or stores), so you don't need to worry about them. That's actually already 
documented (although not in those terms), and if it wasn't true, then we 
couldn't do the spin unlock with just a regular store anyway.

(There's basically never any reason to "speculate" stores before other mem 
ops. It's hard, and pointless. Stores you want to just buffer and move as 
_late_ as possible, loads you want to speculate and move as _early_ as 
possible. Anything else doesn't make sense).

So I'm fairly sure that the only thing you really need to worry about in 
this thing is the load-load ordering (the load for the spinlock compare vs 
any loads "inside" the spinlock), and I'm reasonably certain that no 
existing x86 (and likely no future x86) will make load-load reordering 
effects architecturally visible, even if the implementation may do so 
*internally* when it's not possible to see it in the end result.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, Linux Kernel Mailing List <linux-kernel@...>, <tglx@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 27, 2007 - 7:30 pm

Yes, Intel has never done that. They'll probably never do it since it'll 
break a lot of system software (unless they use a new mode-bit that 
allows system software to enable lose-ordering). Although I clearly 
remember to have read in one of their P4 optimization manuals to not 
assume this in the future.



- Davide


-
To: Davide Libenzi <davidel@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, Linux Kernel Mailing List <linux-kernel@...>, <tglx@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 27, 2007 - 8:46 pm

That optimization manual was confused. 

The Intel memory ordering documentation *clearly* states that only reads 
pass writes, not the other way around.

Some very confused people have thought that "pass" is a two-way thing. 
It's not. "Passing" in the Intel memory ordering means "go _ahead_ of", 
exactly the same way it means in traffic. You don't "pass" people by 
falling behind them.

It's also obvious from reading the manual, because any other reading would 
be very strange: it says

 1. Reads can be carried out speculatively and in any order

 2. Reads can pass buffered writes, but the processor is self-consistent

 3. Writes to memory are always carried out in program order [.. and then 
    lists exceptions that are not interesting - it's clflush and the 
    non-temporal stores, not any normal writes ]

 4. Writes can be buffered

 5. Writes are not performed speculatively; they are only performed for 
    instructions that have actually been retired.

 6. Data from buffered writes can be forwarded to waiting reads within the 
    processor.

 7. Reads or writes cannot pass (be carried out ahead of) I/O 
    instructions, locked instructions or serializing instructions.

 8. Reads cannot pass LFENCE and MFENCE instructions.

 9. Writes cannot pass SFENCE or MFENCE instructions.

The thing to note is:

 a) in (1), Intel says that reads can occur in any order, but (2) makes it 
    clear that that is only relevant wrt other _reads_

 b) in (2), they say "pass", but then they actually explain that "pass" 
    means "be carried out ahead of" in (7). 

    HOWEVER, it should be obvious in (2) even _without_ the explicit 
    clarification in (7) that "pass" is a one-way thing, because otherwise 
    (2) is totally _meaningless_. It would be meaningless for two reasons:

     - (1) already said that reads can be done in any order, so if that 
       was a "any order wrt writes", then (2) would be pointless. So (2) 
       must mean something *else* than "...
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, Linux Kernel Mailing List <linux-kernel@...>, <tglx@...>, Andrew Morton <akpm@...>
Date: Wednesday, June 27, 2007 - 11:03 pm

Yes, they were stating that clearly. IIWNOC (If I Were Not On Crack) I 
remember them saying to not assume any ordering besides the data 
dependency and the CPU self-consistency in the future CPUs, and to use 
*fence instructions when certain semantics were required.
But google did not help me in finding that doc, so maybe I were really on 
crack :)



- Davide


-
To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <nickpiggin@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 27, 2007 - 4:10 pm

hm, i agree with you that this is problematic. Especially on an SMT CPU 
it would be a big architectural restriction if prefetches couldnt cross 
cache misses. (and that's the only way i could see Nick's scheme 
working: MESI coherency coupled with the speculative use of that 
cacheline's value never "surviving" a MESI invalidation of that 
cacheline. That would guarantee that once we have the lock, any 
speculative result is fully coherent and no other CPU has modified it.)

	Ingo
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Wednesday, June 27, 2007 - 2:20 am

OK, I'll try running some tests and get back to you on it.

-- 
SUSE Labs, Novell Inc.
-
To: Nick Piggin <nickpiggin@...>
Cc: Linus Torvalds <torvalds@...>, Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Tuesday, June 26, 2007 - 6:56 am

On Tue, Jun 26, 2007 at 06:42:10PM +1000, Nick Piggin wrote:
...

BTW, could you explain why the below diagnose doesn't relate
to your solution?

On 06/21/2007 12:08 PM, Ingo Molnar wrote:

Thanks &amp; regards,
Jarek P.
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:12 pm

and if people _really_ want to boot a large-smp 32-bit kernel on some 
new, tons-of-cpus box, as a workaround they can enable the spinlock 
debugging code, which has no limitation on the number of CPUs supported.

	Ingo
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 3:56 pm

no real reason that i can recall - i guess nobody dared to touch it 
because it used to have that 'volatile', indicating black voodoo ;-) Now 
that the bad stigma has been removed, we could try the patch below. It 
boots fine here, and we save 1K of kernel text size:

     text    data     bss     dec     hex filename
  6236003  611992  401408 7249403  6e9dfb vmlinux.before
  6235075  611992  401408 7248475  6e9a5b vmlinux.after

I can understand why no data is saved by this change: gcc is aligning 
the next field to a natural boundary anyway and we dont really have 
arrays of spinlocks (fortunately). [and we save no data even if using 
the ((packed)) attribute.] Perhaps some data structure that is never in 
the kernel image itself still got smaller? Any good way to determine 
that?

But why is the text size different? Ah: i think it's spin_lock_init() 
getting shorter :-)

but this is certainly not something for 2.6.22, it's an early 2.6.23 
matter i suspect.

	Ingo

-------------------&gt;
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Subject: [patch] spinlocks i386: change them to byte fields

all spinlock ops are on byte operands, so change the spinlock field to 
be unsigned char. This saves a bit of kernel text size:

   text    data     bss     dec     hex filename
6236003  611992  401408 7249403  6e9dfb vmlinux.before
6235075  611992  401408 7248475  6e9a5b vmlinux.after

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 include/asm-i386/spinlock_types.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-i386/spinlock_types.h
===================================================================
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
 #endif
 
 typedef struct {
-	unsigned int slock;
+	unsigned char slock;
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED	{ 1 }
-
To: Ingo Molnar <mingo@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:10 pm

Actually, some data structures could well shrink.

Look at "struct task_struct", for example. Right now it has two spinlocks 
right next to each other (alloc_lock and pi_lock).

Other data structures may have things like bitfields etc.

But yeah, I'd not expect that to be very common, and in some cases you 
might have to re-order data structures to take advantage of better 

Oh, absolutely.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:23 pm

yeah. We've got init_task's task-struct embedded in the vmlinux, but 
it's aligned to 32 bytes, which probably hides this effect. We'd only 
see it if the size change just happened to bring a key data structure 
(which is also embedded in the vmlinux) just below a modulo 32 bytes 
boundary. The chance for that is around 6:32 per 'merge' event. That 
means that there cannot be all that many such cases ;-)

anyway, the shorter init sequence is worth it already.

	Ingo
-
To: Eric Dumazet <dada1@...>
Cc: Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 3:35 pm

Side note, and as a "truth in advertising" section: I'll have to admit 
that I argued against fair semaphores on the same grounds. I was wrong 
then (and eventually admitted it, and we obviously try to make our mutexes 
and semaphores fair these days!), and maybe I'm wrong now.

If somebody can actually come up with a sequence where we have spinlock 
starvation, and it's not about an example of bad locking, and nobody 
really can come up with any other way to fix it, we may eventually have to 
add the notion of "fair spinlocks".

So my arguments are purely pragmatic. It's not that I hate fairness per 
se. I dislike it only when it's used to "solve" (aka hide) other problems.

In the end, some situations do need fairness, and the fact that aiming for 
fairness is often harder, slower, and more complicated than not doing so 
at that point turns into a non-argument. If you need it, you need it.

I just don't think we need it, and we're better off solving problems other 
ways.

(For example, we might also solve such problems by creating a separate
"fair_spin_lock" abstraction, and only making the particular users that 
need it actually use it. It would depend a bit on whether the cost of 
implementing the fairness is noticeable enough for it to be worth having 
a separate construct for it).

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Chuck Ebbert <cebbert@...>, Ingo Molnar <mingo@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:36 pm

I tried to find such a sequence, but I think its more a matter of hardware 
evolution, and some degenerated cases.

In some years (months ?), it might possible to starve say the file struct 
spinlock of a process in a open()/close() infinite loop. This because the 
number of instruction per 'memory cache line transfert between cpus/core' is 
raising.

But then one can say its a bug in user code :)

Another way to starve kernel might be a loop doing settime() , since seqlock 
are quite special in serialization :

Only seqlock's writers perform atomic ops, readers could be starved because of 

Maybe some *big* NUMA machines really want this fairness (even if it cost some 
cycles as pointed by Davide in http://lkml.org/lkml/2007/3/29/246 ) , I am 
just guessing since I cannot test such monsters. I tested Davide program on a 
Dual Opteron and got some perf difference.


$ ./qspins  -n 2
now testing: TICKLOCK
timeres=4000
uscycles=1991
AVG[0]: 2195.250000 cycles/loop
SIG[0]: 11.813657
AVG[1]: 2212.312500 cycles/loop
SIG[1]: 38.038991

$ ./qspins  -n 2 -s
now testing: SPINLOCK
timeres=4000
uscycles=1991
AVG[0]: 2066.000000 cycles/loop
SIG[0]: 0.000000
AVG[1]: 2115.687500 cycles/loop

-
To: Linus Torvalds <torvalds@...>
Cc: Eric Dumazet <dada1@...>, Chuck Ebbert <cebbert@...>, Jarek Poplawski <jarkao2@...>, Miklos Szeredi <miklos@...>, <chris@...>, <linux-kernel@...>, <tglx@...>, <akpm@...>
Date: Thursday, June 21, 2007 - 4:09 pm

there was one bad case i can remember: the spinlock debugging code had a 
trylock open-coded loop and on certain Opterons CPUs were starving each 
other. This used to trigger with the -&gt;tree_lock rwlock i think, on 
heavy MM loads. The starvation got so bad that the NMI watchdog started 
triggering ...

int