login
Header Space

 
 

AIM7 40% regression with 2.6.26-rc1

Previous thread: on CONFIG_MM_OWNER=y, kernel panic is possible. by KOSAKI Motohiro on Tuesday, May 6, 2008 - 1:40 am. (10 messages)

Next thread: Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition by Neil Brown on Tuesday, May 6, 2008 - 2:53 am. (2 messages)
To: Matthew Wilcox <matthew@...>
Cc: LKML <linux-kernel@...>
Date: Tuesday, May 6, 2008 - 1:48 am

Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1
on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
below patch.

64ac24e738823161693bf791f87adc802cf529ff is first bad commit
commit 64ac24e738823161693bf791f87adc802cf529ff
Author: Matthew Wilcox &lt;matthew@wil.cx&gt;
Date:   Fri Mar 7 21:55:58 2008 -0500

    Generic semaphore implementation


After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
conflictions/errors, aim7 regression became less than 2%.

-yanmin


--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Tuesday, May 6, 2008 - 7:44 am

hm, which exact semaphore would that be due to?

My first blind guess would be the BKL - there's not much other semaphore 
use left in the core kernel otherwise that would affect AIM7 normally. 
The VFS still makes frequent use of the BKL and AIM7 is very VFS 
intense. Getting rid of that BKL use from the VFS might be useful to 
performance anyway.

Could you try to check that it's indeed the BKL?

Easiest way to check it would be to run AIM7 it on 
sched-devel.git/latest and do scheduler tracing via:

   http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

by doing:

   echo stacktrace &gt; /debug/tracing/iter_ctl

you could get exact backtraces of all scheduling points in the trace. If 
the BKL's down() shows up in those traces then it's definitely the BKL 
that causes this. The backtraces will also tell us exactly which BKL use 
is the most frequent one.

To keep tracing overhead low on SMP i'd also suggest to only trace a 
single CPU, via:

  echo 1 &gt; /debug/tracing/tracing_cpumask

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Tuesday, May 6, 2008 - 10:11 pm

Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but
just reported 2 of them because I located the patches. My machine is locating the root cause
of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either
kernel hang with testing or compilation fails.

Another specjbb2005 on Montvale is also under investigation.

Let me figure out how to clone your tree quickly as the network speed is very slow.

One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with

--
To: Ingo Molnar <mingo@...>
Cc: Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Tuesday, May 6, 2008 - 11:41 pm

With my patch+gprof, I collected some data. Below was outputed by gprof.

index % time    self  children    called     name
                0.00    0.00       2/223305376     __down_write_nested [22749]
                0.00    0.00       3/223305376     journal_commit_transaction [10526]
                0.00    0.00       6/223305376     __down_read [22745]
                0.00    0.00       8/223305376     start_this_handle [19167]
                0.00    0.00      15/223305376     sys_pause [19808]
                0.00    0.00      17/223305376     log_wait_commit [11047]
                0.00    0.00      20/223305376     futex_wait [8122]
                0.00    0.00      64/223305376     pdflush [14335]
                0.00    0.00      71/223305376     do_get_write_access [5367]
                0.00    0.00      84/223305376     pipe_wait [14460]
                0.00    0.00     111/223305376     kjournald [10726]
                0.00    0.00     116/223305376     int_careful [9634]
                0.00    0.00     224/223305376     do_nanosleep [5418]
                0.00    0.00    1152/223305376     watchdog [22065]
                0.00    0.00    4087/223305376     worker_thread [22076]
                0.00    0.00    5003/223305376     __mutex_lock_killable_slowpath [23305]
                0.00    0.00    7810/223305376     ksoftirqd [10831]
                0.00    0.00    9389/223305376     __mutex_lock_slowpath [23306]
                0.00    0.00   10642/223305376     io_schedule [9813]
                0.00    0.00   23544/223305376     migration_thread [11495]
                0.00    0.00   35319/223305376     __cond_resched [22673]
                0.00    0.00   49065/223305376     retint_careful [16146]
                0.00    0.00  119757/223305376     sysret_careful [20074]
                0.00    0.00  151717/223305376     do_wait [5545]
                0.00    0.00  250221/223305376     do_exit [5356]
                0.00    0.00  303836/223305376     cpu_idle [43...
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 7:00 am

I have an older patchkit that introduced unlocked_fnctl for some cases. It was 
briefly in mm but then dropped. Sounds like it is worth resurrecting?

tty_* is being taken care of by Alan.

chrdev_open is more work.

-Andi (who BTW never quite understood why BKL is a semaphore now and not
a spinlock?)
--
To: Andi Kleen <andi@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Ingo Molnar <mingo@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 9:59 am

On Wed, 07 May 2008 13:00:14 +0200

The tty open/close paths are probably a few months away from dropping the
BKL.

Alan
--
To: Andi Kleen <andi@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 7:46 am

Not sure what you're talking about here, Andi.  The only lock_kernel in
fcntl.c is around the call to -&gt;fasync.  And Yanmin's traces don't show

See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 8:21 am

I am aware of that commit, thank you, but the comment was refering to that it 
came with about zero justification why it was done. For the left over BKL 
regions which are relatively short surely a spinlock would be better than a 
semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.

If that was done all these regressions would disappear I bet. That said
of course it is still better to actually fix the lock_kernel()s, but shorter
time just fixing lock_kernel again would be easier.

-Andi

--
To: Andi Kleen <andi@...>
Cc: Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 10:36 am

I do agree. 

I think turning the BKL into a semaphore was fine per se, but that was 
when semaphores were fast. 

Considering the apparent AIM regressions, we really either need to revert 
the semaphore consolidation, or we need to fix the kernel lock. And by 
"fixing", I don't mean removing it - it will happen, but it almost 
certainly won't happen for 2.6.26.

The easiest approach would seem to just turn the BKL into a mutex instead, 
which should hopefully be about as optimized as the old semaphores.

But my preferred option would indeed be just turning it back into a 
spinlock - and screw latency and BKL preemption - and having the RT people 
who care deeply just work on removing the BKL in the long run.

Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious 
when we now apparently have even more reason to aim for removing the BKL 
rather than trying to mess around with it.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 12:20 pm

hm, do we know it for a fact that the 40% AIM regression is due to the 
fastpath overhead of the BKL? It would be extraordinary if so.

I think it is far more likely that it's due to the different scheduling 
and wakeup behavior of the new kernel/semaphore.c code. So the fix would 
be to restore the old scheduling behavior - that's what Yanmin's manual 
revert did and that's what got him back the previous AIM7 performance.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 12:35 pm

Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which 
was what, 7500 lines of code removed that got reverted.

And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel 
was this (and I quote your message):

    remove the !PREEMPT_BKL code.
    
    this removes 160 lines of legacy code.

in other words, your only stated valid reason for getting rid of the 
spinlock was 160 lines, and the comment didn't even match what it did (it 
removed the spinlocks entirely, not just the preemptible version).

In contrast, the revert adds 7500 lines. If you go by the only documented 
reason for the crap that is the current BKL, then I know which one I'll 
take. I'll take the spinlock back, and I'd rather put preemption back 
than ever take those semaphores.

And even that's ignoring another issue: did anybody ever even do that AIM7 
benchmark comparing spinlocks to the semaphore-BKL? It's quite possible 
that the semaphores (even the well-behaved ones) behaved worse than the 
spinlocks.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 1:05 pm

i wouldnt advocate a 7500 revert instead of a 160 lines change.

my suggestion was that the scheduling behavior of the new 
kernel/semaphore.c code is causing the problem - i.e. making it match 

it was removed by me in the course of this discussion:

   http://lkml.org/lkml/2008/1/2/58

the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the 
spinlock version] was broken for a longer period of time (it crashed 
trivially), because nobody apparently used it. People (Nick) asked why 
it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was 
the default, that was what all distros used. I.e. the spinlock code was 
in essence dead code at that point in time.

the spinlock code might in fact perform _better_, but nobody came up 

that's a good question...

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 1:24 pm

Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was 
the known-flaky one.

The thread you point to also says that it's PREEMPT_BKL=y that was the 
problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was 
still using reiserfs."), not the plain spinlock approach.

But it would definitely be interesting to see the crash reports. And the 
help message always said "Say N if you are unsure." even if it ended up 
being marked 'y' by default at some point (and then in January was made 
first unconditional, and then removed entirely)

Because in many ways, the non-preempt BKL is the *much* simpler case. I 
don't see why it would crash - it just turns the BKL into a trivial 
counting spinlock that can sleep.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 1:36 pm

no, there was another problem (which i couldnt immediately find because 
lkml.org only indexes part of the threads, i'll research it some more), 

yeah. The latencies are a different problem, and indeed were reported 
against PREEMPT_BKL, and believed to be due to reiser3 and the tty code. 
(reiser3 runs almost all of its code under the BKL)

The !PREEMPT_BKL crash was some simple screwup on my part of getting 
atomicity checks wrong in cond_resched() - and it went unnoticed for a 
long time - or something like that. I'll try to find that discussion.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 1:55 pm

Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe 
way. It uses just

	!(preempt_count() &amp; PREEMPT_ACTIVE)

to see whether it can schedule, and it should probably use in_atomic() 
which ignores the kernel lock.

But right now, that whole thing is disabled if PREEMPT is on anyway, so in 
effect (with my test patch, at least) cond_preempt() would just be a no-op 
if PREEMPT is on, even if BKL isn't preemptable.

So it doesn't look buggy, but it looks like it might cause longer 
latencies than strictly necessary. And if somebody depends on 
cond_resched() to avoid some bad livelock situation, that would obviously 
not work (but that sounds like a fundamental bug anyway, I really hope 

Yes, some silly bug sounds more likely. Especially considering how many 
different cases there were (semaphores vs spinlocks vs preemptable 
spinlocks).

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 1:59 pm

Funny you should mention it; locks.c uses cond_resched() assuming that
it ignores the BKL.  Not through needing to avoid livelock, but it does
presume that other higher priority tasks contending for the lock will
get a chance to take it.  You'll notice the patch I posted yesterday
drops the file_lock_lock around the call to cond_resched().

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 2:17 pm

Well, this would only be noticeable with CONFIG_PREEMPT.

If you don't have preempt enabled, it looks like everything should work 
ok: the kernel lock wouldn't increase the preempt count, and 
_cond_resched() works fine.

If you're PREEMPT, then the kernel lock would increase the preempt count, 
and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT 
you'd never see it *anyway*, because PREEMPT will disable cond_resched() 
entirely (because preemption takes care of normal scheduling latencies 
without it).

And I'm also sure that this all worked fine at some point, and it's 
largely a result just of the multiple different variations of BKL 
preemption coupled with some of them getting removed entirely, so the code 
that used to handle it just got corrupt over time. See commit 02b67cc3b, 
for example. 

.. Hmm ... Time passes. Linus looks at git history.

It does look like "cond_resched()" has not worked with the BKL since 2005, 
and hasn't taken the BKL into account. Commit 5bbcfd9000:

    [PATCH] cond_resched(): fix bogus might_sleep() warning

+       if (unlikely(preempt_count()))
+               return;

which talks about the BKS, ie it only took the *semaphore* implementation 
into account. Never the spinlock-with-preemption-count one.

Or am I blind?

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Matthew Wilcox <matthew@...>, Andi Kleen <andi@...>, Zhang, Yanmin <yanmin_zhang@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 2:49 pm

hm, i think you are right.

most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case, 
and not getting proper cond_resched() behavior in case of !PREEMPT_BKL 
would certainly not be noticed by distros or users.

We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this 
post-2.6.7 commit that introduced the feature:

|  commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d
|  Author: Ingo Molnar &lt;mingo@elte.hu&gt;
|  Date:   Fri Jan 7 21:59:57 2005 -0800
|
|     [PATCH] remove the BKL by turning it into a semaphore

There was constant trouble around all these variations of preemptability 
and their combination with debugging helpers. (So i was rather happy to 
get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no 
tears will be shed.)

	Ingo
--
To: Andi Kleen <andi@...>
Cc: Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 11:19 am

Here's a trial balloon patch to do that.

Yanmin - this is not well tested, but the code is fairly obvious, and it 
would be interesting to hear if this fixes the performance regression. 
Because if it doesn't, then it's not the BKL, or something totally 
different is going on.

Now, we should probably also test just converting the thing to a mutex, 
to see if that perhaps also fixes it.

			Linus

---
 arch/mn10300/Kconfig    |   11 ----
 include/linux/hardirq.h |   18 ++++---
 kernel/sched.c          |   27 ++---------
 lib/kernel_lock.c       |  120 +++++++++++++++++++++++++++++++---------------
 4 files changed, 95 insertions(+), 81 deletions(-)

diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 6a6409a..e856218 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -186,17 +186,6 @@ config PREEMPT
 	  Say Y here if you are building a kernel for a desktop, embedded
 	  or real-time system.  Say N if you are unsure.
 
-config PREEMPT_BKL
-	bool "Preempt The Big Kernel Lock"
-	depends on PREEMPT
-	default y
-	help
-	  This option reduces the latency of the kernel by making the
-	  big kernel lock preemptible.
-
-	  Say Y here if you are building a kernel for a desktop system.
-	  Say N if you are unsure.
-
 config MN10300_CURRENT_IN_E2
 	bool "Hold current task address in E2 register"
 	default y
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 897f723..181006c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -72,6 +72,14 @@
 #define in_softirq()		(softirq_count())
 #define in_interrupt()		(irq_count())
 
+#if defined(CONFIG_PREEMPT)
+# define PREEMPT_INATOMIC_BASE kernel_locked()
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_INATOMIC_BASE 0
+# define PREEMPT_CHECK_OFFSET 0
+#endif
+
 /*
  * Are we running in atomic context?  WARNING: this macro cannot
  * always detect atomic context; in particular, it cannot know about
@@ -79,17 +87,11 @@
  * used in the general case to...
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 10:44 pm

Congratulations! The patch really fixes the regression completely!
vmstat showed cpu idle is 0%, just like 2.6.25's.

Some config options in my .config file:

CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y


--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 2:43 am

great! Yanmin, could you please also check the other patch i sent (also 
attached below), does it solve the regression similarly?

	Ingo

---
 lib/kernel_lock.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
 	task-&gt;lock_depth = -1;
 	preempt_enable_no_resched();
 
-	down(&amp;kernel_sem);
+	while (down_trylock(&amp;kernel_sem))
+		cpu_relax();
 
 	preempt_disable();
 	task-&gt;lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
 	struct task_struct *task = current;
 	int depth = task-&gt;lock_depth + 1;
 
-	if (likely(!depth))
+	if (likely(!depth)) {
 		/*
 		 * No recursion worries - we set up lock_depth _after_
 		 */
-		down(&amp;kernel_sem);
+		while (down_trylock(&amp;kernel_sem))
+			cpu_relax();
+	}
 
 	task-&gt;lock_depth = depth;
 }
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 3:14 am

With your patch, aim7 regression becomes less than 2%. I ran the testing twice.

Linus' patch could recover it completely. As aim7 result is quite stable(usually

--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 3:39 am

is this the old original aim7 you are running, or osdl-aim-7 or 
re-aim-7?

if it's aim7 then this is a workload that starts+stops 2000 parallel 
tasks that each start and exit at the same time. That might explain its 
sensitivity on the BKL - this is all about tty-controlled task startup 
and exit.

i could not get it to produce anywhere close to stable results though. I 
also frequently get into this problem:

  AIM Multiuser Benchmark - Suite VII Run Beginning
  Tasks    jobs/min  jti  jobs/min/task      real       cpu
   2000
  Failed to execute
          new_raph 200
  Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17

  Failed to execute
          disk_cp /mnt/shm
  disk_cp (1): cannot open /mnt/shm/tmpa.common
  disk1.c: No such file or directory

  [.. etc. a large stream of them .. ]

system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its 
work files.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 4:44 am

I useold AIM7 plus a small patch which is just to change a couple of data type to match
My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing
and didn't see any failure except:
AIM Multiuser Benchmark - Suite VII Run Beginning

Tasks    jobs/min  jti  jobs/min/task      real       cpu
 2000create_shared_memory(): can't create semaphore, pausing...
create_shared_memory(): can't create semaphore, pausing...


Above info doesn't mean errors.

Perhaps you could:
1) Apply the attched aim9 patch;
2) check if you have write right under /mnt/shm;
3) echo "/mnt/shm"&gt;aim7_path/config;
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 5:21 am

that failure message you got worries me - it indicates that your test 
ran out of IPC semaphores. You can fix it via upping the semaphore 
limits via:

   echo "500 32000 128 512" &gt; /proc/sys/kernel/sem

could you check that you still get similar results with this limit 
fixed?

note that once i've fixed the semaphore limits it started running fine 
here. And i see zero idle time during the run on a quad core box.

here are my numbers:

  # on v2.6.26-rc1-166-gc0a1811

  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55851.4         93      208.4   793.6   0.4654   # BKL: sleep
  2000    55402.2         79      210.1   800.1   0.4617

  2000    55728.4         93      208.9   795.5   0.4644   # BKL: spin
  2000    55787.2         93      208.7   794.5   0.4649   #

so the results are the same within noise.

I'll also check this workload on an 8-way box to make sure it's OK on 
larger CPU counts too.

could you double-check your test?

plus a tty tidbit as well, during the test i saw a few of these:

 Warning: dev (tty1) tty-&gt;count(639) != #fd's(638) in release_dev
 Warning: dev (tty1) tty-&gt;count(462) != #fd's(463) in release_dev
 Warning: dev (tty1) tty-&gt;count(274) != #fd's(275) in release_dev
 Warning: dev (tty1) tty-&gt;count(4) != #fd's(3) in release_dev
 Warning: dev (tty1) tty-&gt;count(164) != #fd's(163) in release_dev

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 5:30 am

A quick test showed it does work.


--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 5:29 am

false alarm there - these were due to the breakage in the hack-patch i 
used ...

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>
Date: Thursday, May 8, 2008 - 2:48 am

but but but.  Some other users of down() have presumably also regressed.  We just
have't found the workload to demonstrate that yet.

--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Wednesday, May 7, 2008 - 11:29 pm

Well, that shows that it was the BKL.

That said, "idle 0%" is easy when you spin. Do you also have actual 
performance numbers? I'd hope that not only do we use full CPU time, it's 
also at least as fast as the old semaphores were?

While I've been dissing sleeping locks (because their overhead is so 
high), at least in _theory_ they can get better behavior when not 
spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure, 
but I'd still like to hear actual performance numbers too, just to be 
sure.

Anyway, at least the "was it the BKL or some other semaphore user" 
question is entirely off the table.

So we need to

 - fix the BKL. My patch may be a good starting point, but there are 
   alternatives:

    (a) reinstate the old BKL code entirely

	Quite frankly, I'd prefer not to. Not only did it have three 
	totally different cases, some of them were apparently broken (ie 
	BKL+regular preempt didn't cond_resched() right), and I just don't 
	think it's worth maintaining three different versions, when 
	distro's are going to pick one anyway. *We* should pick one, and 
	maintain it.

    (b) screw the special BKL preemption - it's a spinlock, we don't 
	preempt other spinlocks, but at least fix BKL+preempt+cond_resched
	thing. 

	This would be "my patch + fixes" where at least one of the fixes 
	is the known (apparently old) cond_preempt() bug.

    (c) Try to keep the 2.6.25 code as closely as possible, but just 
	switch over to mutexes instead.

	I dunno. I was never all that enamoured with the BKL as a sleeping 
	lock, so I'm biased against this one, but hey, it's just a 
	personal bias.

 - get rid of the BKL anyway, at least in anything that is noticeable.

	Matthew's patch to file locking is probably worth doing as-is, 
	simply because I haven't heard any better solutions. The BKL 
	certainly can't be it, and whatever comes out of the NFSD 
	discussion will almost certainly involve just making sure that 
	those leases just use the new ...
To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 12:08 am

Yes. My conclusion is based on the actual number. cpu idle 0% is just

--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>
Date: Thursday, May 8, 2008 - 12:17 am

Thanks, that's all I wanted to verify.

I'll leave this overnight, and see if somebody has come up with some smart 
and wonderful patch. And if not, I think I'll apply mine as "known to fix 
a regression", and we can perhaps then improve on things further from 
there.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 8:01 am

hey, i happen to have such a smart and wonderful patch =B-)

i reproduced the AIM7 workload and can confirm Yanmin's findings that 
-.26-rc1 regresses over .25 - by over 67% here.

Looking at the workload i found and fixed what i believe to be the real 
bug causing the AIM7 regression: it was inefficient wakeup / scheduling 
/ locking behavior of the new generic semaphore code, causing suboptimal 
performance.

The problem comes from the following code. The new semaphore code does 
this on down():

        spin_lock_irqsave(&amp;sem-&gt;lock, flags);
        if (likely(sem-&gt;count &gt; 0))
                sem-&gt;count--;
        else
                __down(sem);
        spin_unlock_irqrestore(&amp;sem-&gt;lock, flags);

and this on up():

        spin_lock_irqsave(&amp;sem-&gt;lock, flags);
        if (likely(list_empty(&amp;sem-&gt;wait_list)))
                sem-&gt;count++;
        else
                __up(sem);
        spin_unlock_irqrestore(&amp;sem-&gt;lock, flags);

where __up() does:

        list_del(&amp;waiter-&gt;list);
        waiter-&gt;up = 1;
        wake_up_process(waiter-&gt;task);

and where __down() does this in essence:

        list_add_tail(&amp;waiter.list, &amp;sem-&gt;wait_list);
        waiter.task = task;
        waiter.up = 0;
        for (;;) {
                [...]
                spin_unlock_irq(&amp;sem-&gt;lock);
                timeout = schedule_timeout(timeout);
                spin_lock_irq(&amp;sem-&gt;lock);
                if (waiter.up)
                        return 0;
        }

the fastpath looks good and obvious, but note the following property of 
the contended path: if there's a task on the -&gt;wait_list, the up() of 
the current owner will "pass over" ownership to that waiting task, in a 
wake-one manner, via the waiter-&gt;up flag and by removing the waiter from 
the wait list.

That is all and fine in principle, but as implemented in 
kernel/semaphore.c it also creates a nasty, hidden source of con...
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 9:56 am

On Thu, 8 May 2008 14:01:30 +0200

ok sounds like I like the fairness part of the new semaphores (but
obviously not the 67% performance downside; I'd expect to sacrifice a
little performance.. but this much??).

It sucks though; if this were a mutex, we could wake up the owner of
the bugger in the contented acquire path synchronously.... but these 
are semaphores, and don't have an owner ;( bah bah bah
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 9:20 am

I did note that earlier downthread ... although to be fair, I thought of
it in terms of three tasks with the third task coming in and stealing
the second tasks's wakeup rather than the first task starving the second

Fair is certainly the enemy of throughput (see also dbench arguments
passim).  It may be that some semaphore users really do want fairness --
it seems pretty clear that we don't want fairness for the BKL.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Thursday, May 8, 2008 - 11:01 am

i dont think we need to consider any theoretical arguments about 
fairness here as there's a fundamental down-to-earth maintenance issue 
that governs: old semaphores were similarly unfair too, so it is just a 
bad idea (and a bug) to change behavior when implementing new, generic 
semaphores that are supposed to be a seemless replacement! This is about 
legacy code that is intended to be phased out anyway.

This is already a killer argument and we wouldnt have to look any 
further.

but even on the more theoretical level i disagree: fairness of CPU time 
is something that is implemented by the scheduler in a natural way 
already. Putting extra ad-hoc synchronization and scheduling into the 
locking primitives around data structures only gives mathematical 
fairness and artificial micro-scheduling, it does not actually make the 
end result more useful! This is especially true for the BKL which is 
auto-dropped by the scheduler anyway. (so descheduling a task will 
automatically release it of its BKL ownership)

For example we've invested a _lot_ of time and effort into adding lock 
stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a 
_lot_ harder to do atomically with PI constraints but still possible and 
makes sense in the grand scheme of things. kernel/mutex.c is also 
"unfair" - and that's correct IMO.

For the BKL in particular there's almost no sense to talk about any 
underlying resource and there's almost no expectation from users for 
that imaginery resource to be shared fairly.

	Ingo
--
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 8:28 am

Peter pointed it out that because sem-&gt;count is u32, the &lt;= 0 is in fact 
a "== 0" condition - the patch below does that. As expected gcc figured 
out the same thing too so the resulting code output did not change. (so 
this is just a cleanup)

i've got this lined up in sched.git and it's undergoing testing right 
now. If that testing goes fine and if there are no objections i'll send 
a pull request for it later today.

	Ingo

----------------&gt;
Subject: semaphores: improve code
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Thu May 08 14:19:23 CEST 2008

No code changed:

kernel/semaphore.o:

   text	   data	    bss	    dec	    hex	filename
   1207	      0	      0	   1207	    4b7	semaphore.o.before
   1207	      0	      0	   1207	    4b7	semaphore.o.after

md5:
   c10198c2952bd345a1edaac6db891548  semaphore.o.before.asm
   c10198c2952bd345a1edaac6db891548  semaphore.o.after.asm

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/semaphore.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,7 +54,7 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
-	if (unlikely(sem-&gt;count &lt;= 0))
+	if (unlikely(!sem-&gt;count))
 		__down(sem);
 	sem-&gt;count--;
 	spin_unlock_irqrestore(&amp;sem-&gt;lock, flags);
@@ -76,7 +76,7 @@ int down_interruptible(struct semaphore 
 	int result = 0;
 
 	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
-	if (unlikely(sem-&gt;count &lt;= 0))
+	if (unlikely(!sem-&gt;count))
 		result = __down_interruptible(sem);
 	if (!result)
 		sem-&gt;count--;
@@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
-	if (unlikely(sem-&gt;count &lt;= 0))
+	if (unlikely(!sem-&gt;count))
 		result = __down_killable(s...
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 12:02 pm

Why don't we just make it do the same thing that the x86 semaphores used 
to do: make it signed, and decrement unconditionally. And callt eh 
slow-path if it became negative.

IOW, make the fast-path be

	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
	if (--sem-&gt;count &lt; 0)
		__down();
	spin_unlock_irqrestore(&amp;sem-&gt;lock, flags);

and now we have an existing known-good implementation to look at?

Rather than making up a totally new and untested thing.

		Linus
--
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 2:30 pm

Ok, after having thought that over, and looked at the code, I think I like 
your version after all. The old implementation was pretty complex due to 
the need to be so extra careful about the count that could change outside 
of the lock, so everything considered, a new implementation that is 
simpler is probably the better choice.

Ergo, I will just pull your scheduler tree.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 4:19 pm

yeah. I thought about that too, the problem i found is this thing in the 
old lib/semaphore-sleepers.c code's __down() path:

        remove_wait_queue_locked(&amp;sem-&gt;wait, &amp;wait);
        wake_up_locked(&amp;sem-&gt;wait);
        spin_unlock_irqrestore(&amp;sem-&gt;wait.lock, flags);
        tsk-&gt;state = TASK_RUNNING;

that mystery wakeup once i understood to be necessary for some weird 
ordering reason, but it would probably be hard to justify in the new 
code, because it's done unconditionally, regardless of whether there are 
sleepers around.

And once we deviate from the old code, we might as well go for the 
simplest approach - which also happens to be rather close to the mutex 
code's current slowpath - just with counting property added, legacy 

great! Meanwhile a 100 randconfigs booted fine with that tree so i'd say 
the implementation is robust.

i also did a quick re-test of AIM7 because the wakeup logic changed a 
bit from what i tested initially (from round-robin to strict FIFO), and 
as expected not much changed in the AIM7 results on the quad:

  Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
  2000    55019.9         96      211.6   806.5   0.4585
  2000    55116.2         90      211.2   804.7   0.4593
  2000    55082.3         82      211.3   805.5   0.4590

this is slightly lower but the test was not fully apples to apples 
because this also had some tracing active and other small details. It's 
still very close to the v2.6.25 numbers. I suspect some more performance 
could be won in this particular workload by getting rid of the BKL 
dependency altogether.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 4:27 pm

Did somebody have any trace on which BKL taker it is? It apparently wasn't 
file locking. Was it the tty layer?

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 5:45 pm

yeah, i captured a trace of all the down()s that happen in the workload, 
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's 
the semaphore activities in the trace:

# grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \
                                          uniq -c | sort -n | cut -d= -f1-4

      9  down &lt;= lock_kernel &lt;= proc_lookup_de &lt;= proc_lookup &lt;
     12  down &lt;= lock_kernel &lt;= de_put &lt;= proc_delete_inode &lt;
     14  down &lt;= lock_kernel &lt;= proc_lookup_de &lt;= proc_lookup &lt;
     19  down &lt;= lock_kernel &lt;= vfs_ioctl &lt;= do_vfs_ioctl &lt;
     58  down &lt;= lock_kernel &lt;= tty_release &lt;= __fput &lt;
     62  down &lt;= lock_kernel &lt;= chrdev_open &lt;= __dentry_open &lt;
     70  down &lt;= lock_kernel &lt;= sys_fcntl &lt;= system_call_after_swapgs &lt;
   2512  down &lt;= lock_kernel &lt;= opost &lt;= write_chan &lt;
   2574  down &lt;= lock_kernel &lt;= write_chan &lt;= tty_write &lt;

note that this is running the fixed semaphore code, so contended 
semaphores are really rare in the trace. The histogram above includes 
all calls to down().

here's the full trace file (from a single CPU, on a dual-core box 
running aim7):

   http://redhat.com/~mingo/misc/aim7-ftrace.txt

some other interesting stats. Top 5 wakeups sources:

 # grep wake_up aim7-ftrace.txt  | cut -d: -f2- | sort | uniq -c | 
                                   sort -n | cut -d= -f1-6 | tail -5
  [...]
    340  default_wake_function &lt;= __wake_up_common &lt;= __wake_up_sync &lt;= unix_write_space &lt;= sock_wfree &lt;= skb_release_all &lt;
    411  default_wake_function &lt;= autoremove_wake_function &lt;= __wake_up_common &lt;= __wake_up_sync &lt;= pipe_read &lt;= do_sync_read &lt;
    924  default_wake_function &lt;= autoremove_wake_function &lt;= __wake_up_common &lt;= __wake_up_sync &lt;= pipe_write &lt;= do_sync_write &lt;
   1301  default_wake_function &lt;= __wake_up_common &lt;= __wake_up &lt;= n_tty_receive_b...
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Alan Cox <alan@...>
Date: Thursday, May 8, 2008 - 6:55 pm

Ok. tty write handling. Nasty. But not as nasty as the open/close code, 
perhaps, and maybe we'll get it fixed some day.

In fact, I thought we had fixed most of this already, but hey, I was 


No, it's not rmap contention. Your profile hits are just on the actual 
calculations, and it's all data-dependent arithmetic and loads. Some cache 
misses on the page tables, clearly, but it looks like a lot of it is even 
just the plain arithmetic (the imul followed by a data-dependent 'lea' 
instruction).

Some of it is that "page_to_pfn(page)", which involves a nasty division 
(divide by sizeof(struct page)). It gets turned into that shift and 
multiply, but it's still quite expensive with big constants etc.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 7:16 pm

I have pushed it down to n_tty line discipline code but not within that.
It is on the hit list but I'm working on more pressing stuff first (USB
layer, extracting commonality to start to tackle open etc etc)

I don't think fixing n_tty is now a big job if someone wants to take a
swing at it. The driver write/throttle/etc routines below the n_tty ldisc
layer are now BKL clean so it should just be the internal locking of the
buffers, window and the like to tackle.

Feel free to have a go 8)
--
To: Alan Cox <alan@...>
Cc: Ingo Molnar <mingo@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 7:33 pm

Well, it turns out that Ingo's fixed statistics actually put the real cost 
in fcntl/ioctl/open/release:

    310  down &lt;= lock_kernel &lt;= sys_fcntl &lt;= system_call_after_swapgs &lt;
    332  down &lt;= lock_kernel &lt;= vfs_ioctl &lt;= do_vfs_ioctl &lt;
    380  down &lt;= lock_kernel &lt;= tty_release &lt;= __fput &lt;
    422  down &lt;= lock_kernel &lt;= chrdev_open &lt;= __dentry_open &lt;

rather than the write routines. But it may be that Ingo was just profiling 
two different sections, and it's really all of them.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Alan Cox <alan@...>, Ingo Molnar <mingo@...>, Zhang, Yanmin <yanmin_zhang@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Friday, May 9, 2008 - 4:29 am

That must be -&gt;fasync? If it was file locks the lock_kernel would not
be inlined into sys_fcntl. Or is that an truncated back trace?

-Andi (wondering if he should plug -&gt;fasync_unlocked again ...)
--
To: Linus Torvalds <torvalds@...>
Cc: Alan Cox <alan@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Friday, May 9, 2008 - 2:50 am

the first trace had general desktop load mixed into it as well - so 
while it's not interesting to AIM7 the BKL does matter in those 
situations and i'd not be surprised if it was responsible for certain 
categories of desktop lag.

The second trace was the correct 'pure' AIM7 workload which produces 
very little tty output. It is a quite stable workload and the trace i 
uploaded is representative of the totality of that workload. AIM7 runs 
for several minutes so there's no significant rampup/rampdown 
interaction either.

	Ingo
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 7:27 pm

tty release is probably a few months away from getting cured  - I'm
afraid it will almost certainly be the very last user of the BKL in tty
to get fixed as it depends on everything else being sanely locked.
--
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Alan Cox <alan@...>
Date: Thursday, May 8, 2008 - 7:07 pm

Btw, sparse will complain about those, because the source code *looks* 
really cheap.

The normal "page_to_pfn()" looks trivial:

	((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)

which looks like a trivial subtraction and addition of a constant, but the 
subtraction is on C pointers, and basically turns into

	((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page)

and because "struct page" is not some nice power-of-two in size, that 
division is rather nasty even though it's a constant size.

		Linus
--
To: Ingo Molnar <mingo@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Alan Cox <alan@...>
Date: Thursday, May 8, 2008 - 7:14 pm

Sometimes you can fix it.

For example, this change:

	-       if (pte_present(*pte) &amp;&amp; page_to_pfn(page) == pte_pfn(*pte)) {
	+       if (pte_present(*pte) &amp;&amp; page == pfn_to_page(pte_pfn(*pte))) {

can simplify things: instead of moving from a 'struct page' to a pfn, it 
moves from a pfn to a 'struct page', and that is generally cheaper 
(multiply rather than divide by size of struct page). It's not always the 
same thing to do, but I think in this case we can. For me, the code 
generation changes:

	-       movabsq $7905747460161236407, %rdx      #, tmp111
	-       movabsq $32985348833280, %rax   #, tmp107
	-       leaq    (%r12,%rax), %rax       #, tmp106
	-       sarq    $3, %rax        #, tmp106
	-       imulq   %rdx, %rax      # tmp111, tmp106
	-       movabsq $70368744177663, %rdx   #, tmp113
	-       andq    %rdx, %rcx      # tmp113, pte$pte
	-       shrq    $12, %rcx       #, pte$pte
	-       cmpq    %rcx, %rax      # pte$pte, tmp106
	+       movabsq $70368744177663, %rax   #, tmp107
	+       andq    %rax, %rdx      # tmp107, pte$pte
	+       shrq    $12, %rdx       #, pte$pte
	+       imulq   $56, %rdx, %rax #, pte$pte, tmp109
	+       movabsq $-32985348833280, %rdx  #, tmp111
	+       addq    %rdx, %rax      # tmp111, tmp110
	+       cmpq    %rax, %r13      # tmp110, page

which isn't a *huge* deal, but it certainly looks better. One less big 
constant, and one less shift.

It's not going to make a huge difference, though. That function is just 
called too much, and it would still be entirely data-dependent all the way 
through.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 6:02 pm

the first one had some idle time in it as well plus the effects of a 
2000-task Ctrl-Z - which skewed the histograms.

the new stats are more straightforward. down() callsite histogram:

     42  down &lt;= lock_kernel &lt;= de_put &lt;= proc_delete_inode &lt;
     42  down &lt;= lock_kernel &lt;= proc_lookup_de &lt;= proc_lookup &lt;
     78  down &lt;= lock_kernel &lt;= proc_lookup_de &lt;= proc_lookup &lt;
    310  down &lt;= lock_kernel &lt;= sys_fcntl &lt;= system_call_after_swapgs &lt;
    332  down &lt;= lock_kernel &lt;= vfs_ioctl &lt;= do_vfs_ioctl &lt;
    380  down &lt;= lock_kernel &lt;= tty_release &lt;= __fput &lt;
    422  down &lt;= lock_kernel &lt;= chrdev_open &lt;= __dentry_open &lt;

hm, why is chrdev_open() called all that often?

top-5 wakeups:

      4  default_wake_function &lt;= __wake_up_common &lt;= complete &lt;= migration_thread &lt;= kthread &lt;= child_rip &lt;
      4  wake_up_process &lt;= sched_exec &lt;= do_execve &lt;= sys_execve &lt;= stub_execve &lt;=  &lt;
      8  wake_up_process &lt;= __mutex_unlock_slowpath &lt;= mutex_unlock &lt;= do_filp_open &lt;= do_sys_open &lt;= sys_open &lt;
     40  default_wake_function &lt;= autoremove_wake_function &lt;= __wake_up_common &lt;= __wake_up &lt;= sock_def_wakeup &lt;= tcp_rcv_state_process &lt;
     98  default_wake_function &lt;= __wake_up_common &lt;= __wake_up_sync &lt;= do_notify_parent &lt;= do_exit &lt;= do_group_exit &lt;

i.e. very little wakeup activities. Top 10 scheduling points:

     10  schedule &lt;= kjournald &lt;= kthread &lt;= child_rip &lt;
     12  schedule &lt;= __down_write_nested &lt;= __down_write &lt;= down_write &lt;
     29  schedule &lt;= worker_thread &lt;= kthread &lt;= child_rip &lt;
     40  schedule &lt;= schedule_timeout &lt;= inet_stream_connect &lt;= sys_connect &lt;
     59  schedule &lt;= __cond_resched &lt;= _cond_resched &lt;= generic_file_buffered_write &lt;
    111  schedule &lt;= ksoftirqd &lt;= kthread &lt;= child_rip &lt;
    119  schedule &lt;= do_wa...
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 10:43 am

a second update patch, i've further simplified the semaphore wakeup 
logic: there's no need for the wakeup to remove the task from the wait 
list. This will make them a slighly bit more fair, but more importantly, 
this closes a race in my first patch for the unlikely case of a signal 
(or a timeout) and an unlock coming in at the same time and the task not 
getting removed from the wait-list.

( my performance testing with 2000 AIM7 tasks on a quad never hit that
  race but x86.git QA actually triggered it after about 30 random kernel
  bootups and it caused a nasty crash and lockup. )

	Ingo

----------------&gt;
Subject: sem: simplify queue management
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Tue May 06 19:32:42 CEST 2008

kernel/semaphore.o:

   text	   data	    bss	    dec	    hex	filename
   1040	      0	      0	   1040	    410	semaphore.o.before
    975	      0	      0	    975	    3cf	semaphore.o.after

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/semaphore.c |   32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -202,33 +202,34 @@ static inline int __sched __down_common(
 {
 	struct task_struct *task = current;
 	struct semaphore_waiter waiter;
+	int ret = 0;
 
 	waiter.task = task;
+	list_add_tail(&amp;waiter.list, &amp;sem-&gt;wait_list);
 
 	for (;;) {
-		list_add_tail(&amp;waiter.list, &amp;sem-&gt;wait_list);
-
-		if (state == TASK_INTERRUPTIBLE &amp;&amp; signal_pending(task))
-			goto interrupted;
-		if (state == TASK_KILLABLE &amp;&amp; fatal_signal_pending(task))
-			goto interrupted;
-		if (timeout &lt;= 0)
-			goto timed_out;
+		if (state == TASK_INTERRUPTIBLE &amp;&amp; signal_pending(task)) {
+			ret = -EINTR;
+			break;
+		}
+		if (state == TASK_KILLABLE &amp;&amp; fatal_signal_pending(task)) {
+			ret = -EINTR;
...
To: Linus Torvalds <torvalds@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Matthew Wilcox <matthew@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Thursday, May 8, 2008 - 11:10 am

ok, it's looking good here so far so here's the scheduler fixes tree 
that you can pull if my semaphore fix looks good to you too:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus

also includes a scheduler arithmetics fix from Mike. Find the shortlog 
and diff below.

	Ingo

------------------&gt;
Ingo Molnar (1):
      semaphore: fix

Mike Galbraith (1):
      sched: fix weight calculations

 kernel/sched_fair.c |   11 ++++++--
 kernel/semaphore.c  |   64 ++++++++++++++++++++++++---------------------------
 2 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c863663..e24ecd3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
 		if (sched_feat(NEW_FAIR_SLEEPERS)) {
+			unsigned long thresh = sysctl_sched_latency;
+
+			/*
+			 * convert the sleeper threshold into virtual time
+			 */
 			if (sched_feat(NORMALIZED_SLEEPER))
-				vruntime -= calc_delta_weight(sysctl_sched_latency, se);
-			else
-				vruntime -= sysctl_sched_latency;
+				thresh = calc_delta_fair(thresh, se);
+
+			vruntime -= thresh;
 		}
 
 		/* ensure we never gain time by being placed backwards. */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
 	unsigned long flags;
 
 	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
-	if (likely(sem-&gt;count &gt; 0))
-		sem-&gt;count--;
-	else
+	if (unlikely(!sem-&gt;count))
 		__down(sem);
+	sem-&gt;count--;
 	spin_unlock_irqrestore(&amp;sem-&gt;lock, flags);
 }
 EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
 	int result = 0;
 
 	spin_lock_irqsave(&amp;sem-&gt;lock, flags);
-	if (likely(sem-&gt;...
To: Ingo Molnar <mingo@...>, Sven Wegener <sven.wegener@...>
Cc: Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Sunday, May 11, 2008 - 7:03 am

This might be the problem that causes the missing wakeups.  If you have a
semaphore with n=2, and four processes calling down(), tasks A and B
acquire the semaphore and tasks C and D go to sleep.  Task A calls up()
and wakes up C.  Then task B calls up() and doesn't wake up anyone
because C hasn't run yet.  I think we need another wakeup when task C
finishes in __down_common, like this (on top of your patch):

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5e41217..e520ad4 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
 	}
 
 	list_del(&amp;waiter.list);
+
+	/* It's possible we need to wake up the next task on the list too */
+	if (unlikely(sem-&gt;count &gt; 1) &amp;&amp; !list_empty(&amp;sem-&gt;wait_list))
+		__up(sem);
+
 	return ret;
 }
 
Sven, can you try this with your workload?  I suspect this might be it
because XFS does use semaphores with n&gt;1.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Sunday, May 11, 2008 - 10:10 am

This one fixes the regression too, after applying it on top of bf726e.

Sven
--
To: Matthew Wilcox <matthew@...>
Cc: Sven Wegener <sven.wegener@...>, Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Sunday, May 11, 2008 - 9:01 am

this needs to check for ret != 0 as well, otherwise we can be woken but 
a timeout can also trigger =&gt; we lose a wakeup. I.e. like the patch 
below. Hm?

	Ingo

-----------------------------&gt;
Subject: semaphore: fix #3
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Sun May 11 09:51:07 CEST 2008

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/semaphore.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
 	struct task_struct *task;
 };
 
+static noinline void __sched __up(struct semaphore *sem)
+{
+	struct semaphore_waiter *waiter = list_first_entry(&amp;sem-&gt;wait_list,
+						struct semaphore_waiter, list);
+	wake_up_process(waiter-&gt;task);
+}
+
 /*
  * Because this function is inlined, the 'state' parameter will be
  * constant, and thus optimised away by the compiler.  Likewise the
@@ -231,6 +238,10 @@ static inline int __sched __down_common(
 	}
 
 	list_del(&amp;waiter.list);
+	if (unlikely(!list_empty(&amp;sem-&gt;wait_list)) &amp;&amp;
+					((sem-&gt;count &gt; 1) || ret))
+		__up(sem);
+
 	return ret;
 }
 
@@ -254,9 +265,10 @@ static noinline int __sched __down_timeo
 	return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
 }
 
-static noinline void __sched __up(struct semaphore *sem)
-{
-	struct semaphore_waiter *waiter = list_first_entry(&amp;sem-&gt;wait_list,
-						struct semaphore_waiter, list);
-	wake_up_process(waiter-&gt;task);
-}
+
+	/*
+	 * Rotate sleepers - to make sure all of them get woken in case
+	 * of parallel up()s:
+	 */
+	list_move_tail(&amp;waiter-&gt;list, &amp;sem-&gt;wait_list);
+
--
To: Ingo Molnar <mingo@...>
Cc: Sven Wegener <sven.wegener@...>, Linus Torvalds <torvalds@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, LKML <linux-kernel@...>, Alexander Viro <viro@...>, Andrew Morton <akpm@...>