Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1
on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
below patch.
64ac24e738823161693bf791f87adc802cf529ff is first bad commit
commit 64ac24e738823161693bf791f87adc802cf529ff
Author: Matthew Wilcox <matthew@wil.cx>
Date: Fri Mar 7 21:55:58 2008 -0500
Generic semaphore implementation
After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
conflictions/errors, aim7 regression became less than 2%.
-yanmin
--hm, which exact semaphore would that be due to? My first blind guess would be the BKL - there's not much other semaphore use left in the core kernel otherwise that would affect AIM7 normally. The VFS still makes frequent use of the BKL and AIM7 is very VFS intense. Getting rid of that BKL use from the VFS might be useful to performance anyway. Could you try to check that it's indeed the BKL? Easiest way to check it would be to run AIM7 it on sched-devel.git/latest and do scheduler tracing via: http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt by doing: echo stacktrace > /debug/tracing/iter_ctl you could get exact backtraces of all scheduling points in the trace. If the BKL's down() shows up in those traces then it's definitely the BKL that causes this. The backtraces will also tell us exactly which BKL use is the most frequent one. To keep tracing overhead low on SMP i'd also suggest to only trace a single CPU, via: echo 1 > /debug/tracing/tracing_cpumask Ingo --
Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but just reported 2 of them because I located the patches. My machine is locating the root cause of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either kernel hang with testing or compilation fails. Another specjbb2005 on Montvale is also under investigation. Let me figure out how to clone your tree quickly as the network speed is very slow. One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with --
With my patch+gprof, I collected some data. Below was outputed by gprof.
index % time self children called name
0.00 0.00 2/223305376 __down_write_nested [22749]
0.00 0.00 3/223305376 journal_commit_transaction [10526]
0.00 0.00 6/223305376 __down_read [22745]
0.00 0.00 8/223305376 start_this_handle [19167]
0.00 0.00 15/223305376 sys_pause [19808]
0.00 0.00 17/223305376 log_wait_commit [11047]
0.00 0.00 20/223305376 futex_wait [8122]
0.00 0.00 64/223305376 pdflush [14335]
0.00 0.00 71/223305376 do_get_write_access [5367]
0.00 0.00 84/223305376 pipe_wait [14460]
0.00 0.00 111/223305376 kjournald [10726]
0.00 0.00 116/223305376 int_careful [9634]
0.00 0.00 224/223305376 do_nanosleep [5418]
0.00 0.00 1152/223305376 watchdog [22065]
0.00 0.00 4087/223305376 worker_thread [22076]
0.00 0.00 5003/223305376 __mutex_lock_killable_slowpath [23305]
0.00 0.00 7810/223305376 ksoftirqd [10831]
0.00 0.00 9389/223305376 __mutex_lock_slowpath [23306]
0.00 0.00 10642/223305376 io_schedule [9813]
0.00 0.00 23544/223305376 migration_thread [11495]
0.00 0.00 35319/223305376 __cond_resched [22673]
0.00 0.00 49065/223305376 retint_careful [16146]
0.00 0.00 119757/223305376 sysret_careful [20074]
0.00 0.00 151717/223305376 do_wait [5545]
0.00 0.00 250221/223305376 do_exit [5356]
0.00 0.00 303836/223305376 cpu_idle [43...I have an older patchkit that introduced unlocked_fnctl for some cases. It was briefly in mm but then dropped. Sounds like it is worth resurrecting? tty_* is being taken care of by Alan. chrdev_open is more work. -Andi (who BTW never quite understood why BKL is a semaphore now and not a spinlock?) --
On Wed, 07 May 2008 13:00:14 +0200 The tty open/close paths are probably a few months away from dropping the BKL. Alan --
Not sure what you're talking about here, Andi. The only lock_kernel in fcntl.c is around the call to ->fasync. And Yanmin's traces don't show See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0 -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
I am aware of that commit, thank you, but the comment was refering to that it came with about zero justification why it was done. For the left over BKL regions which are relatively short surely a spinlock would be better than a semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL. If that was done all these regressions would disappear I bet. That said of course it is still better to actually fix the lock_kernel()s, but shorter time just fixing lock_kernel again would be easier. -Andi --
I do agree. I think turning the BKL into a semaphore was fine per se, but that was when semaphores were fast. Considering the apparent AIM regressions, we really either need to revert the semaphore consolidation, or we need to fix the kernel lock. And by "fixing", I don't mean removing it - it will happen, but it almost certainly won't happen for 2.6.26. The easiest approach would seem to just turn the BKL into a mutex instead, which should hopefully be about as optimized as the old semaphores. But my preferred option would indeed be just turning it back into a spinlock - and screw latency and BKL preemption - and having the RT people who care deeply just work on removing the BKL in the long run. Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious when we now apparently have even more reason to aim for removing the BKL rather than trying to mess around with it. Linus --
hm, do we know it for a fact that the 40% AIM regression is due to the fastpath overhead of the BKL? It would be extraordinary if so. I think it is far more likely that it's due to the different scheduling and wakeup behavior of the new kernel/semaphore.c code. So the fix would be to restore the old scheduling behavior - that's what Yanmin's manual revert did and that's what got him back the previous AIM7 performance. Ingo --
Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which
was what, 7500 lines of code removed that got reverted.
And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel
was this (and I quote your message):
remove the !PREEMPT_BKL code.
this removes 160 lines of legacy code.
in other words, your only stated valid reason for getting rid of the
spinlock was 160 lines, and the comment didn't even match what it did (it
removed the spinlocks entirely, not just the preemptible version).
In contrast, the revert adds 7500 lines. If you go by the only documented
reason for the crap that is the current BKL, then I know which one I'll
take. I'll take the spinlock back, and I'd rather put preemption back
than ever take those semaphores.
And even that's ignoring another issue: did anybody ever even do that AIM7
benchmark comparing spinlocks to the semaphore-BKL? It's quite possible
that the semaphores (even the well-behaved ones) behaved worse than the
spinlocks.
Linus
--i wouldnt advocate a 7500 revert instead of a 160 lines change. my suggestion was that the scheduling behavior of the new kernel/semaphore.c code is causing the problem - i.e. making it match it was removed by me in the course of this discussion: http://lkml.org/lkml/2008/1/2/58 the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the spinlock version] was broken for a longer period of time (it crashed trivially), because nobody apparently used it. People (Nick) asked why it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was the default, that was what all distros used. I.e. the spinlock code was in essence dead code at that point in time. the spinlock code might in fact perform _better_, but nobody came up that's a good question... Ingo --
Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was the known-flaky one. The thread you point to also says that it's PREEMPT_BKL=y that was the problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was still using reiserfs."), not the plain spinlock approach. But it would definitely be interesting to see the crash reports. And the help message always said "Say N if you are unsure." even if it ended up being marked 'y' by default at some point (and then in January was made first unconditional, and then removed entirely) Because in many ways, the non-preempt BKL is the *much* simpler case. I don't see why it would crash - it just turns the BKL into a trivial counting spinlock that can sleep. Linus --
no, there was another problem (which i couldnt immediately find because lkml.org only indexes part of the threads, i'll research it some more), yeah. The latencies are a different problem, and indeed were reported against PREEMPT_BKL, and believed to be due to reiser3 and the tty code. (reiser3 runs almost all of its code under the BKL) The !PREEMPT_BKL crash was some simple screwup on my part of getting atomicity checks wrong in cond_resched() - and it went unnoticed for a long time - or something like that. I'll try to find that discussion. Ingo --
Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe way. It uses just !(preempt_count() & PREEMPT_ACTIVE) to see whether it can schedule, and it should probably use in_atomic() which ignores the kernel lock. But right now, that whole thing is disabled if PREEMPT is on anyway, so in effect (with my test patch, at least) cond_preempt() would just be a no-op if PREEMPT is on, even if BKL isn't preemptable. So it doesn't look buggy, but it looks like it might cause longer latencies than strictly necessary. And if somebody depends on cond_resched() to avoid some bad livelock situation, that would obviously not work (but that sounds like a fundamental bug anyway, I really hope Yes, some silly bug sounds more likely. Especially considering how many different cases there were (semaphores vs spinlocks vs preemptable spinlocks). Linus --
Funny you should mention it; locks.c uses cond_resched() assuming that it ignores the BKL. Not through needing to avoid livelock, but it does presume that other higher priority tasks contending for the lock will get a chance to take it. You'll notice the patch I posted yesterday drops the file_lock_lock around the call to cond_resched(). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Well, this would only be noticeable with CONFIG_PREEMPT.
If you don't have preempt enabled, it looks like everything should work
ok: the kernel lock wouldn't increase the preempt count, and
_cond_resched() works fine.
If you're PREEMPT, then the kernel lock would increase the preempt count,
and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT
you'd never see it *anyway*, because PREEMPT will disable cond_resched()
entirely (because preemption takes care of normal scheduling latencies
without it).
And I'm also sure that this all worked fine at some point, and it's
largely a result just of the multiple different variations of BKL
preemption coupled with some of them getting removed entirely, so the code
that used to handle it just got corrupt over time. See commit 02b67cc3b,
for example.
.. Hmm ... Time passes. Linus looks at git history.
It does look like "cond_resched()" has not worked with the BKL since 2005,
and hasn't taken the BKL into account. Commit 5bbcfd9000:
[PATCH] cond_resched(): fix bogus might_sleep() warning
+ if (unlikely(preempt_count()))
+ return;
which talks about the BKS, ie it only took the *semaphore* implementation
into account. Never the spinlock-with-preemption-count one.
Or am I blind?
Linus
--hm, i think you are right. most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case, and not getting proper cond_resched() behavior in case of !PREEMPT_BKL would certainly not be noticed by distros or users. We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this post-2.6.7 commit that introduced the feature: | commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d | Author: Ingo Molnar <mingo@elte.hu> | Date: Fri Jan 7 21:59:57 2005 -0800 | | [PATCH] remove the BKL by turning it into a semaphore There was constant trouble around all these variations of preemptability and their combination with debugging helpers. (So i was rather happy to get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no tears will be shed.) Ingo --
Here's a trial balloon patch to do that. Yanmin - this is not well tested, but the code is fairly obvious, and it would be interesting to hear if this fixes the performance regression. Because if it doesn't, then it's not the BKL, or something totally different is going on. Now, we should probably also test just converting the thing to a mutex, to see if that perhaps also fixes it. Linus --- arch/mn10300/Kconfig | 11 ---- include/linux/hardirq.h | 18 ++++--- kernel/sched.c | 27 ++--------- lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++--------------- 4 files changed, 95 insertions(+), 81 deletions(-) diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig index 6a6409a..e856218 100644 --- a/arch/mn10300/Kconfig +++ b/arch/mn10300/Kconfig @@ -186,17 +186,6 @@ config PREEMPT Say Y here if you are building a kernel for a desktop, embedded or real-time system. Say N if you are unsure. -config PREEMPT_BKL - bool "Preempt The Big Kernel Lock" - depends on PREEMPT - default y - help - This option reduces the latency of the kernel by making the - big kernel lock preemptible. - - Say Y here if you are building a kernel for a desktop system. - Say N if you are unsure. - config MN10300_CURRENT_IN_E2 bool "Hold current task address in E2 register" default y diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 897f723..181006c 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -72,6 +72,14 @@ #define in_softirq() (softirq_count()) #define in_interrupt() (irq_count()) +#if defined(CONFIG_PREEMPT) +# define PREEMPT_INATOMIC_BASE kernel_locked() +# define PREEMPT_CHECK_OFFSET 1 +#else +# define PREEMPT_INATOMIC_BASE 0 +# define PREEMPT_CHECK_OFFSET 0 +#endif + /* * Are we running in atomic context? WARNING: this macro cannot * always detect atomic context; in particular, it cannot know about @@ -79,17 +87,11 @@ * used in the general case to...
Congratulations! The patch really fixes the regression completely! vmstat showed cpu idle is 0%, just like 2.6.25's. Some config options in my .config file: CONFIG_NR_CPUS=32 CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y --
great! Yanmin, could you please also check the other patch i sent (also
attached below), does it solve the regression similarly?
Ingo
---
lib/kernel_lock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
task->lock_depth = -1;
preempt_enable_no_resched();
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();
preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;
- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();
+ }
task->lock_depth = depth;
}
--With your patch, aim7 regression becomes less than 2%. I ran the testing twice. Linus' patch could recover it completely. As aim7 result is quite stable(usually --
is this the old original aim7 you are running, or osdl-aim-7 or
re-aim-7?
if it's aim7 then this is a workload that starts+stops 2000 parallel
tasks that each start and exit at the same time. That might explain its
sensitivity on the BKL - this is all about tty-controlled task startup
and exit.
i could not get it to produce anywhere close to stable results though. I
also frequently get into this problem:
AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
2000
Failed to execute
new_raph 200
Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17
Failed to execute
disk_cp /mnt/shm
disk_cp (1): cannot open /mnt/shm/tmpa.common
disk1.c: No such file or directory
[.. etc. a large stream of them .. ]
system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
work files.
Ingo
--I useold AIM7 plus a small patch which is just to change a couple of data type to match My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing and didn't see any failure except: AIM Multiuser Benchmark - Suite VII Run Beginning Tasks jobs/min jti jobs/min/task real cpu 2000create_shared_memory(): can't create semaphore, pausing... create_shared_memory(): can't create semaphore, pausing... Above info doesn't mean errors. Perhaps you could: 1) Apply the attched aim9 patch; 2) check if you have write right under /mnt/shm; 3) echo "/mnt/shm">aim7_path/config;
that failure message you got worries me - it indicates that your test ran out of IPC semaphores. You can fix it via upping the semaphore limits via: echo "500 32000 128 512" > /proc/sys/kernel/sem could you check that you still get similar results with this limit fixed? note that once i've fixed the semaphore limits it started running fine here. And i see zero idle time during the run on a quad core box. here are my numbers: # on v2.6.26-rc1-166-gc0a1811 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55851.4 93 208.4 793.6 0.4654 # BKL: sleep 2000 55402.2 79 210.1 800.1 0.4617 2000 55728.4 93 208.9 795.5 0.4644 # BKL: spin 2000 55787.2 93 208.7 794.5 0.4649 # so the results are the same within noise. I'll also check this workload on an 8-way box to make sure it's OK on larger CPU counts too. could you double-check your test? plus a tty tidbit as well, during the test i saw a few of these: Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev Ingo --
A quick test showed it does work. --
false alarm there - these were due to the breakage in the hack-patch i used ... Ingo --
but but but. Some other users of down() have presumably also regressed. We just have't found the workload to demonstrate that yet. --
Well, that shows that it was the BKL.
That said, "idle 0%" is easy when you spin. Do you also have actual
performance numbers? I'd hope that not only do we use full CPU time, it's
also at least as fast as the old semaphores were?
While I've been dissing sleeping locks (because their overhead is so
high), at least in _theory_ they can get better behavior when not
spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure,
but I'd still like to hear actual performance numbers too, just to be
sure.
Anyway, at least the "was it the BKL or some other semaphore user"
question is entirely off the table.
So we need to
- fix the BKL. My patch may be a good starting point, but there are
alternatives:
(a) reinstate the old BKL code entirely
Quite frankly, I'd prefer not to. Not only did it have three
totally different cases, some of them were apparently broken (ie
BKL+regular preempt didn't cond_resched() right), and I just don't
think it's worth maintaining three different versions, when
distro's are going to pick one anyway. *We* should pick one, and
maintain it.
(b) screw the special BKL preemption - it's a spinlock, we don't
preempt other spinlocks, but at least fix BKL+preempt+cond_resched
thing.
This would be "my patch + fixes" where at least one of the fixes
is the known (apparently old) cond_preempt() bug.
(c) Try to keep the 2.6.25 code as closely as possible, but just
switch over to mutexes instead.
I dunno. I was never all that enamoured with the BKL as a sleeping
lock, so I'm biased against this one, but hey, it's just a
personal bias.
- get rid of the BKL anyway, at least in anything that is noticeable.
Matthew's patch to file locking is probably worth doing as-is,
simply because I haven't heard any better solutions. The BKL
certainly can't be it, and whatever comes out of the NFSD
discussion will almost certainly involve just making sure that
those leases just use the new ...Yes. My conclusion is based on the actual number. cpu idle 0% is just --
Thanks, that's all I wanted to verify. I'll leave this overnight, and see if somebody has come up with some smart and wonderful patch. And if not, I think I'll apply mine as "known to fix a regression", and we can perhaps then improve on things further from there. Linus --
hey, i happen to have such a smart and wonderful patch =B-)
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.
Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.
The problem comes from the following code. The new semaphore code does
this on down():
spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);
and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}
the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.
That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of con...On Thu, 8 May 2008 14:01:30 +0200 ok sounds like I like the fairness part of the new semaphores (but obviously not the 67% performance downside; I'd expect to sacrifice a little performance.. but this much??). It sucks though; if this were a mutex, we could wake up the owner of the bugger in the contented acquire path synchronously.... but these are semaphores, and don't have an owner ;( bah bah bah --
I did note that earlier downthread ... although to be fair, I thought of it in terms of three tasks with the third task coming in and stealing the second tasks's wakeup rather than the first task starving the second Fair is certainly the enemy of throughput (see also dbench arguments passim). It may be that some semaphore users really do want fairness -- it seems pretty clear that we don't want fairness for the BKL. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
i dont think we need to consider any theoretical arguments about fairness here as there's a fundamental down-to-earth maintenance issue that governs: old semaphores were similarly unfair too, so it is just a bad idea (and a bug) to change behavior when implementing new, generic semaphores that are supposed to be a seemless replacement! This is about legacy code that is intended to be phased out anyway. This is already a killer argument and we wouldnt have to look any further. but even on the more theoretical level i disagree: fairness of CPU time is something that is implemented by the scheduler in a natural way already. Putting extra ad-hoc synchronization and scheduling into the locking primitives around data structures only gives mathematical fairness and artificial micro-scheduling, it does not actually make the end result more useful! This is especially true for the BKL which is auto-dropped by the scheduler anyway. (so descheduling a task will automatically release it of its BKL ownership) For example we've invested a _lot_ of time and effort into adding lock stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a _lot_ harder to do atomically with PI constraints but still possible and makes sense in the grand scheme of things. kernel/mutex.c is also "unfair" - and that's correct IMO. For the BKL in particular there's almost no sense to talk about any underlying resource and there's almost no expectation from users for that imaginery resource to be shared fairly. Ingo --
Peter pointed it out that because sem->count is u32, the <= 0 is in fact a "== 0" condition - the patch below does that. As expected gcc figured out the same thing too so the resulting code output did not change. (so this is just a cleanup) i've got this lined up in sched.git and it's undergoing testing right now. If that testing goes fine and if there are no objections i'll send a pull request for it later today. Ingo ----------------> Subject: semaphores: improve code From: Ingo Molnar <mingo@elte.hu> Date: Thu May 08 14:19:23 CEST 2008 No code changed: kernel/semaphore.o: text data bss dec hex filename 1207 0 0 1207 4b7 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after md5: c10198c2952bd345a1edaac6db891548 semaphore.o.before.asm c10198c2952bd345a1edaac6db891548 semaphore.o.after.asm Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/semaphore.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -54,7 +54,7 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) __down(sem); sem->count--; spin_unlock_irqrestore(&sem->lock, flags); @@ -76,7 +76,7 @@ int down_interruptible(struct semaphore int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) result = __down_interruptible(sem); if (!result) sem->count--; @@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (unlikely(sem->count <= 0)) + if (unlikely(!sem->count)) result = __down_killable(s...
Why don't we just make it do the same thing that the x86 semaphores used to do: make it signed, and decrement unconditionally. And callt eh slow-path if it became negative. IOW, make the fast-path be spin_lock_irqsave(&sem->lock, flags); if (--sem->count < 0) __down(); spin_unlock_irqrestore(&sem->lock, flags); and now we have an existing known-good implementation to look at? Rather than making up a totally new and untested thing. Linus --
Ok, after having thought that over, and looked at the code, I think I like your version after all. The old implementation was pretty complex due to the need to be so extra careful about the count that could change outside of the lock, so everything considered, a new implementation that is simpler is probably the better choice. Ergo, I will just pull your scheduler tree. Linus --
yeah. I thought about that too, the problem i found is this thing in the
old lib/semaphore-sleepers.c code's __down() path:
remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_RUNNING;
that mystery wakeup once i understood to be necessary for some weird
ordering reason, but it would probably be hard to justify in the new
code, because it's done unconditionally, regardless of whether there are
sleepers around.
And once we deviate from the old code, we might as well go for the
simplest approach - which also happens to be rather close to the mutex
code's current slowpath - just with counting property added, legacy
great! Meanwhile a 100 randconfigs booted fine with that tree so i'd say
the implementation is robust.
i also did a quick re-test of AIM7 because the wakeup logic changed a
bit from what i tested initially (from round-robin to strict FIFO), and
as expected not much changed in the AIM7 results on the quad:
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55019.9 96 211.6 806.5 0.4585
2000 55116.2 90 211.2 804.7 0.4593
2000 55082.3 82 211.3 805.5 0.4590
this is slightly lower but the test was not fully apples to apples
because this also had some tracing active and other small details. It's
still very close to the v2.6.25 numbers. I suspect some more performance
could be won in this particular workload by getting rid of the BKL
dependency altogether.
Ingo
--Did somebody have any trace on which BKL taker it is? It apparently wasn't file locking. Was it the tty layer? Linus --
yeah, i captured a trace of all the down()s that happen in the workload,
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's
the semaphore activities in the trace:
# grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \
uniq -c | sort -n | cut -d= -f1-4
9 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
12 down <= lock_kernel <= de_put <= proc_delete_inode <
14 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
19 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
58 down <= lock_kernel <= tty_release <= __fput <
62 down <= lock_kernel <= chrdev_open <= __dentry_open <
70 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
2512 down <= lock_kernel <= opost <= write_chan <
2574 down <= lock_kernel <= write_chan <= tty_write <
note that this is running the fixed semaphore code, so contended
semaphores are really rare in the trace. The histogram above includes
all calls to down().
here's the full trace file (from a single CPU, on a dual-core box
running aim7):
http://redhat.com/~mingo/misc/aim7-ftrace.txt
some other interesting stats. Top 5 wakeups sources:
# grep wake_up aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c |
sort -n | cut -d= -f1-6 | tail -5
[...]
340 default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all <
411 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read <
924 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write <
1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_b...Ok. tty write handling. Nasty. But not as nasty as the open/close code, perhaps, and maybe we'll get it fixed some day. In fact, I thought we had fixed most of this already, but hey, I was No, it's not rmap contention. Your profile hits are just on the actual calculations, and it's all data-dependent arithmetic and loads. Some cache misses on the page tables, clearly, but it looks like a lot of it is even just the plain arithmetic (the imul followed by a data-dependent 'lea' instruction). Some of it is that "page_to_pfn(page)", which involves a nasty division (divide by sizeof(struct page)). It gets turned into that shift and multiply, but it's still quite expensive with big constants etc. Linus --
I have pushed it down to n_tty line discipline code but not within that. It is on the hit list but I'm working on more pressing stuff first (USB layer, extracting commonality to start to tackle open etc etc) I don't think fixing n_tty is now a big job if someone wants to take a swing at it. The driver write/throttle/etc routines below the n_tty ldisc layer are now BKL clean so it should just be the internal locking of the buffers, window and the like to tackle. Feel free to have a go 8) --
Well, it turns out that Ingo's fixed statistics actually put the real cost
in fcntl/ioctl/open/release:
310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
380 down <= lock_kernel <= tty_release <= __fput <
422 down <= lock_kernel <= chrdev_open <= __dentry_open <
rather than the write routines. But it may be that Ingo was just profiling
two different sections, and it's really all of them.
Linus
--That must be ->fasync? If it was file locks the lock_kernel would not be inlined into sys_fcntl. Or is that an truncated back trace? -Andi (wondering if he should plug ->fasync_unlocked again ...) --
the first trace had general desktop load mixed into it as well - so while it's not interesting to AIM7 the BKL does matter in those situations and i'd not be surprised if it was responsible for certain categories of desktop lag. The second trace was the correct 'pure' AIM7 workload which produces very little tty output. It is a quite stable workload and the trace i uploaded is representative of the totality of that workload. AIM7 runs for several minutes so there's no significant rampup/rampdown interaction either. Ingo --
tty release is probably a few months away from getting cured - I'm afraid it will almost certainly be the very last user of the BKL in tty to get fixed as it depends on everything else being sanely locked. --
Btw, sparse will complain about those, because the source code *looks* really cheap. The normal "page_to_pfn()" looks trivial: ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET) which looks like a trivial subtraction and addition of a constant, but the subtraction is on C pointers, and basically turns into ((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page) and because "struct page" is not some nice power-of-two in size, that division is rather nasty even though it's a constant size. Linus --
Sometimes you can fix it.
For example, this change:
- if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+ if (pte_present(*pte) && page == pfn_to_page(pte_pfn(*pte))) {
can simplify things: instead of moving from a 'struct page' to a pfn, it
moves from a pfn to a 'struct page', and that is generally cheaper
(multiply rather than divide by size of struct page). It's not always the
same thing to do, but I think in this case we can. For me, the code
generation changes:
- movabsq $7905747460161236407, %rdx #, tmp111
- movabsq $32985348833280, %rax #, tmp107
- leaq (%r12,%rax), %rax #, tmp106
- sarq $3, %rax #, tmp106
- imulq %rdx, %rax # tmp111, tmp106
- movabsq $70368744177663, %rdx #, tmp113
- andq %rdx, %rcx # tmp113, pte$pte
- shrq $12, %rcx #, pte$pte
- cmpq %rcx, %rax # pte$pte, tmp106
+ movabsq $70368744177663, %rax #, tmp107
+ andq %rax, %rdx # tmp107, pte$pte
+ shrq $12, %rdx #, pte$pte
+ imulq $56, %rdx, %rax #, pte$pte, tmp109
+ movabsq $-32985348833280, %rdx #, tmp111
+ addq %rdx, %rax # tmp111, tmp110
+ cmpq %rax, %r13 # tmp110, page
which isn't a *huge* deal, but it certainly looks better. One less big
constant, and one less shift.
It's not going to make a huge difference, though. That function is just
called too much, and it would still be entirely data-dependent all the way
through.
Linus
--the first one had some idle time in it as well plus the effects of a
2000-task Ctrl-Z - which skewed the histograms.
the new stats are more straightforward. down() callsite histogram:
42 down <= lock_kernel <= de_put <= proc_delete_inode <
42 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
78 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
380 down <= lock_kernel <= tty_release <= __fput <
422 down <= lock_kernel <= chrdev_open <= __dentry_open <
hm, why is chrdev_open() called all that often?
top-5 wakeups:
4 default_wake_function <= __wake_up_common <= complete <= migration_thread <= kthread <= child_rip <
4 wake_up_process <= sched_exec <= do_execve <= sys_execve <= stub_execve <= <
8 wake_up_process <= __mutex_unlock_slowpath <= mutex_unlock <= do_filp_open <= do_sys_open <= sys_open <
40 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up <= sock_def_wakeup <= tcp_rcv_state_process <
98 default_wake_function <= __wake_up_common <= __wake_up_sync <= do_notify_parent <= do_exit <= do_group_exit <
i.e. very little wakeup activities. Top 10 scheduling points:
10 schedule <= kjournald <= kthread <= child_rip <
12 schedule <= __down_write_nested <= __down_write <= down_write <
29 schedule <= worker_thread <= kthread <= child_rip <
40 schedule <= schedule_timeout <= inet_stream_connect <= sys_connect <
59 schedule <= __cond_resched <= _cond_resched <= generic_file_buffered_write <
111 schedule <= ksoftirqd <= kthread <= child_rip <
119 schedule <= do_wa...a second update patch, i've further simplified the semaphore wakeup
logic: there's no need for the wakeup to remove the task from the wait
list. This will make them a slighly bit more fair, but more importantly,
this closes a race in my first patch for the unlikely case of a signal
(or a timeout) and an unlock coming in at the same time and the task not
getting removed from the wait-list.
( my performance testing with 2000 AIM7 tasks on a quad never hit that
race but x86.git QA actually triggered it after about 30 random kernel
bootups and it caused a nasty crash and lockup. )
Ingo
---------------->
Subject: sem: simplify queue management
From: Ingo Molnar <mingo@elte.hu>
Date: Tue May 06 19:32:42 CEST 2008
kernel/semaphore.o:
text data bss dec hex filename
1040 0 0 1040 410 semaphore.o.before
975 0 0 975 3cf semaphore.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -202,33 +202,34 @@ static inline int __sched __down_common(
{
struct task_struct *task = current;
struct semaphore_waiter waiter;
+ int ret = 0;
waiter.task = task;
+ list_add_tail(&waiter.list, &sem->wait_list);
for (;;) {
- list_add_tail(&waiter.list, &sem->wait_list);
-
- if (state == TASK_INTERRUPTIBLE && signal_pending(task))
- goto interrupted;
- if (state == TASK_KILLABLE && fatal_signal_pending(task))
- goto interrupted;
- if (timeout <= 0)
- goto timed_out;
+ if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+ ret = -EINTR;
...ok, it's looking good here so far so here's the scheduler fixes tree
that you can pull if my semaphore fix looks good to you too:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus
also includes a scheduler arithmetics fix from Mike. Find the shortlog
and diff below.
Ingo
------------------>
Ingo Molnar (1):
semaphore: fix
Mike Galbraith (1):
sched: fix weight calculations
kernel/sched_fair.c | 11 ++++++--
kernel/semaphore.c | 64 ++++++++++++++++++++++++---------------------------
2 files changed, 38 insertions(+), 37 deletions(-)
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c863663..e24ecd3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
+ unsigned long thresh = sysctl_sched_latency;
+
+ /*
+ * convert the sleeper threshold into virtual time
+ */
if (sched_feat(NORMALIZED_SLEEPER))
- vruntime -= calc_delta_weight(sysctl_sched_latency, se);
- else
- vruntime -= sysctl_sched_latency;
+ thresh = calc_delta_fair(thresh, se);
+
+ vruntime -= thresh;
}
/* ensure we never gain time by being placed backwards. */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
unsigned long flags;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(!sem->count))
__down(sem);
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
int result = 0;
spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->...This might be the problem that causes the missing wakeups. If you have a semaphore with n=2, and four processes calling down(), tasks A and B acquire the semaphore and tasks C and D go to sleep. Task A calls up() and wakes up C. Then task B calls up() and doesn't wake up anyone because C hasn't run yet. I think we need another wakeup when task C finishes in __down_common, like this (on top of your patch): diff --git a/kernel/semaphore.c b/kernel/semaphore.c index 5e41217..e520ad4 100644 --- a/kernel/semaphore.c +++ b/kernel/semaphore.c @@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state, } list_del(&waiter.list); + + /* It's possible we need to wake up the next task on the list too */ + if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list)) + __up(sem); + return ret; } Sven, can you try this with your workload? I suspect this might be it because XFS does use semaphores with n>1. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
This one fixes the regression too, after applying it on top of bf726e. Sven --
this needs to check for ret != 0 as well, otherwise we can be woken but
a timeout can also trigger => we lose a wakeup. I.e. like the patch
below. Hm?
Ingo
----------------------------->
Subject: semaphore: fix #3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
struct task_struct *task;
};
+static noinline void __sched __up(struct semaphore *sem)
+{
+ struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
+}
+
/*
* Because this function is inlined, the 'state' parameter will be
* constant, and thus optimised away by the compiler. Likewise the
@@ -231,6 +238,10 @@ static inline int __sched __down_common(
}
list_del(&waiter.list);
+ if (unlikely(!list_empty(&sem->wait_list)) &&
+ ((sem->count > 1) || ret))
+ __up(sem);
+
return ret;
}
@@ -254,9 +265,10 @@ static noinline int __sched __down_timeo
return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
}
-static noinline void __sched __up(struct semaphore *sem)
-{
- struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
- struct semaphore_waiter, list);
- wake_up_process(waiter->task);
-}
+
+ /*
+ * Rotate sleepers - to make sure all of them get woken in case
+ * of parallel up()s:
+ */
+ list_move_tail(&waiter->list, &sem->wait_list);
+
--