Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1
on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
below patch.64ac24e738823161693bf791f87adc802cf529ff is first bad commit
commit 64ac24e738823161693bf791f87adc802cf529ff
Author: Matthew Wilcox <matthew@wil.cx>
Date: Fri Mar 7 21:55:58 2008 -0500Generic semaphore implementation
After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
conflictions/errors, aim7 regression became less than 2%.-yanmin
--
hm, which exact semaphore would that be due to?
My first blind guess would be the BKL - there's not much other semaphore
use left in the core kernel otherwise that would affect AIM7 normally.
The VFS still makes frequent use of the BKL and AIM7 is very VFS
intense. Getting rid of that BKL use from the VFS might be useful to
performance anyway.Could you try to check that it's indeed the BKL?
Easiest way to check it would be to run AIM7 it on
sched-devel.git/latest and do scheduler tracing via:http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt
by doing:
echo stacktrace > /debug/tracing/iter_ctl
you could get exact backtraces of all scheduling points in the trace. If
the BKL's down() shows up in those traces then it's definitely the BKL
that causes this. The backtraces will also tell us exactly which BKL use
is the most frequent one.To keep tracing overhead low on SMP i'd also suggest to only trace a
single CPU, via:echo 1 > /debug/tracing/tracing_cpumask
Ingo
--
Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but
just reported 2 of them because I located the patches. My machine is locating the root cause
of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either
kernel hang with testing or compilation fails.Another specjbb2005 on Montvale is also under investigation.
Let me figure out how to clone your tree quickly as the network speed is very slow.
One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with
--
With my patch+gprof, I collected some data. Below was outputed by gprof.
index % time self children called name
0.00 0.00 2/223305376 __down_write_nested [22749]
0.00 0.00 3/223305376 journal_commit_transaction [10526]
0.00 0.00 6/223305376 __down_read [22745]
0.00 0.00 8/223305376 start_this_handle [19167]
0.00 0.00 15/223305376 sys_pause [19808]
0.00 0.00 17/223305376 log_wait_commit [11047]
0.00 0.00 20/223305376 futex_wait [8122]
0.00 0.00 64/223305376 pdflush [14335]
0.00 0.00 71/223305376 do_get_write_access [5367]
0.00 0.00 84/223305376 pipe_wait [14460]
0.00 0.00 111/223305376 kjournald [10726]
0.00 0.00 116/223305376 int_careful [9634]
0.00 0.00 224/223305376 do_nanosleep [5418]
0.00 0.00 1152/223305376 watchdog [22065]
0.00 0.00 4087/223305376 worker_thread [22076]
0.00 0.00 5003/223305376 __mutex_lock_killable_slowpath [23305]
0.00 0.00 7810/223305376 ksoftirqd [10831]
0.00 0.00 9389/223305376 __mutex_lock_slowpath [23306]
0.00 0.00 10642/223305376 io_schedule [9813]
0.00 0.00 23544/223305376 migration_thread [11495]
0.00 0.00 35319/223305376 __cond_resched [22673]
0.00 0.00 49065/223305376 retint_careful [16146]
0.00 0.00 119757/223305376 sysret_careful [20074]
0.00 0.00 151717/223305376 do_wait [5545]
0.00 0.00 250221/223305376 do_exit [5356]
0.00 0.00 303836/223305376 cpu_idle [43...
I have an older patchkit that introduced unlocked_fnctl for some cases. It was
briefly in mm but then dropped. Sounds like it is worth resurrecting?tty_* is being taken care of by Alan.
chrdev_open is more work.
-Andi (who BTW never quite understood why BKL is a semaphore now and not
a spinlock?)
--
On Wed, 07 May 2008 13:00:14 +0200
The tty open/close paths are probably a few months away from dropping the
BKL.Alan
--
Not sure what you're talking about here, Andi. The only lock_kernel in
fcntl.c is around the call to ->fasync. And Yanmin's traces don't showSee git commit 6478d8800b75253b2a934ddcb734e13ade023ad0
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
I am aware of that commit, thank you, but the comment was refering to that it
came with about zero justification why it was done. For the left over BKL
regions which are relatively short surely a spinlock would be better than a
semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.If that was done all these regressions would disappear I bet. That said
of course it is still better to actually fix the lock_kernel()s, but shorter
time just fixing lock_kernel again would be easier.-Andi
--
I do agree.
I think turning the BKL into a semaphore was fine per se, but that was
when semaphores were fast.Considering the apparent AIM regressions, we really either need to revert
the semaphore consolidation, or we need to fix the kernel lock. And by
"fixing", I don't mean removing it - it will happen, but it almost
certainly won't happen for 2.6.26.The easiest approach would seem to just turn the BKL into a mutex instead,
which should hopefully be about as optimized as the old semaphores.But my preferred option would indeed be just turning it back into a
spinlock - and screw latency and BKL preemption - and having the RT people
who care deeply just work on removing the BKL in the long run.Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious
when we now apparently have even more reason to aim for removing the BKL
rather than trying to mess around with it.Linus
--
hm, do we know it for a fact that the 40% AIM regression is due to the
fastpath overhead of the BKL? It would be extraordinary if so.I think it is far more likely that it's due to the different scheduling
and wakeup behavior of the new kernel/semaphore.c code. So the fix would
be to restore the old scheduling behavior - that's what Yanmin's manual
revert did and that's what got him back the previous AIM7 performance.Ingo
--
Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which
was what, 7500 lines of code removed that got reverted.And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel
was this (and I quote your message):remove the !PREEMPT_BKL code.
this removes 160 lines of legacy code.
in other words, your only stated valid reason for getting rid of the
spinlock was 160 lines, and the comment didn't even match what it did (it
removed the spinlocks entirely, not just the preemptible version).In contrast, the revert adds 7500 lines. If you go by the only documented
reason for the crap that is the current BKL, then I know which one I'll
take. I'll take the spinlock back, and I'd rather put preemption back
than ever take those semaphores.And even that's ignoring another issue: did anybody ever even do that AIM7
benchmark comparing spinlocks to the semaphore-BKL? It's quite possible
that the semaphores (even the well-behaved ones) behaved worse than the
spinlocks.Linus
--
i wouldnt advocate a 7500 revert instead of a 160 lines change.
my suggestion was that the scheduling behavior of the new
kernel/semaphore.c code is causing the problem - i.e. making it matchit was removed by me in the course of this discussion:
http://lkml.org/lkml/2008/1/2/58
the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the
spinlock version] was broken for a longer period of time (it crashed
trivially), because nobody apparently used it. People (Nick) asked why
it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was
the default, that was what all distros used. I.e. the spinlock code was
in essence dead code at that point in time.the spinlock code might in fact perform _better_, but nobody came up
that's a good question...
Ingo
--
Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was
the known-flaky one.The thread you point to also says that it's PREEMPT_BKL=y that was the
problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was
still using reiserfs."), not the plain spinlock approach.But it would definitely be interesting to see the crash reports. And the
help message always said "Say N if you are unsure." even if it ended up
being marked 'y' by default at some point (and then in January was made
first unconditional, and then removed entirely)Because in many ways, the non-preempt BKL is the *much* simpler case. I
don't see why it would crash - it just turns the BKL into a trivial
counting spinlock that can sleep.Linus
--
no, there was another problem (which i couldnt immediately find because
lkml.org only indexes part of the threads, i'll research it some more),yeah. The latencies are a different problem, and indeed were reported
against PREEMPT_BKL, and believed to be due to reiser3 and the tty code.
(reiser3 runs almost all of its code under the BKL)The !PREEMPT_BKL crash was some simple screwup on my part of getting
atomicity checks wrong in cond_resched() - and it went unnoticed for a
long time - or something like that. I'll try to find that discussion.Ingo
--
Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe
way. It uses just!(preempt_count() & PREEMPT_ACTIVE)
to see whether it can schedule, and it should probably use in_atomic()
which ignores the kernel lock.But right now, that whole thing is disabled if PREEMPT is on anyway, so in
effect (with my test patch, at least) cond_preempt() would just be a no-op
if PREEMPT is on, even if BKL isn't preemptable.So it doesn't look buggy, but it looks like it might cause longer
latencies than strictly necessary. And if somebody depends on
cond_resched() to avoid some bad livelock situation, that would obviously
not work (but that sounds like a fundamental bug anyway, I really hopeYes, some silly bug sounds more likely. Especially considering how many
different cases there were (semaphores vs spinlocks vs preemptable
spinlocks).Linus
--
Funny you should mention it; locks.c uses cond_resched() assuming that
it ignores the BKL. Not through needing to avoid livelock, but it does
presume that other higher priority tasks contending for the lock will
get a chance to take it. You'll notice the patch I posted yesterday
drops the file_lock_lock around the call to cond_resched().--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Well, this would only be noticeable with CONFIG_PREEMPT.
If you don't have preempt enabled, it looks like everything should work
ok: the kernel lock wouldn't increase the preempt count, and
_cond_resched() works fine.If you're PREEMPT, then the kernel lock would increase the preempt count,
and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT
you'd never see it *anyway*, because PREEMPT will disable cond_resched()
entirely (because preemption takes care of normal scheduling latencies
without it).And I'm also sure that this all worked fine at some point, and it's
largely a result just of the multiple different variations of BKL
preemption coupled with some of them getting removed entirely, so the code
that used to handle it just got corrupt over time. See commit 02b67cc3b,
for example... Hmm ... Time passes. Linus looks at git history.
It does look like "cond_resched()" has not worked with the BKL since 2005,
and hasn't taken the BKL into account. Commit 5bbcfd9000:[PATCH] cond_resched(): fix bogus might_sleep() warning
+ if (unlikely(preempt_count()))
+ return;which talks about the BKS, ie it only took the *semaphore* implementation
into account. Never the spinlock-with-preemption-count one.Or am I blind?
Linus
--
hm, i think you are right.
most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case,
and not getting proper cond_resched() behavior in case of !PREEMPT_BKL
would certainly not be noticed by distros or users.We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this
post-2.6.7 commit that introduced the feature:| commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Fri Jan 7 21:59:57 2005 -0800
|
| [PATCH] remove the BKL by turning it into a semaphoreThere was constant trouble around all these variations of preemptability
and their combination with debugging helpers. (So i was rather happy to
get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no
tears will be shed.)Ingo
--
Here's a trial balloon patch to do that.
Yanmin - this is not well tested, but the code is fairly obvious, and it
would be interesting to hear if this fixes the performance regression.
Because if it doesn't, then it's not the BKL, or something totally
different is going on.Now, we should probably also test just converting the thing to a mutex,
to see if that perhaps also fixes it.Linus
---
arch/mn10300/Kconfig | 11 ----
include/linux/hardirq.h | 18 ++++---
kernel/sched.c | 27 ++---------
lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++---------------
4 files changed, 95 insertions(+), 81 deletions(-)diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 6a6409a..e856218 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -186,17 +186,6 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.-config PREEMPT_BKL
- bool "Preempt The Big Kernel Lock"
- depends on PREEMPT
- default y
- help
- This option reduces the latency of the kernel by making the
- big kernel lock preemptible.
-
- Say Y here if you are building a kernel for a desktop system.
- Say N if you are unsure.
-
config MN10300_CURRENT_IN_E2
bool "Hold current task address in E2 register"
default y
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 897f723..181006c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -72,6 +72,14 @@
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())+#if defined(CONFIG_PREEMPT)
+# define PREEMPT_INATOMIC_BASE kernel_locked()
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_INATOMIC_BASE 0
+# define PREEMPT_CHECK_OFFSET 0
+#endif
+
/*
* Are we running in atomic context? WARNING: this macro cannot
* always detect atomic context; in particular, it cannot know about
@@ -79,17 +87,11 @@
* used in the general case to...
Congratulations! The patch really fixes the regression completely!
vmstat showed cpu idle is 0%, just like 2.6.25's.Some config options in my .config file:
CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y--
great! Yanmin, could you please also check the other patch i sent (also
attached below), does it solve the regression similarly?Ingo
---
lib/kernel_lock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
task->lock_depth = -1;
preempt_enable_no_resched();- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();
+ }task->lock_depth = depth;
}
--
With your patch, aim7 regression becomes less than 2%. I ran the testing twice.
Linus' patch could recover it completely. As aim7 result is quite stable(usually
--
is this the old original aim7 you are running, or osdl-aim-7 or
re-aim-7?if it's aim7 then this is a workload that starts+stops 2000 parallel
tasks that each start and exit at the same time. That might explain its
sensitivity on the BKL - this is all about tty-controlled task startup
and exit.i could not get it to produce anywhere close to stable results though. I
also frequently get into this problem:AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
2000
Failed to execute
new_raph 200
Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17Failed to execute
disk_cp /mnt/shm
disk_cp (1): cannot open /mnt/shm/tmpa.common
disk1.c: No such file or directory[.. etc. a large stream of them .. ]
system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
work files.Ingo
--
I useold AIM7 plus a small patch which is just to change a couple of data type to match
My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing
and didn't see any failure except:
AIM Multiuser Benchmark - Suite VII Run BeginningTasks jobs/min jti jobs/min/task real cpu
2000create_shared_memory(): can't create semaphore, pausing...
create_shared_memory(): can't create semaphore, pausing...Above info doesn't mean errors.
Perhaps you could:
1) Apply the attched aim9 patch;
2) check if you have write right under /mnt/shm;
3) echo "/mnt/shm">aim7_path/config;
that failure message you got worries me - it indicates that your test
ran out of IPC semaphores. You can fix it via upping the semaphore
limits via:echo "500 32000 128 512" > /proc/sys/kernel/sem
could you check that you still get similar results with this limit
fixed?note that once i've fixed the semaphore limits it started running fine
here. And i see zero idle time during the run on a quad core box.here are my numbers:
# on v2.6.26-rc1-166-gc0a1811
Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55851.4 93 208.4 793.6 0.4654 # BKL: sleep
2000 55402.2 79 210.1 800.1 0.46172000 55728.4 93 208.9 795.5 0.4644 # BKL: spin
2000 55787.2 93 208.7 794.5 0.4649 #so the results are the same within noise.
I'll also check this workload on an 8-way box to make sure it's OK on
larger CPU counts too.could you double-check your test?
plus a tty tidbit as well, during the test i saw a few of these:
Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev
Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev
Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev
Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev
Warning: dev (tty1) tty->count(164) != #fd's(163) in release_devIngo
--
A quick test showed it does work.
--
false alarm there - these were due to the breakage in the hack-patch i
used ...Ingo
--
but but but. Some other users of down() have presumably also regressed. We just
have't found the workload to demonstrate that yet.--
Well, that shows that it was the BKL.
That said, "idle 0%" is easy when you spin. Do you also have actual
performance numbers? I'd hope that not only do we use full CPU time, it's
also at least as fast as the old semaphores were?While I've been dissing sleeping locks (because their overhead is so
high), at least in _theory_ they can get better behavior when not
spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure,
but I'd still like to hear actual performance numbers too, just to be
sure.Anyway, at least the "was it the BKL or some other semaphore user"
question is entirely off the table.So we need to
- fix the BKL. My patch may be a good starting point, but there are
alternatives:(a) reinstate the old BKL code entirely
Quite frankly, I'd prefer not to. Not only did it have three
totally different cases, some of them were apparently broken (ie
BKL+regular preempt didn't cond_resched() right), and I just don't
think it's worth maintaining three different versions, when
distro's are going to pick one anyway. *We* should pick one, and
maintain it.(b) screw the special BKL preemption - it's a spinlock, we don't
preempt other spinlocks, but at least fix BKL+preempt+cond_resched
thing.This would be "my patch + fixes" where at least one of the fixes
is the known (apparently old) cond_preempt() bug.(c) Try to keep the 2.6.25 code as closely as possible, but just
switch over to mutexes instead.I dunno. I was never all that enamoured with the BKL as a sleeping
lock, so I'm biased against this one, but hey, it's just a
personal bias.- get rid of the BKL anyway, at least in anything that is noticeable.
Matthew's patch to file locking is probably worth doing as-is,
simply because I haven't heard any better solutions. The BKL
certainly can't be it, and whatever comes out of the NFSD
discussion will almost certainly involve just making sure that
those leases just use the new ...
Yes. My conclusion is based on the actual number. cpu idle 0% is just
--
Thanks, that's all I wanted to verify.
I'll leave this overnight, and see if somebody has come up with some smart
and wonderful patch. And if not, I think I'll apply mine as "known to fix
a regression", and we can perhaps then improve on things further from
there.Linus
--
hey, i happen to have such a smart and wonderful patch =B-)
i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.The problem comes from the following code. The new semaphore code does
this on down():spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);and this on up():
spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);where __up() does:
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);and where __down() does this in essence:
list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of con...
On Thu, 8 May 2008 14:01:30 +0200
ok sounds like I like the fairness part of the new semaphores (but
obviously not the 67% performance downside; I'd expect to sacrifice a
little performance.. but this much??).It sucks though; if this were a mutex, we could wake up the owner of
the bugger in the contented acquire path synchronously.... but these
are semaphores, and don't have an owner ;( bah bah bah
--
I did note that earlier downthread ... although to be fair, I thought of
it in terms of three tasks with the third task coming in and stealing
the second tasks's wakeup rather than the first task starving the secondFair is certainly the enemy of throughput (see also dbench arguments
passim). It may be that some semaphore users really do want fairness --
it seems pretty clear that we don't want fairness for the BKL.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
i dont think we need to consider any theoretical arguments about
fairness here as there's a fundamental down-to-earth maintenance issue
that governs: old semaphores were similarly unfair too, so it is just a
bad idea (and a bug) to change behavior when implementing new, generic
semaphores that are supposed to be a seemless replacement! This is about
legacy code that is intended to be phased out anyway.This is already a killer argument and we wouldnt have to look any
further.but even on the more theoretical level i disagree: fairness of CPU time
is something that is implemented by the scheduler in a natural way
already. Putting extra ad-hoc synchronization and scheduling into the
locking primitives around data structures only gives mathematical
fairness and artificial micro-scheduling, it does not actually make the
end result more useful! This is especially true for the BKL which is
auto-dropped by the scheduler anyway. (so descheduling a task will
automatically release it of its BKL ownership)For example we've invested a _lot_ of time and effort into adding lock
stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a
_lot_ harder to do atomically with PI constraints but still possible and
makes sense in the grand scheme of things. kernel/mutex.c is also
"unfair" - and that's correct IMO.For the BKL in particular there's almost no sense to talk about any
underlying resource and there's almost no expectation from users for
that imaginery resource to be shared fairly.Ingo
--
Peter pointed it out that because sem->count is u32, the <= 0 is in fact
a "== 0" condition - the patch below does that. As expected gcc figured
out the same thing too so the resulting code output did not change. (so
this is just a cleanup)i've got this lined up in sched.git and it's undergoing testing right
now. If that testing goes fine and if there are no objections i'll send
a pull request for it later today.Ingo
---------------->
Subject: semaphores: improve code
From: Ingo Molnar <mingo@elte.hu>
Date: Thu May 08 14:19:23 CEST 2008No code changed:
kernel/semaphore.o:
text data bss dec hex filename
1207 0 0 1207 4b7 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.aftermd5:
c10198c2952bd345a1edaac6db891548 semaphore.o.before.asm
c10198c2952bd345a1edaac6db891548 semaphore.o.after.asmSigned-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,7 +54,7 @@ void down(struct semaphore *sem)
unsigned long flags;spin_lock_irqsave(&sem->lock, flags);
- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
__down(sem);
sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
@@ -76,7 +76,7 @@ int down_interruptible(struct semaphore
int result = 0;spin_lock_irqsave(&sem->lock, flags);
- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
result = __down_interruptible(sem);
if (!result)
sem->count--;
@@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem)
int result = 0;spin_lock_irqsave(&sem->lock, flags);
- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
result = __down_killable(s...
Why don't we just make it do the same thing that the x86 semaphores used
to do: make it signed, and decrement unconditionally. And callt eh
slow-path if it became negative.IOW, make the fast-path be
spin_lock_irqsave(&sem->lock, flags);
if (--sem->count < 0)
__down();
spin_unlock_irqrestore(&sem->lock, flags);and now we have an existing known-good implementation to look at?
Rather than making up a totally new and untested thing.
Linus
--
Ok, after having thought that over, and looked at the code, I think I like
your version after all. The old implementation was pretty complex due to
the need to be so extra careful about the count that could change outside
of the lock, so everything considered, a new implementation that is
simpler is probably the better choice.Ergo, I will just pull your scheduler tree.
Linus
--
yeah. I thought about that too, the problem i found is this thing in the
old lib/semaphore-sleepers.c code's __down() path:remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_RUNNING;that mystery wakeup once i understood to be necessary for some weird
ordering reason, but it would probably be hard to justify in the new
code, because it's done unconditionally, regardless of whether there are
sleepers around.And once we deviate from the old code, we might as well go for the
simplest approach - which also happens to be rather close to the mutex
code's current slowpath - just with counting property added, legacygreat! Meanwhile a 100 randconfigs booted fine with that tree so i'd say
the implementation is robust.i also did a quick re-test of AIM7 because the wakeup logic changed a
bit from what i tested initially (from round-robin to strict FIFO), and
as expected not much changed in the AIM7 results on the quad:Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55019.9 96 211.6 806.5 0.4585
2000 55116.2 90 211.2 804.7 0.4593
2000 55082.3 82 211.3 805.5 0.4590this is slightly lower but the test was not fully apples to apples
because this also had some tracing active and other small details. It's
still very close to the v2.6.25 numbers. I suspect some more performance
could be won in this particular workload by getting rid of the BKL
dependency altogether.Ingo
--
Did somebody have any trace on which BKL taker it is? It apparently wasn't
file locking. Was it the tty layer?Linus
--
yeah, i captured a trace of all the down()s that happen in the workload,
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's
the semaphore activities in the trace:# grep lock_kernel /debug/tracing/trace | cut -d: -f2- | sort | \
uniq -c | sort -n | cut -d= -f1-49 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
12 down <= lock_kernel <= de_put <= proc_delete_inode <
14 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
19 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
58 down <= lock_kernel <= tty_release <= __fput <
62 down <= lock_kernel <= chrdev_open <= __dentry_open <
70 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
2512 down <= lock_kernel <= opost <= write_chan <
2574 down <= lock_kernel <= write_chan <= tty_write <note that this is running the fixed semaphore code, so contended
semaphores are really rare in the trace. The histogram above includes
all calls to down().here's the full trace file (from a single CPU, on a dual-core box
running aim7):http://redhat.com/~mingo/misc/aim7-ftrace.txt
some other interesting stats. Top 5 wakeups sources:
# grep wake_up aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c |
sort -n | cut -d= -f1-6 | tail -5
[...]
340 default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all <
411 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read <
924 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write <
1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_b...
Ok. tty write handling. Nasty. But not as nasty as the open/close code,
perhaps, and maybe we'll get it fixed some day.In fact, I thought we had fixed most of this already, but hey, I was
No, it's not rmap contention. Your profile hits are just on the actual
calculations, and it's all data-dependent arithmetic and loads. Some cache
misses on the page tables, clearly, but it looks like a lot of it is even
just the plain arithmetic (the imul followed by a data-dependent 'lea'
instruction).Some of it is that "page_to_pfn(page)", which involves a nasty division
(divide by sizeof(struct page)). It gets turned into that shift and
multiply, but it's still quite expensive with big constants etc.Linus
--
I have pushed it down to n_tty line discipline code but not within that.
It is on the hit list but I'm working on more pressing stuff first (USB
layer, extracting commonality to start to tackle open etc etc)I don't think fixing n_tty is now a big job if someone wants to take a
swing at it. The driver write/throttle/etc routines below the n_tty ldisc
layer are now BKL clean so it should just be the internal locking of the
buffers, window and the like to tackle.Feel free to have a go 8)
--
Well, it turns out that Ingo's fixed statistics actually put the real cost
in fcntl/ioctl/open/release:310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
380 down <= lock_kernel <= tty_release <= __fput <
422 down <= lock_kernel <= chrdev_open <= __dentry_open <rather than the write routines. But it may be that Ingo was just profiling
two different sections, and it's really all of them.Linus
--
That must be ->fasync? If it was file locks the lock_kernel would not
be inlined into sys_fcntl. Or is that an truncated back trace?-Andi (wondering if he should plug ->fasync_unlocked again ...)
--
the first trace had general desktop load mixed into it as well - so
while it's not interesting to AIM7 the BKL does matter in those
situations and i'd not be surprised if it was responsible for certain
categories of desktop lag.The second trace was the correct 'pure' AIM7 workload which produces
very little tty output. It is a quite stable workload and the trace i
uploaded is representative of the totality of that workload. AIM7 runs
for several minutes so there's no significant rampup/rampdown
interaction either.Ingo
--
tty release is probably a few months away from getting cured - I'm
afraid it will almost certainly be the very last user of the BKL in tty
to get fixed as it depends on everything else being sanely locked.
--
Btw, sparse will complain about those, because the source code *looks*
really cheap.The normal "page_to_pfn()" looks trivial:
((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)
which looks like a trivial subtraction and addition of a constant, but the
subtraction is on C pointers, and basically turns into((unsigned long)page - (unsigned long)mem_map) / sizeof(struct page)
and because "struct page" is not some nice power-of-two in size, that
division is rather nasty even though it's a constant size.Linus
--
Sometimes you can fix it.
For example, this change:
- if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
+ if (pte_present(*pte) && page == pfn_to_page(pte_pfn(*pte))) {can simplify things: instead of moving from a 'struct page' to a pfn, it
moves from a pfn to a 'struct page', and that is generally cheaper
(multiply rather than divide by size of struct page). It's not always the
same thing to do, but I think in this case we can. For me, the code
generation changes:- movabsq $7905747460161236407, %rdx #, tmp111
- movabsq $32985348833280, %rax #, tmp107
- leaq (%r12,%rax), %rax #, tmp106
- sarq $3, %rax #, tmp106
- imulq %rdx, %rax # tmp111, tmp106
- movabsq $70368744177663, %rdx #, tmp113
- andq %rdx, %rcx # tmp113, pte$pte
- shrq $12, %rcx #, pte$pte
- cmpq %rcx, %rax # pte$pte, tmp106
+ movabsq $70368744177663, %rax #, tmp107
+ andq %rax, %rdx # tmp107, pte$pte
+ shrq $12, %rdx #, pte$pte
+ imulq $56, %rdx, %rax #, pte$pte, tmp109
+ movabsq $-32985348833280, %rdx #, tmp111
+ addq %rdx, %rax # tmp111, tmp110
+ cmpq %rax, %r13 # tmp110, pagewhich isn't a *huge* deal, but it certainly looks better. One less big
constant, and one less shift.It's not going to make a huge difference, though. That function is just
called too much, and it would still be entirely data-dependent all the way
through.Linus
--
the first one had some idle time in it as well plus the effects of a
2000-task Ctrl-Z - which skewed the histograms.the new stats are more straightforward. down() callsite histogram:
42 down <= lock_kernel <= de_put <= proc_delete_inode <
42 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
78 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
310 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
332 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
380 down <= lock_kernel <= tty_release <= __fput <
422 down <= lock_kernel <= chrdev_open <= __dentry_open <hm, why is chrdev_open() called all that often?
top-5 wakeups:
4 default_wake_function <= __wake_up_common <= complete <= migration_thread <= kthread <= child_rip <
4 wake_up_process <= sched_exec <= do_execve <= sys_execve <= stub_execve <= <
8 wake_up_process <= __mutex_unlock_slowpath <= mutex_unlock <= do_filp_open <= do_sys_open <= sys_open <
40 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up <= sock_def_wakeup <= tcp_rcv_state_process <
98 default_wake_function <= __wake_up_common <= __wake_up_sync <= do_notify_parent <= do_exit <= do_group_exit <i.e. very little wakeup activities. Top 10 scheduling points:
10 schedule <= kjournald <= kthread <= child_rip <
12 schedule <= __down_write_nested <= __down_write <= down_write <
29 schedule <= worker_thread <= kthread <= child_rip <
40 schedule <= schedule_timeout <= inet_stream_connect <= sys_connect <
59 schedule <= __cond_resched <= _cond_resched <= generic_file_buffered_write <
111 schedule <= ksoftirqd <= kthread <= child_rip <
119 schedule <= do_wa...
a second update patch, i've further simplified the semaphore wakeup
logic: there's no need for the wakeup to remove the task from the wait
list. This will make them a slighly bit more fair, but more importantly,
this closes a race in my first patch for the unlikely case of a signal
(or a timeout) and an unlock coming in at the same time and the task not
getting removed from the wait-list.( my performance testing with 2000 AIM7 tasks on a quad never hit that
race but x86.git QA actually triggered it after about 30 random kernel
bootups and it caused a nasty crash and lockup. )Ingo
---------------->
Subject: sem: simplify queue management
From: Ingo Molnar <mingo@elte.hu>
Date: Tue May 06 19:32:42 CEST 2008kernel/semaphore.o:
text data bss dec hex filename
1040 0 0 1040 410 semaphore.o.before
975 0 0 975 3cf semaphore.o.afterSigned-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -202,33 +202,34 @@ static inline int __sched __down_common(
{
struct task_struct *task = current;
struct semaphore_waiter waiter;
+ int ret = 0;waiter.task = task;
+ list_add_tail(&waiter.list, &sem->wait_list);for (;;) {
- list_add_tail(&waiter.list, &sem->wait_list);
-
- if (state == TASK_INTERRUPTIBLE && signal_pending(task))
- goto interrupted;
- if (state == TASK_KILLABLE && fatal_signal_pending(task))
- goto interrupted;
- if (timeout <= 0)
- goto timed_out;
+ if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+ ret = -EINTR;
...
ok, it's looking good here so far so here's the scheduler fixes tree
that you can pull if my semaphore fix looks good to you too:git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus
also includes a scheduler arithmetics fix from Mike. Find the shortlog
and diff below.Ingo
------------------>
Ingo Molnar (1):
semaphore: fixMike Galbraith (1):
sched: fix weight calculationskernel/sched_fair.c | 11 ++++++--
kernel/semaphore.c | 64 ++++++++++++++++++++++++---------------------------
2 files changed, 38 insertions(+), 37 deletions(-)diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c863663..e24ecd3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
+ unsigned long thresh = sysctl_sched_latency;
+
+ /*
+ * convert the sleeper threshold into virtual time
+ */
if (sched_feat(NORMALIZED_SLEEPER))
- vruntime -= calc_delta_weight(sysctl_sched_latency, se);
- else
- vruntime -= sysctl_sched_latency;
+ thresh = calc_delta_fair(thresh, se);
+
+ vruntime -= thresh;
}/* ensure we never gain time by being placed backwards. */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
unsigned long flags;spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(!sem->count))
__down(sem);
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)
int result = 0;spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->...
This might be the problem that causes the missing wakeups. If you have a
semaphore with n=2, and four processes calling down(), tasks A and B
acquire the semaphore and tasks C and D go to sleep. Task A calls up()
and wakes up C. Then task B calls up() and doesn't wake up anyone
because C hasn't run yet. I think we need another wakeup when task C
finishes in __down_common, like this (on top of your patch):diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5e41217..e520ad4 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
}list_del(&waiter.list);
+
+ /* It's possible we need to wake up the next task on the list too */
+ if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
+ __up(sem);
+
return ret;
}Sven, can you try this with your workload? I suspect this might be it
because XFS does use semaphores with n>1.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
This one fixes the regression too, after applying it on top of bf726e.
Sven
--
this needs to check for ret != 0 as well, otherwise we can be woken but
a timeout can also trigger => we lose a wakeup. I.e. like the patch
below. Hm?Ingo
----------------------------->
Subject: semaphore: fix #3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
struct task_struct *task;
};+static noinline void __sched __up(struct semaphore *sem)
+{
+ struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
+}
+
/*
* Because this function is inlined, the 'state' parameter will be
* constant, and thus optimised away by the compiler. Likewise the
@@ -231,6 +238,10 @@ static inline int __sched __down_common(
}list_del(&waiter.list);
+ if (unlikely(!list_empty(&sem->wait_list)) &&
+ ((sem->count > 1) || ret))
+ __up(sem);
+
return ret;
}@@ -254,9 +265,10 @@ static noinline int __sched __down_timeo
return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
}-static noinline void __sched __up(struct semaphore *sem)
-{
- struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
- struct semaphore_waiter, list);
- wake_up_process(waiter->task);
-}
+
+ /*
+ * Rotate sleepers - to make sure all of them get woken in case
+ * of parallel up()s:
+ */
+ list_move_tail(&waiter->list, &sem->wait_list);
+
--
Still mangled ... and I don't see how we lose a wakeup. We test for
having the semaphore before we check for having been interrupted, and we
hold the lock the whole time.IOW, what I think you're checking for is:
task A task B
if sem->count >0
break;
sem->count++
wake_up_process(B)
if (state == TASK_INTERRUPTIBLE && signal_pending(task))
break;--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
ok, agreed, that race cannot happen.
Ingo
--
This is exactly it. Or rather, it's even simpler. Three tasks
involved; A and B call xlog_state_get_iclog_space() and end up calling
psema(&log->l_flushsema) [down() by any other name -- the semaphore is
initialised to 0; it's really a completion]. Then task C calls
xlog_state_do_callback() which does:while (flushcnt--)
vsema(&log->l_flushsema);[vsema is AKA up()]
It assumes this wakes up both A and B ... but it won't with Ingo's code;
it'll wake up A twice.So I deem my fix "proven by thought experiment". I haven't tried
booting it or anything.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
i actually have two fixes, made earlier today. The 'fix3' one has been
confirmed by Sven to fix the regression - but i think we need the 'fix
#2' one below as well to make it complete.Ingo
-------------------->
Subject: semaphore: fix3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 7 +++++++
1 file changed, 7 insertions(+)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -258,5 +258,12 @@ static noinline void __sched __up(struct
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
+
+ /*
+ * Rotate sleepers - to make sure all of them get woken in case
+ * of parallel up()s:
+ */
+ list_move_tail(&waiter->list, &sem->wait_list);
+
wake_up_process(waiter->task);
}---------------->
Subject: semaphore: fix #2
From: Ingo Molnar <mingo@elte.hu>
Date: Thu May 08 11:53:48 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
struct task_struct *task;
};+static noinline void __sched __up(struct semaphore *sem)
+{
+ struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
+}
+
/*
* Because this function is inlined, the 'state' parameter will be
* constant, and thus optimised away by the compiler. Likewise the
@@ -231,6 +238,9 @@ static inline int __sched __down_common(
}list_del(&waiter.list);
+ if (un...
i just combined them into a single fix, see below.
Ingo
--------------------------->
Subject: semaphore: fix #3
From: Ingo Molnar <mingo@elte.hu>
Date: Sun May 11 09:51:07 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/semaphore.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -194,6 +194,13 @@ struct semaphore_waiter {
struct task_struct *task;
};+static noinline void __sched __up(struct semaphore *sem)
+{
+ struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
+}
+
/*
* Because this function is inlined, the 'state' parameter will be
* constant, and thus optimised away by the compiler. Likewise the
@@ -231,6 +238,9 @@ static inline int __sched __down_common(
}list_del(&waiter.list);
+ if (unlikely(!list_empty(&sem->wait_list)) && sem->count)
+ __up(sem);
+
return ret;
}@@ -254,9 +264,10 @@ static noinline int __sched __down_timeo
return __down_common(sem, TASK_UNINTERRUPTIBLE, jiffies);
}-static noinline void __sched __up(struct semaphore *sem)
-{
- struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
- struct semaphore_waiter, list);
- wake_up_process(waiter->task);
-}
+
+ /*
+ * Rotate sleepers - to make sure all of them get woken in case
+ * of parallel up()s:
+ */
+ list_move_tail(&waiter->list, &sem->wait_list);
+
--
Seems like extra cache line dirtying for no real gain over my solution.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
the gain is rather obvious: two parallel up()s (or just up()s which come
close enough after each other) will wake up two tasks in parallel. With
your patch, the first guy wakes up and then it wakes up the second guy.
I.e. your patch serializes the wakeup chain, mine keeps it parallel.the cache line dirtying is rather secondary to any solution - the first
goal for any locking primitive is to get scheduling precise: to not wake
up more tasks than optimal and to not wake up less tasks than optimal.i.e. can you see any conceptual hole in the patch below?
Ingo
---
kernel/semaphore.c | 6 ++++++
1 file changed, 6 insertions(+)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -258,5 +258,11 @@ static noinline void __sched __up(struct
{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
+ /*
+ * Rotate sleepers - to make sure all of them get woken in case
+ * of parallel up()s:
+ */
+ list_move_tail(&waiter->list, &sem->wait_list);
+
wake_up_process(waiter->task);
}
--
That's a laudable goal, but ultimately it's secondary to performance (or
No conceptual holes, just a performance one.
Either we want your patch below or mine; definitely not both.
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5e41217..e520ad4 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -229,6 +229,11 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
}list_del(&waiter.list);
+
+ /* It's possible we need to wake up the next task on the list too */
+ if (unlikely(sem->count > 1) && !list_empty(&sem->wait_list))
+ __up(sem);
+
return ret;--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
but the problem is that by serializing the wakeup chains naively you
introduced a more than 50% AIM7 performance regression.Ingo
--
No, the problem is that the BKL shouldn't be a semaphore at all.
Performance fixed.After that, it's *purely* a correctness issue, and then we might as well
be fair and not allow the stealing of semaphores from under waiters at
ALL. Which is what Matthews original code did.In other words, current -git is fine, and was already confirmed by Sven
to fix the bug (before *any* of your patches were), and was earlier
confirmed to fix the AIM7 performance regression (better than *any* of
your patches were).So I fixed the problems last night already. Stop wasting everybody's time.
Linus
--
That's a different issue. The AIM7 regression is to do with whether a
task that is currently running and hits a semaphore that has no current
holder but someone else waiting should be allowed to jump the queue.
No argument there; performance trumps theoretical fairness.This issue is whether multiple sleepers should be woken up all-at-once
or one-at-a-time. Here, you seem to be arguing for theoretical fairness
to trump performance.(Let's be quite clear; this issue affects *only* multiple
sleepers and multiple wakes given to those sleepers. ie
semaphores-being-used-as-completions and true counting semaphores).--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Actually, let me just go into this a little further.
In principle, you'd think that we'd want to wake up all the tasks
possible as soon as possible. In practice, Dave Chinner has said that
the l_flushsema introduces a thundering herd (a few hundred tasks can
build up behind it on systems such as Columbia apparently) that then
run into a bottleneck as soon as they're unleashed.Current XFS CVS has a fix from myself and Christoph that gets rid of the
l_flushsema and replaces it with a staggered wakeup of each task that's
waiting as the previously woken task clears the critical section.Obviously, generic up() can't possibly do as well, but by staggering
the release of tasks from __down_common(), we mitigate the herd somewhat.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
the solution is to reduce semaphore usage by converting them to mutexes.
Is anyone working on removing legacy semaphore use from XFS?Ingo
--
This race is completely irrelevant to converting semaphores to mutexes.
It can only occur for semaphores which /can't/ be converted to mutexes.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
i was not talking about the race. I was just reacting on your comments
about thundering herds and staggered wakeups - which is a performance
detail. Semaphores should not regress AIM7 by 50% but otherwise they are
legacy code and their use should be reduced monotonically, so i was
asking why anyone still cares about tuning semaphore details in XFSexactly what usecase is that? Perhaps it could be converted to an atomic
counter + the wait_event() APIs.Ingo
--
That's what we did. l_flushsema is gone. It actually got replaced with
Effectively, it's a completion. It just works better with staggered
wakeups than it does with the naive completion.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
So why not transform it to real completions instead? And if our current
'struct completion' abstraction is insufficient for whatever reason, why
not extend that instead?Ingo
--
My point is that for the only user of counting semaphores and/or
semaphores-abused-as-completions that has so far hit this race, the
serialised wake-up performs better. You have not pointed at a scenario
that _shows_ a parallel wake-up to perform better. Some hand-waving
and talking about lofty principles, yes. But no actual data.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
a 50% AIM7 slowdown maybe? With the BKL being a spinlock again it doesnt
matter much in practice though.Ingo
--
You're not understanding me. This is completely inapplicable to the BKL
because only one task can be in wakeup at a time (due to it having a
maximum value of 1). There's no way to hit this race with the BKL.
The only kind of semaphore that can hit this race is the kind that can
have more than one wakeup in progress at a time -- ie one which can have
a value >1. Like completions and real counting semaphores.So the only thing worth talking about (and indeed, it's now entirely
moot) is what's the best way to solve this problem /for this kind of
semaphore/.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
yes, but even for parallel wakeups for completions it's good in general
to keep more tasks in flight than to keep less tasks in flight.Perhaps the code could throttle them to nr_cpus, but otherwise, as the
BKL example has shown it (in another context), we do much better if we
overload the scheduler (in which case it can and does batch
intelligently) than if we try to second-guess it and under-load it and
create lots of scheduling events.i'd agree with you that are no numbers available pro or contra, so you
it's not really moot in terms of improving the completions code i
suspect? For XFS i guess performance matters.Ingo
--
That might be the case for some users, but it isn't the case for XFS.
The first thing that each task does is grab a spinlock, so if you put as
much in flight as early as possible, you end up with horrible contention
on that spinlock. I have no idea whether this is the common case forI think the completion code is less optimised than the semaphore code
today. Clearly the same question does arise, but I don't know what the
answer is for completion users either.Let's do a quick survey. drivers/net has 5 users:
3c527.c -- execution_cmd has a mutex held, so never more than one task
waiting anyway. xceiver_cmd is called during open and close which I think
are serialised at a higher level. In any case, no performance issue here.iseries_veth.c -- grabs a spinlock soon after being woken.
plip.c -- called in close, no perf implication.
ppp_synctty.c -- called in close, no perf implication.
ps3_gelic_wireless.c - If this isn't serialised, it's buggy.
Maybe drivers/net is a bad example. Let's look at */*.c:
as-iosched.c -- in exit path.
blk-barrier.c -- completion on stack, so only one waiter.
blk-exec.c -- ditto
cfq-iosched.c -- in exit pathcrypto/api.c -- in init path
crypto/gcm.c -- in setkey path
crypto/tcrypt.c -- crypto testing. Not a perf path.fs/exec.c -- waiting for coredumps.
kernel/exit.c -- likewise
kernel/fork.c -- completion on stack
kernel/kmod.c -- completion on stack
kernel/kthread.c -- kthread creation and deletion. Shouldn't be a hot
path, plus this looks like there's only going to be one task waiting.
kernel/rcupdate.c -- one completion on stack, one synchronised by a mutex
kernel/rcutorture.c -- doesn't matter
kernel/sched.c -- both completions on stack
kernel/stop_machine.c -- completion on stack
kernel/sysctl.c -- completion on stack
kernel/workqueue.c -- completion on stacklib/klist.c -- This one seems like it could potentially have lots of
waiters, if only anything actually used klists.It seems like most users use comp...
hm, this sounds like damage that is inflicted on itself by the XFS code.
Why does it signal to its waiters that "resource is available", when in
reality that resource is not available but immediately serialized via a
lock? (even if the lock might technically be some _other_ object)I have not looked closely at this but the more natural wakeup flow here
would be that if you know there's going to be immediate contention, to
signal a _single_ resource to a _single_ waiter, and then once that
contention point is over a (hopefully) much more parallel processing
phase occurs, to use a multi-value completion there.in other words: dont tell the scheduler that there is parallelism in the
system when in reality there is not. And for the same reason, do not
throttle wakeups in a completion mechanism artificially because one
given user utilizes it suboptimally. Once throttled it's not possible toyeah. I'd guess XFS would be the primary user in this area who cares
yeah - although i guess in general it's a bit safer to use an explicit
completion. With a task pointer you have to be sure the task is still
present, etc. (with a completion you are forced to put that completion
object _somewhere_, which immediately forces one to think about lifetime
issues. A wakeup to a single task pointer is way too easy to get wrong.)yeah, that's definitely so.
Ingo
--
So you have 'n' resources available, and you use a counting semaphore for
that resource counting.But you'd still need a spinlock to actually protect the list of resources
themselves. The spinlock protects a different thing than the semaphore.
The semaphore isn't about mutual exclusion - it's about counting resources
and waiting when there are too many things in flight.And you seem to think that a counting semaphore is about mutual exclusion.
It has nothing what-so-ever to do with that.Linus
--
i was reacting to the point that Matthew made:
" The first thing that each task does is grab a spinlock, so if you
put as much in flight as early as possible, you end up with horrible
contention on that spinlock. "We were talking about the case of double, parallel up()s. My point was
that the best guess is to put two tasks in flight in the synchronization
primitive. Matthew's point was that the wakeups of the two tasks should
be chained: one task gets woken first, which then wakes the second task
in a chain. [ Matthew, i'm paraphrasing your opinion so please correct
me if i'm misinterpreting your point. ]My argument is that chaining like that in the synchronization primitive
is bad for parallelism in general, because wakeup latencies are just too
long in general and any action required in the context of the "first
waker" throttles parallelism artificially and introduces artificial
workload delays.Even in the simple list+lock case you are talking about it's beneficial
to keep as many wakers in flight as possible. The reason is that even
with the worst-possible cacheline bouncing imaginable, it's much better
to spin a bit on a spinlock (which is "horrible bouncing" only on paper,
in practice it's nicely ordered waiting on a FIFO spinlock, with a list
op every 100 nsecs or so) than to implement "soft bouncing" of tasks via
an artificial chain of wakeups.That artificial chain of wakeups has a latency of a few microseconds
even in the best case - and in the worst-case it can be a delay of
milliseconds later - throttling not just the current parallelism of the
workload but also hiding potential future parallelism. The hardware is
so much faster at messaging. [*] [**]Ingo
[*] Not without qualifications - maybe not so on 1024 CPU systems, but
certainly so on anything sane up to 16way or so. But if worry about
1024 CPU systems i'd suggest to first take a look at the current
practice in kernel/semaphore.c taking the sema...
Er, I mis-wrote there.
Task A calls up() and wakes up C. Then task B calls up() and wakes up C
again because C hasn't removed itself from the list yet. D never
receives a wakeup. The solution is for C to pass a wakeup along to the
next in line. (The solution remains the same).--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
The commit description says:
<-- snip -->
...
This bug could be related to the regression reported by Yanmin Zhang:
| Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
| of regressions with 2.6.26-rc1:
|
| 1) 8-core stoakley: 28%;
| 2) 16-core tigerton: 20%;
| 3) Itanium Montvale: 50%.Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
...<-- snip -->
Can we get that verified and the description updated before it hits
Linus' tree?Otherwise this "could be related" will become unchangable metadata that
will stay forever - no matter whether there's any relation at all.Thanks
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
that's not needed. Mike's fix is correct, regardless of whether it fixes
... and the problem with that is exactly what?
Ingo
--
Then scrap the part about it possibly fixing a regression and the
It is important that our metadata is as complete and correct as
reasonably possible. Our code is not as well documented as it should be,
and in my experience often the only way to understand what happens and
why it happens is to ask git for the metadata (and I'm actually doing
this even for most of my "trivial" patches).In 3 hours or 3 years someone might look at this commit trying to
understand what it does and why it does this.And there's a big difference between "we do it because it's correct from
a theoretical point of view" and "it is supposed to fix a hugecu
Adrian--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed--
here's a simpler trial baloon test-patch (well, hack) that is also
reasonably well tested. It turns the BKL into a "spin-semaphore". If
this resolves the performance problem then it's all due to the BKL's
scheduling/preemption properties.this approach is ugly (it's just a more expensive spinlock), but has an
advantage: the code logic is obviously correct, and it would also make
it much easier later on to turn the BKL back into a sleeping lock again
- once the TTY code's BKL use is fixed. (i think Alan said it might
happen in the next few months) The BKL is more expensive than a simple
spinlock anyway.Ingo
------------->
Subject: BKL: spin on acquire
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 19:05:40 CEST 2008NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
lib/kernel_lock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
task->lock_depth = -1;
preempt_enable_no_resched();- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();
+ }task->lock_depth = depth;
}
--
Are there really that many long lived BKL holders? I have some doubts.
Semaphore makes only sense when the critical region is at least
many thousands of cycles to amortize the scheduling overhead.Ok perhaps this needs some numbers to decide.
The semaphores should be still nearly as fast in theory, especially
Or figure out what made the semaphore consolidation slower? As Ingo
pointed out earlier 40% is unlikely to be a fast path problem, but some
algorithmic problem. Surely that is fixable (even for .26)?Perhaps we were lucky it showed so easily, not in something tricky.
-Andi
--
Absolutely. Yanmin is apparently showing that each call to __down()
results in 1,451 calls to schedule(). wtf?--
I can't figure it out either. Unless schedule() is broken somehow ...
but that should have shown up with semaphore-sleepers.c, shouldn't it?One other difference between semaphore-sleepers and the new generic code
is that in effect, semaphore-sleepers does a little bit of spinning
before it sleeps. That is, if up() and down() are called more-or-less
simultaneously, the increment of sem->count will happen before __down
calls schedule(). How about something like this:diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..ef83f5a 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -211,6 +211,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
waiter.up = 0;for (;;) {
+ int i;
if (state == TASK_INTERRUPTIBLE && signal_pending(task))
goto interrupted;
if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -219,7 +220,15 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
goto timed_out;
__set_task_state(task, state);
spin_unlock_irq(&sem->lock);
+
+ for (i = 0; i < 10; i++) {
+ if (waiter.up)
+ goto skip_schedule;
+ cpu_relax();
+ }
+
timeout = schedule_timeout(timeout);
+ skip_schedule:
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;Maybe it'd be enough to test it once ... or maybe we should use
spin_is_locked() ... Ingo?--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
It isn't as if the RT build can't use a different lock type to the
We have some horrible long lasting BKL users left unfortunately.
Alan
--
Well, considering just *how* bad the new BKL apparently is, I think that's
a separate issue. The semaphore implementation is simply not worth it. AtQuite frankly, maybe we _need_ to have a bad BKL for those to ever get
fixed. As it was, people worked on trying to make the BKL behave better,
and it was a failure. Rather than spend the effort on trying to make it
work better (at a horrible cost), why not just say "Hell no - if you have
issues with it, you need to work with people to get rid of the BKL
rather than cluge around it".Linus
--
Put another way: if we had introduced the BKL-as-semaphore with a known
40% performance drop in AIM7, I would simply never ever have accepted the
patch in the first place, regardless of _any_ excuses.Performance is a feature too.
Now, just because the code is already merged should not be an excuse for
it then being shown to be bad. It's not a valid excuse to say "but we
already merged it, so we can't unmerge it". We sure as hell _can_ unmerge
it.Linus
--
that's one often-forgotten BKL site: about 1000 ioctls are still running
under the BKL. The TTY one is hurting the most. To make sure it's only
that BKL acquire/release that hurts, could you try the hack patch below,
does it make any difference to performance?but even if taking the BKL does hurt, it's quite unexpected to cause a
40% drop. Perhaps AIM7 has tons of threads that exit at once and all try
to release their controlling terminal or something like that?Ingo
------------------------>
Subject: DANGEROUS tty hack: no BKL
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/char/tty_io.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:static int tty_release(struct inode *inode, struct file *filp)
{
- lock_kernel();
+ /* DANGEROUS - can crash your kernel! */
+// lock_kernel();
release_dev(filp);
- unlock_kernel();
+// unlock_kernel();
return 0;
}--
although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking
if you use a serial console you will need the updated patch below.
Ingo
---------------------->
Subject: no: tty bkl
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/char/tty_io.c | 5 +++--
drivers/serial/serial_core.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:static int tty_release(struct inode *inode, struct file *filp)
{
- lock_kernel();
+ /* DANGEROUS - can crash your kernel! */
+// lock_kernel();
release_dev(filp);
- unlock_kernel();
+// unlock_kernel();
return 0;
}Index: linux/drivers/serial/serial_core.c
===================================================================
--- linux.orig/drivers/serial/serial_core.c
+++ linux/drivers/serial/serial_core.c
@@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct
struct uart_state *state = tty->driver_data;
struct uart_port *port;- BUG_ON(!kernel_locked());
+// BUG_ON(!kernel_locked());if (!state || !state->port)
return;--
I tested it on my 8-core stoakley. The result is 4% worse than the one of pure
--
Really? Are you sure? That would imply that we keep on waking up tasks
which then fail to acquire the lock. But the code pretty plainly doesn't
do that.Still :(
--
Yes, totally based on the data.
The data means the calling times among functions. Initially , I just collected the caller
of schedule and schedule_timeout. Then I found most schedule/schedule_timeout are called by
__down which is called down. Then, I changes kernel to collect more functions' calling info.If comparing the calling times of down, __down and schedule_timeout, we could find
Yes. The data has an error difference, but the difference is small. My patch doesn't
use lock to protect data in case it might introduces too much overhead.--
That's slightly slanderous to the VFS ;-) The BKL really isn't used
that much any more. So little that I've gone through and produced a
list of places it's used:fs/block_dev.c opening and closing a block device. Unlikely to be
provoked by AIM7.
fs/char_dev.c chrdev_open. Unlikely to be provoked by AIM7.
fs/compat.c mount. Unlikely to be provoked by AIM7.
fs/compat_ioctl.c held around calls to ioctl translator.
fs/exec.c coredump. If this is a contention problem ...
fs/fcntl.c held around call to ->fasync.
fs/ioctl.c held around f_op ->ioctl call (tmpfs doesn't have
ioctl). ditto bmap. there's fasync, as previously
mentioned.
fs/locks.c hellhole. I hope AIM7 doesn't use locks.
fs/namespace.c mount, umount. Unlikely to be provoked by AIM7.
fs/read_write.c llseek. tmpfs uses the unlocked version.
fs/super.c shtdown, remount. Unlikely to be provoked by AIM7.So the only likely things I can see are:
- file locks
- fasync--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
I've wanted to fix file locks for a while. Here's a first attempt.
It was done quickly, so I concede that it may well have bugs in it.
I found (and fixed) one with LTP.It takes *no account* of nfsd, nor remote filesystems. We need to have
a serious discussion about their requirements.diff --git a/fs/locks.c b/fs/locks.c
index 663c069..cb09765 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -140,6 +140,8 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)+static DEFINE_SPINLOCK(file_lock_lock);
+
static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);@@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
*/
static void locks_delete_block(struct file_lock *waiter)
{
- lock_kernel();
+ spin_lock(&file_lock_lock);
__locks_delete_block(waiter);
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}/* Insert waiter into blocker's block list.
@@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
{
struct file_lock *cfl;- lock_kernel();
+ spin_lock(&file_lock_lock);
for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
fl->fl_pid = pid_vnr(cfl->fl_nspid);
} else
fl->fl_type = F_UNLCK;
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return;
}
EXPORT_SYMBOL(posix_test_lock);
@@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
int error = 0;
int found = 0;- lock_kernel();
- if (request->fl_flags & FL_ACCESS)
+ if (request->fl_flags & FL_ACCESS) {
+ spin_lock(&file_lock_lock);
goto find_conflict;
+ }if (request->fl_type != F_UNLCK) {
error = -ENOMEM;
+
new_fl = locks_alloc_lock();
if (new_fl == N...
I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
pure 2.6.26-rc1.I replied this email in case you have many patches and I might test what you don't
expect.--
Ouch. That's really odd. The BKL->spinlock conversion looks really
obvious, so it shouldn't be that noticeably slower.The *one* difference is that the BKL has the whole "you can take it
recursively and you can sleep without dropping it because the scheduler
will drop it for you" thing. The spinlock conversion changed all of that
into explicit "drop and retake" locks, and maybe that causes some issues.But 23% worse? That sounds really odd/extreme.
Can you do a oprofile run or something?
Linus
--
I collected oprofile data. It looks not useful, as cpu idle is more than 50%.
samples % app name symbol name
270157 9.4450 multitask add_long
266419 9.3143 multitask add_int
238934 8.3534 multitask add_double
187184 6.5442 multitask mul_double
159448 5.5745 multitask add_float
156312 5.4649 multitask sieve
148081 5.1771 multitask mul_float
127192 4.4468 multitask add_short
80480 2.8137 multitask string_rtns_1
57520 2.0110 vmlinux clear_page_c
53935 1.8856 multitask div_long
48753 1.7045 libc-2.6.90.so strncat
40825 1.4273 multitask array_rtns
32807 1.1470 vmlinux __copy_user_nocache
31995 1.1186 multitask div_int
31143 1.0888 multitask div_float
28821 1.0076 multitask div_double
26400 0.9230 vmlinux find_lock_page
26159 0.9146 vmlinux unmap_vmas
25249 0.8827 multitask div_short
21509 0.7520 vmlinux native_read_tsc
18865 0.6595 vmlinux copy_user_generic_string
17993 0.6291 vmlinux copy_page_c
16367 0.5722 vmlinux system_call
14616 0.5110 libc-2.6.90.so msort_with_tmp
13630 0.4765 vmlinux native_sched_clock
12952 0.4528 vmlinux copy_page_range
12817 0.4481 libc-2.6.90.so strcat
12708 0.4443 vmlinux calc_delta_mine
12611 0.4409 libc-2.6.90.so memset
11631 0.4066 bash (no symbols)
9991 0.3493 vmlinux update_curr
9328 0.3261 vmlinux unlock_page--
Ahh, so it's probably still the BKL that is the problem, it's just not in
the file locking code. The changes to fs/locks.c probably didn't matter
all that much, and the additional regression was likely just some
perturbation.So it's probably fasync that AIM7 tests. Quite possibly coupled with
/dev/tty etc. No file locking.Linus
--
Do we actually know that the locks code is implicated in this regression?
I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
or remote_llseek().tmpfs tends to do weird stuff - it would be interesting to know if the
regression is also present on ramfs or ext2/ext3/xfs/etc.It would be interesting to see if the context switch rate has increased.
Finally: how come we regressed by swapping the semaphore implementation
anyway? We went from one sleeping lock implementation to another - I'd
have expected performance to be pretty much the same.<looks at the implementation>
down(), down_interruptible() and down_try() should use spin_lock_irq(), not
irqsave.up() seems to be doing wake-one, FIFO which is nice. Did the
implementation which we just removed also do that? Was it perhaps
accidentally doing LIFO or something like that?--
If heavily contended, it could do this.
up() would increment sem->count and cal __up() which would call wake_up()
down() would decrement sem->count.
The unlucky task woken by __up would lose the race and go back to sleep.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
It doesn' have to be heavily contended - if it's just hot and a bit lucky,
it would potentially never schedule at all, because it would never take
the spinlock and serialize the callers.It doesn't even need "unfairness" to work that way. The old semaphore
implementation was very much designed to be lock-free, and if you had one
CPU doing a lock while another did an unlock, the *common* situation was
that the unlock would succeed first, because the unlocker was also the
person who had the spinlock exclusively in its cache!The above may count as "lucky", but the hot-cache-line thing is a big
deal. It likely "lucky" into something that isn't a 50:50 chance, but
something that is quite possible to trigger consistently if you just have
mostly short holders of the lock.Which, btw, is probably true. The BKL is normally held for short times,
and released (by that thread) for relatively much longer times. Which
is when spinlocks tend to work the best, even when they are fair (because
it's not so much a fairness issue, it's simply a cost-of-taking-the-lock
issue!)Linus
--
.. and don't get me wrong: the old semaphores (and the new mutexes) should
also have this property when lucky: taking the lock is often a hot-path
case.And the spinlock+generic semaphore thing probably makes that "lucky"
behavior be exponentially less likely, because now to hit the lucky case,
rather than the hot path having just *one* access to the interesting cache
line, it has basically something like 4 accesses (spinlock, count test,
count decrement, spinunlock), in addition to various serializing
instructions, so I suspect it quite often gets serialized simply because
even the "fast path" is actually about ten times as long!As a result, a slow "fast path" means that the thing gets saturated much
more easily, and that in turn means that the "fast path" turns into a
"slow path" more easily, which is how you end up in the scheduler rather
than just taking the fast path.This is why sleeping locks are more expensive in general: they have a
*huge* cost from when they get contended. Hundreds of times higher than a
spinlock. And the faster they are, the longer it takes for them to get
contended under load. So slowing them down in the fast path is a double
whammy, in that it shows their bad behaviour much earlier.And the generic semaphores really are slower than the old optimized ones
in that fast path. By a *big* amount.Which is why I'm 100% convinced it's not even worth saving the old code.
It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
"slow path" other than the fact that it gets to that slow path much more
these days.Linus
--
i think your theory should be easy to test: Yanmin, could you turn on
CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses?Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does
exactly that: it doesnt use the single-instruction fastpath [it uses
asm-generic/mutex-null.h] but always drops into the slowpath (to be able
to access debug state). That debug code is about as expensive as the
generic semaphore code's current fastpath. (perhaps even more
expensive.)There's far more normal mutex fastpath use during an AIM7 run than any
BKL use. So if it's due to any direct fastpath overhead and the
resulting widening of the window for the real slowdown, we should see a
severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?Ingo
--
Not agreed.
The BKL is special because it is a *single* lock.
All the "normal" mutex code use fine-grained locking, so even if you slow
down the fast path, that won't cause the same kind of fastpath->slowpath
increase.In order to see the fastpath->slowpath thing, you do need to have many
threads hitting the same lock: ie the slowdown has to result in real
contention.Almost no mutexes have any potential for contention what-so-ever, except
for things that very consciously try to hit it (multiple threads doing
readdir and/or file creation on the *same* directory etc).The BKL really is special.
Linus
--
ok, indeed my suggestion is wrong and this would not be a good
comparison.another idea: my trial-baloon patch should test your theory too, because
the generic down_trylock() is still the 'fat' version, it does:spin_lock_irqsave(&sem->lock, flags);
count = sem->count - 1;
if (likely(count >= 0))
sem->count = count;
spin_unlock_irqrestore(&sem->lock, flags);if there is a noticeable performance difference between your
trial-ballon patch and mine, then the micro-cost of the BKL very much
matters to this workload. Agreed about that?but i'd be _hugely_ surprised about it. The tty code's BKL use should i
think only happen when a task exits and releases the tty - and a task
exit - even if this is a threaded test (which AIM7 can be - not sure
which exact parameters Yanmin used) - the costs of thread creation and
thread exit are just not in the same ballpark as any BKL micro-costs.
Dunno, maybe i overlooked some high-freq BKL user. (but any such site
would have shown up before) Even assuming a widening of the critical
path and some catastrophic domino effect (that does show up as increased
scheduling) i've never seen a 40% drop like this.this regression, to me, has "different scheduling behavior" written all
over it - but that's just an impression. I'm not going to bet against
you though ;-)Ingo
--
I agree that your trial-balloon should likely get rid of the big
regression, since it avoids the scheduler.So with your patch, lock_kernel() ends up being just a rather expensive
spinlock. And yes, I'd expect that it should get rid of the 40% cost,
because while it makes lock_kernel() more expensive than a spinlock and
you might end up having a few more cacheline bounces on the lock due to
that, that's still the "small" expense compared to going through the whole
scheduler on conflicts.So I'd expect that realistically the performance difference between your
version and just plain spinlocks shouldn't be *that* big. I'd expect it to
be visible, but in the (low) single-digit percentage range rather than in
any 40% range. That's just a guess.Linus
--
third attempt - the patch below ontop of v2.6.25 should be quite similar
fastpath atomic overhead to what generic semaphores do? So if Yanmin
tests this patch ontop of v2.6.25, we should see the direct fastpath
overhead - without any changes to the semaphore wakeup/scheduling logic
otherwise.[ this patch should in fact be a bit worse, because there's two more
atomics in the fastpath - the fastpath atomics of the old semaphore
code. ]Ingo
------------------>
Subject: v2.6.25 BKL: add atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008---
lib/kernel_lock.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,7 @@
* Don't use in new code.
*/
static DECLARE_MUTEX(kernel_sem);
+static DEFINE_SPINLOCK(global_lock);/*
* Re-acquire the kernel semaphore.
@@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(vdown(&kernel_sem);
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+
preempt_disable();
task->lock_depth = saved_lock_depth;@@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v
void __lockfunc __release_kernel_lock(void)
{
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+
up(&kernel_sem);
}@@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
down(&kernel_sem);+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+ }
+
task->lock_depth = depth;
}--
Well, it doesn't have the irq stuff, which is also pretty costly. Also, it
doesn't nest the accesses the same way (with the counts being *inside* the
spinlock and serialized against each other), so I'm not 100% sure you'd
get the same behaviour.But yes, it certainly has the potential to show the same slowdown. But
it's not a very good patch, since not showing it doesn't really prove
much.Linus
--
ok, the one below does irq ops and the counter behavior - and because
the critical section also has the old-semaphore atomics i think this
should definitely be a more expensive fastpath than what the new generic
code introduces. So if this patch produces a 40% AIM7 slowdown on
v2.6.25 it's the fastpath overhead (and its effects on slowpath
probability) that makes the difference.Ingo
------------------->
Subject: add BKL atomic overhead
From: Ingo Molnar <mingo@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008NOT-Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
lib/kernel_lock.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,8 @@
* Don't use in new code.
*/
static DECLARE_MUTEX(kernel_sem);
+static int global_count;
+static DEFINE_SPINLOCK(global_lock);/*
* Re-acquire the kernel semaphore.
@@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v
{
struct task_struct *task = current;
int saved_lock_depth = task->lock_depth;
+ unsigned long flags;BUG_ON(saved_lock_depth < 0);
@@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v
down(&kernel_sem);
+ spin_lock_irqsave(&global_lock, flags);
+ global_count++;
+ spin_unlock_irqrestore(&global_lock, flags);
+
preempt_disable();
task->lock_depth = saved_lock_depth;@@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v
void __lockfunc __release_kernel_lock(void)
{
+ spin_lock_irqsave(&global_lock, flags);
+ global_count--;
+ spin_unlock_irqrestore(&global_lock, flags);
+
up(&kernel_sem);
}@@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void)
struct task_struct *task = current;
int depth = task->lock_depth + 1;- if (likely(!depth))
+ if (likely(!depth)) {
/*
...
No it doesn't. The counter isn't used for any actual *testing*, so the
locking around it and the serialization of it has absolutely no impact on
the scheduling behaviour!Since the big slowdown was clearly accompanied by sleeping behaviour (the
processes who didn't get the lock end up sleeping!), that is a *big* part
of the slowdown.Is it possible that your patch gets similar behaviour? Absolutely. But
you're missing the whole point here. Anybody can make code behave badly
and perform worse. But if you want to just verify that it's about the
sleeping behaviour and timings of the BKL, then you need to do exactly
that: emulate the sleeping behavior, not just the timings _outside_ of the
sleeping behavior.The thing is, we definitely are interested to see whether it's the BKL or
some other semaphore that is the problem. But the best way to test that is
to just try my patch that *guarantees* that the BKL doesn't have any
semaphore behaviour AT ALL.Could it be something else entirely? Yes. We know it's semaphore- related.
We don't know for a fact that it's the BKL itself. There could be other
semaphores that are that hot. It sounds unlikely, but quite frankly,
regardless, I don't really see the point of your patches.If Yanmin tries my patch, it is *guaranteed* to show something. It either
shows that it's about the BKL (and that we absolutely have to do the BKL
as something _else_ than a generic semaphore), or it shows that it's not
about the BKL (and that _all_ the patches in this discussion are likely
pointless).In contrast, these "try to emulate bad behavior with the old known-ok
semaphores" don't show anything AT ALL. We already know it's related to
semaphores. And your patches aren't even guaranteed to show the same
issue.Linus
--
One patch I'd still like Yanmin to test is my one from yesterday which
removes the BKL from fs/locks.c.http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2
Obviously, it won't help if the problem isn't the BKL.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
And I'd personally rather have the network-fs people test and comment on
that one ;)I think that patch is worth looking at regardless, but the problems with
that one aren't about performance, but about what the implications are for
the filesystems (if any)...Linus
--
Oh, well, they don't seem interested.
I can comment on some of the problems though.
fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
protect them against locks.c any more. That's probably OK for fs/nfs/*
since they'll be protected by their own data structures (Someone please
check me on that?), but it's a bad idea for lockd/nfsd which are walking
the lists for filesystems.Are we going to have to export the file_lock_lock? I'd rather not. But
we need to keep nfsd/lockd from tripping over locks.c.Maybe we could come up with a decent API that lockd could use? It all
seems a bit complex at the moment ... maybe lockd should be keeping
track of the locks it owns anyway (since surely the posix deadlock
detection code can't work properly if it's just passing all the locks
through).--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Poor timing: we're all preparing for and travelling to the annual
Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
locks it holds on the server. We could alternatively have coded aI'm not sure what you mean when you talk about lockd keeping track of
the locks it owns. It has to keep those locks on inode->i_flock in order
to make them visible to the host filesystem...All lockd really needs, is the ability to find a lock it owns, and then
obtain a copy. As for the nfs client, I suspect we can make do with
something similar...Cheers
Trond--
So, assuming nfs is taking care of its own locking (I don't know if
that's right), that leaves nlm_traverse_locks() and nlm_file_inuse()That sounds right.
--
yeah, i was just trying to come up with patches to probe which one of
the following two possibilities is actually the case:- if the regression is due to the difference in scheduling behavior of
new semaphores (different wakeup patterns, etc.), that's fixable in
the new semaphore code => then the BKL code need not change.- if the regression is due due to difference in the fastpath cost, then
the new semaphores can probably not be improved (much of their appeal
comes from them not being complex and not being in assembly) => then
the BKL code needs to change to become cheaper [i.e. then we want
your patch].Ingo
--
Put another way: let's say that the "good fastpath" is basically a single
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the
no-bouncing, no-contention case.Doing it with debugging (call overhead, spinlocks, local irq saving rtc)
will probably easily triple it or more, but we're not changing anything
else. There's no "downstream" effect: the behaviour itself doesn't change.
It doesn't get more bouncing, it doesn't start sleeping.But what happens if the lock has the *potential* for conflicts is
different.There, a "longish pause + fast lock + short average code sequece + fast
unlock" is quite likely to stay uncontended for a fair amount of time, and
while it will be much slower than the no-contention-at-all case (because
you do get a pretty likely cacheline event at the "fast lock" part), with
a fairly low number of CPU's and a long enough pause, you *also* easily
get into a pattern where the thing that got the lock will likely also get
to unlock without dropping the cacheline.So far so good.
But that basically depends on the fact that "lock + work + unlock" is
_much_ shorter than the "longish pause" in between, so that even if you
have <n> CPU's all doing the same thing, their pauses between the locked
section are still bigger than <n> times that short time.Once that is no longer true, you now start to bounce both at the lock
*and* the unlock, and now that whole sequence got likely ten times slower.
*AND* because it now actually has real contention, it actually got even
worse: if the lock is a sleeping one, you get *another* order of magnitude
just because you now started doing scheduling overhead too!So the thing is, it just breaks down very badly. A spinlock that gets
contention probably gets ten times slower due to bouncing the cacheline. A
semaphore that gets contention probably gets a *hundred* times slower, or
more.And so my bet is that both the old and the new semaphores had the same bad
break-down sit...
my own guesstimate about AIM7 performance impact resulting out of
CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or
will at most be in the sub-1% range. But i've been badly wrong before :)Ingo
--
Stupid question: why doesn't lock_kernel() use a mutex?
(stupid answer: it'll trigger might_sleep() checks when we do it early in
boot with irqs disabled, but we can fix that)(And __might_sleep()'s system_state check might even save us from that)
Of course, we shouldn't change anything until we've worked out why the new
semaphores got slower.--
Not stupid.
The only reason some code didn't get turned over to mutexes was literally
that they didn't want the debugging because they were doing intentionally
bad things.I think the BKL is one of them (the console semaphore was another, iirc).
Linus
--
down_trylock() is used in atomic code. See for example kernel/printk.c. So
no, that one needs to be irqsafe.Linus
--
i just checked the old implementation on x86. It used
lib/semaphore-sleepers.c which does one weird thing:- __down() when it returns wakes up yet another task via
wake_up_locked().i.e. we'll always keep yet another task in flight. This can mask wakeup
latencies especially when it takes time.The patch (hack) below tries to emulate this weirdness - it 'kicks'
another task as well and keeps it busy. Most of the time this just
causes extra scheduling, but if AIM7 is _just_ saturating the number of
CPUs, it might make a difference. Yanmin, does the patch below make any
difference to the AIM7 results?( it would be useful data to get a meaningful context switch trace from
the whole regressed workload, and compare it to a context switch trace
with the revert added. )Ingo
---
kernel/semaphore.c | 10 ++++++++++
1 file changed, 10 insertions(+)Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -261,4 +261,14 @@ static noinline void __sched __up(struct
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
+
+ if (likely(list_empty(&sem->wait_list)))
+ return;
+ /*
+ * Opportunistically wake up another task as well but do not
+ * remove it from the list:
+ */
+ waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);
}
--
I tested it on my 8-core stoakley and the result is 12% worse than the one of
pure 2.6.26-rc1.--
Not yet. We don't even know it's the BKL. It's just my best guess.
We're waiting for the original reporter to run some tests Ingo pointedWe talked about this ... the BKL actually requires that you be able to
acquire it with interrupts disabled. Maybe we should make lock_kernel
do this:if (likely(!depth)) {
unsigned long flags;
local_save_flags(flags);
down();
local_irq_restore(flags);
}But tweaking down() is not worth it -- we should be eliminating users of
That's a question for someone who knows x86 assembler, I think.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
hm, where does it require it, besides the early bootup code? (which
should just be fixed)down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that
the assembly is mostly just for the fastpath - and a 40% regression
cannot be about fastpath differences. In the old code the scheduling
happens in lib/semaphore-sleeper.c, and from the looks of it it appears
to be a proper FIFO as well. (plus this small wakeup weirdness it has)i reviewed the new code in kernel/semaphore.c as well and can see
nothing bad in it - it does proper wake-up, FIFO queueing, like the
mutex code.Ingo
--
On Tue, 6 May 2008 19:49:54 +0200
Yeah, the early bootup code. The kernel does accidental lock_kernel()s in
various places and if that renables interrupts then powerpc goeth crunch.Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
There's the weird wakeup in down() which I understood for about five
minutes five years ago. Perhaps that accidentally sped something up.
Oh well, more investigation needed..
--
I was just reviewing the code and I came across one of these:
/*
* Some notes on the implementation:
*
* The spinlock controls access to the other members of the semaphore.
* down_trylock() and up() can be called from interrupt context, so we
* have to disable interrupts when taking the lock. It turns out various
* parts of the kernel expect to be able to use down() on a semaphore in
* interrupt context when they know it will succeed, so we have to use
* irqsave variants for down(), down_interruptible() and down_killable()
* too.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
there some user that doesn't have the inode info, or does anything that
might cross inode boundaries?This does seem to drop all locking around the "setlease()" calls down to
the filesystem, which worries me. That said, we clearly do need to do
this. Probably should have done it a long time ago.Hmm?
Linus
--
The deadlock detection crosses inode boundaries.
--b.
--
/proc/locks and deadlock detection both cross inode boundaries (and even
filesystem boundaries). The BKL-removal brigade tried this back in 2.4
and the locking ended up scaling worse than just plonking a singleThe only filesystems that are going to have their own setlease methods
will be remote ones (nfs, smbfs, etc). They're going to need to sleep
while the server responds to them. So holding a spinlock while we callSo that find_conflict doesn't end up in the first column, which causes
diff to treat it as a function name for the purposes of the @@ lines.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
Ok, no worries. Just as long as I know why it's a single lock. Looks ok to
me, apart from the need for testing (and talking to NFS etc people).Linus
--
Please can we just fix the tools not mangle the kernel to work around
silly bugs ?--
The people who control the tools refuse to fix them.
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
That's just plain bullocks.
First off, that "@@" line thing in diffs is not important enough to screw
up the source code for. Ever. It's just a small hint to make it somewhat
easier to see more context for humans.Second, it's not even that bad to show the last label there, rather than
the function name.Third, you seem to be a git user, so if you actually really care that much
about the @@ line, then git actually lets you set your very own pattern
for those things.In fact, you can even do it on a per-file basis based on things like
filename rules (ie you can have different patterns for what to trigger on
for a *.c file and for a *.S file, since in a *.S file the 'name:' thing
_is_ the right pattern).So not only are you making idiotic changes just for irrelevant tool usage,
you're also apparently lying about people "refusing to fix" things as an
excuse.You can play with it. It's documented in gitattributes (see "Defining a
custom hunk header"), and the default one is just the same one that GNU
diff uses for "-p". I think.You can add something like this to your ~/.gitconfig:
[diff "default"]
funcname=^[a-zA-Z_$].*(.*$to only trigger the funcname pattern on a line that starts with a valid C
identifier hing, and contains a '('.And you can just override the default like the above (that way you don't
have to specify attributes), but if you want to do things differently for
*.c files than from *.S files, you can edit your .git/info/attributes file
and make it contain something like*.S diff=assembler
*.c diff=Cand now you can make your ~/.gitconfig actually show them differently, ie
something like[diff "C"]
funcname=^[a-zA-Z_$].*(.*$[diff "assembler"]
funcname=^[a-zA-Z_$].*:etc.
Of course, there is a real cost to this, but it's cheap enough in practice
that you'll never notice.Linus
--
On Tue, 6 May 2008 10:51:12 -0600
That would be their problem. We should refuse to mash the kernel up
because they aren't doing their job. If need be someone can fork a
private git-patch.
--
40%?! That's shocking. Can you tell which semaphore was heavily
contended? I have a horrible feeling that it's the BKL.--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
| Amit K. Arora | [RFC] Heads up on sys_fallocate() |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Stephen Rothwell | Re: Announce: Linux-next (Or Andrew's dream :-)) |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Patrick McHardy | Re: [GIT]: Networking |
| Natalie Protasevich | [BUG] New Kernel Bugs |
