"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:
"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."
Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."
Ingo noted that the new tree already uncovered a serious regression in the kernel, and continued:
"Once this tree stabilizes, elimination of the BKL can be done the usual and well-known way of eliminating big locks: by pushing it down into subsystems and replacing it with subsystem locks, and splitting those locks and eliminating them. We've done this countless times in the past and there are lots of capable developers who can attack such problems."
Linus responded favorably, "ok, so I'm obviously happy. This is exactly the kind of thing I would want to see." He went on to suggest improvements to the new "kill-the-BKL" branch to make it easier to pull in obvious fixes, allowing the changes to get additional testing in the mainline kernel.
There was much more discussion about these planned changes, some of which is included below.
From: Ingo Molnar
Subject: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: Wednesday, May 14, 2008 - 12:49 pm
As some of the latency junkies on lkml already know it, commit 8e3e076
("BKL: revert back to the old spinlock implementation") in v2.6.26-rc2
removed the preemptible BKL feature and made the Big Kernel Lock a
spinlock and thus turned it into non-preemptible code again. This commit
returned the BKL code to the 2.6.7 state of affairs in essence.
Linus also indicated that pretty much the only acceptable way to change
this (to us -rt folks rather unfortunate) latency source and to get rid
of this non-preemptible locking complication is to remove the BKL.
This task is not easy at all. 12 years after Linux has been converted to
an SMP OS we still have 1300+ legacy BKL using sites. There are 400+
lock_kernel() critical sections and 800+ ioctls. They are spread out
across rather difficult areas of often legacy code that few people
understand and few people dare to touch.
It takes top people like Alan Cox to map the semantics and to remove BKL
code, and even for Alan (who is doing this for the TTY code) it is a
long and difficult task.
According to my quick & dirty git-log analysis, at the current pace of
BKL removal we'd have to wait more than 10 years to remove most BKL
critical sections from the kernel and to get acceptable latencies again.
The biggest technical complication is that the BKL is unlike any other
lock: it "self-releases" when schedule() is called. This makes the BKL
spinlock very "sticky", "invisible" and viral: it's very easy to add it
to a piece of code (even unknowingly) and you never really know whether
it's held or not. PREEMPT_BKL made it even more invisible, because it
made its effects even less visible to ordinary users.
Furthermore, the BKL is not covered by lockdep, so its dependencies are
largely unknown and invisible, and it is all lost in the haze of the
past ~15 years of code changes. All this has built up to a kind of Fear,
Uncertainty and Doubt about the BKL: nobody really knows it, nobody
really dares to touch it and code can break silently and subtly if BKL
locking is wrong.
So with these current rules of the game we cannot realistically fix this
amount of BKL code in the kernel. People wont just be able to change
1300 very difficult and fragile legacy codepaths in the kernel
overnight, just to improve the latencies of the kernel.
So ... because i find a 10+ year wait rather unacceptable, here is a
different attempt: lets try and change the rules of the game :-)
The technical goal is to make BKL removal much more easy and much more
natural - to make the BKL more visible and to remove its FUD component.
To achieve those goals i've created and uploaded the "kill-the-BKL"
prototype branch to the -tip tree, which branch consists of 19 various
commits at the moment:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git kill-the-BKL
This branch (against latest -git) implements the biggest (and by far
most critical) core kernel changes towards fast BKL elimination:
- it fixes all "the BKL auto-releases on schedule()" assumptions i
could trigger on my testboxes.
- it adds a handful of debug facilities to warn about common BKL
assumptions that are not valid anymore under the new locking model
- it turns the BKL into an ordinary mutex and removes all
"auto-release" BKL legacy code from the scheduler.
- it thus adds lockdep support to the BKL
- it activates the BKL on UP && !PREEMPT too - this makes the code
simpler and more universal and hopefully activates more people to get
rid of the BKL.
- makes BKL sections again preemptible
- ... simplifies the BKL code greatly, and moves it out of the core
kernel
In other words: the kill-the-BKL tree turns the BKL into an ordinary
albeit somewhat big mutex, with a quirky lock/unlock interface called
"lock_kernel()" and "unlock_kernel()".
Certainly the most interesting commit to check is aa3187000:
"remove the BKL: remove it from the core kernel!".
Once this tree stabilizes, elimination of the BKL can be done the usual
and well-known way of eliminating big locks: by pushing it down into
subsystems and replacing it with subsystem locks, and splitting those
locks and eliminating them. We've done this countless times in the past
and there are lots of capable developers who can attack such problems.
In the future we might also want to try to eliminate the self-recursion
(nested locking) feature of the BKL - this would make BKL code even more
apparent.
Shortlog, diffstat and patches can be found below. I've build and boot
tested it on 32-bit and 64-bit x86.
NOTE: the code is highly experimental - it is recommended to try this
with PROVE_LOCKING and SOFTLOCKUP_DEBUG enabled. If you trigger a
lockdep warning and a softlockup warning, please report it.
Linus, Alan: the increased visibility and debuggability of the BKL
already uncovered a rather serious regression in upstream -git. You
might want to cherry pick this single fix, it will apply just fine to
current -git:
| commit d70785165e2ef13df53d7b365013aaf9c8b4444d
| Author: Ingo Molnar
| Date: Wed May 14 17:11:46 2008 +0200
|
| tty: fix BKL related leak and crash
This bug might explain a so far undebugged atomic-scheduling crash i saw
in overnight randconfig boot testing. I tried to keep the fix minimal
and safe. (although it might make sense to refactor the opost() code to
have a single exit site in the future)
Bugreports, comments and any other feedback is more than welcome,
Ingo
------------>
Ingo Molnar (19):
revert ("BKL: revert back to the old spinlock implementation")
remove the BKL: change get_fs_type() BKL dependency
remove the BKL: reduce BKL locking during bootup
remove the BKL: restruct ->bd_mutex and BKL dependency
remove the BKL: change ext3 BKL assumption
remove the BKL: reduce misc_open() BKL dependency
remove the BKL: remove "BKL auto-drop" assumption from vt_waitactive()
remove the BKL: remove it from the core kernel!
softlockup helper: print BKL owner
remove the BKL: flush_workqueue() debug helper & fix
remove the BKL: tty updates
remove the BKL: lockdep self-test fix
remove the BKL: request_module() debug helper
remove the BKL: procfs debug helper and BKL elimination
remove the BKL: do not take the BKL in init code
remove the BKL: restructure NFS code
tty: fix BKL related leak and crash
remove the BKL: fix UP build
remove the BKL: use the BKL mutex on !SMP too
arch/mn10300/Kconfig | 11 ++++
drivers/char/misc.c | 8 +++
drivers/char/n_tty.c | 13 +++-
drivers/char/tty_io.c | 14 ++++-
drivers/char/vt_ioctl.c | 8 +++
fs/block_dev.c | 4 +-
fs/ext3/super.c | 4 -
fs/filesystems.c | 12 ++++
fs/proc/generic.c | 12 ++--
fs/proc/inode.c | 3 -
fs/proc/root.c | 9 +--
include/linux/hardirq.h | 18 +++---
include/linux/smp_lock.h | 36 ++---------
init/Kconfig | 5 --
init/main.c | 7 +-
kernel/fork.c | 4 +
kernel/kmod.c | 22 +++++++
kernel/sched.c | 16 +-----
kernel/softlockup.c | 3 +
kernel/workqueue.c | 13 ++++
lib/Makefile | 4 +-
lib/kernel_lock.c | 142 +++++++++++++---------------------------------
net/sunrpc/sched.c | 6 ++
23 files changed, 180 insertions(+), 194 deletions(-)
From: Linus Torvalds <torvalds@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 2:41 pm 2008
On Wed, 14 May 2008, Ingo Molnar wrote:
>
> Linus, Alan: the increased visibility and debuggability of the BKL
> already uncovered a rather serious regression in upstream -git. You
> might want to cherry pick this single fix, it will apply just fine to
> current -git:
Ok, so I'm obviously happy. This is exactly the kind of thing I would want
to see.
That said, the way it is now set up, it's unreasonable to merge anything
directly, and while I can cherry-pick obvious fixes this way, I do think
we could do things better.
It should be possible to set things up so that it's a config option, and
we can mark it EXPERIMENTAL but still merge it into the standard kernel,
so that we'd have the debug stuff there. That would get a lot more
coverage, especially if it all still *works*, even if the debug stuff then
complains (ie it would be nicer if the lock itself didn't start breaking).
So for example, have CONFIG_DEBUG_BKL turn it into a mutex (and select
mutex debugging), and get all the debug coverage that way, but then when
somebody enters the scheduler with the lock held, first complain, but then
auto-release it anyway. That way, bugs get found and complained about, but
hopefully the machine still ends up working.
Linus
--
From: Jonathan Corbet <corbet@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:45 pm 2008
Sez Ingo:
> This task is not easy at all. 12 years after Linux has been converted to
> an SMP OS we still have 1300+ legacy BKL using sites. There are 400+
> lock_kernel() critical sections and 800+ ioctls.
There's also every char device open() method - a rather long list in its
own right. I'd be surprised if one in ten of them really needs it, but
one has to look...
I've been looking at the chrdev code anyway, and pondering on how this
might be addressed. Here's some thoughts on alternatives, I'd be
curious what people think:
1: We could add an unlocked_open() to the file_operations structure;
drivers could be converted over as they are verified not to need the
BKL on open. Disadvantages are that it grows this structure for a
relatively rare case - most open() calls already don't need the BKL.
But it's a relatively easy path without flag days.
2: Create a char_dev_ops structure for char devs and use it instead of
file_operations. I vaguely remember seeing Al mutter about that a
while back. Quite a while back. This mirrors what was done with
block devices, and makes some sense - there's a lot of stuff in
struct file_operations which is not really applicable to char devs.
Then struct char_dev_ops could have open() and locked_open(), with
the latter destined for removal sometime around 2015 or so.
Advantages are that it's cleaner and separates out some things which
perhaps shouldn't be mixed anyway. Disadvantage is...well...a fair
amount of code churn. It would also require chrdev-specific wrappers
to map straight file_operations calls in the VFS to the new
callbacks.
3: Provide a new form of cdev_add() which lets the driver indicate
that the BKL is not needed on open (or anything else?). At a
minimum, it could just be a new parameter on cdev_add which has a
value of zero or FIXME_I_STILL_NEED_BKL. Still some churn but easier
to script and smaller because a lot of drivers are still using
register_chrdev() - something else worth fixing.
A more involved form might provide a new chardev_add() which takes
the new char_dev_ops structure too. Mapping between new and old
operations vectors would be done internally to avoid breaking older
drivers before they can be fixed.
4: Just find every char dev open() function and shove in lock_kernel()
calls, then remove the call from chrdev_open(). The disadvantage
here is that, beyond lots of work and churn, there's no way to know
which ones you missed.
I kind of like the combination of 2 and 3, done in such a way that
there's no "every driver must change" flag day. This could be an
interesting project, even... Thoughts?
jon
--
From: Linus Torvalds <torvalds@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:56 pm 2008
On Wed, 14 May 2008, Jonathan Corbet wrote:
>
> There's also every char device open() method - a rather long list in its
> own right. I'd be surprised if one in ten of them really needs it, but
> one has to look...
I don't think there are *that* many. I found only 83 instances of
"register_chrdev()" in the kernel, so the open methods should be pretty
limited.
Of course, some open methods call other sub-registrations, but you'd start
off by moving the lock_kernel() down just *one* stage.
So it literally should be:
- remove one lock_kernel/unlock_kernel pair in fs/char_dev.c
- add max 83 pairs in the places that register those things
- external modules will need to add it themselves some day.
> 1: We could add an unlocked_open() to the file_operations structure;
> drivers could be converted over as they are verified not to need the
> BKL on open. Disadvantages are that it grows this structure for a
> relatively rare case - most open() calls already don't need the BKL.
> But it's a relatively easy path without flag days.
I really don't think it's worth the pain. See above. The numbers aren't
that huge, and external modules simply aren't a pressing enough issue.
Linus
--
From: Jonathan Corbet <corbet@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 6:07 pm 2008
Linus Torvalds wrote:
> I don't think there are *that* many. I found only 83 instances of
> "register_chrdev()" in the kernel, so the open methods should be pretty
> limited.
There's the drivers calling cdev_add() directly as well - another
40ish. Still not a huge list, I guess.
> So it literally should be:
> - remove one lock_kernel/unlock_kernel pair in fs/char_dev.c
> - add max 83 pairs in the places that register those things
> - external modules will need to add it themselves some day.
This is all certainly doable, but it leaves me with one concern: there
will be no signal to external module maintainers that the change needs
to be made. So, beyond doubt, quite a few of them will just continue to
be shipped unfixed - and they will still run. If any of them actually
*need* the BKL, something awful may happen to somebody someday.
jon
--
From: Linus Torvalds <torvalds@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 6:14 pm 2008
On Wed, 14 May 2008, Jonathan Corbet wrote:
>
> This is all certainly doable, but it leaves me with one concern: there
> will be no signal to external module maintainers that the change needs
> to be made. So, beyond doubt, quite a few of them will just continue to
> be shipped unfixed - and they will still run. If any of them actually
> *need* the BKL, something awful may happen to somebody someday.
External modules have bugs because interfaces change. Film at 11.
It's true, but it definitely shouldn't keep us from just doing it.
Especially since well-maintained external modules (ie the authors follow
big discussions like this) can just take the kernel lock regardless of
kernel version, since it won't even be broken with old kernels.
Of course, well-maintained kernel modules wouldn't depend on the BKL in
the first place. Oh, well.
Linus
--
From: Andi Kleen <andi@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 2:30 pm 2008
Ingo Molnar writes:
> As some of the latency junkies on lkml already know it, commit 8e3e076
> ("BKL: revert back to the old spinlock implementation") in v2.6.26-rc2
> removed the preemptible BKL feature and made the Big Kernel Lock a
> spinlock and thus turned it into non-preemptible code again. This commit
> returned the BKL code to the 2.6.7 state of affairs in essence.
It's a reasonable start, but have you considered doing this work
in tree instead? As in just add all the warnings, but don't actually
change the semantics yet. I suspect you would get far more users
this way and the work would go faster.
It would be reasonable to enable this in -mm if it the warnings are
not too intrusive (self disable itself etc.)
Also for fixing the ioctls I'm not sure that dynamic instrumentation
will really work because it would be tough to execute them all.
I suspect some variant of static code analysis would make sense
for the ioctls.
I used to do some auditing with cflow. That won't
catch indirect function calls unfortunately, but if there's
some way to find those and bail out one could do an automated
tool that flags all the ioctls that don't sleep for example
(don't have any sleeping functions in the call chain -- this
might need some manual annotation, but hopefully not much)
Then it would be possible to safely switch those over to a blocking
mutex variant of BKL.
Now there could be some more automated analysis here: for example the
main other user of BKL is character open. I suspect to really
make progress here you would also need a open_unlocked() and
do the same for all the open functions etc.
> According to my quick & dirty git-log analysis, at the current pace of
> BKL removal we'd have to wait more than 10 years to remove most BKL
> critical sections from the kernel and to get acceptable latencies again.
Hmm, is BKL really that common still that it's a latency problem?
The few VFS cases like locks can be fixed without extreme measures.
Most of the legacy users are unlikely to be latency problems,
simply because only very few people (or nobody) still has that hardware
and the code will never run.
Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever.
-Andi
--
From: Alan Cox <alan@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:00 pm 2008
> Most of the legacy users are unlikely to be latency problems,
> simply because only very few people (or nobody) still has that hardware
> and the code will never run.
>
> Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever.
Most of the legacy users inflict that locking on other code - eg the ISN
use of the BKL directly impacts on the tty layer work.
--
From: Andi Kleen <andi@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:13 pm 2008
Alan Cox wrote:
>> Most of the legacy users are unlikely to be latency problems,
>> simply because only very few people (or nobody) still has that hardware
>> and the code will never run.
>>
>> Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever.
>
> Most
Most?
>of the legacy users inflict that locking on other code - eg the ISN
> use of the BKL directly impacts on the tty layer work.
So you just stick unlock_kernel()/lock_kernel() around the call
to TTY (or similar to the entry points)
-Andi
--
From: Alan Cox <alan@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:19 pm 2008
> Most?
Yes
>
> >of the legacy users inflict that locking on other code - eg the ISN
> > use of the BKL directly impacts on the tty layer work.
>
> So you just stick unlock_kernel()/lock_kernel() around the call
> to TTY (or similar to the entry points)
It isn't that simple - I've spent a good deal of time working on it.
There are lots of paths that rely on interactions between modules. Eg we
found stuff racing between the pid structs tty internals and procfs that
happened to be saved by the BKL.
That in itself is a problem Ingo's stuff won't help with: We have lots of
"magic" accidental, undocumented and pot luck BKL locking semantics
between subsystems that are not even visible.
--
From: Linus Torvalds <torvalds@...>
Subject: Re: [announce] "kill the Big Kernel Lock (BKL)" tree
Date: May 14, 5:45 pm 2008
On Wed, 14 May 2008, Alan Cox wrote:
>
> That in itself is a problem Ingo's stuff won't help with: We have lots of
> "magic" accidental, undocumented and pot luck BKL locking semantics
> between subsystems that are not even visible.
The good news is that I suspect they are going away. It probably is mainly
tty and /proc by now, and /proc is pretty close to done.
It's hard to have too many inter-module dependencies when most of the core
modules no longer even take the kernel lock any more.
In the VFS layer, we still have
- the ioctl thing, obviously. That's just mind-numbing "move things
down", not hard per se. But there's a *lot* of them (and I suspect the
huge majority of them don't actually need it, since they'd already be
racing against read/write anyway if they did).
- default_llseek(). Probably the same, just a lot less of it.
- superblock read/write.
and the latter one in particular is really dubious (we already have
"[un]lock_super()" around it all, I think).
The core kernel, VM and networking already don't really do BKL. And it's
seldom the case that subsystems interact with other unrelated subsystems
outside of the core areas.
So it's a lot of work, no doubt, but I do think we should be able to do
it. The most mind-numbing part is literally all the ioctl crud. There's
more ioctl points than there are lock_kernel() calls left anywhere else.
Linus
--
He noted that this had a
Shouldn't this be "effect" instead of "affect"?
Yes, I think that is right,
Yes, I think that is right, it would effect it rather than affect it.
humhum
you're reading critical words spread by brilliant folks and you're talking about the misspelling of a single word ... !?!?
So?
So? It's nice to know what's right and what's not, be it the subject of the message or its grammar ;-)
It is not a misspelling,
It is not a misspelling, affect and effect have different meanings. It is clear what was meant, but can make quite a difference in some circumstances. Sometimes a mastery of a language makes all the difference, especially in technical matters.
The noun "affect" is the
The noun "affect" is the emotional face one wears.
The noun "effect" is the result of something.
The verb "affect" is "to change."
The verb "effect" is "to cause."
In this case, the noun "effect" is what belongs.
wha?
a.) Wha?
b.) "He noted that this had a very negative affect on the real time kernel efforts" --> He noted that this ( affected || effected ) the real time kernel efforts negatively.
The grandparent post is
The grandparent post is correct.
In this sentence "had" is the verb and "effect" is the predicative (and thus noun).
Yes, this is correct because here "affected" is the verb.
Splitting hairs
It's not a spelling error though. "Affect" is spelled correctly. It's a word choice error. And you are correct, "effect" is the correct word. The person you're replying to is also correct that it's not a spelling error.
Gotta love hair-splitting.
--
Program Intellivision and play Space Patrol!
SMP OS
It should have been coded as a SMP OS in the first place.
Then it wouldn't have to be converted to a SMP OS.
I await your SMP OS with
I await your SMP OS with baited anticipation.
Break out the time-machine.
Break out the time-machine.
baited vs bated
'bated anticipation'... unless you're going for the pun
yeah, you should have
yeah, you should have designed it !!!!
damn I'm really tired of seeing guys talking about what others should have done ... do it!!
As they say, hindsight is
As they say, hindsight is 20-20
Math strikes back!
20-20 == 0
Equation corrected
Naturally, the parent meant to say that hindsight is 20/20.
And for those who are math challenged, 20/20 = 1; 1 = 100%.
It was designed to be a very small thing.
It was designed to be a very small thing. It started as a modem interface. Linus didn't anticipate that it would become a SMP OS.
If Linux was started on 1991 (17 years ago), and converted to SMP 12 years ago, it means that it was a SMP OS for roughly 70% of its life.
SMP, sure
SMP boxes weren't widely available in 1991 when Linus started this.
RE: SMP, sure
| SMP boxes weren't widely available in 1991 when Linus started this.
Sure there were. However Linux was originally developed for x86 PC hardware, which didn't get SMP until Pentium Pro in 1995.
Dual Processor Servers with 75MHz Pentium Processors '93
Sorry.
Advent of PC SMP <> Pentium Pro
I got a dual P-120 Mhz (overclocked to 133!) in the shed.
Silicon graphics?
I'm not sure if this counts, but I've heard that old silicon graphics workstations contained two 386's, one for the main processing, and the other for graphics processing.
Nope
SGI was all MIPS, all the time (they owned it!), until they discovered linux.
The 'S' is for 'symmetric'.
The 'S' is for 'symmetric'.
Synthesis multiprocessor support.
1.3.4 Target Hardware (:
[*] Sony NEWS 1860 with two 68030 processors.
Nope
I still have a Tyan Dual Pentium motherboard. Max 200 MHz. Won't run MMX processors! I think I bought it in 1993. Still runs, very slowly.
It's so big and complex
Tanenbaum was right.