As some of the latency junkies on lkml already know it, commit 8e3e076
("BKL: revert back to the old spinlock implementation") in v2.6.26-rc2
removed the preemptible BKL feature and made the Big Kernel Lock a
spinlock and thus turned it into non-preemptible code again. This commit
returned the BKL code to the 2.6.7 state of affairs in essence.Linus also indicated that pretty much the only acceptable way to change
this (to us -rt folks rather unfortunate) latency source and to get rid
of this non-preemptible locking complication is to remove the BKL.This task is not easy at all. 12 years after Linux has been converted to
an SMP OS we still have 1300+ legacy BKL using sites. There are 400+
lock_kernel() critical sections and 800+ ioctls. They are spread out
across rather difficult areas of often legacy code that few people
understand and few people dare to touch.It takes top people like Alan Cox to map the semantics and to remove BKL
code, and even for Alan (who is doing this for the TTY code) it is a
long and difficult task.According to my quick & dirty git-log analysis, at the current pace of
BKL removal we'd have to wait more than 10 years to remove most BKL
critical sections from the kernel and to get acceptable latencies again.The biggest technical complication is that the BKL is unlike any other
lock: it "self-releases" when schedule() is called. This makes the BKL
spinlock very "sticky", "invisible" and viral: it's very easy to add it
to a piece of code (even unknowingly) and you never really know whether
it's held or not. PREEMPT_BKL made it even more invisible, because it
made its effects even less visible to ordinary users.Furthermore, the BKL is not covered by lockdep, so its dependencies are
largely unknown and invisible, and it is all lost in the haze of the
past ~15 years of code changes. All this has built up to a kind of Fear,
Uncertainty and Doubt about the BKL: nobody really knows it, nobody
really dares to touch it and c...
So looking a bit more at your trivial fixups, I'd suggest strongly that
they be re-organized a bit.I cherry-picked your tty layer thing, because it was a real fix.
The above doesn't even work in general. It depends on having just a single
level of locking, and is ugly to boot. So wow about we just expose some
version ofdepth = release_kernel_lock()
..
reacquire_kernel_lock(depth);to existing BKL users as a way to safely release and re-aquire it
regardless of depth. That makes the code more generic, but it *also* makesI think this one should be changed - that comment says "do not take", but
in fact you still do take it, you just release it earlier. So we should
just not start out with the kernel locked in the first place. The BKL
doesn't do anything for the init sequence anyway, since all of this is for
code that runs before there even are any other threads (not counting the
idle thread).I don't see anything in there that could *possibly* depend on the kernel
ACK. Code that relies on this is broken anyway.
We used to have lots of "proc_create()" followed by setup code that could
race with "proc_lookup()", but they were fundamentally racy anyway, soAnd obviously this one I'd never take. It would need to work with a
working BKL implementation.Linus
--
On Thu, 15 May 2008 10:41:54 -0700 (PDT)
can we make this even more specific/restricted? Like having something
likecall_bkl_unlocked(function_pointer, argument);
or something that will internally do the full unlock and then the
function call. The last thing we need is another nailgun that BKL using
code can use to staple themselves to something big and fast moving.
By having a more restricted interface... less likely.
Maybe we can even get away with only adrop_bkl_and_schedule();
and nothing else.
--
No, that would defeat the whole purpose of the exercise. This drop on
schedule property makes it possible to have inverse lock order and not
deadlock.That also makes the manual re-acquire on a different level pretty ugly
and deadlock prone.The whole purpose of this patch series was to get rid of this exact
problem so that the BKL turns into something that resembles a normal
lock within the regular locking hierarchy.--
On Thu, 15 May 2008 22:45:55 +0200
I would totally agree with you, except that all these patches
effectively do it manually again ANYWAY :(so what I propose is make it explicit drop_bkl_and_schedule() call only,
and only do them as a very very last resort.For 99% of the rest it does give exactly the regular benefits you
describe. And we can then prioritize these ugly cases to get de-bkl'd
first.
--
Ok, so I'm obviously happy. This is exactly the kind of thing I would want
to see.That said, the way it is now set up, it's unreasonable to merge anything
directly, and while I can cherry-pick obvious fixes this way, I do think
we could do things better.It should be possible to set things up so that it's a config option, and
we can mark it EXPERIMENTAL but still merge it into the standard kernel,
so that we'd have the debug stuff there. That would get a lot more
coverage, especially if it all still *works*, even if the debug stuff then
complains (ie it would be nicer if the lock itself didn't start breaking).So for example, have CONFIG_DEBUG_BKL turn it into a mutex (and select
mutex debugging), and get all the debug coverage that way, but then when
somebody enters the scheduler with the lock held, first complain, but then
auto-release it anyway. That way, bugs get found and complained about, but
hopefully the machine still ends up working.Linus
--
yeah. This is just a first approximation. It might be v2.6.27 stuff, if
hm, we'll got more ideas about other debug helpers, but i dont think
warning in the scheduler is realistic or useful: lots and lots of code
_does_ reschedule with the BKL held and always did - we never knew this
before in a reliable way due to the auto-release. Sleeping locks that
purely nest inside the BKL are the norm in the VFS, in the tty code and
in most other places - they should be fine and are frequently taken in
BKL sections (and frequently produce scheduling there).As the BKL gets pushed inside subsystems, so do inner locks vanish from
its scope - and at the final stage it can become a spinlock or mutex,
depending on what the actual use is.The main point with the mutex is to make the BKL _stricter_ and more
defined - this hurts BKL using code more (see the many fixes that were
needed), but it also makes things much more visible and much more
fixable IMO. This tree turns the BKL into "just another mutex", with a
tiny bit of self-recursion glue on top of it.Btw., often there's potential scheduling at points where BKL using code
does not expect it. So this series might also _fix_ some rare races.The fact that this also makes BKL critical sections involuntarily
preemptible is a side-effect (which is one of my main motivations to do
this whole thing), and it's a pretty much unavoidable side-effect.Also, turning it into a more or less simple mutex with no scheduler
smarts at all, it all fits into our "how do we remove a serializing
lock" workflow rather well. Even if for some piece of code not much
changes in reality, it becomes more familar, less mystic and more
trustable to fix and improve.Btw., while i hacked on this today, i _think_ i've got most of the worst
problems mapped out already. I needed two fixes to get it to boot to a
ssh shell prompt without hanging. I needed 10 more fixes to solve all
the dependencies that lockdep found. Another 5 fixes were exp...
Hi Ingo,
Must be a typo?Regards,
Frederik
--
It's a reasonable start, but have you considered doing this work
in tree instead? As in just add all the warnings, but don't actually
change the semantics yet. I suspect you would get far more users
this way and the work would go faster.It would be reasonable to enable this in -mm if it the warnings are
not too intrusive (self disable itself etc.)Also for fixing the ioctls I'm not sure that dynamic instrumentation
will really work because it would be tough to execute them all.I suspect some variant of static code analysis would make sense
for the ioctls.I used to do some auditing with cflow. That won't
catch indirect function calls unfortunately, but if there's
some way to find those and bail out one could do an automated
tool that flags all the ioctls that don't sleep for example
(don't have any sleeping functions in the call chain -- this
might need some manual annotation, but hopefully not much)Then it would be possible to safely switch those over to a blocking
mutex variant of BKL.Now there could be some more automated analysis here: for example the
main other user of BKL is character open. I suspect to really
make progress here you would also need a open_unlocked() andHmm, is BKL really that common still that it's a latency problem?
The few VFS cases like locks can be fixed without extreme measures.Most of the legacy users are unlikely to be latency problems,
simply because only very few people (or nobody) still has that hardware
and the code will never run.Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever.
-Andi
--
Most of the legacy users inflict that locking on other code - eg the ISN
use of the BKL directly impacts on the tty layer work.
--
So you just stick unlock_kernel()/lock_kernel() around the call
to TTY (or similar to the entry points)-Andi
--
... assuming that the ISDN code doesn't assume lock continuity across
the TTY call.-hpa
--
And procfs and between the tty and the net config code and ...
Keeping the BKL just in legacy places doesn't work. A counting mutex (ie
one you can self multi-lock) might be very useful to fix some of these
however as once we push it down to the point of being a driver specific
lock we can just give it a driver mutex
--
> Most?
It isn't that simple - I've spent a good deal of time working on it.
There are lots of paths that rely on interactions between modules. Eg we
found stuff racing between the pid structs tty internals and procfs that
happened to be saved by the BKL.That in itself is a problem Ingo's stuff won't help with: We have lots of
"magic" accidental, undocumented and pot luck BKL locking semantics
between subsystems that are not even visible.
--
The good news is that I suspect they are going away. It probably is mainly
tty and /proc by now, and /proc is pretty close to done.It's hard to have too many inter-module dependencies when most of the core
modules no longer even take the kernel lock any more.In the VFS layer, we still have
- the ioctl thing, obviously. That's just mind-numbing "move things
down", not hard per se. But there's a *lot* of them (and I suspect the
huge majority of them don't actually need it, since they'd already be
racing against read/write anyway if they did).- default_llseek(). Probably the same, just a lot less of it.
- superblock read/write.
and the latter one in particular is really dubious (we already have
"[un]lock_super()" around it all, I think).The core kernel, VM and networking already don't really do BKL. And it's
seldom the case that subsystems interact with other unrelated subsystems
outside of the core areas.So it's a lot of work, no doubt, but I do think we should be able to do
it. The most mind-numbing part is literally all the ioctl crud. There's
more ioctl points than there are lock_kernel() calls left anywhere else.Linus
--
Character devices in general.
And what's pretty nasty is that some interfaces force BKL still, so
- fasync
[had some patches for "fasync_locked", not sure if it's worth it]
- character device open
I tried to recruit kernel janitors some time ago to just do all the
ioctl -> ioctl_unlocked/explicit lock_kernel changes. There were a few
patches generated but the effort died down then.BTW for ioctl the dynamic instrumentation method proposed also won't
work because it's basically impossible to exercise all these ioctls-Andi
--
Start at the other end - you can't fix the ioctls until you fix what the
ioctls interact with. That ends up at the basic data structure and once
you fix those the rest just starts to fall into place.
--
In my experience there's usually not too much interaction with other
kernel structures at the random driver ioctl level. I'm sure tty is
different, but it's probably not typical.--
Take a look at how file->f_flags is locked.
Alan
--
Ah I posted patches to fix that one a couple of weeks
ago, but unfortunately they fell out of -mm again.It was part of the "unlocked fasync" patchkit.
-Andi
--
yeah. And Alan has a good point: there _is_ lots of magic stuff
happening below the BKL. One of them are the BKL <-> other lock
dependencies. My stuff helps with mapping that part of the magic: it
turns the BKL into an ordinary mutex and thus integrates it into
lockdep's existing dependency validation machinery.In other words: this stuff makes BKL validation _stronger_, not weaker,
and hence it ultimately helps its mapping and elimination. It turns the
"magic" into something more concrete.It might not help with other magic directly - but it helps indirectly,
because now the "magic" has shrunk, so there's more attention and more
resources available to fix it in the places where the magic hurts. (And
suggestions are welcome for more debug helpers to make more magic more
visible.)Whenever someone narrows the BKL's scope, that will always have to be
done carefully - and that's true of any other lock. This patchset
(except perhaps the boot bits) does not narrow the BKL's scope.It will still be no doubt a tough job (reducing/changing locking of
_any_ locked path is a tough job), but it will now fit into our existing
practices much better and we'll get various reminders from lockdep and
the other debug helpers when we forgot about some detail.Before this there was almost zero feedback from the kernel when
something around the BKL broke: pretty much the only remainder we had
from incorrect BKL elimination were subtle breakages.And my personal experience might matter as well: before this i never
dared to touch BKL code. I once removed _all_ BKL locking from all the
kernel _by accident_ [i typoed a single line in lib/smp_lock.c] and ran
it on my main desktop for about a day and never noticed a thing - until
a few weird TTY messages popped up in the syslog...But with this scheme, i felt _much_ more secure about touching BKL code,
and kicking the BKL from the scheduler was pure joy i have to say. (even
though it will of course remain in the u...
There's also every char device open() method - a rather long list in its
own right. I'd be surprised if one in ten of them really needs it, but
one has to look...I've been looking at the chrdev code anyway, and pondering on how this
might be addressed. Here's some thoughts on alternatives, I'd be
curious what people think:1: We could add an unlocked_open() to the file_operations structure;
drivers could be converted over as they are verified not to need the
BKL on open. Disadvantages are that it grows this structure for a
relatively rare case - most open() calls already don't need the BKL.
But it's a relatively easy path without flag days.2: Create a char_dev_ops structure for char devs and use it instead of
file_operations. I vaguely remember seeing Al mutter about that a
while back. Quite a while back. This mirrors what was done with
block devices, and makes some sense - there's a lot of stuff in
struct file_operations which is not really applicable to char devs.
Then struct char_dev_ops could have open() and locked_open(), with
the latter destined for removal sometime around 2015 or so.Advantages are that it's cleaner and separates out some things which
perhaps shouldn't be mixed anyway. Disadvantage is...well...a fair
amount of code churn. It would also require chrdev-specific wrappers
to map straight file_operations calls in the VFS to the new
callbacks.3: Provide a new form of cdev_add() which lets the driver indicate
that the BKL is not needed on open (or anything else?). At a
minimum, it could just be a new parameter on cdev_add which has a
value of zero or FIXME_I_STILL_NEED_BKL. Still some churn but easier
to script and smaller because a lot of drivers are still using
register_chrdev() - something else worth fixing.A more involved form might provide a new chardev_add() which takes
the new char_dev_ops structure too. Mapping between new and old
operations vectors would be don...
1b: add a .locked_open and move all BKL-requiring code to use that.
When time comes and BKL is gone, .locked_open can be removed again,Iff you create a new char_dev_ops, don't clutter it with the old stuff.
This is the BSD/Solaris tactic, heh :)
--
1c: make the BKL-unsafe drivers depend on !SMP && !PREEMPT.
--
I don't think there are *that* many. I found only 83 instances of
"register_chrdev()" in the kernel, so the open methods should be pretty
limited.Of course, some open methods call other sub-registrations, but you'd start
off by moving the lock_kernel() down just *one* stage.So it literally should be:
- remove one lock_kernel/unlock_kernel pair in fs/char_dev.c
- add max 83 pairs in the places that register those thingsI really don't think it's worth the pain. See above. The numbers aren't
that huge, and external modules simply aren't a pressing enough issue.Linus
--
There's the drivers calling cdev_add() directly as well - another
This is all certainly doable, but it leaves me with one concern: there
will be no signal to external module maintainers that the change needs
to be made. So, beyond doubt, quite a few of them will just continue to
be shipped unfixed - and they will still run. If any of them actually
*need* the BKL, something awful may happen to somebody someday.jon
--
I now have a large patch and my full x86-32 build tree building without
->ioctl() in file_operations. Its a 350K patch and took all day so I'll
begin splitting it out and sending chunks to tree maintainers.Alan
--
External modules have bugs because interfaces change. Film at 11.
It's true, but it definitely shouldn't keep us from just doing it.
Especially since well-maintained external modules (ie the authors follow
big discussions like this) can just take the kernel lock regardless of
kernel version, since it won't even be broken with old kernels.Of course, well-maintained kernel modules wouldn't depend on the BKL in
the first place. Oh, well.Linus
--
...and so that's what I've done. My approach was to find every
register_chrdev() and cdev_add() call, look at the associated
file_operations, then go back to the open() function, if any. Unless it
was almost immediately obvious to me that the function was either (1) so
trivial as to not require locking (quite few of them are "return 0;"), or
(2) clearly doing its own locking, I wrapped the code in the BKL.Finally, I removed the BKL from chrdev_open().
Allmodconfig and allyesconfig makes work here, and this kernel runs on my
dual-core desktop. But, clearly, I don't have all of this hardware.
Actually, I wonder if some of it still exists outside of museums. So
there's probably something stupid in there somewhere.The result is available at:
git://git.lwn.net/linux-2.6.git cdev
I'll put a shortlog and diffstat at the end of this message. For
completeness, there's also a list of files I examined and did *not* change.Assuming nobody tells me I'm completely off-base, I guess my next step is
to start running individual patches past maintainers. Some of them,
probably (I hope), will tell me that I've been wasting my time and that
their code doesn't need the BKL. In such cases, I'll gladly drop the
associated patch. But there's a fair amount of stuff here which clearly
*does* need it still. If all seems well, maybe this tree should get into
linux-next at some point too.Comments?
The changes come out to this:
arch/cris/arch-v10/drivers/gpio.c | 3 ++
arch/cris/arch-v10/drivers/sync_serial.c | 34 ++++++++++++++++----------
arch/cris/arch-v32/drivers/mach-a3/gpio.c | 4 +++
arch/cris/arch-v32/drivers/mach-fs/gpio.c | 5 +++
arch/cris/arch-v32/drivers/sync_serial.c | 33 +++++++++++++++-----------
arch/mips/kernel/rtlx.c | 7 ++++-
arch/mips/kernel/vpe.c | 12 +++++++--
arch/mips/sibyte/common/sb_tbprof.c | 25 ++++++++++++++-----
arch/sh/boards/landisk/gio.c | 10 +++++--
...
Btw, Jonathan, would you be willing to maintain some kind of tree of these
BKL removal patches?This is different from the work Ingo is doing in the sense that these
things should be (a) safe and reasonably obvious and thus (b) presumably
ready to be merged in the next merge window. Ingo's BKL debugging tree is
likely a good thing to use to find places that need work, but actually
removing the BKL from some subsystem is a different issue.(And when I say "safe and reasonably obvious" I obviously don't mean that
there can't be bugs. Mistakes happen, and some BKL use might be overly
subtle like the issue that Alan pointed out with an empty ->open routine
almost accidentally serializing with initialization, but that's why I'd
not merge these things after -rc2 anyway, but in the next merge window).Because if you're willing to maintain a BKL-cleanup tree that gets merged
into linux-next etc, I'd submit my VFAT/MSDOS BKL removal patch to you.
The reason I did that one was that Thomas actually reported that to be a
major source of latency problems on one of his embedded systems (80ms
latency!), so it would be nice to have that patch in some place where it
might get tested.Linus
--
Sure, I can do that - as long as people don't mind that committed to
being with the in-laws and off the net for the first couple of weeks in
June. I'll try to get the first version up shortly, after yet another
pass over the chardev pushdown stuff.jon
--
May I suggest just adding a comment in those files, just saying something
like/* This does not need the BKL, because .. */
where even the "because" part could be dropped when it's really obvious.
That way that "list of files I examined and did *not* change" would be
obvious in the patch itself, and we also have some documentation thatSame deal - just document the fact that the BKL isn't needed.
Yeah, in the long run that kind of documentation is worthless and we may
want to get rid of it again in a year or two, but in the short run it's a
good idea. If only to help people who want to review your patches.Btw, do you have gitweb running anywhere?
Linus
--
No, I guess I need to figure out how to set it up. Either that or get
one of those kernel.org accounts and put things there.Thanks,
jon
--
Note that the majority of drivers use (grep suggests up to 165
of them) uses misc_register instead of register_chrdev/cdev_add.
Your patches are still correct, because you pushed the BKL into the
misc_open function, but there is an obvious next step in pushing
it further into the misc drivers.
There are probably a few more subsystems with minor number specificIn your current git tree, this change is no longer the final one, so
bisecting the series may cause other bugs. You should probably reorder
the patches at some point to avoid this.Arnd <><
--
There's a few intermediate dispatcher levels like this, actually.
Lots of video drivers get called behind video_open(), usb drivers from
usb_open(), etc. Not much to be done but to push things down one levelBisection is going to be problem regardless - if a problem turns up,
it's going to be the chrdev_open() change which gets fingered. I bet,
though, that it will be a rare BKL-related problem which is reproducible
enough to be easily bisectable.But, yes, I do need to reorganize the patch series once I'm done adding
on changes.jon
--
I've given it a try for all the misc drivers that have an open() function.
The vast majority of them are actually watchdog drivers, all of which
register as a misc device by themselves. You seem to already have a script
to turn per-file changes into a patch each, so I'm sending you two patches:
one for all the watchdog drivers (maybe Wim can take care of that as well)
and one for all the other misc drivers (this one needs to be split).Arnd <><
--
OK, it looks like the "misc" misc drivers patch can go into the
bkl-removal tree, while the watchdog patches should not. What that
means, I guess, is that the final misc_open() patch cannot go in at this
point; Alan's watchdog stuff needs to find its way in first. MakeAlas, I have no such script. I just committed each change as I made it
- each one required individual attention anyway. The misc changes look
pretty straightforward, so I could probably hack up such a thing pretty
quickly if you don't have a tree with broken out patches.Thanks,
jon
--
Right, unless Alan or Wim are confident enough that removing the
BKL won't break the drivers (more than they are today).
Almost all of the open functions go along the lines ofint open(struct file *f, struct inode *i)
{
if (wd_is_open)
return -EBUSY;
wd_is_open = 1;start_wd();
return nonseekable_open(f, i);
}nonseekable_open doesn't need the BKL by itself, and the wd_is_open
variable is protected by the misc_mtx mutex.
I can't see any scenario in which start_wd() would need the BKL, or
where a watchdog driver needs cycle_kernel_lock(), but I was't confident
enough about that assessment, because I'm not really familiar withI've done a semi-automated split and applied the patches on top of your
tree. You can pull these fromgit://git.kernel.org/pub/scm/linux/kernel/git/arnd/cell-2.6 bkl-removal
(I guess I should do a separate tree for it, will do that if more stuff
comes up.)Arnd <><
--
You need to review the use of misc_register(). Which is what I did
already and sorted out for each watchdog - the job is done and completed
and the various problem cases fixed. Watchdog has already been made BKL
removal safe in the patch series I sent.Alan
--
The Big Kernel Lock has been pushed down from chardev_open
to misc_open, this change moves it to the individual misc
driver open functions.As before, the change was purely mechanical, most drivers
should actually not need the BKL. In particular, we still
hold the misc_mtx() while calling the open() function
The patch should probably be split into one changeset
per driver.Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Index: linux-2.6/arch/arm/common/rtctime.c
===================================================================
--- linux-2.6.orig/arch/arm/common/rtctime.c
+++ linux-2.6/arch/arm/common/rtctime.c
@@ -16,6 +16,7 @@
#include <linux/poll.h>
#include <linux/proc_fs.h>
#include <linux/miscdevice.h>
+#include <linux/smp_lock.h>
#include <linux/spinlock.h>
#include <linux/capability.h>
#include <linux/device.h>
@@ -282,6 +283,7 @@ static int rtc_open(struct inode *inode,
{
int ret;+ lock_kernel();
mutex_lock(&rtc_mutex);if (rtc_inuse) {
@@ -301,6 +303,7 @@ static int rtc_open(struct inode *inode,
}
}
mutex_unlock(&rtc_mutex);
+ unlock_kernel();return ret;
}
Index: linux-2.6/arch/blackfin/mach-bf561/coreb.c
===================================================================
--- linux-2.6.orig/arch/blackfin/mach-bf561/coreb.c
+++ linux-2.6/arch/blackfin/mach-bf561/coreb.c
@@ -32,6 +32,7 @@
#include <linux/device.h>
#include <linux/ioport.h>
#include <linux/module.h>
+#include <linux/smp_lock.h>
#include <linux/uaccess.h>
#include <linux/fs.h>
#include <asm/dma.h>
@@ -196,6 +197,7 @@ static loff_t coreb_lseek(struct file *fstatic int coreb_open(struct inode *inode, struct file *file)
{
+ lock_kernel();
spin_lock_irq(&coreb_lock);if (coreb_status & COREB_IS_OPEN)
@@ -204,10 +206,12 @@ static int coreb_open(struct inode *inod
coreb_status |= COREB_IS_OPEN;spin_unlock_irq(&coreb_...
please drop the coreb.c changes from your patch
-mike
--
At a minimum, I would hope such a request would say something like "I've
looked at the driver's locking and am convinced that the BKL is not
needed." Have you done that? There is a certain leap of faith involved
in removing that protection from a driver.I decided to take a quick look...
- You use spin_lock_irq(&coreb_lock) in a number of places, but you do
not take the lock in the interrupt handler. You also do not take the
lock in coreb_write() or coreb_read(), so those can race with the
interrupt handler, with ioctl(), and with each other.- coreb_write() and coreb_read() do interruptible waits, but do not
check to see whether they were interrupted. They will, in fact,
continue in their I/O loops after a signal.- In both functions you have:
unsigned long p = *ppos;
if (p + count > coreb_size)
return -EFAULT;that calculation can overflow.
- You also do this:
static ssize_t coreb_write(struct file *file, const char *buf, size_t count,
loff_t * ppos)
/* ... */
set_dma_start_addr(CH_MEM_STREAM2_SRC, (unsigned long)buf);In other words, the DMA is done directly to/from a user-space
address. Maybe that's safe on Blackfin, I don't know...- I have no idea why some of your functions are using d_inode->i_mutex.
- In coreb_ioctl():
spin_lock_irq(&coreb_lock);
if (coreb_status & COREB_IS_RUNNING) {
retval = -EBUSY;
break;
}this will exit the function with the spinlock still held and
interrupts disabled.case CMD_COREB_RESET:
printk(KERN_INFO "Resetting Core B\n");
bfin_write_SICB_SYSCR(bfin_read_SICB_SYSCR() | 0x0080);
break;You do not acquire the lock here, so this can race against other
ioctl() calls. And ioctl() can race against read() and write().Registration and such seem reasonable, so I can't come up with a
scenario where loss of BKL protection will create trouble. Given the
other problems there, though, I'll confess to being a bit nervous about
...
the lock is to protect one thing: coreb_status. we lock around any
access to it, so it not being grabbed in the irq handler or any other
function where coreb_status is not utilized is irrelevant. that means
the BKL is not needed in the driver.the rest of your comments are more or less on target, but again
irrelevant to the topic of the BKL. i'll keep them in mind when i
rewrite the driver, thanks.
-mike
--
On Tue, 20 May 2008 01:26:28 +0200
Acked-by: Alan Cox <alan@redhat.com>
--
this open func already has a spinlock protecting it. doesnt that mean
we dont need the bkl in it ?
-mike
--
The existence of a spinlock is a good sign. But, until somebody has
looked at the code and verified that said lock is really protecting
everything, it's best to leave the BKL protection (which has always been
there, just at a higher level) in place.jon
--
if the spinlock doesnt do what it's advertising (preventing mutual
access), then the BKL is needed. if there's some UP behavior i'm not
aware of, then the BKL is needed. otherwise, the BKL is not needed in
this driver.i should prob rewrite this driver anyways ... the open code could
easily be replaced with some atomic funcs.
-mike
--
Since all misc drivers that have an open() function now take the
BKL in there, there is no longer the need to take it in the common
misc_open() function.Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Index: linux-2.6/drivers/char/misc.c
===================================================================
--- linux-2.6.orig/drivers/char/misc.c
+++ linux-2.6/drivers/char/misc.c
@@ -50,7 +50,6 @@
#include <linux/device.h>
#include <linux/tty.h>
#include <linux/kmod.h>
-#include <linux/smp_lock.h>/*
* Head entry for the doubly linked miscdevice list
@@ -121,7 +120,6 @@ static int misc_open(struct inode * inod
int err = -ENODEV;
const struct file_operations *old_fops, *new_fops = NULL;- lock_kernel();
mutex_lock(&misc_mtx);list_for_each_entry(c, &misc_list, list) {
@@ -159,7 +157,6 @@ static int misc_open(struct inode * inod
fops_put(old_fops);
fail:
mutex_unlock(&misc_mtx);
- unlock_kernel();
return err;
}--
You have to be careful before assuming but yes - seems sensible.
I'm currently munching my way through the watchdog drivers fixing them up
for unlocked_ioctl/BKL drops and finding various things needing fixingDefinitely
--
If they literaly are 'return 0' you can just remove them, as a
Even if clearly does it's own locking please add the BKL for now and let
the maintainers sort it out later, better be safe then sorry.Except for that thanks a lot, this is the kind of work that's more
productive than all these discussions here :)For some reason about 80 instances seem awfully few, but we've move a
lot of device into subsystems from beeing plain chardevs so this
might actually be correct.
--
And here's a patch to do just that: remove all empty chardev
open/release methods. Based on the list compiled by Jonathan.(and yeah, ip2_ipl_open is not technically empty at the source level,
but only at the binary level :))Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: linux-2.6/arch/cris/arch-v10/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v10/drivers/i2c.c 2008-05-16 17:58:15.000000000 +0200
+++ linux-2.6/arch/cris/arch-v10/drivers/i2c.c 2008-05-16 17:58:19.000000000 +0200
@@ -563,18 +563,6 @@ i2c_readreg(unsigned char theSlave, unsi
return b;
}-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
/* Main device API. ioctl's to write or read to/from i2c registers.
*/@@ -619,8 +607,6 @@ i2c_ioctl(struct inode *inode, struct fi
static const struct file_operations i2c_fops = {
.owner = THIS_MODULE,
.ioctl = i2c_ioctl,
- .open = i2c_open,
- .release = i2c_release,
};int __init
Index: linux-2.6/arch/cris/arch-v32/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v32/drivers/i2c.c 2008-05-16 17:57:56.000000000 +0200
+++ linux-2.6/arch/cris/arch-v32/drivers/i2c.c 2008-05-16 17:58:07.000000000 +0200
@@ -633,18 +633,6 @@ i2c_readreg(unsigned char theSlave, unsi
return b;
}-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
/* Main device API. ioctl's to write or read to/from i2c registers.
*/@@ -689,8 +677,6 @@ i2c_ioctl(struct inode *inode, struct fi
static const struct file_operations i2c_fops = {
.owner = THIS_MODULE,
.ioctl = i2c_ioctl,
- .open = i2c_open,
- .release = i2c_release,
};static int __i...
Actually it turns out you can introduce bugs doing this when the BKL is
pushed down.The problem is the methods are not NULL, they (with the lock pushed down
are){
lock_kernel();
unlock_kernel();
}And we have drivers with setup code that does things in the wrong order
but under the BKL. eg one I just fixed didmisc_register()
init locks
allocate memory
do stuff
return 0;The lock/unlock in the open happens to save your butt against the wrong
order of intialisation because the open cannot occur before the lock is
taken, and thanks to the BKL it cannot make any progress until the setup
is completed. Fun too - udev loves opening things as they appear so in
some cases we might actually trigger them too.So when you remove the _open() empty methods *please* make sure you have
verified the correctness and ordering of the entire registration path.
I've found three examples of this so far just cleaning up
drivers/watchdog.Alan
--
Hmph.
As it turns out, a misc driver will still be OK because the BKL has not
(yet) been pushed past misc_open(). What this does mean, though, is
that all of those empty and trivial open functions need to be
revisited. I thought this looked too easy the first time through...jon
--
I think it would be best to make them lock/unlock kernel in the first
pass and then work through them. The BKL can be subtle and evil, but as I
brought it into the world I guess I must banish it ;)Alan
--
> Index: linux-2.6/drivers/char/ip2/ip2main.c
Looks fine but please send it via my tty tree so it doesn't collide with
the tty work.--
In general when changing semantics drastically you should force
compile errors by renaming the respective entry point. That
has been the standard Linux method for this for years.I doubt it will be very interesting, but it would be useful.
The goal less being to get rid of BKL in old drivers, but not
requiring BKL in new drivers. Basically all BKL assumptions
in interfaces really should go.-Andi
--
No, we really do want to get rid of BKL in old drivers too. Or at least in
the interfaces.Linus
--
In the interfaces definitely yes and all subsystems should have their
own lock_kernel calls, but why in the old drivers? For those it's very
unlikely they are used on any SMP system anyways (e.g. anything
depending on CONFIG_ISA) or if they do only on 2 CPU systems.Of course if you can find someone to do the work it wouldn't be
bad, just wouldn't seem like a particularly useful investment of time to
me.Also it would be bad if the people who did such conversions didn't
actually test it and that's a great danger with many old drivers because
nearly nobody has the hardware (and if they do it won't be in a SMP system)-Andi
--
Because
- You need to verify the locking assumptions that remain are entirely
driver internal
- At the point you achieve that you've done *ALL* the work required to
add a driver specific lock--
>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:
Andi> In the interfaces definitely yes and all subsystems should have
Andi> their own lock_kernel calls, but why in the old drivers? For
Andi> those it's very unlikely they are used on any SMP system anyways
Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2
Andi> CPU systems.I'm still running an SMP server with ISA slots. I'd love to
contribute testing and possibly coding to this effort, but
realisticlly I'll be able to compile and boot stuff.John
--
I do too (although one CPU has died recently), but how many ISA devices
do you use in it? Mine used to have a ISA ISDN card, but that was it
and then no ISA anymore even though the slots are still in there.Also on 2 CPU systems BKL is not that critical anyways. It only starts
to hurt on larger CPU counts.-Andi
--
>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:
Andi> In the interfaces definitely yes and all subsystems should have
Andi> their own lock_kernel calls, but why in the old drivers? For
Andi> those it's very unlikely they are used on any SMP system anyways
Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2Andi> I do too (although one CPU has died recently), but how many ISA
Andi> devices do you use in it? Mine used to have a ISA ISDN card, but
Andi> that was it and then no ISA anymore even though the slots are
Andi> still in there.I must admit I used to have an ISA Cyclades 8 port serial card running
in there, but now it's all PCI stuff. It only had one ISA slot. So
yes, ISA SMP boxes are slowly dying, but they'll be around for a long
time to come.Andi> Also on 2 CPU systems BKL is not that critical anyways. It only
Andi> starts to hurt on larger CPU counts.True, but with the growing number of multicore systems, esp on
desktops, it's going to be an issue. How will the BKL work on quad
core boxes?I'm going to shutup now, since I don't have the knowledge to really
contribute much to the discussion.John
--
My point was that for those few users it's actually better to keep
the BKL. If you try to remove it on some old driver you cannot test
due to lack of hardware the risk of breaking the driver is higher than the
gain you would get from removing it. The best you can do in this
legacy code is to keep it running with minimal changes in the oldMulticore systems don't have ISA slots.
[yes I'm sure someone will tell me now about the ISA-over-USB device
that exists. Don't bother, it doesn't add anything to the point]-Andi
--
Its beginning to sound like should start 2.7 ;)
Alan
--
Ingo Molnar wrote:
<snip description and patches>
I decided to give this tree a try, and I got:
[4294034.386085] ------------[ cut here ]------------
[4294034.387882] WARNING: at fs/proc/generic.c:669
create_proc_entry+0x3d/0xc5()
[4294034.390059] Pid: 2565, comm: Xorg Not tainted
2.6.26-rc2-00456-gd9df34e #35
[4294034.392682]
[4294034.392683] Call Trace:
[4294034.394071] [<ffffffff8022a8ac>] warn_on_slowpath+0x53/0x81
[4294034.394077] [<ffffffff802b57b4>] ? proc_register+0xf7/0x162
[4294034.394081] [<ffffffff802b580b>] ? proc_register+0x14e/0x162
[4294034.394087] [<ffffffff804afa05>] ? _spin_unlock+0x30/0x4b
[4294034.394091] [<ffffffff802b580b>] ? proc_register+0x14e/0x162
[4294034.394095] [<ffffffff8021a127>] ? startup_ioapic_irq+0x54/0x5f
[4294034.394099] [<ffffffff802b5fcc>] create_proc_entry+0x3d/0xc5
[4294034.394103] [<ffffffff80252008>] register_irq_proc+0x84/0xa0
[4294034.394108] [<ffffffff80250b1a>] setup_irq+0x1b2/0x21b
[4294034.394113] [<ffffffff80250cc8>] request_irq+0xf1/0x117
[4294034.394117] [<ffffffff8038aaaa>] ? radeon_driver_irq_handler+0x0/0x7e
[4294034.394122] [<ffffffff803789b6>] ? drm_control+0x0/0x186
[4294034.394126] [<ffffffff80378acf>] drm_control+0x119/0x186
[4294034.394130] [<ffffffff803770be>] drm_ioctl+0x1d3/0x265
[4294034.394589] [<ffffffff80283e5e>] vfs_ioctl+0x5e/0x77
[4294034.394593] [<ffffffff802840d2>] do_vfs_ioctl+0x25b/0x270
[4294034.394598] [<ffffffff804af23e>] ? trace_hardirqs_on_thunk+0x35/0x3a
[4294034.394601] [<ffffffff80284129>] sys_ioctl+0x42/0x65
[4294034.394606] [<ffffffff8020b2eb>] system_call_after_swapgs+0x7b/0x80
[4294034.411597]
[4294034.411601] ---[ end trace 7f52164e4c2b9927 ]---I have no idea if that is even related to the BKL or not, I haven't even
opened the source file yet, but I figured I'd report it.--
Kevin Winchester
--
And now applying the debugging tips that Linus, Al and others supplied
to me awhile back, I see from GDB that:vfs_ioctl locks the kernel before calling drm_ioctl, and, that
create_proc_entry() has the following new line thanks to Ingo:WARN_ON_ONCE(kernel_locked());
According to Ingo's patch log:
The functions, if called from the BKL, show that the calling site
might have a dependency on the procfs code previously using the BKL
in the dir-entry manipulation functions.I do not really know what that means, so I cc'd Dave Airlie to see if he
has a solution.--
Kevin Winchester--
Out of amusement I took the watchdog drivers and started looking for
large cans of worms in the BKL drop arena.Here is a fun one for general discussion - right now driver probe
functions request resources. We have no ordering on the requests so we
have deadlocks if two drivers do resource requests for conflicting
resources in reverse order."Discuss"
Alan
--
resource requests aren't blocking, so it wouldn't actually be a deadlock.
It would just be a "both failed, try again".That said, two drivers shouldn't be probing the same hardware at the same
time regardless, so I can't imagine that it's much of a problem in real
life.Linus
--
What deadlocks? resource allocation normally doesn't block. So if there's
a ordering issue one of them will fail and should bail out.That said if you have conflicting resources then failing is the correct
behavior anyways.-Andi
--
| Washington Odhiambo | Weird Problem with NAT - more details |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | Re: [GIT]: Networking |
| Denys Fedoryshchenko | thousands of classes, e1000 TX unit hang |
