As some of the latency junkies on lkml already know it, commit 8e3e076
("BKL: revert back to the old spinlock implementation") in v2.6.26-rc2
removed the preemptible BKL feature and made the Big Kernel Lock a
spinlock and thus turned it into non-preemptible code again. This commit
returned the BKL code to the 2.6.7 state of affairs in essence.
Linus also indicated that pretty much the only acceptable way to change
this (to us -rt folks rather unfortunate) latency source and to get rid
of this non-preemptible locking complication is to remove the BKL.
This task is not easy at all. 12 years after Linux has been converted to
an SMP OS we still have 1300+ legacy BKL using sites. There are 400+
lock_kernel() critical sections and 800+ ioctls. They are spread out
across rather difficult areas of often legacy code that few people
understand and few people dare to touch.
It takes top people like Alan Cox to map the semantics and to remove BKL
code, and even for Alan (who is doing this for the TTY code) it is a
long and difficult task.
According to my quick & dirty git-log analysis, at the current pace of
BKL removal we'd have to wait more than 10 years to remove most BKL
critical sections from the kernel and to get acceptable latencies again.
The biggest technical complication is that the BKL is unlike any other
lock: it "self-releases" when schedule() is called. This makes the BKL
spinlock very "sticky", "invisible" and viral: it's very easy to add it
to a piece of code (even unknowingly) and you never really know whether
it's held or not. PREEMPT_BKL made it even more invisible, because it
made its effects even less visible to ordinary users.
Furthermore, the BKL is not covered by lockdep, so its dependencies are
largely unknown and invisible, and it is all lost in the haze of the
past ~15 years of code changes. All this has built up to a kind of Fear,
Uncertainty and Doubt about the BKL: nobody really knows it, nobody
really dares to touch it and c...So looking a bit more at your trivial fixups, I'd suggest strongly that they be re-organized a bit. I cherry-picked your tty layer thing, because it was a real fix. The above doesn't even work in general. It depends on having just a single level of locking, and is ugly to boot. So wow about we just expose some version of depth = release_kernel_lock() .. reacquire_kernel_lock(depth); to existing BKL users as a way to safely release and re-aquire it regardless of depth. That makes the code more generic, but it *also* makes I think this one should be changed - that comment says "do not take", but in fact you still do take it, you just release it earlier. So we should just not start out with the kernel locked in the first place. The BKL doesn't do anything for the init sequence anyway, since all of this is for code that runs before there even are any other threads (not counting the idle thread). I don't see anything in there that could *possibly* depend on the kernel ACK. Code that relies on this is broken anyway. We used to have lots of "proc_create()" followed by setup code that could race with "proc_lookup()", but they were fundamentally racy anyway, so And obviously this one I'd never take. It would need to work with a working BKL implementation. Linus --
On Thu, 15 May 2008 10:41:54 -0700 (PDT) can we make this even more specific/restricted? Like having something like call_bkl_unlocked(function_pointer, argument); or something that will internally do the full unlock and then the function call. The last thing we need is another nailgun that BKL using code can use to staple themselves to something big and fast moving. By having a more restricted interface... less likely. Maybe we can even get away with only a drop_bkl_and_schedule(); and nothing else. --
No, that would defeat the whole purpose of the exercise. This drop on schedule property makes it possible to have inverse lock order and not deadlock. That also makes the manual re-acquire on a different level pretty ugly and deadlock prone. The whole purpose of this patch series was to get rid of this exact problem so that the BKL turns into something that resembles a normal lock within the regular locking hierarchy. --
On Thu, 15 May 2008 22:45:55 +0200 I would totally agree with you, except that all these patches effectively do it manually again ANYWAY :( so what I propose is make it explicit drop_bkl_and_schedule() call only, and only do them as a very very last resort. For 99% of the rest it does give exactly the regular benefits you describe. And we can then prioritize these ugly cases to get de-bkl'd first. --
Ok, so I'm obviously happy. This is exactly the kind of thing I would want to see. That said, the way it is now set up, it's unreasonable to merge anything directly, and while I can cherry-pick obvious fixes this way, I do think we could do things better. It should be possible to set things up so that it's a config option, and we can mark it EXPERIMENTAL but still merge it into the standard kernel, so that we'd have the debug stuff there. That would get a lot more coverage, especially if it all still *works*, even if the debug stuff then complains (ie it would be nicer if the lock itself didn't start breaking). So for example, have CONFIG_DEBUG_BKL turn it into a mutex (and select mutex debugging), and get all the debug coverage that way, but then when somebody enters the scheduler with the lock held, first complain, but then auto-release it anyway. That way, bugs get found and complained about, but hopefully the machine still ends up working. Linus --
yeah. This is just a first approximation. It might be v2.6.27 stuff, if hm, we'll got more ideas about other debug helpers, but i dont think warning in the scheduler is realistic or useful: lots and lots of code _does_ reschedule with the BKL held and always did - we never knew this before in a reliable way due to the auto-release. Sleeping locks that purely nest inside the BKL are the norm in the VFS, in the tty code and in most other places - they should be fine and are frequently taken in BKL sections (and frequently produce scheduling there). As the BKL gets pushed inside subsystems, so do inner locks vanish from its scope - and at the final stage it can become a spinlock or mutex, depending on what the actual use is. The main point with the mutex is to make the BKL _stricter_ and more defined - this hurts BKL using code more (see the many fixes that were needed), but it also makes things much more visible and much more fixable IMO. This tree turns the BKL into "just another mutex", with a tiny bit of self-recursion glue on top of it. Btw., often there's potential scheduling at points where BKL using code does not expect it. So this series might also _fix_ some rare races. The fact that this also makes BKL critical sections involuntarily preemptible is a side-effect (which is one of my main motivations to do this whole thing), and it's a pretty much unavoidable side-effect. Also, turning it into a more or less simple mutex with no scheduler smarts at all, it all fits into our "how do we remove a serializing lock" workflow rather well. Even if for some piece of code not much changes in reality, it becomes more familar, less mystic and more trustable to fix and improve. Btw., while i hacked on this today, i _think_ i've got most of the worst problems mapped out already. I needed two fixes to get it to boot to a ssh shell prompt without hanging. I needed 10 more fixes to solve all the dependencies that lockdep found. Another 5 fixes were exp...
Hi Ingo, Must be a typo? Regards, Frederik --
It's a reasonable start, but have you considered doing this work in tree instead? As in just add all the warnings, but don't actually change the semantics yet. I suspect you would get far more users this way and the work would go faster. It would be reasonable to enable this in -mm if it the warnings are not too intrusive (self disable itself etc.) Also for fixing the ioctls I'm not sure that dynamic instrumentation will really work because it would be tough to execute them all. I suspect some variant of static code analysis would make sense for the ioctls. I used to do some auditing with cflow. That won't catch indirect function calls unfortunately, but if there's some way to find those and bail out one could do an automated tool that flags all the ioctls that don't sleep for example (don't have any sleeping functions in the call chain -- this might need some manual annotation, but hopefully not much) Then it would be possible to safely switch those over to a blocking mutex variant of BKL. Now there could be some more automated analysis here: for example the main other user of BKL is character open. I suspect to really make progress here you would also need a open_unlocked() and Hmm, is BKL really that common still that it's a latency problem? The few VFS cases like locks can be fixed without extreme measures. Most of the legacy users are unlikely to be latency problems, simply because only very few people (or nobody) still has that hardware and the code will never run. Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever. -Andi --
Most of the legacy users inflict that locking on other code - eg the ISN use of the BKL directly impacts on the tty layer work. --
So you just stick unlock_kernel()/lock_kernel() around the call to TTY (or similar to the entry points) -Andi --
... assuming that the ISDN code doesn't assume lock continuity across the TTY call. -hpa --
And procfs and between the tty and the net config code and ... Keeping the BKL just in legacy places doesn't work. A counting mutex (ie one you can self multi-lock) might be very useful to fix some of these however as once we push it down to the point of being a driver specific lock we can just give it a driver mutex --
> Most? It isn't that simple - I've spent a good deal of time working on it. There are lots of paths that rely on interactions between modules. Eg we found stuff racing between the pid structs tty internals and procfs that happened to be saved by the BKL. That in itself is a problem Ingo's stuff won't help with: We have lots of "magic" accidental, undocumented and pot luck BKL locking semantics between subsystems that are not even visible. --
The good news is that I suspect they are going away. It probably is mainly tty and /proc by now, and /proc is pretty close to done. It's hard to have too many inter-module dependencies when most of the core modules no longer even take the kernel lock any more. In the VFS layer, we still have - the ioctl thing, obviously. That's just mind-numbing "move things down", not hard per se. But there's a *lot* of them (and I suspect the huge majority of them don't actually need it, since they'd already be racing against read/write anyway if they did). - default_llseek(). Probably the same, just a lot less of it. - superblock read/write. and the latter one in particular is really dubious (we already have "[un]lock_super()" around it all, I think). The core kernel, VM and networking already don't really do BKL. And it's seldom the case that subsystems interact with other unrelated subsystems outside of the core areas. So it's a lot of work, no doubt, but I do think we should be able to do it. The most mind-numbing part is literally all the ioctl crud. There's more ioctl points than there are lock_kernel() calls left anywhere else. Linus --
Character devices in general. And what's pretty nasty is that some interfaces force BKL still, so - fasync [had some patches for "fasync_locked", not sure if it's worth it] - character device open I tried to recruit kernel janitors some time ago to just do all the ioctl -> ioctl_unlocked/explicit lock_kernel changes. There were a few patches generated but the effort died down then. BTW for ioctl the dynamic instrumentation method proposed also won't work because it's basically impossible to exercise all these ioctls -Andi --
Start at the other end - you can't fix the ioctls until you fix what the ioctls interact with. That ends up at the basic data structure and once you fix those the rest just starts to fall into place. --
In my experience there's usually not too much interaction with other kernel structures at the random driver ioctl level. I'm sure tty is different, but it's probably not typical. --
Take a look at how file->f_flags is locked. Alan --
Ah I posted patches to fix that one a couple of weeks ago, but unfortunately they fell out of -mm again. It was part of the "unlocked fasync" patchkit. -Andi --
yeah. And Alan has a good point: there _is_ lots of magic stuff happening below the BKL. One of them are the BKL <-> other lock dependencies. My stuff helps with mapping that part of the magic: it turns the BKL into an ordinary mutex and thus integrates it into lockdep's existing dependency validation machinery. In other words: this stuff makes BKL validation _stronger_, not weaker, and hence it ultimately helps its mapping and elimination. It turns the "magic" into something more concrete. It might not help with other magic directly - but it helps indirectly, because now the "magic" has shrunk, so there's more attention and more resources available to fix it in the places where the magic hurts. (And suggestions are welcome for more debug helpers to make more magic more visible.) Whenever someone narrows the BKL's scope, that will always have to be done carefully - and that's true of any other lock. This patchset (except perhaps the boot bits) does not narrow the BKL's scope. It will still be no doubt a tough job (reducing/changing locking of _any_ locked path is a tough job), but it will now fit into our existing practices much better and we'll get various reminders from lockdep and the other debug helpers when we forgot about some detail. Before this there was almost zero feedback from the kernel when something around the BKL broke: pretty much the only remainder we had from incorrect BKL elimination were subtle breakages. And my personal experience might matter as well: before this i never dared to touch BKL code. I once removed _all_ BKL locking from all the kernel _by accident_ [i typoed a single line in lib/smp_lock.c] and ran it on my main desktop for about a day and never noticed a thing - until a few weird TTY messages popped up in the syslog... But with this scheme, i felt _much_ more secure about touching BKL code, and kicking the BKL from the scheduler was pure joy i have to say. (even though it will of course remain in the u...
There's also every char device open() method - a rather long list in its own right. I'd be surprised if one in ten of them really needs it, but one has to look... I've been looking at the chrdev code anyway, and pondering on how this might be addressed. Here's some thoughts on alternatives, I'd be curious what people think: 1: We could add an unlocked_open() to the file_operations structure; drivers could be converted over as they are verified not to need the BKL on open. Disadvantages are that it grows this structure for a relatively rare case - most open() calls already don't need the BKL. But it's a relatively easy path without flag days. 2: Create a char_dev_ops structure for char devs and use it instead of file_operations. I vaguely remember seeing Al mutter about that a while back. Quite a while back. This mirrors what was done with block devices, and makes some sense - there's a lot of stuff in struct file_operations which is not really applicable to char devs. Then struct char_dev_ops could have open() and locked_open(), with the latter destined for removal sometime around 2015 or so. Advantages are that it's cleaner and separates out some things which perhaps shouldn't be mixed anyway. Disadvantage is...well...a fair amount of code churn. It would also require chrdev-specific wrappers to map straight file_operations calls in the VFS to the new callbacks. 3: Provide a new form of cdev_add() which lets the driver indicate that the BKL is not needed on open (or anything else?). At a minimum, it could just be a new parameter on cdev_add which has a value of zero or FIXME_I_STILL_NEED_BKL. Still some churn but easier to script and smaller because a lot of drivers are still using register_chrdev() - something else worth fixing. A more involved form might provide a new chardev_add() which takes the new char_dev_ops structure too. Mapping between new and old operations vectors would be don...
1b: add a .locked_open and move all BKL-requiring code to use that. When time comes and BKL is gone, .locked_open can be removed again, Iff you create a new char_dev_ops, don't clutter it with the old stuff. This is the BSD/Solaris tactic, heh :) --
1c: make the BKL-unsafe drivers depend on !SMP && !PREEMPT. --
I don't think there are *that* many. I found only 83 instances of "register_chrdev()" in the kernel, so the open methods should be pretty limited. Of course, some open methods call other sub-registrations, but you'd start off by moving the lock_kernel() down just *one* stage. So it literally should be: - remove one lock_kernel/unlock_kernel pair in fs/char_dev.c - add max 83 pairs in the places that register those things I really don't think it's worth the pain. See above. The numbers aren't that huge, and external modules simply aren't a pressing enough issue. Linus --
There's the drivers calling cdev_add() directly as well - another This is all certainly doable, but it leaves me with one concern: there will be no signal to external module maintainers that the change needs to be made. So, beyond doubt, quite a few of them will just continue to be shipped unfixed - and they will still run. If any of them actually *need* the BKL, something awful may happen to somebody someday. jon --
I now have a large patch and my full x86-32 build tree building without ->ioctl() in file_operations. Its a 350K patch and took all day so I'll begin splitting it out and sending chunks to tree maintainers. Alan --
External modules have bugs because interfaces change. Film at 11. It's true, but it definitely shouldn't keep us from just doing it. Especially since well-maintained external modules (ie the authors follow big discussions like this) can just take the kernel lock regardless of kernel version, since it won't even be broken with old kernels. Of course, well-maintained kernel modules wouldn't depend on the BKL in the first place. Oh, well. Linus --
...and so that's what I've done. My approach was to find every register_chrdev() and cdev_add() call, look at the associated file_operations, then go back to the open() function, if any. Unless it was almost immediately obvious to me that the function was either (1) so trivial as to not require locking (quite few of them are "return 0;"), or (2) clearly doing its own locking, I wrapped the code in the BKL. Finally, I removed the BKL from chrdev_open(). Allmodconfig and allyesconfig makes work here, and this kernel runs on my dual-core desktop. But, clearly, I don't have all of this hardware. Actually, I wonder if some of it still exists outside of museums. So there's probably something stupid in there somewhere. The result is available at: git://git.lwn.net/linux-2.6.git cdev I'll put a shortlog and diffstat at the end of this message. For completeness, there's also a list of files I examined and did *not* change. Assuming nobody tells me I'm completely off-base, I guess my next step is to start running individual patches past maintainers. Some of them, probably (I hope), will tell me that I've been wasting my time and that their code doesn't need the BKL. In such cases, I'll gladly drop the associated patch. But there's a fair amount of stuff here which clearly *does* need it still. If all seems well, maybe this tree should get into linux-next at some point too. Comments? The changes come out to this: arch/cris/arch-v10/drivers/gpio.c | 3 ++ arch/cris/arch-v10/drivers/sync_serial.c | 34 ++++++++++++++++---------- arch/cris/arch-v32/drivers/mach-a3/gpio.c | 4 +++ arch/cris/arch-v32/drivers/mach-fs/gpio.c | 5 +++ arch/cris/arch-v32/drivers/sync_serial.c | 33 +++++++++++++++----------- arch/mips/kernel/rtlx.c | 7 ++++- arch/mips/kernel/vpe.c | 12 +++++++-- arch/mips/sibyte/common/sb_tbprof.c | 25 ++++++++++++++----- arch/sh/boards/landisk/gio.c | 10 +++++-- ...
Btw, Jonathan, would you be willing to maintain some kind of tree of these BKL removal patches? This is different from the work Ingo is doing in the sense that these things should be (a) safe and reasonably obvious and thus (b) presumably ready to be merged in the next merge window. Ingo's BKL debugging tree is likely a good thing to use to find places that need work, but actually removing the BKL from some subsystem is a different issue. (And when I say "safe and reasonably obvious" I obviously don't mean that there can't be bugs. Mistakes happen, and some BKL use might be overly subtle like the issue that Alan pointed out with an empty ->open routine almost accidentally serializing with initialization, but that's why I'd not merge these things after -rc2 anyway, but in the next merge window). Because if you're willing to maintain a BKL-cleanup tree that gets merged into linux-next etc, I'd submit my VFAT/MSDOS BKL removal patch to you. The reason I did that one was that Thomas actually reported that to be a major source of latency problems on one of his embedded systems (80ms latency!), so it would be nice to have that patch in some place where it might get tested. Linus --
Sure, I can do that - as long as people don't mind that committed to being with the in-laws and off the net for the first couple of weeks in June. I'll try to get the first version up shortly, after yet another pass over the chardev pushdown stuff. jon --
May I suggest just adding a comment in those files, just saying something like /* This does not need the BKL, because .. */ where even the "because" part could be dropped when it's really obvious. That way that "list of files I examined and did *not* change" would be obvious in the patch itself, and we also have some documentation that Same deal - just document the fact that the BKL isn't needed. Yeah, in the long run that kind of documentation is worthless and we may want to get rid of it again in a year or two, but in the short run it's a good idea. If only to help people who want to review your patches. Btw, do you have gitweb running anywhere? Linus --
No, I guess I need to figure out how to set it up. Either that or get one of those kernel.org accounts and put things there. Thanks, jon --
Note that the majority of drivers use (grep suggests up to 165 of them) uses misc_register instead of register_chrdev/cdev_add. Your patches are still correct, because you pushed the BKL into the misc_open function, but there is an obvious next step in pushing it further into the misc drivers. There are probably a few more subsystems with minor number specific In your current git tree, this change is no longer the final one, so bisecting the series may cause other bugs. You should probably reorder the patches at some point to avoid this. Arnd <>< --
There's a few intermediate dispatcher levels like this, actually. Lots of video drivers get called behind video_open(), usb drivers from usb_open(), etc. Not much to be done but to push things down one level Bisection is going to be problem regardless - if a problem turns up, it's going to be the chrdev_open() change which gets fingered. I bet, though, that it will be a rare BKL-related problem which is reproducible enough to be easily bisectable. But, yes, I do need to reorganize the patch series once I'm done adding on changes. jon --
I've given it a try for all the misc drivers that have an open() function. The vast majority of them are actually watchdog drivers, all of which register as a misc device by themselves. You seem to already have a script to turn per-file changes into a patch each, so I'm sending you two patches: one for all the watchdog drivers (maybe Wim can take care of that as well) and one for all the other misc drivers (this one needs to be split). Arnd <>< --
OK, it looks like the "misc" misc drivers patch can go into the bkl-removal tree, while the watchdog patches should not. What that means, I guess, is that the final misc_open() patch cannot go in at this point; Alan's watchdog stuff needs to find its way in first. Make Alas, I have no such script. I just committed each change as I made it - each one required individual attention anyway. The misc changes look pretty straightforward, so I could probably hack up such a thing pretty quickly if you don't have a tree with broken out patches. Thanks, jon --
Right, unless Alan or Wim are confident enough that removing the
BKL won't break the drivers (more than they are today).
Almost all of the open functions go along the lines of
int open(struct file *f, struct inode *i)
{
if (wd_is_open)
return -EBUSY;
wd_is_open = 1;
start_wd();
return nonseekable_open(f, i);
}
nonseekable_open doesn't need the BKL by itself, and the wd_is_open
variable is protected by the misc_mtx mutex.
I can't see any scenario in which start_wd() would need the BKL, or
where a watchdog driver needs cycle_kernel_lock(), but I was't confident
enough about that assessment, because I'm not really familiar with
I've done a semi-automated split and applied the patches on top of your
tree. You can pull these from
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/cell-2.6 bkl-removal
(I guess I should do a separate tree for it, will do that if more stuff
comes up.)
Arnd <><
--You need to review the use of misc_register(). Which is what I did already and sorted out for each watchdog - the job is done and completed and the various problem cases fixed. Watchdog has already been made BKL removal safe in the patch series I sent. Alan --
The Big Kernel Lock has been pushed down from chardev_open
to misc_open, this change moves it to the individual misc
driver open functions.
As before, the change was purely mechanical, most drivers
should actually not need the BKL. In particular, we still
hold the misc_mtx() while calling the open() function
The patch should probably be split into one changeset
per driver.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Index: linux-2.6/arch/arm/common/rtctime.c
===================================================================
--- linux-2.6.orig/arch/arm/common/rtctime.c
+++ linux-2.6/arch/arm/common/rtctime.c
@@ -16,6 +16,7 @@
#include <linux/poll.h>
#include <linux/proc_fs.h>
#include <linux/miscdevice.h>
+#include <linux/smp_lock.h>
#include <linux/spinlock.h>
#include <linux/capability.h>
#include <linux/device.h>
@@ -282,6 +283,7 @@ static int rtc_open(struct inode *inode,
{
int ret;
+ lock_kernel();
mutex_lock(&rtc_mutex);
if (rtc_inuse) {
@@ -301,6 +303,7 @@ static int rtc_open(struct inode *inode,
}
}
mutex_unlock(&rtc_mutex);
+ unlock_kernel();
return ret;
}
Index: linux-2.6/arch/blackfin/mach-bf561/coreb.c
===================================================================
--- linux-2.6.orig/arch/blackfin/mach-bf561/coreb.c
+++ linux-2.6/arch/blackfin/mach-bf561/coreb.c
@@ -32,6 +32,7 @@
#include <linux/device.h>
#include <linux/ioport.h>
#include <linux/module.h>
+#include <linux/smp_lock.h>
#include <linux/uaccess.h>
#include <linux/fs.h>
#include <asm/dma.h>
@@ -196,6 +197,7 @@ static loff_t coreb_lseek(struct file *f
static int coreb_open(struct inode *inode, struct file *file)
{
+ lock_kernel();
spin_lock_irq(&coreb_lock);
if (coreb_status & COREB_IS_OPEN)
@@ -204,10 +206,12 @@ static int coreb_open(struct inode *inod
coreb_status |= COREB_IS_OPEN;
spin_unlock_irq(&coreb_...please drop the coreb.c changes from your patch -mike --
At a minimum, I would hope such a request would say something like "I've
looked at the driver's locking and am convinced that the BKL is not
needed." Have you done that? There is a certain leap of faith involved
in removing that protection from a driver.
I decided to take a quick look...
- You use spin_lock_irq(&coreb_lock) in a number of places, but you do
not take the lock in the interrupt handler. You also do not take the
lock in coreb_write() or coreb_read(), so those can race with the
interrupt handler, with ioctl(), and with each other.
- coreb_write() and coreb_read() do interruptible waits, but do not
check to see whether they were interrupted. They will, in fact,
continue in their I/O loops after a signal.
- In both functions you have:
unsigned long p = *ppos;
if (p + count > coreb_size)
return -EFAULT;
that calculation can overflow.
- You also do this:
static ssize_t coreb_write(struct file *file, const char *buf, size_t count,
loff_t * ppos)
/* ... */
set_dma_start_addr(CH_MEM_STREAM2_SRC, (unsigned long)buf);
In other words, the DMA is done directly to/from a user-space
address. Maybe that's safe on Blackfin, I don't know...
- I have no idea why some of your functions are using d_inode->i_mutex.
- In coreb_ioctl():
spin_lock_irq(&coreb_lock);
if (coreb_status & COREB_IS_RUNNING) {
retval = -EBUSY;
break;
}
this will exit the function with the spinlock still held and
interrupts disabled.
case CMD_COREB_RESET:
printk(KERN_INFO "Resetting Core B\n");
bfin_write_SICB_SYSCR(bfin_read_SICB_SYSCR() | 0x0080);
break;
You do not acquire the lock here, so this can race against other
ioctl() calls. And ioctl() can race against read() and write().
Registration and such seem reasonable, so I can't come up with a
scenario where loss of BKL protection will create trouble. Given the
other problems there, though, I'll confess to being a bit nervous about
...the lock is to protect one thing: coreb_status. we lock around any access to it, so it not being grabbed in the irq handler or any other function where coreb_status is not utilized is irrelevant. that means the BKL is not needed in the driver. the rest of your comments are more or less on target, but again irrelevant to the topic of the BKL. i'll keep them in mind when i rewrite the driver, thanks. -mike --
On Tue, 20 May 2008 01:26:28 +0200 Acked-by: Alan Cox <alan@redhat.com> --
this open func already has a spinlock protecting it. doesnt that mean we dont need the bkl in it ? -mike --
The existence of a spinlock is a good sign. But, until somebody has looked at the code and verified that said lock is really protecting everything, it's best to leave the BKL protection (which has always been there, just at a higher level) in place. jon --
if the spinlock doesnt do what it's advertising (preventing mutual access), then the BKL is needed. if there's some UP behavior i'm not aware of, then the BKL is needed. otherwise, the BKL is not needed in this driver. i should prob rewrite this driver anyways ... the open code could easily be replaced with some atomic funcs. -mike --
Since all misc drivers that have an open() function now take the
BKL in there, there is no longer the need to take it in the common
misc_open() function.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Index: linux-2.6/drivers/char/misc.c
===================================================================
--- linux-2.6.orig/drivers/char/misc.c
+++ linux-2.6/drivers/char/misc.c
@@ -50,7 +50,6 @@
#include <linux/device.h>
#include <linux/tty.h>
#include <linux/kmod.h>
-#include <linux/smp_lock.h>
/*
* Head entry for the doubly linked miscdevice list
@@ -121,7 +120,6 @@ static int misc_open(struct inode * inod
int err = -ENODEV;
const struct file_operations *old_fops, *new_fops = NULL;
- lock_kernel();
mutex_lock(&misc_mtx);
list_for_each_entry(c, &misc_list, list) {
@@ -159,7 +157,6 @@ static int misc_open(struct inode * inod
fops_put(old_fops);
fail:
mutex_unlock(&misc_mtx);
- unlock_kernel();
return err;
}
--You have to be careful before assuming but yes - seems sensible. I'm currently munching my way through the watchdog drivers fixing them up for unlocked_ioctl/BKL drops and finding various things needing fixing Definitely --
If they literaly are 'return 0' you can just remove them, as a Even if clearly does it's own locking please add the BKL for now and let the maintainers sort it out later, better be safe then sorry. Except for that thanks a lot, this is the kind of work that's more productive than all these discussions here :) For some reason about 80 instances seem awfully few, but we've move a lot of device into subsystems from beeing plain chardevs so this might actually be correct. --
And here's a patch to do just that: remove all empty chardev
open/release methods. Based on the list compiled by Jonathan.
(and yeah, ip2_ipl_open is not technically empty at the source level,
but only at the binary level :))
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: linux-2.6/arch/cris/arch-v10/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v10/drivers/i2c.c 2008-05-16 17:58:15.000000000 +0200
+++ linux-2.6/arch/cris/arch-v10/drivers/i2c.c 2008-05-16 17:58:19.000000000 +0200
@@ -563,18 +563,6 @@ i2c_readreg(unsigned char theSlave, unsi
return b;
}
-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
/* Main device API. ioctl's to write or read to/from i2c registers.
*/
@@ -619,8 +607,6 @@ i2c_ioctl(struct inode *inode, struct fi
static const struct file_operations i2c_fops = {
.owner = THIS_MODULE,
.ioctl = i2c_ioctl,
- .open = i2c_open,
- .release = i2c_release,
};
int __init
Index: linux-2.6/arch/cris/arch-v32/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v32/drivers/i2c.c 2008-05-16 17:57:56.000000000 +0200
+++ linux-2.6/arch/cris/arch-v32/drivers/i2c.c 2008-05-16 17:58:07.000000000 +0200
@@ -633,18 +633,6 @@ i2c_readreg(unsigned char theSlave, unsi
return b;
}
-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
- return 0;
-}
-
/* Main device API. ioctl's to write or read to/from i2c registers.
*/
@@ -689,8 +677,6 @@ i2c_ioctl(struct inode *inode, struct fi
static const struct file_operations i2c_fops = {
.owner = THIS_MODULE,
.ioctl = i2c_ioctl,
- .open = i2c_open,
- .release = i2c_release,
};
static int __i...Actually it turns out you can introduce bugs doing this when the BKL is
pushed down.
The problem is the methods are not NULL, they (with the lock pushed down
are)
{
lock_kernel();
unlock_kernel();
}
And we have drivers with setup code that does things in the wrong order
but under the BKL. eg one I just fixed did
misc_register()
init locks
allocate memory
do stuff
return 0;
The lock/unlock in the open happens to save your butt against the wrong
order of intialisation because the open cannot occur before the lock is
taken, and thanks to the BKL it cannot make any progress until the setup
is completed. Fun too - udev loves opening things as they appear so in
some cases we might actually trigger them too.
So when you remove the _open() empty methods *please* make sure you have
verified the correctness and ordering of the entire registration path.
I've found three examples of this so far just cleaning up
drivers/watchdog.
Alan
--Hmph. As it turns out, a misc driver will still be OK because the BKL has not (yet) been pushed past misc_open(). What this does mean, though, is that all of those empty and trivial open functions need to be revisited. I thought this looked too easy the first time through... jon --
I think it would be best to make them lock/unlock kernel in the first pass and then work through them. The BKL can be subtle and evil, but as I brought it into the world I guess I must banish it ;) Alan --
> Index: linux-2.6/drivers/char/ip2/ip2main.c Looks fine but please send it via my tty tree so it doesn't collide with the tty work. --
In general when changing semantics drastically you should force compile errors by renaming the respective entry point. That has been the standard Linux method for this for years. I doubt it will be very interesting, but it would be useful. The goal less being to get rid of BKL in old drivers, but not requiring BKL in new drivers. Basically all BKL assumptions in interfaces really should go. -Andi --
No, we really do want to get rid of BKL in old drivers too. Or at least in the interfaces. Linus --
In the interfaces definitely yes and all subsystems should have their own lock_kernel calls, but why in the old drivers? For those it's very unlikely they are used on any SMP system anyways (e.g. anything depending on CONFIG_ISA) or if they do only on 2 CPU systems. Of course if you can find someone to do the work it wouldn't be bad, just wouldn't seem like a particularly useful investment of time to me. Also it would be bad if the people who did such conversions didn't actually test it and that's a great danger with many old drivers because nearly nobody has the hardware (and if they do it won't be in a SMP system) -Andi --
Because - You need to verify the locking assumptions that remain are entirely driver internal - At the point you achieve that you've done *ALL* the work required to add a driver specific lock --
>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes: Andi> In the interfaces definitely yes and all subsystems should have Andi> their own lock_kernel calls, but why in the old drivers? For Andi> those it's very unlikely they are used on any SMP system anyways Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2 Andi> CPU systems. I'm still running an SMP server with ISA slots. I'd love to contribute testing and possibly coding to this effort, but realisticlly I'll be able to compile and boot stuff. John --
I do too (although one CPU has died recently), but how many ISA devices do you use in it? Mine used to have a ISA ISDN card, but that was it and then no ISA anymore even though the slots are still in there. Also on 2 CPU systems BKL is not that critical anyways. It only starts to hurt on larger CPU counts. -Andi --
>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes: Andi> In the interfaces definitely yes and all subsystems should have Andi> their own lock_kernel calls, but why in the old drivers? For Andi> those it's very unlikely they are used on any SMP system anyways Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2 Andi> I do too (although one CPU has died recently), but how many ISA Andi> devices do you use in it? Mine used to have a ISA ISDN card, but Andi> that was it and then no ISA anymore even though the slots are Andi> still in there. I must admit I used to have an ISA Cyclades 8 port serial card running in there, but now it's all PCI stuff. It only had one ISA slot. So yes, ISA SMP boxes are slowly dying, but they'll be around for a long time to come. Andi> Also on 2 CPU systems BKL is not that critical anyways. It only Andi> starts to hurt on larger CPU counts. True, but with the growing number of multicore systems, esp on desktops, it's going to be an issue. How will the BKL work on quad core boxes? I'm going to shutup
