Re: [PATCH] kill empty chardev open/release methods

Previous thread: [GIT PATCH] driver core fixes against 2.6.26-rc2 by Greg KH on Wednesday, May 14, 2008 - 10:49 am. (3 messages)

Next thread: [PATCH] Remove final references to deprecated, unreferenced TOPDIR. by Robert P. J. Day on Wednesday, May 14, 2008 - 11:25 am. (1 message)
From: Ingo Molnar
Date: Wednesday, May 14, 2008 - 10:49 am

As some of the latency junkies on lkml already know it, commit 8e3e076 
("BKL: revert back to the old spinlock implementation") in v2.6.26-rc2 
removed the preemptible BKL feature and made the Big Kernel Lock a 
spinlock and thus turned it into non-preemptible code again. This commit 
returned the BKL code to the 2.6.7 state of affairs in essence.

Linus also indicated that pretty much the only acceptable way to change 
this (to us -rt folks rather unfortunate) latency source and to get rid 
of this non-preemptible locking complication is to remove the BKL.

This task is not easy at all. 12 years after Linux has been converted to 
an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ 
lock_kernel() critical sections and 800+ ioctls. They are spread out 
across rather difficult areas of often legacy code that few people 
understand and few people dare to touch.

It takes top people like Alan Cox to map the semantics and to remove BKL 
code, and even for Alan (who is doing this for the TTY code) it is a 
long and difficult task.

According to my quick & dirty git-log analysis, at the current pace of 
BKL removal we'd have to wait more than 10 years to remove most BKL 
critical sections from the kernel and to get acceptable latencies again.
 
The biggest technical complication is that the BKL is unlike any other 
lock: it "self-releases" when schedule() is called. This makes the BKL 
spinlock very "sticky", "invisible" and viral: it's very easy to add it 
to a piece of code (even unknowingly) and you never really know whether 
it's held or not. PREEMPT_BKL made it even more invisible, because it 
made its effects even less visible to ordinary users.

Furthermore, the BKL is not covered by lockdep, so its dependencies are 
largely unknown and invisible, and it is all lost in the haze of the 
past ~15 years of code changes. All this has built up to a kind of Fear, 
Uncertainty and Doubt about the BKL: nobody really knows it, nobody 
really dares to touch it and ...
From: Alan Cox
Date: Wednesday, May 14, 2008 - 2:46 pm

Out of amusement I took the watchdog drivers and started looking for
large cans of worms in the BKL drop arena.

Here is a fun one for general discussion - right now driver probe
functions request resources. We have no ordering on the requests so we
have deadlocks if two drivers do resource requests for conflicting
resources in reverse order.

"Discuss"

Alan
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 3:15 pm

What deadlocks? resource allocation normally doesn't block. So if there's
a ordering issue one of them will fail and should bail out.

That said if you have conflicting resources then failing is the correct
behavior anyways.

-Andi
--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 3:11 pm

resource requests aren't blocking, so it wouldn't actually be a deadlock. 
It would just be a "both failed, try again".

That said, two drivers shouldn't be probing the same hardware at the same 
time regardless, so I can't imagine that it's much of a problem in real 
life. 

		Linus
--

From: Kevin Winchester
Date: Friday, May 16, 2008 - 5:14 pm

Ingo Molnar wrote:

<snip description and patches>

I decided to give this tree a try, and I got:


[4294034.386085] ------------[ cut here ]------------
[4294034.387882] WARNING: at fs/proc/generic.c:669 
create_proc_entry+0x3d/0xc5()
[4294034.390059] Pid: 2565, comm: Xorg Not tainted 
2.6.26-rc2-00456-gd9df34e #35
[4294034.392682]
[4294034.392683] Call Trace:
[4294034.394071]  [<ffffffff8022a8ac>] warn_on_slowpath+0x53/0x81
[4294034.394077]  [<ffffffff802b57b4>] ? proc_register+0xf7/0x162
[4294034.394081]  [<ffffffff802b580b>] ? proc_register+0x14e/0x162
[4294034.394087]  [<ffffffff804afa05>] ? _spin_unlock+0x30/0x4b
[4294034.394091]  [<ffffffff802b580b>] ? proc_register+0x14e/0x162
[4294034.394095]  [<ffffffff8021a127>] ? startup_ioapic_irq+0x54/0x5f
[4294034.394099]  [<ffffffff802b5fcc>] create_proc_entry+0x3d/0xc5
[4294034.394103]  [<ffffffff80252008>] register_irq_proc+0x84/0xa0
[4294034.394108]  [<ffffffff80250b1a>] setup_irq+0x1b2/0x21b
[4294034.394113]  [<ffffffff80250cc8>] request_irq+0xf1/0x117
[4294034.394117]  [<ffffffff8038aaaa>] ? radeon_driver_irq_handler+0x0/0x7e
[4294034.394122]  [<ffffffff803789b6>] ? drm_control+0x0/0x186
[4294034.394126]  [<ffffffff80378acf>] drm_control+0x119/0x186
[4294034.394130]  [<ffffffff803770be>] drm_ioctl+0x1d3/0x265
[4294034.394589]  [<ffffffff80283e5e>] vfs_ioctl+0x5e/0x77
[4294034.394593]  [<ffffffff802840d2>] do_vfs_ioctl+0x25b/0x270
[4294034.394598]  [<ffffffff804af23e>] ? trace_hardirqs_on_thunk+0x35/0x3a
[4294034.394601]  [<ffffffff80284129>] sys_ioctl+0x42/0x65
[4294034.394606]  [<ffffffff8020b2eb>] system_call_after_swapgs+0x7b/0x80
[4294034.411597]
[4294034.411601] ---[ end trace 7f52164e4c2b9927 ]---

I have no idea if that is even related to the BKL or not, I haven't even 
opened the source file yet, but I figured I'd report it.

-- 
Kevin Winchester
--

From: Kevin Winchester
Date: Friday, May 16, 2008 - 5:37 pm

And now applying the debugging tips that Linus, Al and others supplied 
to me awhile back, I see from GDB that:

vfs_ioctl locks the kernel before calling drm_ioctl, and, that 
create_proc_entry() has the following new line thanks to Ingo:

     WARN_ON_ONCE(kernel_locked());

According to Ingo's patch log:

     The functions, if called from the BKL, show that the calling site
     might have a dependency on the procfs code previously using the BKL
     in the dir-entry manipulation functions.

I do not really know what that means, so I cc'd Dave Airlie to see if he 
has a solution.

-- 
Kevin Winchester


--

From: Jonathan Corbet
Date: Wednesday, May 14, 2008 - 2:45 pm

There's also every char device open() method - a rather long list in its
own right.  I'd be surprised if one in ten of them really needs it, but
one has to look...

I've been looking at the chrdev code anyway, and pondering on how this
might be addressed.  Here's some thoughts on alternatives, I'd be
curious what people think:

1: We could add an unlocked_open() to the file_operations structure;
   drivers could be converted over as they are verified not to need the
   BKL on open.  Disadvantages are that it grows this structure for a
   relatively rare case - most open() calls already don't need the BKL.
   But it's a relatively easy path without flag days.

2: Create a char_dev_ops structure for char devs and use it instead of
   file_operations.  I vaguely remember seeing Al mutter about that a
   while back.  Quite a while back.  This mirrors what was done with
   block devices, and makes some sense - there's a lot of stuff in
   struct file_operations which is not really applicable to char devs.
   Then struct char_dev_ops could have open() and locked_open(), with
   the latter destined for removal sometime around 2015 or so.

   Advantages are that it's cleaner and separates out some things which
   perhaps shouldn't be mixed anyway.  Disadvantage is...well...a fair
   amount of code churn.  It would also require chrdev-specific wrappers
   to map straight file_operations calls in the VFS to the new
   callbacks.

3: Provide a new form of cdev_add() which lets the driver indicate
   that the BKL is not needed on open (or anything else?).  At a
   minimum, it could just be a new parameter on cdev_add which has a
   value of zero or FIXME_I_STILL_NEED_BKL.  Still some churn but easier
   to script and smaller because a lot of drivers are still using
   register_chrdev() - something else worth fixing.

   A more involved form might provide a new chardev_add() which takes
   the new char_dev_ops structure too.  Mapping between new and old
   operations vectors would be ...
From: Alan Cox
Date: Wednesday, May 14, 2008 - 2:39 pm

Its beginning to sound like should start 2.7 ;)

Alan

--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 3:11 pm

In general when changing semantics drastically you should force
compile errors by renaming the respective entry point. That
has been the standard Linux method for this for years.


I doubt it will be very interesting, but it would be useful.
The goal less being to get rid of BKL in old drivers, but not 
requiring BKL in new drivers. Basically all BKL assumptions
in interfaces really should go.

-Andi
--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 3:16 pm

No, we really do want to get rid of BKL in old drivers too. Or at least in 
the interfaces.

		Linus
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 3:21 pm

In the interfaces definitely yes and all subsystems should have their
own lock_kernel calls, but why in the old drivers? For those it's very
unlikely they are used on any SMP system anyways (e.g. anything
depending on CONFIG_ISA) or if they do only on 2 CPU systems.

Of course if you can find someone to do the work it wouldn't be
bad, just wouldn't seem like a particularly useful investment of time to
me.

Also it would be bad if the people who did such conversions didn't
actually test it and that's a great danger with many old drivers because
nearly nobody has the hardware (and if they do it won't be in a SMP system)

-Andi

--

From: John Stoffel
Date: Thursday, May 15, 2008 - 8:05 am

>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:


Andi> In the interfaces definitely yes and all subsystems should have
Andi> their own lock_kernel calls, but why in the old drivers? For
Andi> those it's very unlikely they are used on any SMP system anyways
Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2
Andi> CPU systems.

I'm still running an SMP server with ISA slots.  I'd love to
contribute testing and possibly coding to this effort, but
realisticlly I'll be able to compile and boot stuff.

John
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 8:10 am

I do too (although one CPU has died recently), but how many ISA devices
do you use in it? Mine used to have a ISA ISDN card, but that was it
and then no ISA anymore even though the slots are still in there.

Also on 2 CPU systems BKL is not that critical anyways. It only starts
to hurt on larger CPU counts.

-Andi

--

From: John Stoffel
Date: Thursday, May 15, 2008 - 8:18 am

>>>>> "Andi" == Andi Kleen <andi@firstfloor.org> writes:

Andi> In the interfaces definitely yes and all subsystems should have
Andi> their own lock_kernel calls, but why in the old drivers? For
Andi> those it's very unlikely they are used on any SMP system anyways
Andi> (e.g. anything depending on CONFIG_ISA) or if they do only on 2

Andi> I do too (although one CPU has died recently), but how many ISA
Andi> devices do you use in it? Mine used to have a ISA ISDN card, but
Andi> that was it and then no ISA anymore even though the slots are
Andi> still in there.

I must admit I used to have an ISA Cyclades 8 port serial card running
in there, but now it's all PCI stuff.  It only had one ISA slot.  So
yes, ISA SMP boxes are slowly dying, but they'll be around for a long
time to come.

Andi> Also on 2 CPU systems BKL is not that critical anyways. It only
Andi> starts to hurt on larger CPU counts.

True, but with the growing number of multicore systems, esp on
desktops, it's going to be an issue.  How will the BKL work on quad
core boxes?

I'm going to shutup now,  since I don't have the knowledge to really
contribute much to the discussion.  

John



--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 8:45 am

My point was that for those few users it's actually better to keep
the BKL. If you try to remove it on some old driver you cannot test
due to lack of hardware the risk of breaking the driver is higher than the 
gain you would get from removing it.  The best you can do in this
legacy code is to keep it running with minimal changes in the old

Multicore systems don't have ISA slots.

[yes I'm sure someone will tell me now about the ISA-over-USB device
that exists. Don't bother, it doesn't add anything to the point]

-Andi
--

From: Alan Cox
Date: Thursday, May 15, 2008 - 6:30 am

Because

- You need to verify the locking assumptions that remain are entirely
driver internal
- At the point you achieve that you've done *ALL* the work required to
add a driver specific lock

--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 2:56 pm

I don't think there are *that* many. I found only 83 instances of 
"register_chrdev()" in the kernel, so the open methods should be pretty 
limited.

Of course, some open methods call other sub-registrations, but you'd start 
off by moving the lock_kernel() down just *one* stage. 

So it literally should be:
 - remove one lock_kernel/unlock_kernel pair in fs/char_dev.c
 - add max 83 pairs in the places that register those things

I really don't think it's worth the pain. See above. The numbers aren't 
that huge, and external modules simply aren't a pressing enough issue.

		Linus
--

From: Jonathan Corbet
Date: Friday, May 16, 2008 - 8:44 am

...and so that's what I've done.  My approach was to find every
register_chrdev() and cdev_add() call, look at the associated
file_operations, then go back to the open() function, if any.  Unless it
was almost immediately obvious to me that the function was either (1) so
trivial as to not require locking (quite few of them are "return 0;"), or
(2) clearly doing its own locking, I wrapped the code in the BKL.

Finally, I removed the BKL from chrdev_open().

Allmodconfig and allyesconfig makes work here, and this kernel runs on my
dual-core desktop.  But, clearly, I don't have all of this hardware.
Actually, I wonder if some of it still exists outside of museums.  So
there's probably something stupid in there somewhere.

The result is available at:

	git://git.lwn.net/linux-2.6.git cdev

I'll put a shortlog and diffstat at the end of this message.  For
completeness, there's also a list of files I examined and did *not* change.

Assuming nobody tells me I'm completely off-base, I guess my next step is
to start running individual patches past maintainers.  Some of them,
probably (I hope), will tell me that I've been wasting my time and that
their code doesn't need the BKL.  In such cases, I'll gladly drop the
associated patch.  But there's a fair amount of stuff here which clearly
*does* need it still.  If all seems well, maybe this tree should get into
linux-next at some point too.

Comments?

The changes come out to this:

 arch/cris/arch-v10/drivers/gpio.c         |    3 ++
 arch/cris/arch-v10/drivers/sync_serial.c  |   34 ++++++++++++++++----------
 arch/cris/arch-v32/drivers/mach-a3/gpio.c |    4 +++
 arch/cris/arch-v32/drivers/mach-fs/gpio.c |    5 +++
 arch/cris/arch-v32/drivers/sync_serial.c  |   33 +++++++++++++++-----------
 arch/mips/kernel/rtlx.c                   |    7 ++++-
 arch/mips/kernel/vpe.c                    |   12 +++++++--
 arch/mips/sibyte/common/sb_tbprof.c       |   25 ++++++++++++++-----
 arch/sh/boards/landisk/gio.c              |   10 ...
From: Christoph Hellwig
Date: Friday, May 16, 2008 - 8:49 am

If they literaly are 'return 0' you can just remove them, as a

Even if clearly does it's own locking please add the BKL for now and let
the maintainers sort it out later, better be safe then sorry.

Except for that thanks a lot, this is the kind of work that's more
productive than all these discussions here :)

For some reason about 80 instances seem awfully few, but we've move a
lot of device into subsystems from beeing plain chardevs so this
might actually be correct.
--

From: Christoph Hellwig
Date: Friday, May 16, 2008 - 9:03 am

And here's a patch to do just that:  remove all empty chardev
open/release methods.  Based on the list compiled by Jonathan.

(and yeah, ip2_ipl_open is not technically empty at the source level,
 but only at the binary level :))


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/arch/cris/arch-v10/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v10/drivers/i2c.c	2008-05-16 17:58:15.000000000 +0200
+++ linux-2.6/arch/cris/arch-v10/drivers/i2c.c	2008-05-16 17:58:19.000000000 +0200
@@ -563,18 +563,6 @@ i2c_readreg(unsigned char theSlave, unsi
 	return b;
 }
 
-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
-	return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
-	return 0;
-}
-
 /* Main device API. ioctl's to write or read to/from i2c registers.
  */
 
@@ -619,8 +607,6 @@ i2c_ioctl(struct inode *inode, struct fi
 static const struct file_operations i2c_fops = {
 	.owner    = THIS_MODULE,
 	.ioctl    = i2c_ioctl,
-	.open     = i2c_open,
-	.release  = i2c_release,
 };
 
 int __init
Index: linux-2.6/arch/cris/arch-v32/drivers/i2c.c
===================================================================
--- linux-2.6.orig/arch/cris/arch-v32/drivers/i2c.c	2008-05-16 17:57:56.000000000 +0200
+++ linux-2.6/arch/cris/arch-v32/drivers/i2c.c	2008-05-16 17:58:07.000000000 +0200
@@ -633,18 +633,6 @@ i2c_readreg(unsigned char theSlave, unsi
 	return b;
 }
 
-static int
-i2c_open(struct inode *inode, struct file *filp)
-{
-	return 0;
-}
-
-static int
-i2c_release(struct inode *inode, struct file *filp)
-{
-	return 0;
-}
-
 /* Main device API. ioctl's to write or read to/from i2c registers.
  */
 
@@ -689,8 +677,6 @@ i2c_ioctl(struct inode *inode, struct fi
 static const struct file_operations i2c_fops = {
 	.owner =    THIS_MODULE,
 	.ioctl =    i2c_ioctl,
-	.open =     i2c_open,
-	.release =  i2c_release,
 };
 
 static int ...
From: Alan Cox
Date: Friday, May 16, 2008 - 9:24 am

> Index: linux-2.6/drivers/char/ip2/ip2main.c

Looks fine but please send it via my tty tree so it doesn't collide with
the tty work.

--

From: Alan Cox
Date: Friday, May 16, 2008 - 1:55 pm

Actually it turns out you can introduce bugs doing this when the BKL is
pushed down.

The problem is the methods are not NULL, they (with the lock pushed down
are)

{
	lock_kernel();
	unlock_kernel();
}

And we have drivers with setup code that does things in the wrong order
but under the BKL. eg one I just fixed did

	misc_register()
	init locks
	allocate memory
	do stuff
	return 0;

The lock/unlock in the open happens to save your butt against the wrong
order of intialisation because the open cannot occur before the lock is
taken, and thanks to the BKL it cannot make any progress until the setup
is completed. Fun too - udev loves opening things as they appear so in
some cases we might actually trigger them too.

So when you remove the _open() empty methods *please* make sure you have
verified the correctness and ordering of the entire registration path.
I've found three examples of this so far just cleaning up
drivers/watchdog.

Alan
--

From: Jonathan Corbet
Date: Sunday, May 18, 2008 - 12:46 pm

Hmph.

As it turns out, a misc driver will still be OK because the BKL has not
(yet) been pushed past misc_open().  What this does mean, though, is
that all of those empty and trivial open functions need to be
revisited.  I thought this looked too easy the first time through...

jon
--

From: Alan Cox
Date: Sunday, May 18, 2008 - 12:58 pm

I think it would be best to make them lock/unlock kernel in the first
pass and then work through them. The BKL can be subtle and evil, but as I
brought it into the world I guess I must banish it ;)


Alan
--

From: Alan Cox
Date: Friday, May 16, 2008 - 9:22 am

You have to be careful before assuming but yes - seems sensible.

I'm currently munching my way through the watchdog drivers fixing them up
for unlocked_ioctl/BKL drops and finding various things needing fixing

Definitely
--

From: Arnd Bergmann
Date: Saturday, May 17, 2008 - 2:15 pm

Note that the majority of drivers use (grep suggests up to 165
of them) uses misc_register instead of register_chrdev/cdev_add.
Your patches are still correct, because you pushed the BKL into the
misc_open function, but there is an obvious next step in pushing
it further into the misc drivers.
There are probably a few more subsystems with minor number specific

In your current git tree, this change is no longer the final one, so
bisecting the series may cause other bugs. You should probably reorder
the patches at some point to avoid this.

	Arnd <><
--

From: Jonathan Corbet
Date: Sunday, May 18, 2008 - 1:26 pm

There's a few intermediate dispatcher levels like this, actually.
Lots of video drivers get called behind video_open(), usb drivers from
usb_open(), etc.  Not much to be done but to push things down one level

Bisection is going to be problem regardless - if a problem turns up,
it's going to be the chrdev_open() change which gets fingered.  I bet,
though, that it will be a rare BKL-related problem which is reproducible
enough to be easily bisectable.

But, yes, I do need to reorganize the patch series once I'm done adding
on changes.

jon
--

From: Arnd Bergmann
Date: Monday, May 19, 2008 - 4:07 pm

I've given it a try for all the misc drivers that have an open() function.
The vast majority of them are actually watchdog drivers, all of which
register as a misc device by themselves. You seem to already have a script
to turn per-file changes into a patch each, so I'm sending you two patches:
one for all the watchdog drivers (maybe Wim can take care of that as well)
and one for all the other misc drivers (this one needs to be split).

	Arnd <><

--

From: Arnd Bergmann
Date: Monday, May 19, 2008 - 4:34 pm

Since all misc drivers that have an open() function now take the
BKL in there, there is no longer the need to take it in the common
misc_open() function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

---
Index: linux-2.6/drivers/char/misc.c
===================================================================
--- linux-2.6.orig/drivers/char/misc.c
+++ linux-2.6/drivers/char/misc.c
@@ -50,7 +50,6 @@
 #include <linux/device.h>
 #include <linux/tty.h>
 #include <linux/kmod.h>
-#include <linux/smp_lock.h>
 
 /*
  * Head entry for the doubly linked miscdevice list
@@ -121,7 +120,6 @@ static int misc_open(struct inode * inod
 	int err = -ENODEV;
 	const struct file_operations *old_fops, *new_fops = NULL;
 	
-	lock_kernel();
 	mutex_lock(&misc_mtx);
 	
 	list_for_each_entry(c, &misc_list, list) {
@@ -159,7 +157,6 @@ static int misc_open(struct inode * inod
 	fops_put(old_fops);
 fail:
 	mutex_unlock(&misc_mtx);
-	unlock_kernel();
 	return err;
 }
 

--

From: Arnd Bergmann
Date: Monday, May 19, 2008 - 4:26 pm

The Big Kernel Lock has been pushed down from chardev_open
to misc_open, this change moves it to the individual misc
driver open functions.

As before, the change was purely mechanical, most drivers
should actually not need the BKL. In particular, we still
hold the misc_mtx() while calling the open() function
The patch should probably be split into one changeset
per driver.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>

---
Index: linux-2.6/arch/arm/common/rtctime.c
===================================================================
--- linux-2.6.orig/arch/arm/common/rtctime.c
+++ linux-2.6/arch/arm/common/rtctime.c
@@ -16,6 +16,7 @@
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
 #include <linux/miscdevice.h>
+#include <linux/smp_lock.h>
 #include <linux/spinlock.h>
 #include <linux/capability.h>
 #include <linux/device.h>
@@ -282,6 +283,7 @@ static int rtc_open(struct inode *inode,
 {
 	int ret;
 
+	lock_kernel();
 	mutex_lock(&rtc_mutex);
 
 	if (rtc_inuse) {
@@ -301,6 +303,7 @@ static int rtc_open(struct inode *inode,
 		}
 	}
 	mutex_unlock(&rtc_mutex);
+	unlock_kernel();
 
 	return ret;
 }
Index: linux-2.6/arch/blackfin/mach-bf561/coreb.c
===================================================================
--- linux-2.6.orig/arch/blackfin/mach-bf561/coreb.c
+++ linux-2.6/arch/blackfin/mach-bf561/coreb.c
@@ -32,6 +32,7 @@
 #include <linux/device.h>
 #include <linux/ioport.h>
 #include <linux/module.h>
+#include <linux/smp_lock.h>
 #include <linux/uaccess.h>
 #include <linux/fs.h>
 #include <asm/dma.h>
@@ -196,6 +197,7 @@ static loff_t coreb_lseek(struct file *f
 
 static int coreb_open(struct inode *inode, struct file *file)
 {
+	lock_kernel();
 	spin_lock_irq(&coreb_lock);
 
 	if (coreb_status & COREB_IS_OPEN)
@@ -204,10 +206,12 @@ static int coreb_open(struct inode *inod
 	coreb_status |= COREB_IS_OPEN;
 
 	spin_unlock_irq(&coreb_lock);
+	unlock_kernel();
 	return 0;
 
  out_busy:
 ...
From: Mike Frysinger
Date: Monday, May 19, 2008 - 5:07 pm

this open func already has a spinlock protecting it.  doesnt that mean
we dont need the bkl in it ?
-mike
--

From: Jonathan Corbet
Date: Monday, May 19, 2008 - 5:21 pm

The existence of a spinlock is a good sign.  But, until somebody has
looked at the code and verified that said lock is really protecting
everything, it's best to leave the BKL protection (which has always been
there, just at a higher level) in place.

jon
--

From: Mike Frysinger
Date: Monday, May 19, 2008 - 5:46 pm

if the spinlock doesnt do what it's advertising (preventing mutual
access), then the BKL is needed.  if there's some UP behavior i'm not
aware of, then the BKL is needed.  otherwise, the BKL is not needed in
this driver.

i should prob rewrite this driver anyways ... the open code could
easily be replaced with some atomic funcs.
-mike
--

From: Alan Cox
Date: Tuesday, May 20, 2008 - 1:46 am

On Tue, 20 May 2008 01:26:28 +0200

Acked-by: Alan Cox <alan@redhat.com>
--

From: Mike Frysinger
Date: Tuesday, May 20, 2008 - 4:01 pm

please drop the coreb.c changes from your patch
-mike
--

From: Jonathan Corbet
Date: Tuesday, May 20, 2008 - 4:25 pm

At a minimum, I would hope such a request would say something like "I've
looked at the driver's locking and am convinced that the BKL is not
needed."  Have you done that?  There is a certain leap of faith involved
in removing that protection from a driver.

I decided to take a quick look...

- You use spin_lock_irq(&coreb_lock) in a number of places, but you do
  not take the lock in the interrupt handler.  You also do not take the
  lock in coreb_write() or coreb_read(), so those can race with the
  interrupt handler, with ioctl(), and with each other.

- coreb_write() and coreb_read() do interruptible waits, but do not
  check to see whether they were interrupted.  They will, in fact,
  continue in their I/O loops after a signal.

- In both functions you have:

	unsigned long p = *ppos;

	if (p + count > coreb_size)
		return -EFAULT;

  that calculation can overflow.

- You also do this:

  static ssize_t coreb_write(struct file *file, const char *buf, size_t count,
	 		     loff_t * ppos)
  /* ... */
  		set_dma_start_addr(CH_MEM_STREAM2_SRC, (unsigned long)buf);

  In other words, the DMA is done directly to/from a user-space
  address.  Maybe that's safe on Blackfin, I don't know...

- I have no idea why some of your functions are using d_inode->i_mutex.

- In coreb_ioctl():

		spin_lock_irq(&coreb_lock);
		if (coreb_status & COREB_IS_RUNNING) {
			retval = -EBUSY;
			break;
		}

  this will exit the function with the spinlock still held and
  interrupts disabled.

	case CMD_COREB_RESET:
		printk(KERN_INFO "Resetting Core B\n");
		bfin_write_SICB_SYSCR(bfin_read_SICB_SYSCR() | 0x0080);
		break;

  You do not acquire the lock here, so this can race against other
  ioctl() calls.  And ioctl() can race against read() and write().

Registration and such seem reasonable, so I can't come up with a
scenario where loss of BKL protection will create trouble.  Given the
other problems there, though, I'll confess to being a bit nervous ...
From: Mike Frysinger
Date: Wednesday, May 21, 2008 - 9:22 am

the lock is to protect one thing: coreb_status.  we lock around any
access to it, so it not being grabbed in the irq handler or any other
function where coreb_status is not utilized is irrelevant.  that means
the BKL is not needed in the driver.

the rest of your comments are more or less on target, but again
irrelevant to the topic of the BKL.  i'll keep them in mind when i
rewrite the driver, thanks.
-mike
--

From: Jonathan Corbet
Date: Tuesday, May 20, 2008 - 8:13 am

OK, it looks like the "misc" misc drivers patch can go into the
bkl-removal tree, while the watchdog patches should not.  What that
means, I guess, is that the final misc_open() patch cannot go in at this
point; Alan's watchdog stuff needs to find its way in first.  Make

Alas, I have no such script.  I just committed each change as I made it
- each one required individual attention anyway.  The misc changes look
pretty straightforward, so I could probably hack up such a thing pretty
quickly if you don't have a tree with broken out patches.

Thanks,

jon
--

From: Arnd Bergmann
Date: Tuesday, May 20, 2008 - 10:21 am

Right, unless Alan or Wim are confident enough that removing the
BKL won't break the drivers (more than they are today).
Almost all of the open functions go along the lines of

int open(struct file *f, struct inode *i)
{
	if (wd_is_open)
		return -EBUSY;
	wd_is_open = 1;
	
	start_wd();

	return nonseekable_open(f, i);
}

nonseekable_open doesn't need the BKL by itself, and the wd_is_open
variable is protected by the misc_mtx mutex.
I can't see any scenario in which start_wd() would need the BKL, or
where a watchdog driver needs cycle_kernel_lock(), but I was't confident
enough about that assessment, because I'm not really familiar with

I've done a semi-automated split and applied the patches on top of your
tree. You can pull these from

git://git.kernel.org/pub/scm/linux/kernel/git/arnd/cell-2.6 bkl-removal

(I guess I should do a separate tree for it, will do that if more stuff
comes up.)

	Arnd <><
--

From: Alan Cox
Date: Tuesday, May 20, 2008 - 11:51 am

You need to review the use of misc_register(). Which is what I did
already and sorted out for each watchdog - the job is done and completed
and the various problem cases fixed. Watchdog has already been made BKL
removal safe in the patch series I sent.

Alan
--

From: Linus Torvalds
Date: Friday, May 16, 2008 - 9:30 am

May I suggest just adding a comment in those files, just saying something 
like

	/* This does not need the BKL, because .. */

where even the "because" part could be dropped when it's really obvious.

That way that "list of files I examined and did *not* change" would be 
obvious in the patch itself, and we also have some documentation that 

Same deal - just document the fact that the BKL isn't needed.

Yeah, in the long run that kind of documentation is worthless and we may 
want to get rid of it again in a year or two, but in the short run it's a 
good idea. If only to help people who want to review your patches.

Btw, do you have gitweb running anywhere? 

		Linus
--

From: Jonathan Corbet
Date: Friday, May 16, 2008 - 9:43 am

No, I guess I need to figure out how to set it up.  Either that or get
one of those kernel.org accounts and put things there.

Thanks,

jon
--

From: Linus Torvalds
Date: Saturday, May 17, 2008 - 2:58 pm

Btw, Jonathan, would you be willing to maintain some kind of tree of these 
BKL removal patches? 

This is different from the work Ingo is doing in the sense that these 
things should be (a) safe and reasonably obvious and thus (b) presumably 
ready to be merged in the next merge window. Ingo's BKL debugging tree is 
likely a good thing to use to find places that need work, but actually 
removing the BKL from some subsystem is a different issue.

(And when I say "safe and reasonably obvious" I obviously don't mean that 
there can't be bugs. Mistakes happen, and some BKL use might be overly 
subtle like the issue that Alan pointed out with an empty ->open routine 
almost accidentally serializing with initialization, but that's why I'd 
not merge these things after -rc2 anyway, but in the next merge window).

Because if you're willing to maintain a BKL-cleanup tree that gets merged 
into linux-next etc, I'd submit my VFAT/MSDOS BKL removal patch to you. 
The reason I did that one was that Thomas actually reported that to be a 
major source of latency problems on one of his embedded systems (80ms 
latency!), so it would be nice to have that patch in some place where it 
might get tested.

		Linus
--

From: Jonathan Corbet
Date: Sunday, May 18, 2008 - 1:07 pm

Sure, I can do that - as long as people don't mind that committed to
being with the in-laws and off the net for the first couple of weeks in
June.  I'll try to get the first version up shortly, after yet another
pass over the chardev pushdown stuff.

jon
--

From: Jonathan Corbet
Date: Wednesday, May 14, 2008 - 3:07 pm

There's the drivers calling cdev_add() directly as well - another

This is all certainly doable, but it leaves me with one concern: there
will be no signal to external module maintainers that the change needs
to be made.  So, beyond doubt, quite a few of them will just continue to
be shipped unfixed - and they will still run.  If any of them actually
*need* the BKL, something awful may happen to somebody someday.

jon
--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 3:14 pm

External modules have bugs because interfaces change. Film at 11.

It's true, but it definitely shouldn't keep us from just doing it. 
Especially since well-maintained external modules (ie the authors follow 
big discussions like this) can just take the kernel lock regardless of 
kernel version, since it won't even be broken with old kernels.

Of course, well-maintained kernel modules wouldn't depend on the BKL in 
the first place. Oh, well.

		Linus


--

From: Alan Cox
Date: Thursday, May 22, 2008 - 1:20 pm

I now have a large patch and my full x86-32 build tree building without
->ioctl() in file_operations. Its a 350K patch and took all day so I'll
begin splitting it out and sending chunks to tree maintainers.

Alan
--

From: Jan Engelhardt
Date: Thursday, May 15, 2008 - 1:44 am

1b: add a .locked_open and move all BKL-requiring code to use that.
When time comes and BKL is gone, .locked_open can be removed again,

Iff you create a new char_dev_ops, don't clutter it with the old stuff.

This is the BSD/Solaris tactic, heh :)

--

From: Diego Calleja
Date: Thursday, May 15, 2008 - 7:54 am

1c: make the BKL-unsafe drivers depend on !SMP && !PREEMPT.
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 11:30 am

It's a reasonable start, but have you considered doing this work
in tree instead? As in just add all the warnings, but don't actually
change the semantics yet.  I suspect you would get far more users
this way and the work would go faster. 

It would be reasonable to enable this in -mm if it the warnings are
not too intrusive (self disable itself etc.)

Also for fixing the ioctls I'm not sure that dynamic instrumentation
will really work because it would be tough to execute them all.

I suspect some variant of static code analysis would make sense
for the ioctls. 

I used to do some auditing with cflow. That won't
catch indirect function calls unfortunately, but if there's 
some way to find those and bail out one could do an automated
tool that flags all the ioctls that don't sleep for example
(don't have any sleeping functions in the call chain -- this
might need some manual annotation, but hopefully not much)

Then it would be possible to safely switch those over to a blocking
mutex variant of BKL. 

Now there could be some more automated analysis here: for example the 
main other user of BKL is character open. I suspect to really
make progress here you would also need a open_unlocked() and

Hmm, is BKL really that common still that it's a latency problem?
The few VFS cases like locks can be fixed without extreme measures.

Most of the legacy users are unlikely to be latency problems,
simply because only very few people (or nobody) still has that hardware
and the code will never run.

Also I wouldn't lose sleep over e.g. let ISDN continue using BKL forever.

-Andi
--

From: Alan Cox
Date: Wednesday, May 14, 2008 - 2:00 pm

Most of the legacy users inflict that locking on other code - eg the ISN
use of the BKL directly impacts on the tty layer work.
--

From: Andi Kleen
Date: Wednesday, May 14, 2008 - 2:13 pm

So you just stick unlock_kernel()/lock_kernel() around the call
to TTY (or similar to the entry points)

-Andi




--

From: Alan Cox
Date: Wednesday, May 14, 2008 - 2:19 pm

> Most?


It isn't that simple - I've spent a good deal of time working on it.
There are lots of paths that rely on interactions between modules. Eg we
found stuff racing between the pid structs tty internals and procfs that
happened to be saved by the BKL.

That in itself is a problem Ingo's stuff won't help with: We have lots of
"magic" accidental, undocumented and pot luck BKL locking semantics
between subsystems that are not even visible.
--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 2:45 pm

The good news is that I suspect they are going away. It probably is mainly 
tty and /proc by now, and /proc is pretty close to done.

It's hard to have too many inter-module dependencies when most of the core 
modules no longer even take the kernel lock any more.

In the VFS layer, we still have 

 - the ioctl thing, obviously. That's just mind-numbing "move things 
   down", not hard per se. But there's a *lot* of them (and I suspect the 
   huge majority of them don't actually need it, since they'd already be 
   racing against read/write anyway if they did).

 - default_llseek(). Probably the same, just a lot less of it.

 - superblock read/write.

and the latter one in particular is really dubious (we already have 
"[un]lock_super()" around it all, I think).

The core kernel, VM and networking already don't really do BKL. And it's 
seldom the case that subsystems interact with other unrelated subsystems 
outside of the core areas.

So it's a lot of work, no doubt, but I do think we should be able to do 
it. The most mind-numbing part is literally all the ioctl crud. There's 
more ioctl points than there are lock_kernel() calls left anywhere else.

		Linus
--

From: Ingo Molnar
Date: Thursday, May 15, 2008 - 1:02 am

yeah. And Alan has a good point: there _is_ lots of magic stuff 
happening below the BKL. One of them are the BKL <-> other lock 
dependencies. My stuff helps with mapping that part of the magic: it 
turns the BKL into an ordinary mutex and thus integrates it into 
lockdep's existing dependency validation machinery.

In other words: this stuff makes BKL validation _stronger_, not weaker, 
and hence it ultimately helps its mapping and elimination. It turns the 
"magic" into something more concrete.

It might not help with other magic directly - but it helps indirectly, 
because now the "magic" has shrunk, so there's more attention and more 
resources available to fix it in the places where the magic hurts. (And 
suggestions are welcome for more debug helpers to make more magic more 
visible.)

Whenever someone narrows the BKL's scope, that will always have to be 
done carefully - and that's true of any other lock. This patchset 
(except perhaps the boot bits) does not narrow the BKL's scope.

It will still be no doubt a tough job (reducing/changing locking of 
_any_ locked path is a tough job), but it will now fit into our existing 
practices much better and we'll get various reminders from lockdep and 
the other debug helpers when we forgot about some detail.

Before this there was almost zero feedback from the kernel when 
something around the BKL broke: pretty much the only remainder we had 
from incorrect BKL elimination were subtle breakages.

And my personal experience might matter as well: before this i never 
dared to touch BKL code. I once removed _all_ BKL locking from all the 
kernel _by accident_ [i typoed a single line in lib/smp_lock.c] and ran 
it on my main desktop for about a day and never noticed a thing - until 
a few weird TTY messages popped up in the syslog...

But with this scheme, i felt _much_ more secure about touching BKL code, 
and kicking the BKL from the scheduler was pure joy i have to say. (even 
though it will of course remain in the ...
From: Andi Kleen
Date: Wednesday, May 14, 2008 - 3:03 pm

Character devices in general.

And what's pretty nasty is that some interfaces force BKL still, so


- fasync

[had some patches for "fasync_locked", not sure if it's worth it]

- character device open


I tried to recruit kernel janitors some time ago to just do all the
ioctl -> ioctl_unlocked/explicit lock_kernel changes. There were a few
patches generated but the effort died down then.

BTW for ioctl the dynamic instrumentation method proposed also won't
work because it's basically impossible to exercise all these ioctls

-Andi


--

From: Alan Cox
Date: Thursday, May 15, 2008 - 6:34 am

Start at the other end - you can't fix the ioctls until you fix what the
ioctls interact with. That ends up at the basic data structure and once
you fix those the rest just starts to fall into place.
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 7:27 am

In my experience there's usually not too much interaction with other
kernel structures at the random driver ioctl level. I'm sure tty is
different, but it's probably not typical.

--

From: Alan Cox
Date: Thursday, May 15, 2008 - 8:36 am

Take a look at how file->f_flags is locked.

Alan
--

From: Andi Kleen
Date: Friday, May 16, 2008 - 3:21 am

Ah I posted patches to fix that one a couple of weeks
ago, but unfortunately they fell out of -mm again.

It was part of the "unlocked fasync" patchkit.

-Andi

--

From: H. Peter Anvin
Date: Wednesday, May 14, 2008 - 2:16 pm

... assuming that the ISDN code doesn't assume lock continuity across 
the TTY call.

	-hpa
--

From: Alan Cox
Date: Wednesday, May 14, 2008 - 2:17 pm

And procfs and between the tty and the net config code and ...

Keeping the BKL just in legacy places doesn't work. A counting mutex (ie
one you can self multi-lock) might be very useful to fix some of these
however as once we push it down to the point of being a driver specific
lock we can just give it a driver mutex
--

From: Linus Torvalds
Date: Wednesday, May 14, 2008 - 11:41 am

Ok, so I'm obviously happy. This is exactly the kind of thing I would want 
to see.

That said, the way it is now set up, it's unreasonable to merge anything 
directly, and while I can cherry-pick obvious fixes this way, I do think 
we could do things better.

It should be possible to set things up so that it's a config option, and 
we can mark it EXPERIMENTAL but still merge it into the standard kernel, 
so that we'd have the debug stuff there. That would get a lot more 
coverage, especially if it all still *works*, even if the debug stuff then 
complains (ie it would be nicer if the lock itself didn't start breaking).

So for example, have CONFIG_DEBUG_BKL turn it into a mutex (and select 
mutex debugging), and get all the debug coverage that way, but then when 
somebody enters the scheduler with the lock held, first complain, but then 
auto-release it anyway. That way, bugs get found and complained about, but 
hopefully the machine still ends up working.

		Linus
--

From: Ingo Molnar
Date: Wednesday, May 14, 2008 - 12:41 pm

yeah. This is just a first approximation. It might be v2.6.27 stuff, if 


hm, we'll got more ideas about other debug helpers, but i dont think 
warning in the scheduler is realistic or useful: lots and lots of code 
_does_ reschedule with the BKL held and always did - we never knew this 
before in a reliable way due to the auto-release. Sleeping locks that 
purely nest inside the BKL are the norm in the VFS, in the tty code and 
in most other places - they should be fine and are frequently taken in 
BKL sections (and frequently produce scheduling there).

As the BKL gets pushed inside subsystems, so do inner locks vanish from 
its scope - and at the final stage it can become a spinlock or mutex, 
depending on what the actual use is.

The main point with the mutex is to make the BKL _stricter_ and more 
defined - this hurts BKL using code more (see the many fixes that were 
needed), but it also makes things much more visible and much more 
fixable IMO. This tree turns the BKL into "just another mutex", with a 
tiny bit of self-recursion glue on top of it.

Btw., often there's potential scheduling at points where BKL using code 
does not expect it. So this series might also _fix_ some rare races.

The fact that this also makes BKL critical sections involuntarily 
preemptible is a side-effect (which is one of my main motivations to do 
this whole thing), and it's a pretty much unavoidable side-effect.

Also, turning it into a more or less simple mutex with no scheduler 
smarts at all, it all fits into our "how do we remove a serializing 
lock" workflow rather well. Even if for some piece of code not much 
changes in reality, it becomes more familar, less mystic and more 
trustable to fix and improve.

Btw., while i hacked on this today, i _think_ i've got most of the worst 
problems mapped out already. I needed two fixes to get it to boot to a 
ssh shell prompt without hanging. I needed 10 more fixes to solve all 
the dependencies that lockdep found. Another 5 fixes were ...
From: Frederik Deweerdt
Date: Wednesday, May 14, 2008 - 1:05 pm

Hi Ingo,
Must be a typo?

Regards,
Frederik
--

From: Linus Torvalds
Date: Thursday, May 15, 2008 - 10:41 am

So looking a bit more at your trivial fixups, I'd suggest strongly that 
they be re-organized a bit.

I cherry-picked your tty layer thing, because it was a real fix.


The above doesn't even work in general. It depends on having just a single 
level of locking, and is ugly to boot. So wow about we just expose some 
version of

	depth = release_kernel_lock()
	..
	reacquire_kernel_lock(depth);

to existing BKL users as a way to safely release and re-aquire it 
regardless of depth. That makes the code more generic, but it *also* makes 

I think this one should be changed - that comment says "do not take", but 
in fact you still do take it, you just release it earlier. So we should 
just not start out with the kernel locked in the first place. The BKL 
doesn't do anything for the init sequence anyway, since all of this is for 
code that runs before there even are any other threads (not counting the 
idle thread).

I don't see anything in there that could *possibly* depend on the kernel 

ACK. Code that relies on this is broken anyway. 

We used to have lots of "proc_create()" followed by setup code that could 
race with "proc_lookup()", but they were fundamentally racy anyway, so 





And obviously this one I'd never take. It would need to work with a 
working BKL implementation.

			Linus
--

From: Arjan van de Ven
Date: Thursday, May 15, 2008 - 1:27 pm

On Thu, 15 May 2008 10:41:54 -0700 (PDT)


can we make this even more specific/restricted? Like having something
like

call_bkl_unlocked(function_pointer, argument);

or something that will internally do the full unlock and then the
function call. The last thing we need is another nailgun that BKL using
code can use to staple themselves to something big and fast moving.
By having a more restricted interface... less likely.
Maybe we can even get away with only a

drop_bkl_and_schedule();

and nothing else.



--

From: Peter Zijlstra
Date: Thursday, May 15, 2008 - 1:45 pm

No, that would defeat the whole purpose of the exercise. This drop on
schedule property makes it possible to have inverse lock order and not
deadlock.

That also makes the manual re-acquire on a different level pretty ugly
and deadlock prone.

The whole purpose of this patch series was to get rid of this exact
problem so that the BKL turns into something that resembles a normal
lock within the regular locking hierarchy.

--

From: Arjan van de Ven
Date: Thursday, May 15, 2008 - 2:22 pm

On Thu, 15 May 2008 22:45:55 +0200

I would totally agree with you, except that all these patches
effectively do it manually again ANYWAY :(

so what I propose is make it explicit drop_bkl_and_schedule() call only,
and only do them as a very very last resort.

For 99% of the rest it does give exactly the regular benefits you
describe. And we can then prioritize these ugly cases to get de-bkl'd
first.
--

Previous thread: [GIT PATCH] driver core fixes against 2.6.26-rc2 by Greg KH on Wednesday, May 14, 2008 - 10:49 am. (3 messages)

Next thread: [PATCH] Remove final references to deprecated, unreferenced TOPDIR. by Robert P. J. Day on Wednesday, May 14, 2008 - 11:25 am. (1 message)