The block layer is one of the remaining users of the Big Kernel Lock. In there, it is used mainly for serializing the blkdev_get and blkdev_put functions and some ioctl implementations as well as some less common open functions of related character devices following a previous pushdown and parts of the blktrace code. The only one that seems to be a bit nasty is the blkdev_get function which is actually recursive and may try to get the BKL twice. All users except the one in blkdev_get seem to be outermost locks, meaning we don't rely on the release-on-sleep semantics to avoid deadlocks. The ctl_mutex (pktcdvd.ko), raw_mutex (raw.ko), state_mutex (dasd.ko), reconfig_mutex (md.ko), and jfs_log_mutex (jfs.ko) may be held when blkdev_get is called, but as far as I can tell, these mutexes are never acquired from any of the functions that get converted in this patch. In order to get rid of the BKL, this introduces a new global mutex called blkdev_mutex, which replaces the BKL in all drivers that directly interact with the block layer. In case of blkdev_get, the mutex is moved outside of the function itself in order to avoid the recursive taking of blkdev_mutex. Testing so far has shown no problems whatsoever from this patch, but the usage in blkdev_get may introduce extra latencies, and I may have missed corner cases where an block device ioctl function sleeps for a significant amount of time, which may be harmful to the performance of other threads. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: linux-scsi@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- block/bsg.c | 2 -- ...
From: Matthew Wilcox <willy@linux.intel.com> I've taken a patch originally written by Matthew Wilcox and ported it to the current version. It seems that there were originally concerns that this breaks NFS, but since Trond has recently removed the BKL from NFS, my naive assumption would be that it's all good now, despite not having tried to understand what it does. Original introduction from Willy: I've been promising to do this for about seven years now. It seems to work well enough, but I haven't run any serious stress tests on it. This implementation uses one spinlock to protect both lock lists and all the i_flock chains. It doesn't seem worth splitting up the locking any further. I had to move one memory allocation out from under the file_lock_lock. I hope I got that logic right. I'm rather tempted to split out the find_conflict algorithm from that function into something that can be called separately for the FL_ACCESS case. I also have to drop and reacquire the file_lock_lock around the call to cond_resched(). This was done automatically for us before by the special BKL semantics. I had to change vfs_setlease() as it relied on the special BKL ability to recursively acquire the same lock. The internal caller now calls __vfs_setlease and the exported interface acquires and releases the file_lock_lock around calling __vfs_setlease. I should probably split out the removal of interruptible_sleep_on_locked() as it's basically unrelated to all this. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: linux-kernel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- fs/locks.c | ...
Hi Arnd, We still need to fix up the bits in NFS that dereference inode->i_flock. On the client side, those are mainly the bits that deal with lock recovery when the NFS server has rebooted or restarted. AFAICS, there are two places in the NFSv4 client that need to be changed to call lock_flocks(): nfs_delegation_claim_locks(), and nfs4_reclaim_locks(). In both cases, the replacement is trivial. For NFSv3, I think we are already safe, since AFAICS the host->h_rwsem already provides exclusion between file locking and lock recovery attempts. I think we should therefore be able to immediately remove the BKL in fs/lockd/clntlock.c:reclaimer(). I'm not as sure about how sensitive the NFS server is to the switch from BKL -> lock_flocks(). Perhaps Bruce can comment... Cheers Trond --
There's a callback from the lease code, the implementation of which sleeps in the NFSv4 server case. I've got patches for the next merge window which remove that sleep (by just deferring work to a workqueue). release_lockowner() traverses the lock list. I need to fix that. The lockd thread is entirely under the BKL--it takes it at the top of fs/lockd/svc.c:lockd(), and drops it at the bottom. It's possible there may be code that lockd runs that assumes mutual exclusion with code in locks.c, but I don't know. --b. --
That one seems interesting as the lockd thread calls lots of sleeping functions as well as file lock code that takes the BKL again, both of which of course is not allowed when converting to a spinlock. I just attempted to blindly convert lockd to taking lock_flocks() in place of the BKL and releasing it where necessary, but quickly gave up because this seems rather pointless. It basically never does anything interesting between places where it would need to drop the lock. Arnd --
Don't we need to have access to this new lock from modules? In particular, nfsd/lockd currently call lock_kernel to be able to safely access i_flock. This would seem to imply that they would either need new functions inside locks.c to do the same work or export the new lock functions. It seems easiest to just do EXPORT_SYMBOL on the new lock/unlock functions added in this patch, but I do understand if that isn't desireable. Brad Boyer flar@allandria.com --
Hi, Some comments... Why not just put the spin lock calls inline? Use cond_resched_lock() here perhaps? This looks to me like generic_setlease() or whatever fs specific ->setlease() there might be will be called under a spin lock. That doesn't look right to me. Rather than adding locking here, why not push the BKL down into generic_setlease() and ->setlease() first, and then eliminate it on a case by case basis later on? That is the pattern that has been followed elsewhere in the kernel. I might have some further comment on this, but thats as far as I've got at the moment, Steve. --
Sounds fair. Besides generic_setlease (which is in this file as well), the only non-trivial one is cifs_setlease (Cc'ing Steve French now) and that calls generic_setlease in the end. If we can show that cifs_setlease does not need locking, the new lock could be put into generic_setlease directly. Arnd --
The sg driver has no seek semantics on its read() and
write() calls. And sg_open() calls nonseekable_open(). So
.llseek = no_llseek,
And I just checked st.c (SCSI tape driver) and it calls
lock_kernel() .
Doug Gilbert
--
Ok, I missed the nonseekable_open here and assumed someone might be calling seek on it. I'll use no_llseek then, or Ah, good point. So even if the st driver does not need any locking against the block layer, it might need to lock its ioctl against sg. The most simple solution for this would be to let sg take both blkdev_mutex and the BKL, which of course feels like a step backwards. A better way is to get rid of the BKL in sg, which requires a better understanding of what it's actually protecting. It only gets it in the open and ioctl functions, which is a result of the pushdown from the respective file operations. Chances are that it's not needed at all, but that's really hard to tell. Can you shed some more light on this? Arnd --
At the level of SCSI commands, tape device state assumptions made by the st driver could be compromised by SCSI commands sent by the sg driver. However the BKL was never meant to address that concern. From the comment in st_open() [st.c] it would be using nonseekable_open() as well but there are apps out there that do lseek()s on its file descriptors. Not sure how long nonseekable_open() has been in the sg driver The BKL is not used to protect any of the internal objects within the sg driver. From memory it was added in some large code sweep through, not unlike what you are proposing now. So I would not be concerned about any kernel locking interactions between the sg and st drivers. I have added Kai Makisara (st maintainer) to the cc list. Doug Gilbert --
It's been there for a long time, at least since the start of The one in the open function was moved there when the BKL was moved out from vfs_open(), while the use in ioctl is implicit by never having been converted to unlocked_ioctl. I don't see anything that really needs BKL protection in sg_open, so that can probably just be killed. For sg_ioctl, at least the blk_trace_* and scsi_ioctl stuff is currently called with BKL held everywhere else (not in st_ioctl though) Ok. I've also checked st.c again and noticed that it doesn't use use the BKL in ioctl() but only in open(), which is very unlikely to race against anything in sg.c or the block subsystem. Arnd --
Using sg with a tape opened by st may lead to incorrect state within st. However, to prevent this would be far too complicated and so the coordination responsibility is left to the user. (There are some cases where use of both sg and st can be justified but then the user should know what he/she is doing and properly serialize access in user space.) BKL does not have any "hidden duties" in open() in st.c. I don't know any reason why it would be needed, but, because I have not been absolutely sure, I have not removed it. (The tape devices are not opened often and so the overhead has been negligible. That is, while BKL has been available.) If you don't see any reason for BKL in open(), go ahead and remove it. Kai --
The st_open function is serialized through the st_dev_arr_lock
and the STp->in_use flag, so there is no race that the BKL
can protect against in the driver itself, and the function
does not access any global state outside of the driver that
might be protected with the BKL.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/scsi/st.c | 9 +--------
1 files changed, 1 insertions(+), 8 deletions(-)
Ok, that certainly simplifies the sg.c conversion. Thanks!
Arnd
diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 3ea1a71..b1462e0 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -39,7 +39,6 @@ static const char *verstr = "20081215";
#include <linux/cdev.h>
#include <linux/delay.h>
#include <linux/mutex.h>
-#include <linux/smp_lock.h>
#include <asm/uaccess.h>
#include <asm/dma.h>
@@ -1180,7 +1179,6 @@ static int st_open(struct inode *inode, struct file *filp)
int dev = TAPE_NR(inode);
char *name;
- lock_kernel();
/*
* We really want to do nonseekable_open(inode, filp); here, but some
* versions of tar incorrectly call lseek on tapes and bail out if that
@@ -1188,10 +1186,8 @@ static int st_open(struct inode *inode, struct file *filp)
*/
filp->f_mode &= ~(FMODE_PREAD | FMODE_PWRITE);
- if (!(STp = scsi_tape_get(dev))) {
- unlock_kernel();
+ if (!(STp = scsi_tape_get(dev)))
return -ENXIO;
- }
write_lock(&st_dev_arr_lock);
filp->private_data = STp;
@@ -1200,7 +1196,6 @@ static int st_open(struct inode *inode, struct file *filp)
if (STp->in_use) {
write_unlock(&st_dev_arr_lock);
scsi_tape_put(STp);
- unlock_kernel();
DEB( printk(ST_DEB_MSG "%s: Device already in use.\n", name); )
return (-EBUSY);
}
@@ -1249,14 +1244,12 @@ static int st_open(struct inode *inode, struct file *filp)
retval = (-EIO);
goto err_out;
}
- unlock_kernel();
return 0;
err_out:
normalize_buffer(STp->buffer);
STp->in_use = 0;
scsi_tape_put(STp);
- unlock_kernel();
return ...Kai, can we get your ack on this? Thanks. --
