Re: [PATCH 2/2] [RFC] Remove BKL from fs/locks.c

Previous thread: Re: USB transfer_buffer allocations on 64bit systems by Alan Stern on Wednesday, April 14, 2010 - 1:29 pm. (3 messages)

Next thread: [PATCH] Reset ps/2 port should psmouse_probe fail before retrying by Peter M. Petrakis on Wednesday, April 14, 2010 - 1:46 pm. (2 messages)
From: Arnd Bergmann
Date: Wednesday, April 14, 2010 - 1:36 pm

The block layer is one of the remaining users
of the Big Kernel Lock. In there, it is used
mainly for serializing the blkdev_get and
blkdev_put functions and some ioctl implementations
as well as some less common open functions of
related character devices following a previous
pushdown and parts of the blktrace code.

The only one that seems to be a bit nasty is the
blkdev_get function which is actually recursive
and may try to get the BKL twice.

All users except the one in blkdev_get seem to
be outermost locks, meaning we don't rely on
the release-on-sleep semantics to avoid deadlocks.

The ctl_mutex (pktcdvd.ko), raw_mutex (raw.ko),
state_mutex (dasd.ko), reconfig_mutex (md.ko),
and jfs_log_mutex (jfs.ko) may be held when
blkdev_get is called, but as far as I can tell,
these mutexes are never acquired from any of the
functions that get converted in this patch.

In order to get rid of the BKL, this introduces
a new global mutex called blkdev_mutex, which
replaces the BKL in all drivers that directly
interact with the block layer. In case of blkdev_get,
the mutex is moved outside of the function itself
in order to avoid the recursive taking of blkdev_mutex.

Testing so far has shown no problems whatsoever
from this patch, but the usage in blkdev_get
may introduce extra latencies, and I may have
missed corner cases where an block device ioctl
function sleeps for a significant amount of time,
which may be harmful to the performance of other
threads.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: linux-scsi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 block/bsg.c               |    2 --
 ...
From: Arnd Bergmann
Date: Wednesday, April 14, 2010 - 1:36 pm

From: Matthew Wilcox <willy@linux.intel.com>

I've taken a patch originally written by Matthew Wilcox and
ported it to the current version. It seems that there were
originally concerns that this breaks NFS, but since Trond
has recently removed the BKL from NFS, my naive assumption
would be that it's all good now, despite not having tried to
understand what it does.

Original introduction from Willy:

   I've been promising to do this for about seven years now.

   It seems to work well enough, but I haven't run any serious stress
   tests on it.  This implementation uses one spinlock to protect both lock
   lists and all the i_flock chains.  It doesn't seem worth splitting up
   the locking any further.

   I had to move one memory allocation out from under the file_lock_lock.
   I hope I got that logic right.  I'm rather tempted to split out the
   find_conflict algorithm from that function into something that can be
   called separately for the FL_ACCESS case.

   I also have to drop and reacquire the file_lock_lock around the call
   to cond_resched().  This was done automatically for us before by the
   special BKL semantics.

   I had to change vfs_setlease() as it relied on the special BKL ability
   to recursively acquire the same lock.  The internal caller now calls
   __vfs_setlease and the exported interface acquires and releases the
   file_lock_lock around calling __vfs_setlease.

   I should probably split out the removal of interruptible_sleep_on_locked()
   as it's basically unrelated to all this.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
---
 fs/locks.c |  ...
From: Trond Myklebust
Date: Wednesday, April 14, 2010 - 1:52 pm

Hi Arnd,

We still need to fix up the bits in NFS that dereference inode->i_flock.
On the client side, those are mainly the bits that deal with lock
recovery when the NFS server has rebooted or restarted.

AFAICS, there are two places in the NFSv4 client that need to be changed
to call lock_flocks(): nfs_delegation_claim_locks(), and
nfs4_reclaim_locks(). In both cases, the replacement is trivial.

For NFSv3, I think we are already safe, since AFAICS the host->h_rwsem
already provides exclusion between file locking and lock recovery
attempts. I think we should therefore be able to immediately remove the
BKL in fs/lockd/clntlock.c:reclaimer().

I'm not as sure about how sensitive the NFS server is to the switch from
BKL -> lock_flocks(). Perhaps Bruce can comment...

Cheers
  Trond

--

From: J. Bruce Fields
Date: Wednesday, April 14, 2010 - 2:04 pm

There's a callback from the lease code, the implementation of which
sleeps in the NFSv4 server case.  I've got patches for the next merge
window which remove that sleep (by just deferring work to a workqueue).

release_lockowner() traverses the lock list.  I need to fix that.

The lockd thread is entirely under the BKL--it takes it at the top of
fs/lockd/svc.c:lockd(), and drops it at the bottom.  It's possible
there may be code that lockd runs that assumes mutual exclusion with
code in locks.c, but I don't know.

--b.
--

From: Arnd Bergmann
Date: Thursday, April 15, 2010 - 1:36 pm

That one seems interesting as the lockd thread calls lots of sleeping
functions as well as file lock code that takes the BKL again, both
of which of course is not allowed when converting to a spinlock.

I just attempted to blindly convert lockd to taking lock_flocks()
in place of the BKL and releasing it where necessary, but quickly gave
up because this seems rather pointless. It basically never does
anything interesting between places where it would need to drop the
lock.

	Arnd
--

From: Brad Boyer
Date: Wednesday, April 14, 2010 - 9:14 pm

Don't we need to have access to this new lock from modules? In particular,
nfsd/lockd currently call lock_kernel to be able to safely access i_flock.
This would seem to imply that they would either need new functions inside
locks.c to do the same work or export the new lock functions. It seems
easiest to just do EXPORT_SYMBOL on the new lock/unlock functions added
in this patch, but I do understand if that isn't desireable.

	Brad Boyer
	flar@allandria.com

--

From: Steven Whitehouse
Date: Thursday, April 15, 2010 - 7:48 am

Hi,

Some comments...

Why not just put the spin lock calls inline?

Use cond_resched_lock() here perhaps?

This looks to me like generic_setlease() or whatever fs specific
->setlease() there might be will be called under a spin lock. That
doesn't look right to me.

Rather than adding locking here, why not push the BKL down into
generic_setlease() and ->setlease() first, and then eliminate it on a
case by case basis later on? That is the pattern that has been followed
elsewhere in the kernel.

I might have some further comment on this, but thats as far as I've got
at the moment,

Steve.


--

From: Arnd Bergmann
Date: Thursday, April 15, 2010 - 8:17 am

Sounds fair. Besides generic_setlease (which is in this file as well),
the only non-trivial one is cifs_setlease (Cc'ing Steve French now)
and that calls generic_setlease in the end.

If we can show that cifs_setlease does not need locking, the new lock
could be put into generic_setlease directly.

	Arnd
--

From: Douglas Gilbert
Date: Wednesday, April 14, 2010 - 3:48 pm

The sg driver has no seek semantics on its read() and
write() calls. And sg_open() calls nonseekable_open(). So
     .llseek = no_llseek,

And I just checked st.c (SCSI tape driver) and it calls
lock_kernel() .


Doug Gilbert
--

From: Arnd Bergmann
Date: Thursday, April 15, 2010 - 12:11 am

Ok, I missed the nonseekable_open here and assumed someone
might be calling seek on it. I'll use no_llseek then, or

Ah, good point. So even if the st driver does not need
any locking against the block layer, it might need to
lock its ioctl against sg.

The most simple solution for this would be to let sg
take both blkdev_mutex and the BKL, which of course
feels like a step backwards.

A better way is to get rid of the BKL in sg, which requires
a better understanding of what it's actually protecting.
It only gets it in the open and ioctl functions, which is a
result of the pushdown from the respective file operations.
Chances are that it's not needed at all, but that's really
hard to tell. Can you shed some more light on this?

	Arnd
--

From: Douglas Gilbert
Date: Thursday, April 15, 2010 - 6:15 am

At the level of SCSI commands, tape device state assumptions
made by the st driver could be compromised by SCSI commands
sent by the sg driver. However the BKL was never meant
to address that concern.

 From the comment in st_open() [st.c] it would be using
nonseekable_open() as well but there are apps out there
that do lseek()s on its file descriptors. Not sure
how long nonseekable_open() has been in the sg driver

The BKL is not used to protect any of the internal
objects within the sg driver. From memory it was added
in some large code sweep through, not unlike what you
are proposing now.

So I would not be concerned about any kernel locking
interactions between the sg and st drivers. I have
added Kai Makisara (st maintainer) to the cc list.

Doug Gilbert

--

From: Arnd Bergmann
Date: Thursday, April 15, 2010 - 7:29 am

It's been there for a long time, at least since the start of

The one in the open function was moved there when the
BKL was moved out from vfs_open(), while the use in ioctl is
implicit by never having been converted to unlocked_ioctl.

I don't see anything that really needs BKL protection in
sg_open, so that can probably just be killed. For sg_ioctl,
at least the blk_trace_* and scsi_ioctl stuff is currently
called with BKL held everywhere else (not in st_ioctl though)

Ok. I've also checked st.c again and noticed that it
doesn't use use the BKL in ioctl() but only in open(),
which is very unlikely to race against anything in sg.c
or the block subsystem.

	Arnd
--

From: Kai Makisara
Date: Thursday, April 15, 2010 - 1:03 pm

Using sg with a tape opened by st may lead to incorrect state within st. 
However, to prevent this would be far too complicated and so the 
coordination responsibility is left to the user. (There are some 
cases where use of both sg and st can be justified but then the user 
should know what he/she is doing and properly serialize access in user 
space.)

BKL does not have any "hidden duties" in open() in st.c. I don't know any 
reason why it would be needed, but, because I have not been absolutely 
sure, I have not removed it. (The tape devices are not opened often and so 
the overhead has been negligible. That is, while BKL has been available.) 
If you don't see any reason for BKL in open(), go ahead and remove it.

Kai

--

From: Arnd Bergmann
Date: Thursday, April 15, 2010 - 1:51 pm

The st_open function is serialized through the st_dev_arr_lock
and the STp->in_use flag, so there is no race that the BKL
can protect against in the driver itself, and the function
does not access any global state outside of the driver that
might be protected with the BKL.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/scsi/st.c |    9 +--------
 1 files changed, 1 insertions(+), 8 deletions(-)


Ok, that certainly simplifies the sg.c conversion. Thanks!

	Arnd

diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 3ea1a71..b1462e0 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -39,7 +39,6 @@ static const char *verstr = "20081215";
 #include <linux/cdev.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
-#include <linux/smp_lock.h>
 
 #include <asm/uaccess.h>
 #include <asm/dma.h>
@@ -1180,7 +1179,6 @@ static int st_open(struct inode *inode, struct file *filp)
 	int dev = TAPE_NR(inode);
 	char *name;
 
-	lock_kernel();
 	/*
 	 * We really want to do nonseekable_open(inode, filp); here, but some
 	 * versions of tar incorrectly call lseek on tapes and bail out if that
@@ -1188,10 +1186,8 @@ static int st_open(struct inode *inode, struct file *filp)
 	 */
 	filp->f_mode &= ~(FMODE_PREAD | FMODE_PWRITE);
 
-	if (!(STp = scsi_tape_get(dev))) {
-		unlock_kernel();
+	if (!(STp = scsi_tape_get(dev)))
 		return -ENXIO;
-	}
 
 	write_lock(&st_dev_arr_lock);
 	filp->private_data = STp;
@@ -1200,7 +1196,6 @@ static int st_open(struct inode *inode, struct file *filp)
 	if (STp->in_use) {
 		write_unlock(&st_dev_arr_lock);
 		scsi_tape_put(STp);
-		unlock_kernel();
 		DEB( printk(ST_DEB_MSG "%s: Device already in use.\n", name); )
 		return (-EBUSY);
 	}
@@ -1249,14 +1244,12 @@ static int st_open(struct inode *inode, struct file *filp)
 			retval = (-EIO);
 		goto err_out;
 	}
-	unlock_kernel();
 	return 0;
 
  err_out:
 	normalize_buffer(STp->buffer);
 	STp->in_use = 0;
 	scsi_tape_put(STp);
-	unlock_kernel();
 	return ...
From: Frederic Weisbecker
Date: Thursday, April 29, 2010 - 7:18 pm

Kai, can we get your ack on this?

Thanks.

--

Previous thread: Re: USB transfer_buffer allocations on 64bit systems by Alan Stern on Wednesday, April 14, 2010 - 1:29 pm. (3 messages)

Next thread: [PATCH] Reset ps/2 port should psmouse_probe fail before retrying by Peter M. Petrakis on Wednesday, April 14, 2010 - 1:46 pm. (2 messages)