The problem: There is a class of deadlocks that we currently have in the kernel that lockdep does not recognize. In particular with the network stack we can have: rtnl_lock(); use_table(); unregister_netdevice(); rtnl_lock(); unregister_sysctl_table(); .... wait_for_completion(); rtnl_lock(); .... unuse_table() rtnl_unlock(); complete(); Where we never make it to the lines labled .... My patch following patches treats the sysctl use count as a read/writer lock for purposes of lockdep. The code works but I'm not certain I have plugged into lockdep quite correctly. Eric --
The current code works fine, and is actually not buggy but it
does prevent enabling the use of lockdep to check refcounting
vs lock holding ordering problems. So since we can ensure
that we are only hold a single sysctl_head at a time. Allowing
lockdep to complain.
Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
---
fs/proc/proc_sysctl.c | 24 ++++++++++--------------
1 files changed, 10 insertions(+), 14 deletions(-)
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 94fcfff..46eb34c 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -79,7 +79,6 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
{
struct ctl_table_header *head = grab_header(dir);
struct ctl_table *table = PROC_I(dir)->sysctl_entry;
- struct ctl_table_header *h = NULL;
struct qstr *name = &dentry->d_name;
struct ctl_table *p;
struct inode *inode;
@@ -97,10 +96,11 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
p = find_in_table(table, name);
if (!p) {
- for (h = sysctl_head_next(NULL); h; h = sysctl_head_next(h)) {
- if (h->attached_to != table)
+ sysctl_head_finish(head);
+ for (head = sysctl_head_next(NULL); head; head = sysctl_head_next(head)) {
+ if (head->attached_to != table)
continue;
- p = find_in_table(h->attached_by, name);
+ p = find_in_table(head->attached_by, name);
if (p)
break;
}
@@ -110,9 +110,7 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
goto out;
err = ERR_PTR(-ENOMEM);
- inode = proc_sys_make_inode(dir->i_sb, h ? h : head, p);
- if (h)
- sysctl_head_finish(h);
+ inode = proc_sys_make_inode(dir->i_sb, head, p);
if (!inode)
goto out;
@@ -243,7 +241,6 @@ static int proc_sys_readdir(struct file *filp, void *dirent, filldir_t filldir)
struct inode *inode = dentry->d_inode;
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = ...It is possible for get lock ordering deadlocks between locks
and waiting for the sysctl used count to drop to zero. We have
recently observed one of these in the networking code.
So teach the sysctl code how to speak lockdep so the kernel
can warn about these kinds of rare issues proactively.
Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
---
include/linux/sysctl.h | 4 ++
kernel/sysctl.c | 108 ++++++++++++++++++++++++++++++++++++++---------
2 files changed, 91 insertions(+), 21 deletions(-)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 39d471d..ec9b1dd 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -28,6 +28,7 @@
#include <linux/kernel.h>
#include <linux/types.h>
#include <linux/compiler.h>
+#include <linux/lockdep.h>
struct file;
struct completion;
@@ -1087,6 +1088,9 @@ struct ctl_table_header
struct ctl_table *attached_by;
struct ctl_table *attached_to;
struct ctl_table_header *parent;
+#ifdef CONFIG_PROVE_LOCKING
+ struct lockdep_map dep_map;
+#endif
};
/* struct ctl_path describes where in the hierarchy a table is added */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c5ef44f..ea8cc39 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1454,12 +1454,63 @@ static struct ctl_table dev_table[] = {
static DEFINE_SPINLOCK(sysctl_lock);
+#ifndef CONFIG_PROVE_LOCKING
+
+# define lock_sysctl() spin_lock(&sysctl_lock)
+# define unlock_sysctl() spin_unlock(&sysctl_lock)
+
+static inline void table_acquire_use(struct ctl_table_header *hdr) { }
+static inline void table_release_use(struct ctl_table_header *hdr) { }
+static inline void table_acquire(struct ctl_table_header *hdr) { }
+static inline void table_contended(struct ctl_table_header *hdr) { }
+static inline void table_acquired(struct ctl_table_header *hdr) { }
+static inline void table_release(struct ctl_table_header *hdr) { }
+
+#else /* CONFIG_PROVE_LOCKING */
+
+# define lock_sysctl() ...So I merged these two patches. I designated them as to-be-merged via Alexey, with a Cc:Peter (IOW: wakey wakey ;)) It doesn't make a lot of sense to have a different Author: address, IMO. So I rewrote the From: address to be ebiederm@aristanetworks.com, OK? But it would be better if you were to do this, so please do put an explicit From: line at the top of your changelogs. --
It is more sysctl than proc but whichever. As long as people aren't From my perspective it does. It is much easier to use my personal email setup for doing kernel work. But I need to indicate I am signing off as an employee of AristaNetworks. Saying aristanetworks in my signed-off-by seems a little more official. Eric --
On Mon, 30 Mar 2009 15:53:04 -0700 Yep. But keeping all fs/proc/ changes in Alexey's tree keeps things OK, fine, but to avoid confusion and to overtly announce this preference, please do add the From: Line? Plus that saves me from having to edit "ebiederm@xmission.com (Eric W. Biederman)" to "Eric W. Biederman <ebiederm@xmission.com>" a zillion times ;) --
Sounds like a plan. You work on that. I will work on undermining Sounds like a plan. In truth I have been deleting that line from my patches anyway.... It just seemed silly to have it in there when the real From matched. Eric --
Ah, I was under the impression they weren't quite finished yet, apparently I was mistaken. Eric, didn't you need recursive locks fixed and such? Let me have a look at them. --
In theory. In practice I removed the one case that could legitimately --
It would be very good to extend this changelog with a more detailed explanation of the deadlock in question. Let me see if I got it right: We're holding a lock, while waiting for the refcount to drop to 0. Dropping that refcount is blocked on that lock. This means every sysctl thingy gets the same class, is that intended/desired? --
Exactly. I must have written an explanation so many times that it got lost when I wrote that commit message. In particular the problem can be see with /proc/sys/net/ipv4/conf/*/forwarding. The problem is that the handler for fowarding takes the rtnl_lock with the reference count held. Then we call unregister_sysctl_table under the rtnl_lock. If the refcount is to be considered a lock. sysctl_lock must be considered the internals of that lock. lockdep gets extremely confused otherwise. Since the spinlock is static to this file I'm not especially worried There is only one place we initialize it, and as far as I know really only one place we take it. Which is the definition of a lockdep class as far as I know. Eric --
Usually lock internal locks still get lockdep coverage. Let see if we can find a way for this to be true even here. I suspect the below to There you acquire the table while holding the spinlock, generating: sysctl_lock -> table_lock, however you then release the sysctl_lock and re-acquire it, generating table_lock -> sysctl_lock. Indeed, just checking. --
That is an artifact of sysctl_lock being used to implement table_lock as best as I can tell. The case you point out I could probably play with where I claim the lock is acquired and make it work. __sysctl_head_next on the read side is trickier. We come in with table_lock held for read. We grab sysctl_lock. We release table_lock (aka the reference count is decremented) We grab table_lock on the next table (aka the reference count is incremented) We release sysctl_lock If we generate lockdep annotations for that it would seem to transition through the states: table_lock table_lock -> sysctl_lock sysctl_lock sysctl_lock -> table_lock table_lock Short of saying table_lock is an implementation detail. Used to make certain operations atomic I do not see how to model this case. Let me take a slightly simpler case and ask how that gets modeled. Looking at rwsem. Ok all of the annotations are outside of the spin_lock. So in some sense we are sloppy, and fib to lockdep about when the we acquire/release a lock. In another sense we are simply respecting the abstraction. I guess I can take a look and see if I can model things a slightly The only difference I can possibly see is read side versus write side. Or in my case refcount side versus wait side. Eric --
`make headers_check': /usr/src/25/usr/include/linux/sysctl.h:31: included file 'linux/lockdep.h' is not exported make[2]: *** [/usr/src/25/usr/include/linux/.check] Error 1 It looks like we'll need version 2 on these patches.. --
