[PATCH 2/2] sysctl: lockdep support for sysctl reference counting.

Previous thread: [PATCH 15/77] irda: convert smsc driver to net_device_ops by Stephen Hemminger on Friday, March 20, 2009 - 10:35 pm. (3 messages)

Next thread: [PATCH] 3c59x: convert to net_device_ops by Jiri Pirko on Saturday, March 21, 2009 - 4:35 am. (3 messages)
From: Eric W. Biederman
Date: Saturday, March 21, 2009 - 12:39 am

The problem:  There is a class of deadlocks that we currently
have in the kernel that lockdep does not recognize. 

In particular with the network stack we can have:

rtnl_lock();                         use_table();
unregister_netdevice();              rtnl_lock();
unregister_sysctl_table();           ....
wait_for_completion();               rtnl_lock();
....                                 unuse_table()
rtnl_unlock();                       complete();



Where we never make it to the lines labled ....

My patch following patches treats the sysctl use count as a read/writer
lock for purposes of lockdep.

The code works but I'm not certain I have plugged into lockdep quite
correctly.

Eric

--

From: Eric W. Biederman
Date: Saturday, March 21, 2009 - 12:40 am

The current code works fine, and is actually not buggy but it
does prevent enabling the use of lockdep to check refcounting
vs lock holding ordering problems.   So since we can ensure
that we are only hold a single sysctl_head at a time.  Allowing
lockdep to complain.

Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
---
 fs/proc/proc_sysctl.c |   24 ++++++++++--------------
 1 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 94fcfff..46eb34c 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -79,7 +79,6 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 {
 	struct ctl_table_header *head = grab_header(dir);
 	struct ctl_table *table = PROC_I(dir)->sysctl_entry;
-	struct ctl_table_header *h = NULL;
 	struct qstr *name = &dentry->d_name;
 	struct ctl_table *p;
 	struct inode *inode;
@@ -97,10 +96,11 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 
 	p = find_in_table(table, name);
 	if (!p) {
-		for (h = sysctl_head_next(NULL); h; h = sysctl_head_next(h)) {
-			if (h->attached_to != table)
+		sysctl_head_finish(head);
+		for (head = sysctl_head_next(NULL); head; head = sysctl_head_next(head)) {
+			if (head->attached_to != table)
 				continue;
-			p = find_in_table(h->attached_by, name);
+			p = find_in_table(head->attached_by, name);
 			if (p)
 				break;
 		}
@@ -110,9 +110,7 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 		goto out;
 
 	err = ERR_PTR(-ENOMEM);
-	inode = proc_sys_make_inode(dir->i_sb, h ? h : head, p);
-	if (h)
-		sysctl_head_finish(h);
+	inode = proc_sys_make_inode(dir->i_sb, head, p);
 
 	if (!inode)
 		goto out;
@@ -243,7 +241,6 @@ static int proc_sys_readdir(struct file *filp, void *dirent, filldir_t filldir)
 	struct inode *inode = dentry->d_inode;
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = ...
From: Eric W. Biederman
Date: Saturday, March 21, 2009 - 12:42 am

It is possible for get lock ordering deadlocks between locks
and waiting for the sysctl used count to drop to zero.  We have
recently observed one of these in the networking code.

So teach the sysctl code how to speak lockdep so the kernel
can warn about these kinds of rare issues proactively.

Signed-off-by: Eric Biederman <ebiederm@aristanetworks.com>
---
 include/linux/sysctl.h |    4 ++
 kernel/sysctl.c        |  108 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 91 insertions(+), 21 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 39d471d..ec9b1dd 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -28,6 +28,7 @@
 #include <linux/kernel.h>
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/lockdep.h>
 
 struct file;
 struct completion;
@@ -1087,6 +1088,9 @@ struct ctl_table_header
 	struct ctl_table *attached_by;
 	struct ctl_table *attached_to;
 	struct ctl_table_header *parent;
+#ifdef CONFIG_PROVE_LOCKING
+	struct lockdep_map dep_map;
+#endif
 };
 
 /* struct ctl_path describes where in the hierarchy a table is added */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c5ef44f..ea8cc39 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1454,12 +1454,63 @@ static struct ctl_table dev_table[] = {
 
 static DEFINE_SPINLOCK(sysctl_lock);
 
+#ifndef CONFIG_PROVE_LOCKING
+
+# define lock_sysctl() spin_lock(&sysctl_lock)
+# define unlock_sysctl() spin_unlock(&sysctl_lock)
+
+static inline void table_acquire_use(struct ctl_table_header *hdr) { }
+static inline void table_release_use(struct ctl_table_header *hdr) { }
+static inline void table_acquire(struct ctl_table_header *hdr) { }
+static inline void table_contended(struct ctl_table_header *hdr) { }
+static inline void table_acquired(struct ctl_table_header *hdr) { }
+static inline void table_release(struct ctl_table_header *hdr) { }
+
+#else	/* CONFIG_PROVE_LOCKING */
+
+#  define lock_sysctl() ...
From: Andrew Morton
Date: Monday, March 30, 2009 - 3:26 pm

So I merged these two patches.  I designated them as to-be-merged via
Alexey, with a Cc:Peter (IOW: wakey wakey ;))


It doesn't make a lot of sense to have a different Author: address,
IMO.  So I rewrote the From: address to be ebiederm@aristanetworks.com,
OK?

But it would be better if you were to do this, so please do put an
explicit From: line at the top of your changelogs.

--

From: Eric W. Biederman
Date: Monday, March 30, 2009 - 3:53 pm

It is more sysctl than proc but whichever.  As long as people aren't

From my perspective it does.  It is much easier to use my personal
email setup for doing kernel work.  But I need to indicate I am
signing off as an employee of AristaNetworks.  Saying aristanetworks
in my signed-off-by seems a little more official.

Eric

--

From: Andrew Morton
Date: Monday, March 30, 2009 - 4:18 pm

On Mon, 30 Mar 2009 15:53:04 -0700

Yep.  But keeping all fs/proc/ changes in Alexey's tree keeps things

OK, fine, but to avoid confusion and to overtly announce this
preference, please do add the From: Line?

Plus that saves me from having to edit
"ebiederm@xmission.com (Eric W. Biederman)" to
"Eric W. Biederman <ebiederm@xmission.com>"

a zillion times ;)


--

From: Eric W. Biederman
Date: Monday, March 30, 2009 - 4:50 pm

Sounds like a plan.  You work on that.  I will work on undermining

Sounds like a plan.  In truth I have been deleting that line from my
patches anyway....  It just seemed silly to have it in there when
the real From matched.

Eric

--

From: Peter Zijlstra
Date: Tuesday, March 31, 2009 - 1:10 am

Ah, I was under the impression they weren't quite finished yet,
apparently I was mistaken.

Eric, didn't you need recursive locks fixed and such?

Let me have a look at them.
--

From: Eric W. Biederman
Date: Tuesday, March 31, 2009 - 1:47 am

In theory.  In practice I removed the one case that could legitimately
--

From: Peter Zijlstra
Date: Tuesday, March 31, 2009 - 1:17 am

It would be very good to extend this changelog with a more detailed
explanation of the deadlock in question.

Let me see if I got it right:

We're holding a lock, while waiting for the refcount to drop to 0.
Dropping that refcount is blocked on that lock.



This means every sysctl thingy gets the same class, is that
intended/desired?

--

From: Eric W. Biederman
Date: Tuesday, March 31, 2009 - 6:40 am

Exactly.

I must have written an explanation so many times that it got
lost when I wrote that commit message.

In particular the problem can be see with /proc/sys/net/ipv4/conf/*/forwarding.

The problem is that the handler for fowarding takes the rtnl_lock
with the reference count held.

Then we call unregister_sysctl_table under the rtnl_lock.

If the refcount is to be considered a lock.  sysctl_lock must be considered
the internals of that lock.  lockdep gets extremely confused otherwise.
Since the spinlock is static to this file I'm not especially worried

There is only one place we initialize it, and as far as I know really
only one place we take it.  Which is the definition of a lockdep
class as far as I know.

Eric

--

From: Peter Zijlstra
Date: Tuesday, March 31, 2009 - 8:35 am

Usually lock internal locks still get lockdep coverage. Let see if we
can find a way for this to be true even here. I suspect the below to

There you acquire the table while holding the spinlock, generating:
sysctl_lock -> table_lock, however you then release the sysctl_lock and
re-acquire it, generating table_lock -> sysctl_lock.


Indeed, just checking.
--

From: Eric W. Biederman
Date: Tuesday, March 31, 2009 - 3:44 pm

That is an artifact of sysctl_lock being used to implement
table_lock as best as I can tell.  The case you point
out I could probably play with where I claim the lock
is acquired and make it work.

__sysctl_head_next on the read side is trickier.
We come in with table_lock held for read.
We grab sysctl_lock.
We release table_lock (aka the reference count is decremented)
We grab table_lock on the next table (aka the reference count is incremented)
We release sysctl_lock

If we generate lockdep annotations for that it would seem to transition
through the states:
table_lock
table_lock -> sysctl_lock
sysctl_lock
sysctl_lock -> table_lock
table_lock

Short of saying table_lock is an implementation detail.  Used to
make certain operations atomic I do not see how to model this case.

Let me take a slightly simpler case and ask how that gets modeled.
Looking at rwsem.  Ok all of the annotations are outside of the
spin_lock.  So in some sense we are sloppy, and fib to lockdep
about when the we acquire/release a lock.  In another sense
we are simply respecting the abstraction.

I guess I can take a look and see if I can model things a slightly

The only difference I can possibly see is read side versus write side.
Or in my case refcount side versus wait side.

Eric

--

From: Andrew Morton
Date: Friday, April 10, 2009 - 2:18 am

`make headers_check':

/usr/src/25/usr/include/linux/sysctl.h:31: included file 'linux/lockdep.h' is not exported
make[2]: *** [/usr/src/25/usr/include/linux/.check] Error 1

It looks like we'll need version 2 on these patches..
--

Previous thread: [PATCH 15/77] irda: convert smsc driver to net_device_ops by Stephen Hemminger on Friday, March 20, 2009 - 10:35 pm. (3 messages)

Next thread: [PATCH] 3c59x: convert to net_device_ops by Jiri Pirko on Saturday, March 21, 2009 - 4:35 am. (3 messages)