Re: [PATCH] lockdep: lock_set_subclass - reset a held lock's subclass

Previous thread: [git pull] x86 fixes by Ingo Molnar on Thursday, July 31, 2008 - 2:42 pm. (1 message)

Next thread: [PATCH] x86: fdiv bug detection fix by Krzysztof Helt on Thursday, July 31, 2008 - 2:43 pm. (5 messages)
From: Ingo Molnar
Date: Thursday, July 31, 2008 - 2:43 pm

Linus,

Please pull the latest sched-fixes-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus

Thanks,

	Ingo

------------------>
Hugh Dickins (1):
      sched: move sched_clock before first use

OGAWA Hirofumi (1):
      sched: fix SCHED_HRTICK dependency

Peter Zijlstra (2):
      sched: fix warning in hrtick_start_fair()
      lockdep: change scheduler annotation

roel kluin (1):
      sched: test runtime rather than period in global_rt_runtime()


 kernel/Kconfig.hz    |    2 +-
 kernel/sched.c       |   12 +++++-------
 kernel/sched_clock.c |   19 +++++++++----------
 kernel/sched_fair.c  |    2 +-
 4 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
 	default 1000 if HZ_1000
 
 config SCHED_HRTICK
-	def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0236958..f63af45 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -600,7 +600,6 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
-	struct lock_class_key rq_lock_key;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -834,7 +833,7 @@ static inline u64 global_rt_period(void)
 
 static inline u64 global_rt_runtime(void)
 {
-	if (sysctl_sched_rt_period < 0)
+	if (sysctl_sched_rt_runtime < 0)
 		return RUNTIME_INF;
 
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -2759,10 +2758,10 @@ static void double_rq_lock(struct rq *rq1, struct rq *rq2)
 	} else {
 		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
-			spin_lock(&rq2->lock);
+			spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
 		} else {
 			spin_lock(&rq2->lock);
-			spin_lock(&rq1->lock);
+			spin_lock_nested(&rq1->lock, SINGLE_DEPTH_NESTING);
 		}
 ...
From: David Miller
Date: Thursday, July 31, 2008 - 3:04 pm

From: Ingo Molnar <mingo@elte.hu>

Ingo, Peter forgot to tell you but this lockdep change
blows up on my Niagara boxes in a way I haven't figured
out yet.

So could you please not merge that change in for the
time being?

Thanks.
--

From: Ingo Molnar
Date: Thursday, July 31, 2008 - 3:26 pm

are you sure it's this (trivial) annotation in sched.c that blows up, 
not your other (more invasive) changes to lockdep? This patch looked 

sure thing, we can defer it until your bug is understood more fully. 
I've created a second, sched-fixes-for-linus-2 tree for Linus below - 
Linus, if you have not pulled the previous tree yet could you please 
pull this one?

	Ingo

------------------->
Linus,

Please pull the latest sched-fixes-for-linus-2 git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus-2

 Thanks,

	Ingo

------------------>
Hugh Dickins (1):
      sched: move sched_clock before first use

OGAWA Hirofumi (1):
      sched: fix SCHED_HRTICK dependency

Peter Zijlstra (1):
      sched: fix warning in hrtick_start_fair()

roel kluin (1):
      sched: test runtime rather than period in global_rt_runtime()


 kernel/Kconfig.hz    |    2 +-
 kernel/sched.c       |    2 +-
 kernel/sched_clock.c |   19 +++++++++----------
 kernel/sched_fair.c  |    2 +-
 4 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
 	default 1000 if HZ_1000
 
 config SCHED_HRTICK
-	def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+	def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 0236958..0d1717b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -834,7 +834,7 @@ static inline u64 global_rt_period(void)
 
 static inline u64 global_rt_runtime(void)
 {
-	if (sysctl_sched_rt_period < 0)
+	if (sysctl_sched_rt_runtime < 0)
 		return RUNTIME_INF;
 
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..5a2dc7d 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,6 +32,15 @@
 #include <linux/ktime.h>
 ...
From: David Miller
Date: Thursday, July 31, 2008 - 3:55 pm

From: Ingo Molnar <mingo@elte.hu>

I am absolutely sure, I spent the whole night yesterday trying to
debug this.
--

From: David Miller
Date: Friday, August 1, 2008 - 1:11 am

From: David Miller <davem@davemloft.net>

Followup.  I lost two days of my life debugging this because seemingly
nobody can friggin' agree on what to do about the "printk() wakeup
issue".  Thanks!

Can we fix this now, please?

The problem was that Peter's patch triggers a print_deadlock_bug()
in lockdep.c on the runqueue locks.

But those printk()'s quickly want to do a wakeup, which wants to
take the runqueue lock this thread already holds.  So I would only
get the first line of the lockdep debugging followed by a complete
hang.

Doing these wakeups in such a BUG message is unwise.  Please can
we apply something like the following and save other developers
countless wasted hours of their time?

Thanks :-)

--------------------

debug_locks: Set oops_in_progress if we will log messages.

Otherwise lock debugging messages on runqueue locks can deadlock the
system due to the wakeups performed by printk().

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/lib/debug_locks.c b/lib/debug_locks.c
index 0ef01d1..0218b46 100644
--- a/lib/debug_locks.c
+++ b/lib/debug_locks.c
@@ -8,6 +8,7 @@
  *
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  */
+#include <linux/kernel.h>
 #include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/module.h>
@@ -37,6 +38,7 @@ int debug_locks_off(void)
 {
 	if (xchg(&debug_locks, 0)) {
 		if (!debug_locks_silent) {
+			oops_in_progress = 1;
 			console_verbose();
 			return 1;
 		}
--

From: Ingo Molnar
Date: Friday, August 1, 2008 - 2:01 am

ugh. In the context of lockdep you are the first one to trigger this bug 
- thanks for the fix!

We had a few other incidents of printks generated by bugs within the 
scheduler code causing lockups (due to the wakeup) - and Steve Rostedt 
sent a more generic solution for that: to first trylock the runqueue 
lock in that case instead of doing an unconditional wakeup.

The patch made it to linux-next but Andrew NAK-ed that patch because it 
caused other problems: it made printk wakeups conceptually less 
reliable. (a spurious lock taken from another CPU could prevent a printk 

applied to tip/core/locking.

I'm wondering, does this mean that Peter's:

    lockdep: change scheduler annotation

is still not good enough yet?

	Ingo
--

From: David Miller
Date: Friday, August 1, 2008 - 2:13 am

From: Ingo Molnar <mingo@elte.hu>

It doesn't work, the problem is this:

[ 7819.764377] =============================================
[ 7819.775495] [ INFO: possible recursive locking detected ]
[ 7819.780969] 2.6.27-rc1-lockdep #12
[ 7819.786256] ---------------------------------------------
[ 7819.791548] cc1/29466 is trying to acquire lock:
[ 7819.796685]  (&rq->lock/1){.+..}, at: [<0000000000455548>] double_lock_balance+0x78/0x90
[ 7819.806849] 
[ 7819.806859] but task is already holding lock:
[ 7819.815871]  (&rq->lock/1){.+..}, at: [<0000000000455538>] double_lock_balance+0x68/0x90
[ 7819.825448] 
[ 7819.825456] other info that might help us debug this:
[ 7819.834373] 1 lock held by cc1/29466:
[ 7819.838709]  #0:  (&rq->lock/1){.+..}, at: [<0000000000455538>] double_lock_balance+0x68/0x90
[ 7819.847818] 
[ 7819.847827] stack backtrace:
[ 7819.855832] Call Trace:
[ 7819.859696]  [000000000047ecac] __lock_acquire+0xbdc/0xf98
[ 7819.863669]  [000000000047f7c4] lock_acquire+0x64/0x7c
[ 7819.867613]  [00000000006d67d4] _spin_lock_nested+0x1c/0x58
[ 7819.871612]  [0000000000455548] double_lock_balance+0x78/0x90
[ 7819.875600]  [00000000006d3afc] schedule+0x324/0x8d0
[ 7819.879536]  [00000000004c089c] pipe_wait+0x58/0x8c
[ 7819.883462]  [00000000004c0c94] pipe_write+0x3c4/0x480
[ 7819.887597]  [00000000004ba150] do_sync_write+0x80/0xd0
[ 7819.891460]  [00000000004ba92c] vfs_write+0x70/0x10c
[ 7819.895264]  [00000000004bad18] sys_write+0x2c/0x60
[ 7819.899073]  [0000000000406294] linux_sparc_syscall32+0x34/0x40
--

From: Peter Zijlstra
Date: Friday, August 1, 2008 - 4:08 am

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: lockdep: lock_set_subclass - reset a held lock's subclass

this can be used to reset a held lock's subclass, for arbitrary-depth
iterated data structures such as trees or lists which have per-node
locks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/lockdep.h |    4 ++
 kernel/lockdep.c        |   69 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -300,6 +300,9 @@ extern void lock_acquire(struct lockdep_
 extern void lock_release(struct lockdep_map *lock, int nested,
 			 unsigned long ip);
 
+extern void lock_set_subclass(struct lockdep_map *lock, unsigned int subclass,
+			      unsigned long ip);
+
 # define INIT_LOCKDEP				.lockdep_recursion = 0,
 
 #define lockdep_depth(tsk)	(debug_locks ? (tsk)->lockdep_depth : 0)
@@ -316,6 +319,7 @@ static inline void lockdep_on(void)
 
 # define lock_acquire(l, s, t, r, c, i)		do { } while (0)
 # define lock_release(l, n, i)			do { } while (0)
+# define lock_set_subclass(l, s, i)		do { } while (0)
 # define lockdep_init()				do { } while (0)
 # define lockdep_info()				do { } while (0)
 # define lockdep_init_map(lock, name, key, sub)	do { (void)(key); } while (0)
Index: linux-2.6/kernel/lockdep.c
===================================================================
--- linux-2.6.orig/kernel/lockdep.c
+++ linux-2.6/kernel/lockdep.c
@@ -2658,6 +2658,55 @@ static int check_unlock(struct task_stru
 	return 1;
 }
 
+static int
+__lock_set_subclass(struct lockdep_map *lock,
+		    unsigned int subclass, unsigned long ip)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock, *prev_hlock;
+	struct lock_class *class;
+	unsigned int depth;
+	int i;
+
+	depth = ...
From: Jeremy Fitzhardinge
Date: Friday, August 1, 2008 - 11:06 am

Hey, can I use this to suppress the spurious lockdep warnings I get when 
I try to hold more than one pte lock at once?

    J
--

From: Linus Torvalds
Date: Friday, August 1, 2008 - 11:14 am

So how sure are you that they are spurious?

			Linus
--

From: Jeremy Fitzhardinge
Date: Friday, August 1, 2008 - 12:10 pm

I have a function traversing a pagetable in vaddr order (low to high), 
taking pte locks as it builds up batches of pte page updates.  When the 
batch is issued, it releases all the locks, and won't end up holding 
more than ~16 at a time.

So, I think this is OK.  There are no internal lock ordering issues, and 
I don't think there'll be any bad interactions from someone trying to 
take pte locks for two separate pagetables.  I don't think there's 
anyone else trying to take more than one pte lock at once, but if there 
were "lock low vaddr then high" seems like a reasonable locking rule (or 
more precisely "lowest" to deal with the case of a pte page being 
aliased at multiple vaddrs).

Lockdep complains because the split pte locks are all in the same lock 
class, so it reports it as taking a spinlock recursively.  I'd reported 
this to PeterZ before, and he responded with "uh, I'll think about 
it...", which I took to mean "...when someone else has a problem".  So 
I'm wondering if this is that time...

(This is all with split pte locks, obviously.)

    J
--

From: Linus Torvalds
Date: Friday, August 1, 2008 - 12:24 pm

Hmm. 

With hashed locks, that is _not_ safe in general. Now, it may well be safe 
in your case, so I'm not going to say you have a bug, but even if you do 
them in vaddr order, the thing is, we don't guarantee that the _locks_ are 
ordered in virtual address order.

Right now, I think the pte locks are either a single one per mm (in which 
case I assume you don't take any lock at all), or it's a lock that is 
embedded in the pmd page iirc.

What if you have pmd sharing through some shared area being mapped at two 
different processes at different addresses? Yeah, I don't think we share 
pmd's at all (except if you use hugetables and for the kernel), but it's 
one of those things where subtle changes in how the pte lock allocation 
could cause problems.

Eg, I could easily see somebody doing the pte lock as a hash over not just 
the address, but the "struct mm" pointer too. At which point different 
parts of the address space might even share the PTE lock, and you'd get 
deadlocks even without any ABBA behavior, just because the pte lock might 
be A B C A or something inside the same process.

			Linus
--

From: Jeremy Fitzhardinge
Date: Friday, August 1, 2008 - 1:08 pm

Yes, it's current simplicity definitely relies on a simple relationship 

Actually the locks in the pte page, so it's already fairly 
fine-grained.  If it were made coarser - at the pmd level - then I'd 
simply move my lock-taking out a level in the pagetable traversal.

But making it finer - by hashing individual pte entries to their own 
locks - would work badly for me.  I'm taking the lock specifically to 
protect against updates to any pte within the pte page, so I'd have to 
manually take all the locks that the ptes could possibly hash to.  
Presumably in that eventuality we could define correct order for taking 
the hashed pte locks, independent of how the ptes actually map to them 

In this particular case it isn't an issue.  The traversal is performing 
first/last use init/deinit, so any shared parts of the pagetable can be 
skipped with no action because they're already done.  Presumably there'd 
be separate lifetime management for whatever parts of the pagetable can 
be shared, so I'd need to hook in there, rather than the wholesale 

Yep.  That would be awkward.  A function which says "here's a pte page, 
return the set of all pte locks these ptes map to in correct locking 
order" would be useful in that case.  Ugh, but still unpleasant.

    J
--

From: Hugh Dickins
Date: Friday, August 1, 2008 - 12:59 pm

Please check the spin_lock_nested() in move_ptes() in mm/mremap.c.

If you have down_write(&mm->mmap_sem) then you should be safe,
but may need to do something to placate lockdep.  If you don't
have down_write(&mm->mmap_sem), then I think you're in trouble?

Not a big deal, the move_ptes() locking can be adjusted to suit
your rule, it was just easier to do it the way it is at the time.

Hugh
--

From: Jeremy Fitzhardinge
Date: Friday, August 1, 2008 - 1:22 pm

Ah, yes, I did look at that.  I think it isn't an issue, because my code 
is called from dup_mmap(), activate_mm() or exit_mmap().

dup_mmap() already holds mmap_sem.

activate_mm() in exec doesn't hold the sem, but I don't think it's 
possible for anyone to be racing against it.
activate_mm() in unshare doesn't seem to get used.

exit_mmap() gets called when there are no other users, so we'd better 
not be racing...

    J
--

From: Hugh Dickins
Date: Friday, August 1, 2008 - 1:33 pm

All safe against mremap, agreed.

Hugh
--

From: David Miller
Date: Friday, August 1, 2008 - 4:20 pm

From: Hugh Dickins <hugh@veritas.com>

It won't work because spin_lock_nested() is limited to a depth
of 8 and he aparently needs 16.

Taking more than a few locks of the same class at once is bad
news and it's better to find an alternative method.

--

From: Linus Torvalds
Date: Friday, August 1, 2008 - 4:26 pm

It's not always wrong.

If you can guarantee that anybody that takes more than one lock of a 
particular class will always take a single top-level lock _first_, then 
that's all good. You can obviously screw up and take the same lock _twice_ 
(which will deadlock), but at least you cannot get into ABBA situations.

So maybe the right thing to do is to just teach lockdep about "lock 
protection locks". That would have solved the multi-queue issues for 
networking too - all the actual network drivers would still have taken 
just their single queue lock, but the one case that needs to take all of 
them would have taken a separate top-level lock first.

Never mind that the multi-queue locks were always taken in the same order: 
it's never wrong to just have some top-level serialization, and anybody 
who needs to take <n> locks might as well do <n+1>, because they sure as 
hell aren't going to be on _any_ fastpaths.

So the simplest solution really sounds like just teaching lockdep about 
that one special case. It's not "nesting" exactly, although it's obviously 
related to it.

			Linus
--

From: Peter Zijlstra
Date: Friday, August 1, 2008 - 1:49 pm

No, you still cannot hold more than 48 locks at any one time, and this
somewhat horrible annotation only works if you don't generate double
classes in the same chain.

What it can do is:

  L1
   L2
    U1
     L3
      U2
       etc..

What you do is:

  spin_lock(L1)
   spin_lock_nested(L2, 1)
    spin_unlock(L1);
    lockdep_set_subclass(L2, 0); // because L1, who was 0 isn't there anymore
     spin_lock_nested(L3, 1);
      spin_unlock(L2);
      lockdep_set_subclass(L2, 0); // blah blah..
       etc...



--

From: Peter Zijlstra
Date: Friday, August 1, 2008 - 4:08 am

Ingo,

I cannot seem to reproduce this on my quad (I'll let it run a fwe hours
just to make sure), could you meanwhile try on a somewhat larger
machine?

The thing that triggered it for David is a regular kernel build with
large concurrency.

(don't forget to 'fix' printk or apply David's oops_in_progress patch)

---
Subject: lockdep: re-annotate scheduler runqueues

Instead of using a per-rq lock class, use the regular nesting operations.

However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).

So what can happen is:

 spin_lock(rq->lock);	// this rq subclass 0

 double_lock_balance(rq, other_rq);
   // release rq
   // acquire other_rq->lock subclass 0
   // acquire rq->lock subclass 1

 spin_unlock(other_rq->lock);

leaving you with rq->lock in subclass 1

So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.

Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c    |   21 +++++++++++++--------
 kernel/sched_rt.c |    8 +++++---
 2 files changed, 18 insertions(+), 11 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -600,7 +600,6 @@ struct rq {
 	/* BKL stats */
 	unsigned int bkl_count;
 #endif
-	struct lock_class_key rq_lock_key;
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -2759,10 +2758,10 @@ static void double_rq_lock(struct rq *rq
 	} else {
 		if (rq1 < rq2) {
 			spin_lock(&rq1->lock);
-			spin_lock(&rq2->lock);
+			spin_lock_nested(&rq2->lock, SINGLE_DEPTH_NESTING);
 		} else {
 			spin_lock(&rq2->lock);
-			spin_lock(&rq1->lock);
+			spin_lock_nested(&rq1->lock, ...
From: Linus Torvalds
Date: Friday, August 1, 2008 - 10:04 am

Ingo, I hope you haven't gone off for a vacation yet, because right now 
I'm not sure any of the scheduler trees are safe to pull. So pls advise.

		Linus
--

From: David Miller
Date: Saturday, August 2, 2008 - 1:34 am

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Peter, I've been beating up these two patches here for a while
on my 64-cpu box and it hasn't budged yet.

I'll let you know if anything happens, but it looks good so far.
--

Previous thread: [git pull] x86 fixes by Ingo Molnar on Thursday, July 31, 2008 - 2:42 pm. (1 message)

Next thread: [PATCH] x86: fdiv bug detection fix by Krzysztof Helt on Thursday, July 31, 2008 - 2:43 pm. (5 messages)