Re: [RFC][PATCH 7/7] lockdep: spin_lock_nest_lock()

Previous thread: [RFC][PATCH 4/7] lockdep: shrink held_lock structure by Peter Zijlstra on Monday, August 4, 2008 - 6:03 am. (3 messages)

Next thread: [RFC][PATCH 5/7] lockdep: map_acquire by Peter Zijlstra on Monday, August 4, 2008 - 6:03 am. (1 message)
From: Peter Zijlstra
Date: Monday, August 4, 2008 - 6:03 am

Expose the new lock protection lock.

This can be used to annotate places where we take multiple locks of the
same class and avoid deadlocks by always taking another (top-level) lock
first.

NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/lockdep.h  |    2 ++
 include/linux/spinlock.h |    6 ++++++
 kernel/spinlock.c        |   11 +++++++++++
 3 files changed, 19 insertions(+)

Index: linux-2.6/include/linux/lockdep.h
===================================================================
--- linux-2.6.orig/include/linux/lockdep.h
+++ linux-2.6/include/linux/lockdep.h
@@ -410,8 +410,10 @@ static inline void print_irqtrace_events
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # ifdef CONFIG_PROVE_LOCKING
 #  define spin_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 2, NULL, i)
+#  define spin_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 2, n, i)
 # else
 #  define spin_acquire(l, s, t, i)		lock_acquire(l, s, t, 0, 1, NULL, i)
+#  define spin_acquire_nest(l, s, t, n, i)	lock_acquire(l, s, t, 0, 1, NULL, i)
 # endif
 # define spin_release(l, n, i)			lock_release(l, n, i)
 #else
Index: linux-2.6/include/linux/spinlock.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -183,8 +183,14 @@ do {								\
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define spin_lock_nested(lock, subclass) _spin_lock_nested(lock, subclass)
+# define spin_lock_nest_lock(lock, nest_lock)				\
+	 do {								\
+		 typecheck(struct lockdep_map *, &nest_lock->dep_map);	\
+		 _spin_lock_nest_lock(lock, &nest_lock->dep_map);	\
+	 } while (0)
 #else
 # define spin_lock_nested(lock, subclass) _spin_lock(lock)
+# define spin_lock_nest_lock(lock, nest_lock) _spin_lock(lock)
 #endif
 
 #define write_lock(lock)		_write_lock(lock)
Index: ...
From: Roland Dreier
Date: Monday, August 4, 2008 - 7:07 am

> NOTE: we're still bound to the MAX_LOCK_DEPTH (48) limit.

A) It is probably a good idea to put this in a comment somewhere near
   where spin_lock_nest_lock() is declared.

B) It is probably a good idea to write that comment in such a way that
   dumb people like me understand what the limit is.  The sentence I
   quoted above is too telegraphic for me to get.  Is the point that no
   more than 48 spinlocks can be held at once, even if the inner locks
   are protected by some top level lock?  Or do you mean something else?

 - R.
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 7:19 am

No more than 48 locks (mutexes, rwlocks, spinlock, RCU, everything
covered by lockdep) held by any one code-path; including nested
interrupt contexts.

--

From: Roland Dreier
Date: Monday, August 4, 2008 - 7:26 am

> No more than 48 locks (mutexes, rwlocks, spinlock, RCU, everything
 > covered by lockdep) held by any one code-path; including nested
 > interrupt contexts.

Does that mean that something like the new mm_take_all_locks() operation
is going to explode if someone tries to use it with lockdep on?

 - R.
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 7:32 am

Gah - yes, clearly nobody tried this.. :-/

Just looking at the code it will not only run into this limit, but it
would warn about recursion on the second file/anon vma due to utter lack
of annotation.

Why are people still developing without lockdep?

/me sad

--

From: Dave Jones
Date: Monday, August 4, 2008 - 7:53 am

On Mon, Aug 04, 2008 at 04:32:12PM +0200, Peter Zijlstra wrote:
 > On Mon, 2008-08-04 at 07:26 -0700, Roland Dreier wrote:
 > > > No more than 48 locks (mutexes, rwlocks, spinlock, RCU, everything
 > >  > covered by lockdep) held by any one code-path; including nested
 > >  > interrupt contexts.
 > > 
 > > Does that mean that something like the new mm_take_all_locks() operation
 > > is going to explode if someone tries to use it with lockdep on?
 > 
 > Gah - yes, clearly nobody tried this.. :-/
 > 
 > Just looking at the code it will not only run into this limit, but it
 > would warn about recursion on the second file/anon vma due to utter lack
 > of annotation.
 > 
 > Why are people still developing without lockdep?

More puzzling, is why hasn't this triggered in the Fedora rawhide kernels,
which do have lockdep enabled.

	Dave

-- 
http://www.codemonkey.org.uk
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 7:56 am

My guess is that the kvm thing attaches before there are any vma, and
leaves after all the vma are gone. So it would never actually trigger.



--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 9:26 am

Yes, lockdep seems to be fine with kvm in kernel mainline in my
current and past testing. Andrew asked me to check this long ago.

vmx ~ # zgrep LOCKDEP /proc/config.gz 
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y

CPA self-test:
 4k 26112 large 1997 gb 0 x 2652[ffff880000000000-ffff8800bffff000]
 miss 262144
 4k 186880 large 1683 gb 0 x 43021[ffff880000000000-ffff8800bffff000]
 miss 262144
 4k 186880 large 1683 gb 0 x 43021[ffff880000000000-ffff8800bffff000]
 miss 262144
ok.
loaded kvm module ()
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
Ignoring de-assert INIT to vcpu 0
kvm: emulating exchange as write
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
Ignoring de-assert INIT to vcpu 0
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
Ignoring de-assert INIT to vcpu 0
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
apic write: bad size=1 fee00030
Ignoring de-assert INIT to vcpu 0
Ignoring de-assert INIT to vcpu 0
Ignoring de-assert INIT to vcpu 0

I can't see lockdep errors in dmesg starting one more multiple VM in a
loop (all run on a quadcore).

GRU is likely the same.

The only real place where lockdep is unusable in my experience is
preempt-RT, it grinds it to an halt during boot on 8-way before
reaching the shell.
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 9:38 am

Dave Jones just handed me:


David Miller just did a patch that might fix that.

--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 10:27 am

Sorry, that wasn't the problem, but my current testing passed because
of another error... to test this report I rebuilt a kvm configured for
rhel not for mainline so it wasn't the right test...

When "long ago" I tested that this was working fine (actually when
Andrew asked me), I guess lockdep was working because the
implementation with the vmalloc array was slightly different,
otherwise I don't know. I'm fairly certain that it worked fine at some

I can reproduce this now yes after a 'make sync'.

Obviously this is a bug in lockdep that it trips over this otherwise
if lockdep was right the kernel should deadlock while this is just a
false positive and everything runs fine.

I assume it can't understand the spinlock address is different (I
think it uses the address as key only for static locks), so I wonder
if you could call print_deadlock_bug() from the failure path of the
spinlock to solve this?

Do things like double_rq_lock works just because rq1 and rq2 don't

Woow cool, after 11 months I lost any hope that lockdep could ever
work in that environment... Was it an actual bug or is this some way
to lower the complexity of the graph build?
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 10:46 am

From: Andrea Arcangeli <andrea@qumranet.com>

Lockdep can't recognize if spinlocks are at a different address. So
trylock avoids lockdep to generate false positives. After lockdep will
be fixed this change can and should be reverted.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---


(btw it's not like I forgot to sync, but I sync against the wrong
source tree previously because it was in the bash history, it's a bit

In the meantime (as I doubt lockdep will get fixed any time soon) this
will workaround it.

diff -r 3469dce61df1 mm/mmap.c
--- a/mm/mmap.c	Tue Jul 29 20:01:28 2008 +0200
+++ b/mm/mmap.c	Mon Aug 04 19:41:53 2008 +0200
@@ -2279,8 +2279,13 @@ static void vm_lock_anon_vma(struct anon
 		/*
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
+		 *
+		 * spin_lock would confuse lockdep who can't
+		 * differentiate between the 'mapping' always changing
+		 * address.
 		 */
-		spin_lock(&anon_vma->lock);
+		while (!spin_trylock(&anon_vma->lock))
+			cpu_relax();
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
@@ -2310,7 +2315,13 @@ static void vm_lock_mapping(struct addre
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock(&mapping->i_mmap_lock);
+		/*
+		 * spin_lock would confuse lockdep who can't
+		 * differentiate between the 'mapping' always changing
+		 * address.
+		 */
+		while (!spin_trylock(&mapping->i_mmap_lock))
+			cpu_relax();
 	}
 }
 
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 10:57 am

From: Andrea Arcangeli <andrea@qumranet.com>

Lockdep can't recognize if spinlocks are at a different address. So
using trylock in a loop is one way to avoid lockdep to generate false
positives. After lockdep will be fixed this change can and should be
reverted.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
---

Resubmit because I didn't update the subject and so I improved the
comments too. This is mostly to show it's a bug in lockdep and it can
be trivially worked around without having to fix lockdep for real, any
superior solution to this hack is more than welcome and recommended.

diff -r 3469dce61df1 mm/mmap.c
--- a/mm/mmap.c	Tue Jul 29 20:01:28 2008 +0200
+++ b/mm/mmap.c	Mon Aug 04 19:54:27 2008 +0200
@@ -2279,8 +2279,12 @@ static void vm_lock_anon_vma(struct anon
 		/*
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
+		 *
+		 * spin_lock would confuse lockdep who doesn't notice
+		 * the 'anon_vma' always changing address.
 		 */
-		spin_lock(&anon_vma->lock);
+		while (!spin_trylock(&anon_vma->lock))
+			cpu_relax();
 		/*
 		 * We can safely modify head.next after taking the
 		 * anon_vma->lock. If some other vma in this mm shares
@@ -2310,7 +2314,12 @@ static void vm_lock_mapping(struct addre
 		 */
 		if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
 			BUG();
-		spin_lock(&mapping->i_mmap_lock);
+		/*
+		 * spin_lock would confuse lockdep who doesn't notice
+		 * the 'mapping' always changing address.
+		 */
+		while (!spin_trylock(&mapping->i_mmap_lock))
+			cpu_relax();
 	}
 }
 
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 11:48 am

NAK, come-on, you didn't even bother to look at the available

--

From: Roland Dreier
Date: Monday, August 4, 2008 - 11:56 am

> NAK, come-on, you didn't even bother to look at the available
 > annotations..

Agree that this trylock-in-a-loop is a horrible hack that should never
be merged.

However I wonder what you think the right way to annotate the
potentially unbounded number of locks taken in mm_take_all_locks() is?

 - R.
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 12:05 pm

Good question, one which I don't currently have an answer for - see my
other mail why getting rid of the MAX_LOCK_DEPTH limitation is hard.

I'll give it another serious go tomorrow (if nothing else preempts me).

--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 1:15 pm

Let's say when I hear prove-locking my instinct tells me to walk away
as fast as I can. I'll try to explain why I prefer the trylock loop
(which btw is the only one that will hide the lockdep false positives
here and I welcome you to fix it in another way, well another way that
comes to mind is to call __raw_spin_lock which I didn't do because I
don't like those lowlevel details in common code).

So about prove-locking:

1) in production is disabled so when I get bugreports I've to grab
   locking deadlock information as usual (sysrq+t/p or preferably
   lkcd/kdump)

2) while coding it's useless as well because I don't need this thing
   to debug and fix any deadlocks

3) this only finds bugs after the system is hung and I can fix it by
   other means then

Despite having used this for a while, it never happened to me that
this has been useful as single time, I only run into false positives
and systems grinding to an halt so not allowing debug kernels to run
which reduced even more the ability to debug the kernel. I'm not aware
of anybody else who was debugging that fixed bugs thanks to this
code. So in my experience this only has hurt me and I don't want it
ever enabled in anything I deal with.

If only this had a slight chance to find bugs that don't easily
trigger at runtime it'd be an entirely different matter.

To be clear: my criticism is only to prove-locking, for example I find
very useful that there are other things like spinlock debug options
that warns if you call GFP with GFP_KERNEL while you're in atomic
context (which unfortunately only works with preempt enabled for no
good reason because the atomic context is maintained by non-preempt
too but anyway). The spinlock sleep debug effectively find bugs that
would otherwise go unnoticed. Let's say I'm only interested in stuff
that shows bugs that would otherwise go unnoticed, everything else I
can debug it by myself without prove-locking printk in the logs.

So this can only help if you aren't capable of ...
From: Peter Zijlstra
Date: Monday, August 4, 2008 - 1:37 pm

You're so wrong it not even funny. It reports about deadlocks before
they happen. All it needs is to observe a lock order violation and it
will report it. In order for the dead-lock to happen, you need to
actually hit the violation concurrently with the normal order. 

IOW, lockdep can even spot deadlocks on a non-preempt single cpu setup
where they can never actually happen.

Furthermore, it does more than the simple lock deadlocks, it also takes
IRQ state into account. So it can tell about hard or soft irq recursion
deadlocks.

Having lockdep on while developing saves a lot of trouble - in fact it
_has_ caught many real bugs before they could be introduced to mainline,
ask Arjan who has supervised driver development.

Not only that, it caught plenty of real bugs in mainline as well as -rt.
These days it appears to not catch many because the tree is in such good
shape, but that is fully thanks to lockdep.

That is not to say it's perferct - lockdep certainly does have it
limitations. But your portrail is very in-accurate.


--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 2:09 pm

Now tell me how it helps to report them... It tells me the system has
crashed and where, it's not like I couldn't figure it out by myself
but just noticing nothing works and all cpus are spinning in some

The point is that this is a runtime evaluation of lock orders, if
runtime isn't the lucky one that reproduces the deadlock, it'll find

Ask the preempt-RT folks, advertising infinite-cpu-smp on UP. No need
of lockdep to reproduce deadlocks that couldn't be found without it in
UP. Infact lockdep only avoids to optimize away the spinlock, you
don't need preempt-RT to reproduce at zero-runtime-cost in a equally
easily debuggable way whatever the lockdep can reproduced on a UP
compile. You've only to run a SMP kernel on UP, big deal that lockdep

Ok this gets just a bit more interesting but my understanding is that
again this only "reports" stuff that crashes hard and is trivially

Are you saying those bugs weren't trivially debuggable anyway by
kernel-hackers with a simple sysrq+p/t or kgdb stack trace?

If yes, I can only imagine that nobody takes care of setting up
debugging options that actually are _zero_ runtime cost and don't

I've an hard time to believe you, the only thing I can agree is that
if you don't want to run smp kernel on UP (or even better preempt-RT
on UP) you can run lockdep to find recursion.

smp-preempt-RT on UP is a real feature that is useful debugging,
because it makes bugs visible that weren't, lockdep is useless as far
as I can tell.
--

From: Pekka Enberg
Date: Monday, August 4, 2008 - 2:14 pm

Hi Andrea,



Sometimes it's hard to actually trigger a deadlock condition during
development, especially if you're not developing on a big iron
machine.
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 2:30 pm

My understand is that lockdep can't do static analysis of code, it's
not a checker. It can only evaluate the locks in the order the cpu
takes them. And if the CPU never takes them in the wrong order because
only your unlikely to happen slow path takes them in the wrong order
while the other cpus hold them in the opposite order, lockdep can't
say nothing at all, how could it? If it would say something when the
cpu doesn't deadlock a moment later, then it would be a false positive
99% of the time.

So now we learn lockdep also lead developers to think if their code
doesn't generate lockdep errors it will be safe in production, so
again more damage thanks to lockdep.

Lockdep is a printk that reports that you were incredibly lucky, that
you got the lock inversion during development even if there was only
once chance in a thousand, and that you can celebrate. But I know I
can figure it out by myself when the system has crashed without
something eating all cpu and polluting kernel common code to tell me.

In other words I imagine lockdep as something like this:

	char *ptr = NULL
	BUG_ON(!ptr); /* this is lockdep */
	*ptr = 100;

Luckily nobody dreams to add BUG_ON that checks for null pointer to
dereference, as there's the mmu for that.

The same way for the null pointer dereference, there are zerocost
means to debug all sort of kernel crashing deadlocks.

Now perhaps I misunderstood lockdep entirely but I doubt as this thing
is driven by calls made by the spin_lock functions, and it can't know
anything about the spin_lock that haven't been run by the cpu yet.
--

From: Andrew Morton
Date: Monday, August 4, 2008 - 2:41 pm

On Mon, 4 Aug 2008 23:30:18 +0200

You did ;)

At runtime lockdep will "learn" the lock ordering rules, building a
graph in memory.  If it ever sees the thus-learnt rules violated, it
will warn.


Simple example:


static DEFINE_MUTEX(a);
static DEFINE_MUTEX(b);

void f1(void)
{
	mutex_lock(a);
	mutex_lock(b);
}

void f2(void)
{
	mutex_lock(b);
	mutex_lock(a);
}

void doit(void)
{
	f1();
	f2();
}



lockdep will warn here about the ranking violation.  As soon as it
occurs.  Even though the system never deadlocked.

It's very very clever and powerful.
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 3:12 pm

Yes I wasn't aware of the AB BA feature that doesn't require the
inversion to happen on both cpus at the same time, despite having
spent weeks when it was kernel crashing systems at boot (which shows
it's not immediately easy to understand how that thing works by
reading the code).

static DEFINE_MUTEX(a);
static DEFINE_MUTEX(b);
static DEFINE_MUTEX(c);

void f1(void)
{
	mutex_lock(c);
	mutex_lock(a);
	mutex_lock(b);
	mutex_unlock(c);
	mutex_unlock(a);
	mutex_unlock(b);
}

void f2(void)
{
	mutex_lock(c);
	mutex_lock(b);
	mutex_lock(a);
	mutex_unlock(c);
	mutex_unlock(b);
	mutex_unlock(a);
}

void f3(void)
{
	mutex_lock(a);
	mutex_lock(b);
	mutex_unlock(a);
	mutex_unlock(b);
}

void f4(void)
{
	mutex_lock(b);
	mutex_lock(a);
	mutex_unlock(b);
	mutex_unlock(a);
}

void doit(void)
{
	f1();
	f2();
	stop_machine(f3);
	stop_machine(f4);

I'm curious, does it also handle the above? I mean not a big deal to
refactor the code to shutdown lockdep warnings, but it doesn't look
that clever to me. It seems just clever the minimum enough to be
useful with its AB BA memory.

Now I see lockdep has a good point to exist, for the lock inversion
"memory" but this is by no mean a feature I can remotely like in
general given how inaccurate and prone for false positives it is. It
looks more a necessary evil to deal with, like I tried to do with my
patch (even before understanding it was more useful than what I
thought initially).
--

From: Arjan van de Ven
Date: Monday, August 4, 2008 - 2:42 pm

On Mon, 4 Aug 2008 23:30:18 +0200

yes lockdep will only complain WHEN you take them in the wrong order

But you claimed you would for sure be in a deadlock at that point which

this comment totally puzzles me... rather than calling you naive...
where was this said or even implied????




-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 3:30 pm

I already said I didn't know about that despite having spent a fair
amount of time trying to understand why lockdep crashes systems at
boot about an year ago. I admit I didn't understand much about it and
reducing its computation time didn't look feasible, perhaps my fault,

It was just a logical conclusion of the statement that lockdep will
warn of the following classes of locking bugs: "lock inversion
scenarios"... There's no warning anywhere that it might not find them
at all. Or point me to where it is warned you still have to read the
code and verify it yourself, or you still risk to AB BA in
production. If it was warned I wouldn't have mentioned it, people seem
to talk like if lockdep is a checker doing static analyses of all
paths when it can't.

Partly of what I said before is true, even if I didn't understand the
actual details of the AB BA memory it has by reading the code: the BA
may happen only when system is OOM etc... so even lockdep memory may
never find it. So my warning that code might AB BA deadlock even if
lockdep doesn't warn sounds fair enough and without it I'm still
afraid it can lead to developers think everything is ok with regard to
AB BA. In any case (even if everyone already understand lockdep better
than I did before this discussion), even if I'm wrong a sign of
warning that lockdep isn't enough, can't hurt.
--

From: Arjan van de Ven
Date: Monday, August 4, 2008 - 4:38 pm

On Tue, 5 Aug 2008 00:30:11 +0200

interesting; lockdep has been working for the last.. 2 1/2 years at
least, and I don't remember seeing bugreports against it from you that
would describe it as totally non-functional.


Oh well.. seems you're rather preoccupied about it; that's ok, you're
entitled to your opinion even if I don't agree with it ;-)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 5:47 pm

I reported it to Peter. If you see David's email, I guess it can be
implied that I wasn't the only one aware that prove-locking made
certain systems non functional, I thought it was widespread knowledge
maybe not.

It's amazing that things seem to have improved on that side, it surely

So let me understand better: your opinion is that all of lockdep is
useful, not just the AB BA detection?

By reading the source again after 11 months to me it still looks
check_deadlock() only has knowledge of the current context. It loops
over the task struct checking all the locks of the current task!
Combine the great feature that check_deadlock provides, with crashing
at boot, and I hope that better explains my feeling about
lockdep-prove-locking.

This check_deadlock() thing is the real core of my dislike of
prove-locking! The check_noncircular part I totally agree it's useful
now that I see it works differently than check_deadlock (when I read
it last time I thought it worked the same as check_deadlock).

check_noncircular being useful doesn't automatically make
check_deadlock useful.

And incidentally it's exactly this check_deadlock part that is
trapping on my code and that is now requiring silly changes to the
common code (the ones I did) or to make the common code even more
complex (what Peter is planning I guess).
--

From: Arjan van de Ven
Date: Monday, August 4, 2008 - 2:27 pm

On Mon, 4 Aug 2008 23:09:54 +0200


I think you totally misunderstand things then.

Lockdep will report a problem if it *ever* sees a BA order after it has
seen a BA order. They don't have to happen at the same time. Or even
within hours of eachother.

They MIGHT happen... in a narrow time window, when you have a deadlock.
But lockdep will warn you about the order violation without actually
having to dealock... because the AB is likely to be done already most
of the time when the BA happens.
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 2:54 pm

For me the false positive was generated by the recursive spinlock
feature, not the deadlock inversion feature. If you limit my comments
to the recursive spinlock feature perhaps you'll be more likely to

So this is about the lock inversion feature, I admit I didn't realize
it has memory that locks have been taken in AB order and spawns the
first time locks have been taken in BA order. But this will lead to
more false positives because there's nothing wrong to do AB BA if
there's C lock taken before them!

Perhaps enforcing lockdep all over the kernel is worth it just for
this AB BA thing in case it doesn't only spawn false positives, but
still I can't like something that is as inaccurate and prone for
errors as lockdep and spreading all over the place to try to reduce
the false positives it emits. I can't like that but that's just
me... others clearly love it.
--

From: David Miller
Date: Monday, August 4, 2008 - 2:57 pm

From: Andrea Arcangeli <andrea@qumranet.com>

It tells you about deadlock scenerios that you haven't even encoutered
yet.  It shows bugs for parallel code path sequences that have not
even occurred yet.

But I think you're beyond the point of being able to see why lockdep
is valuable.  So why don't we just move on.
--

From: Roland Dreier
Date: Monday, August 4, 2008 - 7:00 pm

> The point is that this is a runtime evaluation of lock orders, if
 > runtime isn't the lucky one that reproduces the deadlock, it'll find
 > nothing at all.

I think the point you miss is that lockdep can report a potential
deadlock, even if the deadlock does not actually occur.  For example
suppose there is an AB-BA deadlock somewhere.  For this to actually
trigger, we have to have one CPU running the AB code path at exactly the
moment another CPU runs the BA code path, with the right timing so one
CPU holds A and tries to grab B while the other CPU already holds B.

With lockdep, we just have to have the AB code path run once at any
point, and then the BA code path run at any later time (even days after
the AB code path has released all the locks).  And then we get a
warning dump that explains the exact potential deadlock.

 - R.
--

From: Andrea Arcangeli
Date: Monday, August 4, 2008 - 7:18 pm

Thanks a lot for the detailed explanation of check_noncircular. I
agree check_noncircular is surely a good argument not to get rid of
prove-locking as a whole. But check_noncircular is also a red-herring
in this context. It's not check_noncircular trapping here,
check_deadlock traps with false positives instead. The question is
what are those false positives buying us? To avoid a developer to
press sysrq+p or break on kgdb?

Let's focus on check_deadlock->print_deadlock_bug and somebody who's
not beyond the point please explain what print_deadlock_bug reports
that does not actually occur and why it's a good idea to change the
common code to accommodate for its false positives instead of getting
rid of it for good.
--

From: Roland Dreier
Date: Tuesday, August 5, 2008 - 5:02 am

> Let's focus on check_deadlock->print_deadlock_bug and somebody who's
 > not beyond the point please explain what print_deadlock_bug reports
 > that does not actually occur and why it's a good idea to change the
 > common code to accommodate for its false positives instead of getting
 > rid of it for good.

check_deadlock operates on classes of locks, so it can warn about
potential deadlocks, eg if we have

	foo(obj1, obj2)
	{
		lock(obj1);
		lock(obj2);
		...

then foo(obj, obj); is a deadlock but lockdep can warn about foo(obj,
different_obj) without triggering the deadlock in reality.  Of course
this leads to false positives, and we sometimes have to change correct
code to help lockdep, but usually such rewriting leads to simpler
clearer better locking anyway.

 - R.
--

From: Andrea Arcangeli
Date: Tuesday, August 5, 2008 - 5:20 am

It surely doesn't lead to simpler and clearer better locking in the
case we're discussing here, and I don't know of other cases where it
leads to better locking but feel free to make examples.
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 11:48 am

Lockdep doesn't look at individual lock instances, but instead works on
lock classes. The rq locks used to have a separate class each, but the
third patch in this series changes that.

The advantage of not working on lock instances is that the dependency
graph becomes much smaller, and it will report problems much earlier,
the down-side is that you will have to augment the information a little.

Normally locks are grouped in classes based on their init site; eg.
every instance initialized by a particular spin_lock_init() will belong
to the same class.

Besides this grouping, you can use things like spin_lock_nested(), which
can be used to annotate simple recursion (up to 8 sub-classes).

Then there is the possibility to explicitly create classes and assign
locks to classes using lockdep_set_class().

Either way - there is a limit of MAX_LOCK_DEPTH (48) locks that can be
tracked per task (including nesting interrupts).

Making that dynamic is rather challenging - as it would involve

He reduced the search complexity of the graph, look at the first patch
in this series.

--

From: David Miller
Date: Monday, August 4, 2008 - 2:32 pm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Peter, your patches are just as effective at making all of my 64-cpu,
128-cpu, etc. machines usable with lockdep.
--

From: Jeremy Fitzhardinge
Date: Monday, August 4, 2008 - 11:06 am

OK, so the expected usage is:

	spin_lock(&outer_lock);
	/* take in any order */
	spin_lock_nest_lock(&inner[0], &outer_lock);
	spin_lock_nest_lock(&inner[2], &outer_lock);
	spin_lock_nest_lock(&inner[1], &outer_lock);
	...

?

And it's OK to

   1. take inner locks one at a time without holding the outer lock
   2. use plain spin_lock on inner locks when you're taking them one at
      a time, and
   3. release the outer lock before releasing the inner locks

but it's not OK to try to use different outer locks for a given inner lock.

Yes?


--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 11:54 am

Yes (there it no requirement that the outer lock is a spinlock_t, just



Only if you then release the inner locks in the reverse order you took
them - the nested release code (releasing a lock that is not on the top
of the stack) basically pops and pushes all the locks, the push will

It doesn't validate this part - as with most lockdep annotations you can
annotate away real deadlocks.


--

From: Jeremy Fitzhardinge
Date: Monday, August 4, 2008 - 12:26 pm

OK.  I don't actually need to do this, but I was asking for 
completeness.  But to clarify, you only need to do the reverse unlock if 
you do it after unlocking the outer lock?  If you're still holding the 

Hm, OK.


    J
--

From: Linus Torvalds
Date: Monday, August 4, 2008 - 12:31 pm

Release order should always be totally irrelevant, whether you hold outer 
locks or not. Only the order of _getting_ locks matter.

And yes, if there is an outer lock, even the order of getting locks is 
irrelevant, as long as anybody who gets more than one inner lock always 
holds the outer one.

		Linus
--

From: Peter Zijlstra
Date: Monday, August 4, 2008 - 12:39 pm

I agree, its just non-trivial to convince lockdep of this. I can give it
a go though.

--

From: Jeremy Fitzhardinge
Date: Monday, August 4, 2008 - 1:16 pm

Right.  But Peter said lockdep was fussy about it for some reason.  If 
the point of the exercise is to eliminate spurious warnings, it would be 
nice to avoid new ones in the process...

    J
--

From: Steven Rostedt
Date: Wednesday, October 8, 2008 - 8:27 am

But I need to disagree on a programming practice style.  Unlocking locks
in a non nested order is just bad programming practice. Unless there is
a good reason to do so, one should never release locks in a non reverse
order they were taken.

This can be a source of bugs, where people might notice an outer lock
being released and think the inner locks were too.

Lately the kernel has been going through a lot of clean ups that have
been making the kernel a much more maintainable beast. I feel we should
enforce the rule of unlocking order (again, unless there is a good
reason not to). Not for a technical reason, but just for a more
maintainable one.

-- Steve

--

From: Linus Torvalds
Date: Wednesday, October 8, 2008 - 8:43 am

No it is not.

Unlocking locks in non-nested order can sometimes be very much the rigth 
thing to do, and thinking otherwise is (a) naive and (b) can generate 
totally unnecessary and pointless bugs.

The thing is, sometimes you have to do it, and imposing totally made-up 
rules ("unlocks have to nest") just confuses everybody.

The FACT is, that unlocks do not have to nest cleanly. That's a rock solid 
*FACT*. The locking order matters, and the unlocking order does not.

And if you cannot accept that as a fact, and you then say "unlock order 
should matter just to keep things nice and clean", then you end up being 
screwed and/or confused when you can't hold to the unlock order.

There are many perfectly valid reasons not to unlock in reverse order. 
Don't create make-believe rules that break those reasons for no gain.

For example:
 - let's say that you have a singly-linked list of objects.
 - you need to lock all objects, do something, and then unlock all 
   objects.
 - the *only* sane way to do that is to just traverse the list twice.
 - that means that you will unlock the objects in the same order you 
   locked them, _not_ in reverse ("nested") order.
 - if you force a rule of "unlocks must be nested", then

	YOU ARE A F*CKING MORON.

It's that simple. Don't do made-up rules that have no technical reason for 
them. 

Lock ordering matters. Unlock ordering does not. It really is that simple. 
Don't confuse the issue by claiming anything else.

		Linus
--

From: Steven Rostedt
Date: Wednesday, October 8, 2008 - 9:03 am

Unfortunately, you cut out my comment that I stated "unless there is a 
good reason not to",

Yes, I totally agree that there are good reasons not to unlock in 
reverse order, and the example
you gave happens to be one of them.

I just find that seeing something like:

    lock(A);
    lock(B);

    [do something]

    unlock(A);
    unlock(B);

just seems to be sloppy.

I wont harp on this, it only came up in conversation in which someone 
pointed out your
post.

-- Steve

--

From: Linus Torvalds
Date: Wednesday, October 8, 2008 - 9:19 am

Well, teh thing is, as long as you think nesting is good, you end up being 

Of course you'll often get nesting almost by mistake.

For example, any normal locking that is hierarchical tends to nest 
naturally, especially if you end up using lots of small functions (which 
you should). Or simply due to error handling.

So in that sense, nesting may be a "natural" thing, but partly exactly 
_because_ it is fairly natural, we should not try to make it even more 
than that, because then when things don't nest (and it does happen), if 
the "things should nest" camp gets too strong, bugs happen.

They happen because somebody looks at the non-nesting case, and thinks 
it's sloppy, and tries to fix it. And in the process just complicates 
matters - and very likely introduces bugs.

And THAT is why people shouldn't think that locks "should" nest. Because 
they shouldn't. They often do, but we're better off making people very 
aware of the fact that 'often' isn't 'good design', it's just a matter of 
'it happens in practice' rather than 'it should be that way'.

I know for a fact that some people thought unlocking in non-nested order 
was a bug. And I believe that belief is a dangerous one.

				Linus
--

From: Steven Rostedt
Date: Wednesday, October 8, 2008 - 9:53 am

Ah, OK. You are fighting against nesting nazis, fair enough.

I have written a bit of code where nesting was not possible (similar to
your example, but I call those traversal locking not nesting). I just 
find that
the locks should be nested when the nesting is natural. Breaking the nesting
on natural nesting locks is a bug, IMHO.  But as you know, there are several
programmers out there that can not determine the difference between natural
nesting locks and non nesting locks.

By adding such a rule, those that can not tell the difference will be 
making a
lot of needless noise, hence, it is best not to make any such rule.

Lesson learned.  I'll now go back to debugging my code.

-- Steve

--

From: Nick Piggin
Date: Wednesday, October 8, 2008 - 8:52 am

An outer one might be more likely to be contended, so you might want
to release it asap.

Other times, you have lock A and lock B held (like scheduler rqs).
You can say unlock(A); unlock(B);

I don't really think it would make things more maintainable, FWIW.
--

From: Steven Rostedt
Date: Wednesday, October 8, 2008 - 10:18 am

I actually did come across one bug in my lifetime where the out of 
nesting order
of unlocks caused a bug. IIRC, it had to do (as Linus mentioned) with 
lots of
little functions that required locking. One of these functions was 
between the
out of order unlocking and was taking another inner lock.

I don't remember the exact details, but it was something that made me 
try to nest
locks and unlocks nicely when possible. And as Linus pointed out, there 
are several
cases where it just doesn't make sense to nest.

-- Steve

--

From: Peter Zijlstra
Date: Thursday, August 7, 2008 - 4:25 am

Please fold (or tell me to provide a new patch)


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff -u linux-2.6/include/linux/spinlock.h linux-2.6/include/linux/spinlock.h
--- linux-2.6/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -185,8 +185,8 @@
 # define spin_lock_nested(lock, subclass) _spin_lock_nested(lock, subclass)
 # define spin_lock_nest_lock(lock, nest_lock)				\
 	 do {								\
-		 typecheck(struct lockdep_map *, &nest_lock->dep_map);	\
-		 _spin_lock_nest_lock(lock, &nest_lock->dep_map);	\
+		 typecheck(struct lockdep_map *, &(nest_lock)->dep_map);\
+		 _spin_lock_nest_lock(lock, &(nest_lock)->dep_map);	\
 	 } while (0)
 #else
 # define spin_lock_nested(lock, subclass) _spin_lock(lock)
diff -u linux-2.6/kernel/spinlock.c linux-2.6/kernel/spinlock.c
--- linux-2.6/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -323,7 +323,7 @@
 	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 }
 
-EXPORT_SYMBOL(_spin_lock_nest_lock)
+EXPORT_SYMBOL(_spin_lock_nest_lock);
 
 #endif
 
--- linux-2.6.orig/include/linux/spinlock_api_smp.h
+++ linux-2.6/include/linux/spinlock_api_smp.h
@@ -22,6 +22,8 @@ int in_lock_functions(unsigned long addr
 void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
 void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass)
 							__acquires(lock);
+void __lockfunc _spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *map)
+							__acquires(lock);
 void __lockfunc _read_lock(rwlock_t *lock)		__acquires(lock);
 void __lockfunc _write_lock(rwlock_t *lock)		__acquires(lock);
 void __lockfunc _spin_lock_bh(spinlock_t *lock)		__acquires(lock);





--

Previous thread: [RFC][PATCH 4/7] lockdep: shrink held_lock structure by Peter Zijlstra on Monday, August 4, 2008 - 6:03 am. (3 messages)

Next thread: [RFC][PATCH 5/7] lockdep: map_acquire by Peter Zijlstra on Monday, August 4, 2008 - 6:03 am. (1 message)