Re: 2.6.34-rc1: rcu lockdep bug?

Previous thread: [PATCH net-next 00/13] RDS bugfixes and cleanups by Andy Grover on Thursday, March 11, 2010 - 4:49 pm. (4 messages)

Next thread: System Administrator by System Administrator on Friday, March 12, 2010 - 2:51 am. (1 message)
From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Friday, March 12, 2010 - 12:56 am

(Cc'ing netdev)


Ok, after decoding the lockdep output, it looks like that
netif_receive_skb() should call rcu_read_lock_bh() instead of rcu_read_lock()?
But I don't know if all callers of netif_receive_skb() are in softirq context.

Paul, what do you think?

Thank you.
--

From: David Miller
Date: Friday, March 12, 2010 - 1:07 am

From: Américo Wang <xiyou.wangcong@gmail.com>

Normally, netif_receive_skb() is invoked from softirq context.

However, via netpoll it can be invoked essentially from any context.

But, when this happens, the networking receive path makes amends such
that this works fine.  That's what the netpoll_receive_skb() check in
netif_receive_skb() is for.  That check makes it bail out early if the
call to netif_receive_skb() is via a netpoll invocation.
--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Friday, March 12, 2010 - 1:59 am

Oh, I see. This means we should call rcu_read_lock_bh() instead.
If Paul has no objections, I will send a patch for this.

Thanks much, David!
--

From: Eric Dumazet
Date: Friday, March 12, 2010 - 4:11 am

Nope, its calling rcu_read_lock() from interrupt context and it should
stay as is (we dont need to disable bh, this has a cpu cost)



--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Friday, March 12, 2010 - 6:11 am

Oh, but lockdep complains about rcu_read_lock(), it said
rcu_read_lock() can't be used in softirq context.

Am I missing something?
--

From: Eric Dumazet
Date: Friday, March 12, 2010 - 6:37 am

Well, lockdep might be dumb, I dont know...

I suggest you read rcu_read_lock_bh kernel doc :

/**
 * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical
section
 *
 * This is equivalent of rcu_read_lock(), but to be used when updates
 * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
 * consider completion of a softirq handler to be a quiescent state,
 * a process in RCU read-side critical section must be protected by
 * disabling softirqs. Read-side critical sections in interrupt context
 * can use just rcu_read_lock().
 *
 */


Last sentence being perfect :

Read-side critical sections in interrupt context
can use just rcu_read_lock().



--

From: =?utf-8?Q?Am=C3=A9rico?= Wang
Date: Friday, March 12, 2010 - 10:33 pm

Yeah, right, then it is more likely to be a bug of rcu lockdep.
Paul is looking at it.

Thanks!
--

From: Paul E. McKenney
Date: Saturday, March 13, 2010 - 2:58 pm

Except that it seems to be working correctly for me...

							Thanx, Paul
--

From: =?utf-8?Q?Am=C3=A9rico?= Wang
Date: Sunday, March 14, 2010 - 6:08 pm

Hmm, then I am confused. The only possibility here is that this is
a lockdep bug...

Thanks for your help!
--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Sunday, March 14, 2010 - 8:10 pm

I believe so...

Peter, this looks odd:

 kernel:  (usbfs_mutex){+.?...}, at: [<ffffffff8146419f>]
netif_receive_skb+0xe7/0x819

netif_receive_skb() never has a chance to take usbfs_mutex. How can this
comes out?
--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Monday, March 15, 2010 - 2:39 am

Ok, I think I found what lockdep really complains about, it is that we took
spin_lock in netpoll_poll_lock() which is in hardirq-enabled environment,
later, we took another spin_lock with spin_lock_irqsave() in netpoll_rx(),
so lockdep thought we broke the locking rule.

I don't know why netpoll_rx() needs irq disabled, it looks like that no one
takes rx_lock in hardirq context. So can we use spin_lock(&rx_lock)
instead? Or am I missing something here? Eric? David?

Thanks!
--

From: Eric Dumazet
Date: Monday, March 15, 2010 - 3:04 am

I am a bit lost.

Could you give the complete picture, because I cannot find it in my
netdev archives.



--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Monday, March 15, 2010 - 3:12 am

Sure, sorry for this.

Here is the whole thread:

http://lkml.org/lkml/2010/3/11/100
--

From: Eric Dumazet
Date: Monday, March 15, 2010 - 3:41 am

OK thanks

netpoll_rx() can be called from hard irqs (netif_rx()), so rx_lock
definitly needs irq care.

netpoll_poll_lock() does take a spinlock with irq enabled, but its not
rx_lock, its napi->poll_lock.

I dont see what could be the problem, is it reproductible with vanilla
kernel ?


--

From: =?UTF-8?Q?Am=C3=A9rico_Wang?=
Date: Tuesday, March 16, 2010 - 3:26 am

Yeah, I knew, but besides rcu locks, these two locks are the only
locks that can be taken in the call chain. I suspect lockdep got

No. I don't know why, my patch doesn't touch any function in the
call chain.

I already "fix" this in another way, so no need to worry this any more.

Thanks for your help!
--

From: Paul E. McKenney
Date: Friday, March 12, 2010 - 3:03 pm

Hmmm...  It is supposed to be OK to use rcu_read_lock() in pretty much
any context, even NMI.  I will take a look.

							Thanx, Paul
--

From: =?utf-8?Q?Am=C3=A9rico?= Wang
Date: Friday, March 12, 2010 - 10:31 pm

Thanks! Please let me know if you have new progress.
--

Previous thread: [PATCH net-next 00/13] RDS bugfixes and cleanups by Andy Grover on Thursday, March 11, 2010 - 4:49 pm. (4 messages)

Next thread: System Administrator by System Administrator on Friday, March 12, 2010 - 2:51 am. (1 message)