Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled

Previous thread: [net-2.6 PATCH 0/7] qlge: Fixes for bonding, RSS and reset. by Ron Mercer on Thursday, October 8, 2009 - 12:54 pm. (9 messages)

Next thread: pull request: wireless-2.6 2009-10-08 by John W. Linville on Thursday, October 8, 2009 - 3:34 pm. (10 messages)
From: Anirban Sinha
Date: Thursday, October 8, 2009 - 3:19 pm

Hi All;

We are noticing a kernel OOPS on 2.6.26 kernel when we issue the command
"clear ip bgp <bgp-peer-ip>" on Quagga BGP routing software. Also we are
noticing a very similar
crash when we set and reset the local TCP MD5 password. I am attaching the
backtrace. I wonder if this has been seen by any other person and whether it
is a known issue? We are running our kernel on the mips hardware.

Thanks,

Ani


# [23:10:35.108808] Kernel bug detected[#1]:
[23:10:35.112527] Cpu 0
[23:10:35.114676] $ 0   : 0000000000000000 0000000014001fe0
0000000000000066 0000000000000004
[23:10:35.122845] $ 4   : ffffffff80516c10 0000000014001fe0
ffffffff8050c010 0000000000000004
[23:10:35.131015] $ 8   : 0000000000000000 0000000000000041
ffffffff805142e8 0000000000000001
[23:10:35.139184] $12   : ffffffff80600000 ffffffff805f0000
0000000000000064 0000000000000190
[23:10:35.147354] $16   : 0000000000000102 ffffffff803afdf0
ffffffff80539040 ffffffff80600780
[23:10:35.155526] $20   : ffffffff80540000 0000000000200200
ffffffff804c0000 000000000000000a
[23:10:35.163695] $24   : a3d70a3d70a3d70b 8000000000000003
[23:10:35.171865] $28   : ffffffff8050c000 ffffffff8050fd90
9000000010030000 ffffffff801487a8
[23:10:35.180035] Hi    : 0000000000000000
[23:10:35.183819] Lo    : 0000000000000000
[23:10:35.187603] epc   : ffffffff801487a8 run_timer_softirq+0x198/0x258
Tainted: P
[23:10:35.196032] ra    : ffffffff801487a8 run_timer_softirq+0x198/0x258
[23:10:35.202395] Status: 14001fe3    KX SX UX KERNEL EXL IE
[23:10:35.207814] Cause : 00808024
[23:10:35.210911] PrId  : 01041100 (SiByte SB1A)
[23:10:35.215209] Modules linked in: xt_state ipt_REJECT iptable_filter
nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
ip_tables ebtable_filter ebtables bridge llc zeug_ipmcdrv(P) irqdisp(P)
zvirt(P) zeugmod(P) softdog
[23:10:35.236024] Process swapper (pid: 0, threadinfo=ffffffff8050c000,
task=ffffffff805142e8, tls=0000000000000000)
[23:10:35.246169] Stack : ffffffff8050fd90 ...
From: David Miller
Date: Thursday, October 8, 2009 - 3:54 pm

From: Anirban Sinha <asinha@zeugmasystems.com>

You will need to update your kernel, there have been many TCP
MD5 bug fixes since 2.6.26
--

From: Anirban Sinha
Date: Thursday, October 8, 2009 - 4:33 pm

Hi:


Sigh ... wish that were that easy! Anyway, as far as I could, I have tried to
apply the upstream patches that seemed relevant to TCP MD5SUM. Am I missing
some other patches? It will be great if someone can point me to any patch that
I might be missing related to the TCP MD5SUM support.

I applied the following patches:

(a)

author	Adam Langley <agl@imperialviolet.org>
	Sat, 19 Jul 2008 07:01:42 +0000 (00:01 -0700)
committer	David S. Miller <davem@davemloft.net>
	Sat, 19 Jul 2008 07:01:42 +0000 (00:01 -0700)
commit	49a72dfb8814c2d65bd9f8c9c6daf6395a1ec58d
tree	38804d609f21503573bbdd8bb9af38df99275ff5	tree | snapshot
parent	845525a642c1c9e1335c33a274d4273906ee58eb	commit | diff
tcp: Fix MD5 signatures for non-linear skbs

Currently, the MD5 code assumes that the SKBs are linear and, in the case
that they aren't, happily goes off and hashes off the end of the SKB and
into random memory.

Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
Polyakov. Also includes a couple of missed route_caps from Stephen's patch
in [2].

[1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
[2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2

Signed-off-by: Adam Langley <agl@imperialviolet.org>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

(b)

author	YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
	 Fri, 18 Apr 2008 03:45:16 +0000 (12:45 +0900)
committer	YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
	 Wed, 11 Jun 2008 18:46:30 +0000 (03:46 +0900)
commit	9501f9722922f2e80e1f9dc6682311d65c2b5690
tree	ca8195e04ea63e8273801030ce26527fe5a8a7c7	tree | snapshot
parent	8d26d76dd4a4c87ef037a44a42a0608ffc730199	commit | diff

tcp md5sig: Let the caller pass appropriate key for
tcp_v{4,6}_do_calc_md5_hash().

As we do for other socket/timewait-socket specific parameters,
let the callers pass appropriate arguments to
tcp_v{4,6}_do_calc_md5_hash().

Signed-off-by: YOSHIFUJI Hideaki ...
From: David Miller
Date: Thursday, October 8, 2009 - 5:57 pm

From: Anirban Sinha <asinha@zeugmasystems.com>

Contact your vendor for support :-)

--

From: Anirban Sinha
Date: Saturday, October 17, 2009 - 10:57 am

but we *are* the vendors for our distribution and even though I may not be
a networking guru, I have little bit knowledge in working my way through the
kernel code. I have traced down the cause of the BUG() and when I did a git
pull against Linus' tree, I see the same issue in teh git tip as well.

The BUG() is triggered from kernel/timer.c, line 1037 within the
function __run_timers(). I am reporting these line numbers from the latest git
tip.

What happens is that before and after the callback, the code grabs the preempt
count and catches unbalanced preempt_enable() and preempt_disable() calls from
within the callback function.  In this case, the callback function is
inet_twdr_hangman() as can be seen from this instrumented log:

[02:15:15.941981] Kernel panic - not syncing: <3>huh, entered ffffffff803fbd60
(inet_twdr_hangman+0x0/0xe0)with preempt_count 00000102, exited with 00000101?

Clearly there is an extra unbalanced preempt_enable() somewhere within the
callback function.

When I looked at the function, I see that in net/ipv4/inet_timewait_sock.c
line 215, the function calls schedule_work(). schedule_work() calls
queue_work() which in turn calls put_cpu() that ultimately does a
preempt_enable(). It is this unbalanced preempt_enable() that decriments the
preempt_count by one as can be seen from the above trace.

I suspect that workqueue related operatios are illegal from a timer callback
function. In that case, the above mentioned callback function needs to be
fixed.

Yes, I can't explain why others are also not seeing the same bug crash. We
don't have the luxury to pull in the latest and greatest kernel from the git
tree everytime an update is made and try it out. So I am unable to repo the
issue with the latest kernel. But if that means that this issue should be
ignored, then that is fine by me. We will fix our private kenrel with an
appropriate patch as we continue to investigate more.

Cheers,

Ani

--

From: Anirban Sinha
Date: Saturday, October 17, 2009 - 7:35 pm

and btw, this is the crash (on mips) we are talking about:

# [23:10:35.108808] Kernel bug detected[#1]:
[23:10:35.112527] Cpu 0
[23:10:35.114676] $ 0   : 0000000000000000 0000000014001fe0
0000000000000066 0000000000000004
[23:10:35.122845] $ 4   : ffffffff80516c10 0000000014001fe0
ffffffff8050c010 0000000000000004
[23:10:35.131015] $ 8   : 0000000000000000 0000000000000041
ffffffff805142e8 0000000000000001
[23:10:35.139184] $12   : ffffffff80600000 ffffffff805f0000
0000000000000064 0000000000000190
[23:10:35.147354] $16   : 0000000000000102 ffffffff803afdf0
ffffffff80539040 ffffffff80600780
[23:10:35.155526] $20   : ffffffff80540000 0000000000200200
ffffffff804c0000 000000000000000a
[23:10:35.163695] $24   : a3d70a3d70a3d70b 8000000000000003
[23:10:35.171865] $28   : ffffffff8050c000 ffffffff8050fd90
9000000010030000 ffffffff801487a8
[23:10:35.180035] Hi    : 0000000000000000
[23:10:35.183819] Lo    : 0000000000000000
[23:10:35.187603] epc   : ffffffff801487a8 run_timer_softirq+0x198/0x258
Tainted: P
[23:10:35.196032] ra    : ffffffff801487a8 run_timer_softirq+0x198/0x258
[23:10:35.202395] Status: 14001fe3    KX SX UX KERNEL EXL IE
[23:10:35.207814] Cause : 00808024
[23:10:35.210911] PrId  : 01041100 (SiByte SB1A)
[23:10:35.215209] Modules linked in: xt_state ipt_REJECT iptable_filter
nf_conntrack_ftp ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
ip_tables ebtable_filter ebtables bridge llc zeug_ipmcdrv(P) irqdisp(P)
zvirt(P) zeugmod(P) softdog
[23:10:35.236024] Process swapper (pid: 0, threadinfo=ffffffff8050c000,
task=ffffffff805142e8, tls=0000000000000000)
[23:10:35.246169] Stack : ffffffff8050fd90 ffffffff8050fd90
0000000014001fe0 ffffffff805ff3e0
[23:10:35.254166]         ffffffff806003c4 0000000000000001
ffffffff8053f650 ffffffff805706d0
[23:10:35.262337]         ffffffff80572020 ffffffff80142280
ffffffff806003c0 0000000000000000
[23:10:35.270507]         0000000014001fe0 000000000000c5b0
ffffffff8fefc520 ffffffff8feea52c
[23:10:35.278676]         ...
From: Anirban Sinha
Date: Sunday, October 18, 2009 - 1:19 pm

Hi Oleg:

I have a question for you. The queue_work() routine which is called from schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is this an attempt to trigger the scheduler? One of the side affects of this enable_preempt() is the crash that we see below. What is happening is that  a timer callback routine, in  this case inet_twdr_hangman(), tries a bunch of cleanup until a threshold is reached. If further cleanups needs to be done beyond the threshold, it queues a work function. Now when the timer callback is run in __run_timers(), the routine grabs the value of preempt_count before and after the callback function call. If the two counts do not match, it calls BUG() (line 1037 in kernel/timer.c). Is is it illegal to schedule a work function from within a timer callback? What would be a good solution? I have already posted in netdev but since workqueues and timers are general kernel infrastructure, I thought I might as well post the question in the main linux m
ailing list and to you.

Here's the output from my instrumented BUG() call:

[02:15:15.941981] Kernel panic - not syncing: <3>huh, entered ffffffff803fbd60
(inet_twdr_hangman+0x0/0xe0)with preempt_count 00000102, exited with 00000101?

 I was thinking of a hacky solution, to replace schedule_work() with schedule_delayed_work() just to get around the issue. But I am sure this is just too hacky and probably not the ideal solution ...

Cheers,

Ani


--

From: Oleg Nesterov
Date: Monday, October 19, 2009 - 5:13 am

Hi Anirban,


No. please note that queue_work() does get_cpu() + put_cpu() to protect
against cpu_down() in between.


Yes, but I can't see how queue_work() can be involved, it doesn't change

Yes sure.

I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.

Could you verify this?

Oleg.

--

From: Anirban Sinha
Date: Monday, October 19, 2009 - 8:32 am

I'll keep looking. Thanks for the help Oleg. 

Ani

--

From: Oleg Nesterov
Date: Monday, October 19, 2009 - 8:36 am

OOPS!!!! I misread your question, didn't notice "il" above...

I meant: yes sure it _is legal_ to schedule a work from within a timer
callback (in fact it is legal from any context).

Sorry for confusion.

Oleg.

--

From: Anirban Sinha
Date: Monday, October 19, 2009 - 9:01 am

ok, then that part of the function looks fine.

Ani
--

From: Anirban Sinha
Date: Monday, October 19, 2009 - 5:56 pm

Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and

--

From: Anirban Sinha
Date: Monday, October 19, 2009 - 6:08 pm

Yes, I have now verified this. There is indeed an issue with one of the
functions called by inet_twdr_hangman(). The call sequence is:

inet_twdr_hangman() -> inet_twdr_do_twkill_work() -> inet_twsk_put() ->
twsk_destructor().

In this case, the destructor callback is tcp_twsk_destructor() (installed
from line 1208 in net/ipv4/tcp_ipv4.c and line 906 in net/ipv6/tcp_ipv6.c) .
Without the TCP_MD5SUM compiled in, the function is a no-op. However, with the MD5SUM
compiled in, it calls tcp_put_md5_sig_pool() (when keylen is non zero) which
does an unbalanced put_cpu(). I did a grep across the entire tree.
tcp_put_md5_sig_pool() is a matching function for tcp_get_md5_sig_pool() and
in all other TCP IPV4 cases, it is called from net/ipv4/tcp_ipv4.c from
functions tcp_v4_md5_hash_hdr() and tcp_v4_hash_skb() along with the matching
get()
function. So I would think that in tcp_twsk_destructor(), the call should be
replaced by  tcp_free_md5_sig_pool() instead.

Signed-of-by: Anirban Sinha <asinha@zeugmasystems.com>
---

net/ipv4/tcp_minisocks.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e48c37d..dccc01e 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
 #ifdef CONFIG_TCP_MD5SIG
        struct tcp_timewait_sock *twsk = tcp_twsk(sk);
        if (twsk->tw_md5_keylen)
-               tcp_put_md5sig_pool();
+               tcp_free_md5sig_pool();
 #endif
 }







--

From: David Miller
Date: Monday, October 19, 2009 - 6:13 pm

From: Anirban Sinha <asinha@zeugmasystems.com>

This has been fixed in the tree for a month of so:

commit 657e9649e745b06675aa5063c84430986cdc3afa
Author: Robert Varga <nite@hq.alert.sk>
Date:   Tue Sep 15 23:49:21 2009 -0700

    tcp: fix CONFIG_TCP_MD5SIG + CONFIG_PREEMPT timer BUG()
    
    I have recently came across a preemption imbalance detected by:
    
    <4>huh, entered ffffffff80644630 with preempt_count 00000102, exited with 00000101?
    <0>------------[ cut here ]------------
    <2>kernel BUG at /usr/src/linux/kernel/timer.c:664!
    <0>invalid opcode: 0000 [1] PREEMPT SMP
    
    with ffffffff80644630 being inet_twdr_hangman().
    
    This appeared after I enabled CONFIG_TCP_MD5SIG and played with it a
    bit, so I looked at what might have caused it.
    
    One thing that struck me as strange is tcp_twsk_destructor(), as it
    calls tcp_put_md5sig_pool() -- which entails a put_cpu(), causing the
    detected imbalance. Found on 2.6.23.9, but 2.6.31 is affected as well,
    as far as I can tell.
    
    Signed-off-by: Robert Varga <nite@hq.alert.sk>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 045bcfd..624c3c9 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -363,7 +363,7 @@ void tcp_twsk_destructor(struct sock *sk)
 #ifdef CONFIG_TCP_MD5SIG
 	struct tcp_timewait_sock *twsk = tcp_twsk(sk);
 	if (twsk->tw_md5_keylen)
-		tcp_put_md5sig_pool();
+		tcp_free_md5sig_pool();
 #endif
 }
 
--

From: Anirban Sinha
Date: Monday, October 19, 2009 - 6:17 pm

Grrrr! Time for me to do a git pull again. The kernel source tree in my work
machine must be out of date by about the same time.

--

Previous thread: [net-2.6 PATCH 0/7] qlge: Fixes for bonding, RSS and reset. by Ron Mercer on Thursday, October 8, 2009 - 12:54 pm. (9 messages)

Next thread: pull request: wireless-2.6 2009-10-08 by John W. Linville on Thursday, October 8, 2009 - 3:34 pm. (10 messages)