Hello Linus. This is the main networking merge for 2.6.27
Highlights:
1) Explicit support for multiple hardware TX queues, from your's
truly.2) Making MIB statistics namespace aware, from Pavel Emelyanov.
3) GVRP support from Patrick McHardy.
4) Packet capture tools can now recreate the VLAN header even
when hardware offloading of VLAN decapsulation is being
performed. Also from Patrick McHardy.5) Dynamic queueing discipline hash table sizing from Patrick
McHardy.6) Lots of wireless stack and driver updates from John Linville
and the wireless crew.7) IPV6 stack improvements from Yoshifuji Hideaki and co.
8) Various wired driver updates via Jeff Garzik and all the various
driver maintainers.Please pull, thanks a lot!
The following changes since commit 5b664cb235e97afbf34db9c4d77f08ebd725335e:
Linus Torvalds (1):
Merge branch 'upstream-linus' of git://git.kernel.org/.../mfasheh/ocfs2are available in the git repository at:
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master
Abhijeet Kolekar (6):
iwlwifi : Set monitor mode for 4965
iwlwifi : Set monitor mode for 3945
iwlwifi : Patch adds rfkill subsystem for 3945
iwlwifi: Remove unnecessary code
iwlwifi: Fix LEDs for 3945
iwlwifi: make index unsigned int for iwl_send_led_cmdAdam Langley (3):
tcp: Fix MD5 signatures for non-linear skbs
tcp: options clean up
tcp: Remove redundant checks when setting eff_sacksAdel Gadllah (5):
iwlwifi: fix rfkill deps and remove input device usage
b43/b43legacy: use RFKILL_STATE_UNBLOCKED instead of RFKILL_STATE_ON
b43/b43legacy: add RFKILL_STATE_HARD_BLOCKED support
iwlwifi: remove input device and fix rfkill state
iwl3965: remove useless network and duplicate checkingAdrian Bunk (13):
make sta_rx_agg_session_timer_expired() static
remove ieee80211_tx_frame()
remove ieee80211_wx_{get,set}_auth()
...
David,
networking in -git (v2.6.26-5253-g14b395e, dead e1000 interface)
silently broke on two testboxes of mine earlier today, and i've bisected
[*] it back to:| d03157babed7424f5391af43200593768ce69c9a is first bad commit
| commit d03157babed7424f5391af43200593768ce69c9a
| Author: Auke Kok <auke-jan.h.kok@intel.com>
| Date: Sun Jun 22 15:21:29 2008 -0700
|
| e1000: remove PCI Express device IDs
|
| We do not want to prolong the situation much longer that e1000
| and e1000e support these devices at the same time. As a result,
| take out the bandage that was added for the interim period
| and remove all the PCI Express device IDs from e1000.i have migrated these testboxes to e1000e. (i migrated a third one
already but forgot about these two and i was stupid enough to do a
bisection suspecting some new bug.)I'm wondering, couldnt we warn about the removed pci express support via
the kernel log? Some simple printk. (Perhaps also with a default-off
.config option that shuts up this warning for users who intentionally
boot E1000=y without e1000e support.)Thanks,
Ingo
[*] the bisection log:
# bad: [14b395e1] Merge branch 'for-2.6.27' of git://linux-nfs.org/~
# good: [bce7f795] Linux 2.6.26
# good: [cadc7236] Merge branch 'bkl-removal' into next
# bad: [a0c80b8d] pkt_sched: Make default qdisc nonshared-multiqueue
# good: [30902dc4] ax25: Fix std timer socket destroy handling.
# bad: [fbd8f13a] net-sched: sch_htb: move hash and sibling list rem
# good: [40af48c1] rt2x00: kill URB for all TX queues during disable_
# bad: [4977929a] iwlwifi: control 11n capabilities through module p
# bad: [28f49d89] Merge branch 'master' of master.kernel.org:/pub/sc
# bad: [0caa116d] net: sh_eth: Fix compile error sh_eth
# good: [485ca22c] DM9000: Re-unite menuconfig entries for DM9000 dri
# good: [177db6f0] ixgbe: add LRO support
# bad: [6e4f6f6f] e1000e: make ioport free
# bad: [d03157ba] e1000: remove PCI Express device ID...
e1000_init_module() already has two printk()s. A third unconditional
one won't hurt a lot. The first two could also be joined.
--
Stefan Richter
-=====-==--- -=== =-=-=
http://arcgraph.de/sr/
--
David,
-tip testing on latest -git (v2.6.26-5253-g14b395e) triggered the
following boot crash on a Core2Duo 64-bit testsystem:ADDRCONF(NETDEV_UP): eth0: link is not ready
eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
------------[ cut here ]------------
Kernel BUG at ffffffff8079afb1 [verbose debug info unavailable]
invalid opcode: 0000 [1] SMP
CPU 0
Pid: 7, comm: events/0 Not tainted 2.6.26-rc8 #21302
RIP: 0010:[<ffffffff8079afb1>] [<ffffffff8079afb1>] __netif_schedule+0xd/0x64
RSP: 0018:ffff81003fa4be30 EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffff81003e9f49f0 RCX: ffffffff80c38fe0
RDX: ffff81003e9e7940 RSI: ffffffff80c3fdc0 RDI: ffffffff80c3fdc0
RBP: ffff81003e9f49f0 R08: 0000000000001607 R09: ffff810003b1c380
R10: 0000000000000005 R11: 00000100ffffffff R12: 0000000000000000
R13: ffff81003e9f49f0 R14: ffff81003e9f4000 R15: ffff81003e9f46c0
FS: 0000000000000000(0000) GS:ffffffff80c7c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000de7ef0 CR3: 000000003e5f1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/0 (pid: 7, threadinfo ffff81003fa4a000, task ffff81003fa20c30)
Stack: ffff81003e9f49f0 ffffffff805083a4 ffffffff80d19840 ffff81003e9f4a08
ffff81003e9e7880 0000000000000000 0000000000000000 0000000000000202
000000003f9f5f10 ffff81003fa06a40 ffffffff80507f5c ffff81003fa06a48
Call Trace:
[<ffffffff805083a4>] ? e1000_watchdog_task+0x448/0x635
[<ffffffff80507f5c>] ? e1000_watchdog_task+0x0/0x635
[<ffffffff8023e11b>] ? run_workqueue+0x80/0x112
[<ffffffff8023e9f6>] ? worker_thread+0xd9/0xe8
[<ffffffff802410af>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff8023e91d>] ? worker_thread+0x0/0xe8
[<ffffffff80240f58>] ? kthread+0x47/0x73
[<ffffffff8022d557>] ? schedule_tail+0x28/0x5d
[<ffffffff8020c288>] ? child_ri...
From: Ingo Molnar <mingo@elte.hu>
Should be fixed by:
e1000: resolve tx multiqueue bug
With the recent changes to tx mutiqueue, e1000 was not calling
netif_start_queue() before calling netif_wake_queue().
This causes an oops during loading of the driver.(Based on commit d55b53fff0c2ddb639dca04c3f5a0854f292d982
("igb/ixgbe/e1000e: resolve tx multiqueue bug").)Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index f8df8bd..cf12b05 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -1472,6 +1472,8 @@ e1000_open(struct net_device *netdev)e1000_irq_enable(adapter);
+ netif_start_queue(netdev);
+
/* fire a link status change interrupt to start the watchdog */
E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_LSC);--
note, my tests have also triggered another boot crash on the same
system, using the same config:PM: Removing info for No Bus:phy0
mac80211_hwsim: ieee80211_register_hw failed (-2)
BUG: unable to handle kernel NULL pointer dereference at 0000000000000370
IP: [<ffffffff808da9f1>] rollback_registered+0x2a/0xd6
PGD 0
Oops: 0000 [1] SMP
CPU 1
Pid: 1, comm: swapper Not tainted 2.6.26-tip-00013-g6de15c6-dirty #21290
RIP: 0010:[<ffffffff808da9f1>] [<ffffffff808da9f1>] rollback_registered+0x2a/0xd6
RSP: 0018:ffff88003f83fe00 EFLAGS: 00010212
RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88003d4baed8
RDX: ffffffff80979f1d RSI: 0000000000000046 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff80d6f4a0 R09: ffff880004576800
R10: 0000000000000000 R11: ffffffff80406afe R12: 0000000000000000
R13: ffff88003d4bb9a0 R14: 0000000000000000 R15: 0000000000000008
FS: 0000000000000000(0000) GS:ffff88003f829160(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000370 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88003f83e000, task ffff88003f824000)
Stack: 0000000000000000 ffffffff808daacf ffff88003d4ba2c0 ffffffff8097e1da
ffff88003d4bb9a0 ffffffff8060eb76 00000000fffffffe ffff88003d4ba2c0
ffff88003d4bb9e0 ffffffff811be87a ffff88003f83fea0 ffffffff8024e672
Call Trace:
[<ffffffff808daacf>] unregister_netdevice+0x32/0x77
[<ffffffff8097e1da>] ieee80211_unregister_hw+0x35/0xd4
[<ffffffff8060eb76>] mac80211_hwsim_free+0x1d/0x6a
[<ffffffff811be87a>] init_mac80211_hwsim+0x2df/0x2f0
[<ffffffff8024e672>] getnstimeofday+0x38/0x95
[<ffffffff8024c76a>] ktime_get_ts+0x21/0x49
[<ffffffff811be59b>] init_mac80211_hwsim+0x0/0x2f0
[<ffffffff8020a042>] do_one_initcall+0x42/0x13b
[<ffffffff80247105>] _...
another one:
[ 24.426510] bus: 'platform': add device i8042
[ 24.430389] PM: Adding info for platform:i8042
[ 24.434799] ohci1394: fw-host0: IntEvent: 00030010
[ 24.434799] ohci1394: fw-host0: irq_handler: Bus reset requested
[ 24.434799] ohci1394: fw-host0: Cancel request received
[ 24.434799] ohci1394: fw-host0: Got RQPkt interrupt status=0x00008409
[ 24.434799] ohci1394: fw-host0: SelfID interrupt received (phyid 0, root)
[ 24.434799] ohci1394: fw-host0: SelfID packet 0x807f8c56 received
[ 24.434799] ieee1394: Including SelfID 0x568c7f80
[ 24.434799] ohci1394: fw-host0: SelfID for this node is 0x807f8c56
[ 24.434799] ohci1394: fw-host0: SelfID complete
[ 24.434799] ohci1394: fw-host0: PhyReqFilter=ffffffffffffffff
[ 24.434799] ieee1394: selfid_complete called with successful SelfID stage ... irm_id: 0xFFC0 node_id: 0xFFC0
[ 24.434799] ieee1394: NodeMgr: Processing reset for host 0
[ 24.434799] ------------[ cut here ]------------
[ 24.434799] kernel BUG at net/core/dev.c:1328!
[ 24.434799] invalid opcode: 0000 [1] SMP
[ 24.434799] CPU 0
[ 24.434799] Pid: 6, comm: events/0 Not tainted 2.6.26-tip-05790-gcad008f-dirty #13278
[ 24.434799] RIP: 0010:[<ffffffff809f5a9d>] [<ffffffff809f5a9d>] __netif_schedule+0x9d/0xb0
[ 24.434799] RSP: 0018:ffffffff812f6de8 EFLAGS: 00010046
[ 24.434799] RAX: 00000000ffffffff RBX: ffffffff81021d80 RCX: 0000000000000000
[ 24.434799] RDX: ffff88003c894930 RSI: 0000000000000000 RDI: ffffffff81021d80
[ 24.434799] RBP: ffffffff812f6df8 R08: 0000000000000000 R09: 0000000000000001
[ 24.434799] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003c884000
[ 24.434799] R13: 0000000000000086 R14: ffff88003c8959d8 R15: ffff88003c886410
[ 24.434799] FS: 0000000000000000(0000) GS:ffffffff8115af80(0000) knlGS:0000000000000000
[ 24.434799] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 24.434799] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[ 24.434799] DR0: 0...
Ok, it is now a WARN_ON_ONCE() in my tree (which I _just_ pushed out).
So it's going to cause irritating messages (once), but the machine should
It's going to bisect down to the same commit you already bisected once,
it's the networking code that changed some of the rules, so various
network drivers that didn't follow the expected rules are now unhappy.Maybe the network drivers are few enough that it will get fixed, or maybe
the WARN_ON_ONCE() will just be removed and the rule not reinforced.I personally suspect the latter, since it seems to happen with just about
_any_ random network driver, including the common and well-maintained ones
(ie the Gods only help us for the truly odd/random cases)Linus
--
From: Linus Torvalds <torvalds@linux-foundation.org>
Yes, we'll see how this plays out.
Ian Schram just posted a patch for the NULL pointer derfer in
wireless Ingo reported, so we'll see if that bug will be fixed
now as well.
--
Yes, the fix from Ian below solved the CONFIG_MAC80211_HWSIM=y crash i
was getting. I have no other pending issues other than a few low-prio
ne2000 build failures.Thanks guys,
Ingo
--------->
commit 2f77dd3a3b5c3a27298fa0a09d8703c09c633fc6
Author: Ian Schram <ischram@telenet.be>
Date: Mon Jul 21 20:18:25 2008 +0200mac80211_hwsim.c: fix: BUG: unable to handle kernel NULL pointer dereference at 0000000000000370
I was looking at this out of interest, but I'm in no way familiar with
the code.Looks to me that the error handling code in mac80211_hwsim is awkward.
Which leads to it calling ieee80211_unregister_hw even when
ieee80211_register_hw failed.The function has a for loop where it generates all simulated radios.
when something fails, the error handling will call mac80211_hwsim_free
which frees all simulated radios who's pointer isn't zero. However the
information stored is insufficient to determine whether or not the call
to ieee80211_register_hw succeeded or not for a specific radio. The
included patch makes init_mac80211_hwsim clean up the current simulated
radio, and then calls into mac80211_hwsim_free to clean up all the
radios that did succeed.This however doesn't explain why the rate control registration failed..
build tested this, but had some problems reproducing the original
problem.Signed-off-by: Ian Schram <ischram@telenet.be>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/wireless/mac80211_hwsim.c | 18 ++++++++++++------
1 files changed, 12 insertions(+), 6 deletions(-)diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index 913dc9f..5816230 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -364,8 +364,7 @@ static void mac80211_hwsim_free(void)
struct mac80211_hwsim_data *data;
data = hwsim_radios[i]->...
From: Ingo Molnar <mingo@elte.hu>
Thanks everyone, applied.
--
In the meantime: Is there perhaps something obviously wrong with
drivers/ieee1394/eth1394.c's netdevice initialization? We do it in
ether1394_add_host(), and shortly thereafter the crashing
ether1394_host_reset() is called. So we have essentially(add host)
dev = alloc_netdev(...);
initialize various members in dev...
register_netdev(dev);(host reset)
netif_stop_queue(dev);
discard some stale 1394 stuff if there were some...
netif_wake_queue(dev); <-- crashes in __netif_schedule(dev);--
Stefan Richter
-=====-==--- -=== =-=-=
http://arcgraph.de/sr/
--
From: Stefan Richter <stefanr@s5r6.in-berlin.de>
You should only do a netif_stop_queue() in your device
initialization, at the very end of ->open() processing
when you've fully committed to returning success.You should not, in particular, be doing a netif_wake_queue()
before you've even done a netif_start_queue().Many of these drivers are using netif_{stop,wake}_queue()
to stop packet flow, in particular when link state changes,
and netif_carrier_{on,off}() already does all of that for
you.Really, anything outside of:
1) netif_start_queue() in ->open()
2) netif_stop_queue() in ->stop()
3) netif_{stop,wake}_queue() in the TX packet handling pathis superfluous.
--
Thanks, I wasn't aware of this. I'll try to fix up eth1394 this way as
time permits.
--
Stefan Richter
-=====-==--- -=== =-=-=
http://arcgraph.de/sr/
--
ok, have updated the testboxes to your latest push.
Btw., otherwise the big networking pull held up pretty well on a healthy
range of testboxes i have, it looked a lot scarier to me in the morning
than it turned out to be during the day. A couple of hundred tests
passed already and no indication of any runtime fragility so far. Boot
crashes/warnings can be annoying and hard to get a proper log of but
once the log is available they are normally quite easy for developers to
act upon.Ingo
--
hm, the distcc TCP hangs are back:
Distcc client box (quad, 10.0.1.16) running v2.6.24:
dione:~> netstat -nt | grep -vw TIME_WAIT | grep 3632
tcp 0 250455 10.0.1.16:55559 10.0.1.19:3632 ESTABLISHED
tcp 0 254743 10.0.1.16:56096 10.0.1.19:3632 ESTABLISHED
tcp 0 219617 10.0.1.16:55674 10.0.1.19:3632 ESTABLISHED[ ^--- note the stuck send-queue ]
Distcc server box (16-way, 10.0.1.19) running very-latest:
phoenix:~> netstat -nt | grep 10.0.1.16 | grep 3632
tcp 0 0 10.0.1.19:3632 10.0.1.16:55559 ESTABLISHED
tcp 0 0 10.0.1.19:3632 10.0.1.16:56096 ESTABLISHED
tcp 0 0 10.0.1.19:3632 10.0.1.16:55674 ESTABLISHEDtcp 0 0 10.0.1.19:3632 10.0.1.16:34411 ESTABLISHED
tcp 0 0 10.0.1.19:3632 10.0.1.16:51094 ESTABLISHED
tcp 0 0 10.0.1.19:3632 10.0.1.16:60787 ESTABLISHED
tcp 0 0 10.0.1.19:3632 10.0.1.16:50874 ESTABLISHEDI.e. the client side send-queue is stuck in established state, server
side thinks it's a proper established connection. Nobody makes any
progress.Also note the final 4 connections on the server side - those are not
present on the client box.The hung condition seemed permanent (i waited a couple of minutes).
Then i shut down the distccd on the server side, which propagated to the
client:distcc[18496] (dcc_pump_sendfile) ERROR: sendfile failed: Broken pipe
distcc[18496] (dcc_readx) ERROR: unexpected eof on fd4
distcc[18496] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE"
distcc[18496] Warning: failed to distribute kernel/futex.c to ph/20, running locally insteadServer side lingered in FIN_WAIT2 a bit:
Proto Recv-Q Send-Q Local Address ...
The missing four client-side connections are more interesting than the
I might be missing something obvious, but I don't think there's anything
unusual in the three sessions displayed on the client. They should beNow this is interesting. I would be much more interested in how the
Not nearly long enough. Retransmits can be sent as infrequently as per
180 seconds. I think there's an argument to use one of the the various
patches that reduce your TCP_RTO_MAX, for example OBATA Noboru's
(http://marc.info/?l=linux-netdev&m=118422471428855): you don't have toYou really should start that capture, and on both client and server.
You don't need to dump everything, only traffic to or from server:distcc.
--
i know, i waited much more than 180 minutes - about 15 minutes. That is
more than enough for this LAN connection.It's all on the LAN directly via a single gigabit switch and no packet
It's not feasible. That box did in excess of 200 GB of network traffic
in the past 7 hours alone. ~10 clients are doing make -j200 type of
kernel builds to this 16way buildbox so it is not realistic to tcpdump
it - especially given the rarity of this problem. (it has not reoccured
since then) The network is local LAN, gigabit ethernet over a single
gigabit switch.Ingo
--
You only need distcc traffic, and perhaps only after it's hung. With
250k outstanding per socket, are you certain that no traffic was sent?
Is it certain that one packet wasn't being sent each three minutes? I
suppose you're right and the stack really is stuck, but this is such an
easy thing to check and eliminate that you should do so. I suppose,
too, that you should trace the server-side processes and confirm that
they are waiting for socket input. You should dump tcp (for the distcc
port) next time the problem recurs and also check that the server
processes are waiting for socket input.
--
ok, will do that if it happens again.
Ingo
--
Ingo,
if it can help, I have a "capture" script which allows you to define
a size and will rotate captures within that size. That's what I'm
using to troubleshoot rarely occuring problems in datacenters, so
it's horrible but efficient :-)You just have to stop it once the problem has happened again. Ping
me if you're interested (I'm lazy to start my laptop right just for
it now in fact).Willy
--
yeah, that would be handy, thanks.
Alas, the problem has not reoccured since then - more than a thousand
kernel builds down the line. Yesterday it triggered so quickly when i
updated the buildbox to the new kernel, and happened repeatedly when i
tried to build a new kernel, that i didnt assume it was something hard
to reproduce - but it went poof after i restarted distccd on the server.So i'd suggest we do not count this as a regression, i've got no way at
the moment of reproducing it reliably.Ingo
--
the permanently hug distcc kernel build bug triggered again, twice.
First time it happened yesterday, i left it running overnight and it
never recovered after a 14+ hours of wait.It shows a similar pattern, 'ESTABLISHED' state on both sides, but the
client-side is stuck and the server (running latest kernel) is seemingly
clueless about that fact:client:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 375450 10.0.1.16:39201 10.0.1.19:3632 ESTABLISHEDserver:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 10.0.1.19:3632 10.0.1.16:39201 ESTABLISHEDi waited ~30 minutes in this second case.
the client (running 2.6.24) does periodic 120 seconds retransmits:
07:40:48.255452 IP dione.39201 > phoenix.distcc: . 1608:2144(536) ack 1 win 584007:40:48.255547 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:40:48.255564 IP dione.39201 > phoenix.distcc: . 67143:67679(536) ack 1 win 5840
07:40:48.255648 IP phoenix.distcc > dione.39201: . ack 2144 win 65535
07:42:48.255440 IP dione.39201 > phoenix.distcc: . 2144:2680(536) ack 1 win 5840
07:42:48.255559 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:42:48.255570 IP dione.39201 > phoenix.distcc: . 67679:68215(536) ack 1 win 5840
07:42:48.255659 IP phoenix.distcc > dione.39201: . ack 2680 win 65535
07:44:48.255436 IP dione.39201 > phoenix.distcc: . 2680:3216(536) ack 1 win 584007:44:48.255570 IP phoenix.distcc > dione.39201: . ack 3216 win 65535
07:44:48.255585 IP dione.39201 > phoenix.distcc: . 68215:68751(536) ack 1 win 5840
07:44:48.255669 IP phoenix.distcc > dione.39201: . ack 3216 win 65535the server (running the latest kernel) responds:
07:40:47.551098 IP dione.39201 > phoenix.distcc: . 1072:1608(536) ack 1 win 584007:40:47.551141 IP phoenix.distcc > dione.39201: . ack 1608 win 65535
07:40:47.551204 IP dione.39201 > phoenix.distcc: . 66607:67143(536) ack 1 win 5...
OK, something's seriously screwed up on dione's kernel. Could
you please disable syncookies (which should enable SACK for you)
and see if the problem goes away?I think our Reno code may have been broken.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
This looks like the FRTO bugs we fixed in 2.6.25.7, afaik, 2.6.24.y wasn't
anymore updated at that time so it's a bit obsolete...--
i.
--
On Thu, 24 Jul 2008, Ilpo J
ok, i've booted the latest kernel on dione and the builds are running on
it now. It's using these settings currently:# cat /proc/sys/net/ipv4/tcp_syncookies
0
# cat /proc/sys/net/ipv4/tcp_sack
1Just to make sure i have the right test-plan: assuming this
suspected-good kernel is trouble-free for say 48 hours, i should be able
to switch back to the suspected-broken variant, by doing:echo 1 > /proc/sys/net/ipv4/tcp_syncookies
without rebooting the kernel, right? And then if the bug comes back
again we've nailed it down to the area you suspect, correct?Ingo
--
Yes, except that you'll have to reestablish those connections
to make the new setting take effect. In either case please
verify with tcpdump to make sure that both sides are proposing
sack at the start of the connection. Alternatively see whether
sacks are generated in case of packet loss.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
From: Ingo Molnar <mingo@elte.hu>
You don't need the explicit tcp_sack=1, it's going to be '1' by
default.Simply switching syncookies on and off will turn both SACK and
timestamps off when running the dist kernel.
--
BTW, if someone can contact the Fedora folks, specifically whoever
added that syncookies enable to /etc/sysctl.conf, and let them know
that they've set our TCP stack back by 10 years by disabling SACK
and timestamps, I'd really appreciate it.
--
But syn cookies should not trigger unless the syn backlog is full, right ?
Anyway, it would also explain why Ingo's running no default MSS (though
it should be correctly negociated by syn cookies).willy
--
Server box (F9) has:
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1timestamp of /etc/sysctl.conf: 2008-05-15, installation default i think.
Other (F8) box has:
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 0timestamp of /etc/sysctl.conf: 2007-01-28, touched 1.5 years ago.
Have not checked the install defaults - if i touched them it was a long
time ago.Ingo
--
here's a new type of crash a -tip testbox triggered today:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<0000000000000000>] 0x0
PGD 0
Oops: 0010 [1] PREEMPT
CPU 0
Pid: 5331, comm: ssh Not tainted 2.6.26 #21613
RIP: 0010:[<0000000000000000>] [<0000000000000000>] 0x0
RSP: 0018:ffff88003b959a70 EFLAGS: 00010217
RAX: ffff88003b9ba91c RBX: ffff88003b9baac0 RCX: 0000000000000002
RDX: ffff88003b9baac0 RSI: 0000000000000014 RDI: ffff88003b9ba3d8
RBP: ffff88003b959a98 R08: ffff88003e434130 R09: ffff88003b9598b8
R10: 0000000000002000 R11: 00000000d85507ac R12: ffff88003b9ba3d8
R13: 0000000000000002 R14: ffff88003e06a020 R15: 0000000000000000
FS: 00007fb2b7806780(0000) GS:ffffffff80c1c180(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003b9e8000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ssh (pid: 5331, threadinfo ffff88003b958000, task ffff88003b99a000)
Stack: ffffffff80777f25 ffff88003b959a98 ffffffff80c24170 ffff88003b959b28
0000000000000004 ffff88003b959ae8 ffffffff80733107 ffff88003b959ab8
ffff88003b9ba3d8 ffffffff80c1ed60 0000000000000004 ffff88003b9ba3d8
Call Trace:
[<ffffffff80777f25>] ? ipv4_confirm+0x8d/0x122
[<ffffffff80733107>] nf_iterate+0x43/0xa5
[<ffffffff8074bbb2>] ? ip_finish_output+0x0/0x62
[<ffffffff807331cd>] nf_hook_slow+0x64/0xcb
[<ffffffff8074bbb2>] ? ip_finish_output+0x0/0x62
[<ffffffff8074b63b>] nf_hook_thresh+0x1e/0x20
[<ffffffff8074bd44>] ip_output+0xb3/0xc9
[<ffffffff8074a27e>] dst_output+0xb/0xd
[<ffffffff8074b6ef>] ip_local_out+0x1e/0x22
[<ffffffff8074c75c>] ip_queue_xmit+0x2a6/0x321
[<ffffffff80241961>] ? sub_preempt_count+0x2d/0x40
[<ffffffff8028f239>] ? slob_alloc+0x11b/0x248
[<ffffffff8075b85a>] tcp_transmit_skb+0x5ec/0x6...
here's the full bootlog:
http://redhat.com/~mingo/misc/crash-Thu_Jul_24_13_23_34_CEST_2008.log
kernel is latest -git, v2.6.26-6371-g338b9bb.
Ingo
--
FYI, this was the most complex bisection i have ever done under Linux.
Firstly, automated bisection honed in on an already known bad commit, so
i had to do another, manual bisection.There i hit four other regression and i had to work them around at
multiple bisection points to be able to bisect this regression. I also
had to do a "nested" sub-bisection to fix one of the bisection points.In case someone else has to bisect in the future in this region as well,
here is the list:1) I hit the mac802 hwsim NULL dereference bootup crash and
cherry-picked its fix, 3a33cc108d1. Alas, that didnt work - so i
tweaked the .config. (hoping that it would not change the crash
pattern - fortunately it didnt)2) build failure:
net/built-in.o: In function `dev_queue_xmit':
: undefined reference to `qdisc_calculate_pkt_len'
net/built-in.o: In function `__qdisc_destroy':
sch_generic.c:(.text+0x22874): undefined reference to qdisc_put_stab'
net/built-in.o: In function `ieee80211_requeue':
: undefined reference to `qdisc_calculate_pkt_len'had to do a secondary, nested bisection to figure out that the
build at this commit point was broken by 175f9c1bba9b ("net_sched:
Add size table for qdiscs") and reverted it.3) on bisection point 11ea859d64b i got a hard lockup:
BUG: NMI Watchdog detected LOCKUP on CPU0, ip ffffffff8028ef43, registers:
Call Trace:
[<ffffffff8028f3c6>] slob_alloc+0x9c/0x248
[<ffffffff8028f59d>] kmem_cache_alloc_node+0x2b/0x5e
[<ffffffff80292e42>] get_empty_filp+0x4f/0xee
[<ffffffff8029c635>] __path_lookup_intent_open+0x2f/0x9f
[<ffffffff8029c719>] path_lookup_open+0xc/0xe
[<ffffffff8029c7c8>] do_filp_open+0xad/0x7e7
[<ffffffff80272bec>] ? trace_preempt_on+0x9/0xb
[<ffffffff802419d9>] ? sub_preempt_count+0x2d/0x40
[<ffffffff802903ca>] ? get_unused_fd_flags+0x110/0x128
[<ffffffff80290433>] do_sys_op...
Just to make sure - the "netfilter/RCU use-after-free fix" was the
Thats odd. I don't think anything is wrong with that patch
itself, its more likely that its triggering a bug in
ct_extend. You config has a few helper enabled (FTP, H.323,
TFTP) and the crash is when trying to call the helper functions.
Did you actually have traffic of one of these protocols?--
yes. You can see it in tip/out-of-tree:
no, that's not likely - it's a default distro bootup.
Ingo
--
The commit makes ct_extend area to be used *very* frequently. Could you=20
try to boot your kernel with nf_conntrack.acct=3D0 to disable accounting?
Does it help?Best regards,
=09=09=09=09Krzysztof Ol=EAdzki
OK thanks. I'll look into it.
--
Does reverting 31d8519c fix this?
--
Hmm, why do you think it's related? It looks like elem->hook() is a
NULL pointer but my patch shouldn't make any difference here...
--
No, its already in ipv4_confirm, so its most likely helper->help()
thats NULL, which is contained in an extension.The reason why I think its this patch is (quoting from an old
email that I never got a response to):---
Your patch introduced a use-after-free and double-free.
krealloc() frees the old pointer, but it is still used
for the ->move operations, then freed again.To fix this I think we need a __krealloc() that doesn't
free the old memory, especially since it must not be
freed immediately because it may still be used in a RCU
read side (see the last part in the patch attached to
this mail (based on a kernel without your patch)).
---I've attached that patch again.
Hmm. Don't you need to fix some of the ordering of the initialization too?
If there are possible readers that happen in parallel with changing this
thing, don't you need to protect the update of "ext->len" against the
actual changes? And the readers should probably have a read barrier
between checking "len" and actually looking at the values? Finally, why do
the "ct->ext" dereference thing, when we know it has to be equal to "new"?ie something like this on the writing side (in _addition_ to both the
patches already seen), but I didn't do the reading side (ie there are no
"smp_rmb()"'s on the reading side)And no, I don't know the code, so I don't know who/what can read those
things with RCU, so maybe there is some reason why the actual data doesn't
need protecting. But I somehow doubt it.Linus
---
net/netfilter/nf_conntrack_extend.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)diff --git a/net/netfilter/nf_conntrack_extend.c b/net/netfilter/nf_conntrack_extend.c
index 3469bc7..135e095 100644
--- a/net/netfilter/nf_conntrack_extend.c
+++ b/net/netfilter/nf_conntrack_extend.c
@@ -115,10 +115,11 @@ void *__nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id id, gfp_t gfp)
ct->ext = new;
}- ct->ext->offset[id] = newoff;
- ct->ext->len = newlen;
- memset((void *)ct->ext + newoff, 0, newlen - newoff);
- return (void *)ct->ext + newoff;
+ new->offset[id] = newoff;
+ memset((void *)new + newoff, 0, newlen - newoff);
+ smp_wmb();
+ new->len = newlen;
+ return (void *)new + newoff;
}
EXPORT_SYMBOL(__nf_ct_ext_add);--
There certainly needs to be an rcu_assign_pointer() or a smp_wmb()
in there somewhere -- whenever it is that the new structure becomes
accessible to RCU readers. I looked around a bit, but couldn't tell
when this stuff becomes accessible to readers...--
Extensions can only be added while the conntrack is "unconfirmed",
meaning its not in the hash tables yet and the reference is exclusive.
The reason why we need RCU at all is that the extension areasThats a relict of the old code, which allocated "new" conditionally.
I've taken that part from your patch without the smb_wmb(), which
shouldn't be necessary.
--
From: Linus Torvalds <torvalds@linux-foundation.org>
Actually in the old code this precondition didn't hold, which explains
how it is.The old code looked like:
if (newlen >= ksize(ct->ext)) {
new = kmalloc(newlen, gfp);
if (!new)
return NULL;
...
ct->ext = new;
}ct->ext->offset[id] = newoff;
ct->ext->len = newlen;
memset((void *)ct->ext + newoff, 0, newlen - newoff);
return (void *)ct->ext + newoff;and in that context 'new' is only assigned in the "newlen >=" guarded
code block.Anyways, it does seem that we should indeed only update the
new larger length only after we've initialized the contents.Note that we could make krealloc() and friends clear out the trailing
bits of the new buffer, and therefore the caller wouldn't even need to
be mindful of such things.I don't know if that's desirable in general, probably it isn't.
--
Agreed. Something like this, perhaps?
[PATCH] netfilter: fix double-free and use-after free
As suggested by Patrick McHardy, introduce a __krealloc() that doesn't
free the original buffer to fix a double-free and use-after-free bug
introduced by me in netfilter that uses RCU.Reported-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9aa90a6..be6f1d4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -96,6 +96,7 @@ int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
/*
* Common kmalloc functions provided by all allocators
*/
+void * __must_check __krealloc(const void *, size_t, gfp_t);
void * __must_check krealloc(const void *, size_t, gfp_t);
void kfree(const void *);
size_t ksize(const void *);
diff --git a/mm/util.c b/mm/util.c
index 8f18683..6ef9e99 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -68,25 +68,22 @@ void *kmemdup(const void *src, size_t len, gfp_t gfp)
EXPORT_SYMBOL(kmemdup);/**
- * krealloc - reallocate memory. The contents will remain unchanged.
+ * __krealloc - like krealloc() but don't free @p.
* @p: object to reallocate memory for.
* @new_size: how many bytes of memory are required.
* @flags: the type of memory to allocate.
*
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes. If @p is %NULL, krealloc()
- * behaves exactly like kmalloc(). If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
+ * This function is like krealloc() except it never frees the originally
+ * allocated buffer. Use this if you don't want to free the buffer immediately
+ * like, for example, with RCU.
*/
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void *__krealloc(const void *p, size_t new_size, gfp_t flags)
{
void *ret;
size_t ks = 0;- if (unlikely(!new_size)) {
- kfree(p);
+ if (unlikely...
Looks good to me, thanks.
--
Ingo, can you please test this? Andrew, I'm at OLS so can you pick up
the patch in your tree?--
Sure. Or Patrick can do so and it can be merged via the net tree.
Ingo, did this patch actually fix something over there?
--
I've picked it up and will merge it via Dave.
--
not that i know of. I'll do some re-tests tomorrow to make sure.
Ingo
--
I managed to reproduce the bug you reported, but it reliably
goes away if I add Pekka's patch. Did you get a chance to
retest already?--
not yet - please consider it fixed, i'll let you know if something
similar to that occurs.Ingo
--
Apparently it didn't but it did fix Dieter's problem:
http://lkml.org/lkml/2008/7/24/337
Dieter, can we add a Tested-by tag from you to this patch?
--
Yes, it definitely fixed my issue and I have not encountered
further problems with the patch. The machine
is running fine with it.Do I have to explicitly add my
Tested-by: Dieter Ries <clip2@gmx.de>
tag somewhere (if yes, where?) or is this enough for you?cu
Dieter
--
No, that's fine. The people who are merging the patch can pick it up.
Thanks a lot for your help!Pekka
--
From: Pekka Enberg <penberg@cs.helsinki.fi>
Patrick will toss it my way, and it'll get integrated
via net-2.6 as a result.
--
Regardless of whether this is the problem, banning ksize because
it can be abused is like banning cars because they can kill people.For example, Ethernet skbs are 1500 bytes long, so using ksize
we could potentially use the left-over memory for temporary storage.
Without it all this memory goes to waste. On a machine that's
doing a lot of networking this could save a signifcant amount
of memory.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Hi Herbert,
Using ksize() for skbs will crash your kernel for some configurations
because calling that function for memory allocated with
kmem_cache_alloc() is not supported by all the allocators (well,
SLOB).Pekka
--
So how about fixing the interface so that it can return an error
to indicate that the allocator doesn't support it? You're taking
away an entire interface just because an underlying implementation
that's used by a very small proportion of users doesn't do the
right thing.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
You don't honestly expect people to correctly code to such a standard,
do you? People will assume that ksize never fails, they will be wrong,Umm, no. There were very few users to being with, so it was actually a
fairly large proportion. And that suggested the interface was a bad
idea.--
Mathematics is the supreme nostalgia of our time.--
If we follow this argument then most of our interfaces would be
Huh? I'm talking about users of the kernel. A very small proportion
of those use SLOB.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Let's try this again: did you know that ksize could fail depending on
kernel configuration? Most of us would answer no. That suggests the API
is bad. This ranks 12 on Rusty's spectrum of user-friendly APIs:Ahh. I see what you're saying. You may be right about that, or you may
be wrong: I don't know which of the millions of mass market embedded
Linux devices out there are using it.And of course I argue that SLOB did do the right thing, which was only
allowing ksize on kmalloced objects. It's an accident of implementation
that kmalloc and kmem_cache_alloc use the same underlying allocator. It
has not been true at all points in the past, it's not true for some
users in the present, and it may not be true for most users in the
future. Thus, it's a bad idea to try to use ksize on something that
wasn't kmalloced.If you have an argument for reintroducing ksize based on an actual use
case, let's please move on to (or back to) it so that we can have some
substance to this discussion.--
Mathematics is the supreme nostalgia of our time.--
I think you misunderstood my argument. I never suggested changing
the existing ksize interface to return an error onto unsuspecting
users. I suggested creating a new interface that is explicitly
designed to return an error if the underlying implementation
is unable to support this.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
I think that could probably be made to work. Perhaps something like:
size_t kmalloc_extra(void *); /* how many extra bytes in this kmalloc?
*/Which, if it didn't work, could return a nice safe 0. We could argue
about signedness a bit, but I think this would always be safe.This will also work with all our current kmalloc implementations. The
trouble was calling ksize() on kmem_cache_alloc objects, which happens
to work with SLAB and SLOB.--
Mathematics is the supreme nostalgia of our time.--
You could easily add some function which returns a constant error for
SLOB and ksize for others, because slob typically contains very little
slack space (unless explicit alignment is asked for).OTOH, skb allocation uses kmalloc don't they? So you could still use
SLOB ksize for that I guess.
--
Yes I was referring to the data portion which is kmalloc'ed.
That is also why I'm interested in ksize because a priori we
don't know exactly how big it's going to be. However, we do
know that statistically 1500 will dominate.I'm not interested in ksize for kmem_cache at all. So in fact
we could have something simpler that's based on kmalloc's rounding
algorithm instead.Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
Yes you could definitely have a function that returns allocated
bytes for a given kmalloc size. Should be about as fast or faster
than extracting the size from the kaddr...--
Yup, makes sense.
--
On the other hand, I can imagine useful allocator changes where this
would not be a constant of requested size. For instance, imagine we had
a classless bucket allocator, but with a heuristic to try a larger
bucket when it wasn't cheap/possible to allocate a right-sized object
(because of memory pressure, etc.) and larger ones were available.
This sort of thing is a pretty small change for SLAB/SLUB.--
Mathematics is the supreme nostalgia of our time.--
here's a longer log from the server, with sequences, flags, etc:
08:06:47.809947 IP (tos 0x0, ttl 64, id 13998, offset 0, flags [DF], proto TCP (6), length 576) dione.39201 > phoenix.distcc: . 234555110:234555646(536) ack 2272574194 win 5840
08:06:47.809974 IP (tos 0x0, ttl 64, id 27389, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.39201: ., cksum 0x9900 (correct), 2272574194:2272574194(0) ack 234555646 win 65535
08:06:47.810051 IP (tos 0x0, ttl 64, id 13999, offset 0, flags [DF], proto TCP (6), length 576) dione.39201 > phoenix.distcc: . 234620645:234621181(536) ack 2272574194 win 5840
08:06:47.810065 IP (tos 0x0, ttl 64, id 27390, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.39201: ., cksum 0x9900 (correct), 2272574194:2272574194(0) ack 234555646 win 65535
08:08:47.829813 IP (tos 0x0, ttl 64, id 14000, offset 0, flags [DF], proto TCP (6), length 576) dione.39201 > phoenix.distcc: . 234555646:234556182(536) ack 2272574194 win 5840
08:08:47.829909 IP (tos 0x0, ttl 64, id 27391, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.39201: ., cksum 0x96e8 (correct), 2272574194:2272574194(0) ack 234556182 win 65535
08:08:47.830009 IP (tos 0x0, ttl 64, id 14001, offset 0, flags [DF], proto TCP (6), length 576) dione.39201 > phoenix.distcc: . 234621181:234621717(536) ack 2272574194 win 5840
08:08:47.830026 IP (tos 0x0, ttl 64, id 27392, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.39201: ., cksum 0x96e8 (correct), 2272574194:2272574194(0) ack 234556182 win 65535
08:10:47.849756 IP (tos 0x0, ttl 64, id 14002, offset 0, flags [DF], proto TCP (6), length 576) dione.39201 > phoenix.distcc: . 234556182:234556718(536) ack 2272574194 win 5840
08:10:47.849845 IP (tos 0x0, ttl 64, id 27393, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.39201: ., cksum 0x94d0 (correct), 2272574194:2272574194(0) ack 234556718 win 65535
08:10:47.849936 IP (tos 0x0, ttl 64, id 14003, o...
No further traffic is generated for this connection?
For a problem like this, you really need to dump on both sides
to get the full picture. Otherwise it's difficult to tell why
we have a gap after 234561792 which is never filled in by dione.BTW any reason why you appear to have SACK off?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
well, i dumped on both sides and posted the result (in my first mail),
default distro setup, Fedora 9 and Fedora 8. No extra tweaking that i'm
aware of.Fortunately this is my lucky day and i just triggered an half-incident,
a "half hang". The connection was stuck for about 40 minutes - and 2
minutes after i started tracing it on both sides it got unstuck. If the
stall is shorter than that i'd normally not notice it. (except for a
drop in testing throughput)Server (phoenix) and client (dione) dumps attached.
Note, these hung TCP problems started after i upgraded the server's
kernel to the big networking merge. Somewhere in the 24 hours preceding
my first report:Date: Tue, 22 Jul 2008 13:21:33 +0200
I upgrade the server's kernel daily, so the regression window is pretty
accurate. So the prime suspect is one of these 1232 commits:| commit db6d8c7a4027b48d797b369a53f8470aaeed7063
| Merge: 3a53337... fb65a7c...
| Author: Linus Torvalds <torvalds@linux-foundation.org>
| Date: Sun Jul 20 17:43:29 2008 -0700
|
| Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
|
| * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits)i.e.:
last known good kernel is: v2.6.26-3465-g5b664cb
first known broken kernel is: v2.6.26-4811-gdb6d8c7
last tested (still broken) kernel is: v2.6.26-6077-gc010b2fStill not reproducible enough to bisect it.
Thanks,
Ingo
------------------>
[traced on the client, dione]10:03:23.590481 IP (tos 0x0, ttl 64, id 46098, offset 0, flags [DF], proto TCP (6), length 576) dione.40344 > phoenix.distcc: ., cksum 0x1855 (incorrect (-> 0xb6ce), 1979565914:1979566450(536) ack 2260258133 win 5840
10:03:23.590593 IP (tos 0x0, ttl 64, id 19850, offset 0, flags [DF], proto TCP (6), length 40) phoenix.distcc > dione.40344: ., cksum 0x5a6b (correct), 2260258133:2260258133(0) ack 1979566450 win 65535
10:03:23.590605 IP (tos 0x0, ttl 64, id 46099...
From: Ingo Molnar <mingo@elte.hu>
It's probably a side effect of their turning on syncookies in /etc/sysctl.conf :-/
I thought we added code that would allow enabling those options
even with syncookies enabled? There is code in net/ipv4/syncookies.c
function cookie_check_timestamp() to handle this.Of course this is a recent addition, so if either side doesn't
have that change.
--
Given the TTLs, it looks to me like both are on the same LAN. Also,
and also MSS is at the lowest value (536), maybe because of numerous
losses on large segments ?Willy
--
good. The sequence numbers from dione to phoenix are bouncing back and
forth because a big data chunk was lost, so dione is alternatively
sending and old segment (retransmit) which, when ACKed, allows it tohere it looks like dione believes it does not have the chunk starting
at 234561792. Maybe it moved its window too far ? It only sends theAnd phoenix insists on getting what is missing from the window.
So dione is wrong here.
Regards,
Willy--
hm, but dione has been running the same 2.6.24.7-92.fc8 kernel for a
long time:/var/log/yum.log-20080528:May 27 23:44:36 Installed: kernel - 2.6.24.7-92.fc8.x86_64
and has 6 days uptime at the moment and no trouble communicating with
any other buildbox - except the [only] one running latest -git.Ingo
--
and indeed i got the new warning:
[ 24.460170] ieee1394: selfid_complete called with successful SelfID stage ... irm_id: 0xFFC0 node_id: 0xFFC0
[ 24.460170] ieee1394: NodeMgr: Processing reset for host 0
[ 24.460170] ------------[ cut here ]------------
[ 24.460170] WARNING: at net/core/dev.c:1330 __netif_schedule+0xc6/0xe0()
[ 24.460170] Pid: 6, comm: events/0 Not tainted 2.6.26-tip-05989-g5172ce5-dirty #13279
[ 24.460170]
[ 24.460170] Call Trace:
[ 24.460170] <IRQ> [<ffffffff8024461a>] warn_on_slowpath+0x5a/0x90
[ 24.460170] [<ffffffff80264c4d>] ? trace_hardirqs_off+0xd/0x10
[ 24.460170] [<ffffffff80ba98ca>] ? _spin_unlock_irqrestore+0x4a/0x70
[ 24.460170] [<ffffffff8088b501>] ? ether1394_reset_priv+0x121/0x140
[ 24.460170] [<ffffffff809f5db6>] __netif_schedule+0xc6/0xe0
[ 24.460170] [<ffffffff8088b62a>] ether1394_host_reset+0x10a/0x120
[ 24.460170] [<ffffffff808719fe>] highlevel_host_reset+0x3e/0x80
[ 24.460170] [<ffffffff8086fe7c>] hpsb_selfid_complete+0x25c/0x330
[ 24.460170] [<ffffffff8087e1eb>] ohci_irq_handler+0x82b/0xbc0
[ 24.460170] [<ffffffff80276e4b>] handle_IRQ_event+0x3b/0x70
[ 24.460170] [<ffffffff80278de9>] handle_level_irq+0xa9/0x130
[ 24.460170] [<ffffffff80210d98>] do_IRQ+0xc8/0x1b0
[ 24.460170] [<ffffffff8020ce13>] ret_from_intr+0x0/0x2e
[ 24.460170] <EOI> [<ffffffff80ba98df>] ? _spin_unlock_irqrestore+0x5f/0x70
[ 24.460170] [<ffffffff80879bd6>] ? set_phy_reg+0x86/0xd0
[ 24.460170] [<ffffffff80871000>] ? delayed_reset_bus+0x0/0xf0
[ 24.460170] [<ffffffff8087a3c3>] ? ohci_devctl+0x153/0x240
[ 24.460170] [<ffffffff8086e762>] ? hpsb_reset_bus+0x22/0x30
[ 24.460170] [<ffffffff808710ac>] ? delayed_reset_bus+0xac/0xf0
[ 24.460170] [<ffffffff802562f9>] ? run_workqueue+0x179/0x220
[ 24.460170] [<ffffffff80256f04>] ? worker_thread+0xa4/0x110
[ 24.46017...
some more information: find below the same crash with vanilla
linus/master and no extra patches. The crash site is:(gdb) list *0xffffffff808be0c2
0xffffffff808be0c2 is in rollback_registered (net/core/dev.c:3793).
3788 {
3789 BUG_ON(dev_boot_phase);
3790 ASSERT_RTNL();
3791
3792 /* Some devices call without registering for initialization unwind. */
3793 if (dev->reg_state == NETREG_UNINITIALIZED) {
3794 printk(KERN_DEBUG "unregister_netdevice: device %s/%p never "
3795 "was registered\n", dev->name, dev);
3796
3797 WARN_ON(1);
(gdb)Thanks,
Ingo
----------------------------------->
Linux version 2.6.26-05253-g14b395e (mingo@dione) (gcc version 4.2.3) #21308 SMP Mon Jul 21 16:14:51 CEST 2008
Command line: root=/dev/sda1 earlyprintk=vga console=ttyS0,115200 console=tty 5 profile=0 debug initcall_debug apic=debug apic=verbose ignore_loglevel sysrq_always_enabled pci=nomsi
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)
BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)
BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)
BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)
BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)
BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data)
BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)
KERNEL supported cpus:
Intel GenuineIntel
AMD AuthenticAMD
Centaur CentaurHauls
console [earlyvga0] enabled
debug: ignoring loglevel setting.
last_pfn = 0x3ff00 max_arch_pfn = 0x3ffffffff
init_memory_mapping
0000000000 - 003fe00000 page 2M
003fe00000 - 003ff00000 page 4k
kernel direct mapping tables up to 3ff00000 @ 8000-b000
last_map_addr: 3ff00000 end: ...
From: Ingo Molnar <mingo@elte.hu>
Johannes/wireless-folks, can you take a look at this?
a 32-bit testbox just triggered the same crash too:
calling init_mac80211_hwsim+0x0/0x310
mac80211_hwsim: Initializing radio 0
phy0: Failed to select rate control algorithm
phy0: Failed to initialize rate control algorithm
mac80211_hwsim: ieee80211_register_hw failed (-2)
BUG: unable to handle kernel NULL pointer dereference at 00000298
IP: [<c06efb98>] rollback_registered+0x28/0x120
*pdpt = 0000000000bc9001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMPand that system has no wireless so i guess it's just some unregister
inbalance kind of init/deinit buglet.Ingo
--
From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 21 Jul 2008 17:04:48 +0200[ Adding linux-wireless CC, again, Ingo please retain it for
--
I was looking at this out of interest, but I'm in no way familiar with the code.
Looks to me that the error handling code in mac80211_hwsim is awkward. Which
leads to it calling ieee80211_unregister_hw even when ieee80211_register_hw failed.The function has a for loop where it generates all simulated radios. when something
fails, the error handling will call mac80211_hwsim_free which frees all simulated radios
who's pointer isn't zero. However the information stored is insufficient to determine
whether or not the call to ieee80211_register_hw succeeded or not for a specific radio.
The included patch makes init_mac80211_hwsim clean up the current simulated radio,
and then calls into mac80211_hwsim_free to clean up all the radios that did succeed.This however doesn't explain why the rate control registration failed.. build tested this,
but had some problems reproducing the original problem.signed-off-by: Ian Schram <ischram@telenet.be>
--- a/mac80211_hwsim.c 2008-07-21 18:48:38.000000000 +0200
+++ b/mac80211_hwsim.c 2008-07-21 19:31:44.000000000 +0200
@@ -364,8 +364,7 @@ static void mac80211_hwsim_free(void)
struct mac80211_hwsim_data *data;
data = hwsim_radios[i]->priv;
ieee80211_unregister_hw(hwsim_radios[i]);
- if (!IS_ERR(data->dev))
- device_unregister(data->dev);
+ device_unregister(data->dev);
ieee80211_free_hw(hwsim_radios[i]);
}
}
@@ -437,7 +436,7 @@ static int __init init_mac80211_hwsim(vo
"mac80211_hwsim: device_create_drvdata "
"failed (%ld)\n", PTR_ERR(data->dev));
err = -ENOMEM;
- goto failed;
+ goto failed_drvdata;
}
data->dev->driver = &mac80211_hwsim_driver;@@ -461,7 +460,7 @@ static int __init init_mac80211_hwsim(vo
if (err < 0) {
printk(KERN_DEBUG "mac80211_hwsim: "
"ieee80211_register_hw failed (%d)\n", err);
- goto failed;
+ goto failed_hw;
}printk(KERN_DEBUG "%s: hwaddr %s registered\n",
@@ -479,9 ...
thanks Ian for the patch, i'll test it.
Note that it was whitespace damaged, find below a tidied up version of
the patch that i've applied to tip/out-of-tree.Ingo
----------------------->
commit 2f77dd3a3b5c3a27298fa0a09d8703c09c633fc6
Author: Ian Schram <ischram@telenet.be>
Date: Mon Jul 21 20:18:25 2008 +0200mac80211_hwsim.c: fix: BUG: unable to handle kernel NULL pointer dereference at 0000000000000370
I was looking at this out of interest, but I'm in no way familiar with
the code.Looks to me that the error handling code in mac80211_hwsim is awkward.
Which leads to it calling ieee80211_unregister_hw even when
ieee80211_register_hw failed.The function has a for loop where it generates all simulated radios.
when something fails, the error handling will call mac80211_hwsim_free
which frees all simulated radios who's pointer isn't zero. However the
information stored is insufficient to determine whether or not the call
to ieee80211_register_hw succeeded or not for a specific radio. The
included patch makes init_mac80211_hwsim clean up the current simulated
radio, and then calls into mac80211_hwsim_free to clean up all the
radios that did succeed.This however doesn't explain why the rate control registration failed..
build tested this, but had some problems reproducing the original
problem.Signed-off-by: Ian Schram <ischram@telenet.be>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/wireless/mac80211_hwsim.c | 18 ++++++++++++------
1 files changed, 12 insertions(+), 6 deletions(-)diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c
index 913dc9f..5816230 100644
--- a/drivers/net/wireless/mac80211_hwsim.c
+++ b/drivers/net/wireless/mac80211_hwsim.c
@@ -364,8 +364,7 @@ static void mac80211_hwsim_free(void)
struct mac80211_hwsim_data *data;
data = hwsim_radios[i]->priv;
...
This patch may be needed to fix error handling in the hw_sim code, but I get the
crash even with that code disabled. I'm currently bisecting to find the culprit.Larry
--
ok. I just reactivated CONFIG_MAC80211_HWSIM, applied Ian's fix and the
crash went away:calling iwl4965_init+0x0/0x6c
iwl4965: Intel(R) Wireless WiFi Link 4965AGN driver for Linux, 1.3.27kd
iwl4965: Copyright(c) 2003-2008 Intel Corporation
initcall iwl4965_init+0x0/0x6c returned 0 after 10 msecs
calling init_mac80211_hwsim+0x0/0x31c
mac80211_hwsim: Initializing radio 0
PM: Adding info for No Bus:hwsim0
PM: Adding info for No Bus:phy0
PM: Adding info for No Bus:wmaster0
phy0: Failed to select rate control algorithm
phy0: Failed to initialize rate control algorithm
PM: Removing info for No Bus:wmaster0
PM: Removing info for No Bus:phy0
mac80211_hwsim: ieee80211_register_hw failed (-2)
PM: Removing info for No Bus:hwsim0
initcall init_mac80211_hwsim+0x0/0x31c returned -2 after 58 msecs
initcall init_mac80211_hwsim+0x0/0x31c returned with error code -2
calling dmfe_init_module+0x0/0xea
dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17)
initcall dmfe_init_module+0x0/0xea returned 0 after 5 msecsSo at least as far as the init_mac80211_hwsim() deinit crash goes:
Tested-by: Ingo Molnar <mingo@elte.hu>
Ingo
--
Yes, I'm chasing a distinct bug. The header for mine is
Jul 21 12:19:37 larrylap kernel: kernel BUG at net/core/dev.c:1328!
Jul 21 12:19:37 larrylap kernel: invalid opcode: 0000 [1] SMP
Jul 21 12:19:37 larrylap kernel: CPU 0
Jul 21 12:19:37 larrylap kernel: Modules linked in: af_packet rfkill_input nfs
lockd nfs_acl sunrpc cpufreq_conservative cpu
freq_userspace cpufreq_powersave powernow_k8 fuse loop dm_mod arc4 ecb
crypto_blkcipher b43 firmware_class rfkill mac80211 c
fg80211 snd_hda_intel snd_pcm snd_timer led_class snd k8temp input_polldev
sr_mod soundcore button battery hwmon cdrom force
deth ac serio_raw ssb snd_page_alloc sg ehci_hcd sd_mod ohci_hcd usbcore edd fan
thermal processor ext3 mbcache jbd pata_amd
ahci libata scsi_mod dock
Jul 21 12:19:37 larrylap kernel: Pid: 2057, comm: b43 Not tainted
2.6.26-Linus-git-05253-g14b395e #1
Jul 21 12:19:37 larrylap kernel: RIP: 0010:[<ffffffff8039ec4d>]
[<ffffffff8039ec4d>] __netif_schedule+0x12/0x75
Jul 21 12:19:37 larrylap kernel: RSP: 0000:ffff8800b9ae1de0 EFLAGS: 00010246With an invalid opcode, mine is likely due to stack corruption.
Larry
--
From: Larry Finger <Larry.Finger@lwfinger.net>
No further backtrace? That will tell us what driver is causing
this.--
Yes, I have a full backtrace.
It starts with possible recursive locking in NetworkManager, and goes directly
into the Warning - this came from a later pull of Linus's tree.Jul 21 15:11:07 larrylap kernel: [ INFO: possible recursive locking detected ]
Jul 21 15:11:07 larrylap kernel: 2.6.26-Linus-git-05614-ge89970a #8
Jul 21 15:11:07 larrylap kernel: ---------------------------------------------
Jul 21 15:11:07 larrylap kernel: NetworkManager/2661 is trying to acquire lock:
Jul 21 15:11:07 larrylap kernel: (&dev->addr_list_lock){-...}, at:
[<ffffffff803a2961>] dev_mc_sync+0x19/0x57
Jul 21 15:11:07 larrylap kernel:
Jul 21 15:11:07 larrylap kernel: but task is already holding lock:
Jul 21 15:11:07 larrylap kernel: (&dev->addr_list_lock){-...}, at:
[<ffffffff8039e7c5>] dev_set_rx_mode+0x19/0x2e
Jul 21 15:11:07 larrylap kernel:
Jul 21 15:11:07 larrylap kernel: other info that might help us debug this:
Jul 21 15:11:07 larrylap kernel: 2 locks held by NetworkManager/2661:
Jul 21 15:11:07 larrylap kernel: #0: (rtnl_mutex){--..}, at:
[<ffffffff803a8318>] rtnetlink_rcv+0x12/0x27
Jul 21 15:11:07 larrylap kernel: #1: (&dev->addr_list_lock){-...}, at:
[<ffffffff8039e7c5>] dev_set_rx_mode+0x19/0x2e
Jul 21 15:11:07 larrylap kernel:
Jul 21 15:11:07 larrylap kernel: stack backtrace:
Jul 21 15:11:07 larrylap kernel: Pid: 2661, comm: NetworkManager Not tainted
2.6.26-Linus-git-05614-ge89970a #8
Jul 21 15:11:07 larrylap kernel:
Jul 21 15:11:07 larrylap kernel: Call Trace:
Jul 21 15:11:07 larrylap kernel: [<ffffffff80251b02>] __lock_acquire+0xb7b/0xecc
Jul 21 15:11:07 larrylap kernel: [<ffffffff80251ea4>] lock_acquire+0x51/0x6a
Jul 21 15:11:07 larrylap kernel: [<ffffffff803a2961>] dev_mc_sync+0x19/0x57
Jul 21 15:11:07 larrylap kernel: [<ffffffff80408f9c>] _spin_lock_bh+0x23/0x2c
Jul 21 15:11:07 larrylap kernel: [<ffffffff803a2961>] dev_mc_sync+0x19/0x57
Jul 21 15:11:07 larrylap kernel: [<ffffffff8039e7...
From: Larry Finger <Larry.Finger@lwfinger.net>
That helps a lot, I'm looking at this now.
Thanks.
--
not sure whether it got reported already, but -tip testing triggered
this new ATA over Ethernet lockdep warning on latest -git:=================================
[ INFO: inconsistent lock state ]
2.6.26-tip-00346-gb4d3941-dirty #13925
---------------------------------
inconsistent {softirq-on-W} -> {in-softirq-W} usage.
swapper/0 [HC0[0]:SC1[2]:HE1:SE0] takes:
(noop_qdisc.q.lock){-+..}, at: [<ffffffff80880ba8>] dev_queue_xmit+0xd8/0x520
{softirq-on-W} state was registered at:
[<ffffffff80266517>] __lock_acquire+0x3a7/0x10e0
[<ffffffff802672a7>] lock_acquire+0x57/0x80
[<ffffffff80a080da>] _spin_lock+0x2a/0x40
[<ffffffff808a141f>] shutdown_scheduler_queue+0x3f/0x60
[<ffffffff808a1481>] dev_shutdown+0x41/0x90
[<ffffffff8087e9da>] rollback_registered+0x6a/0x100
[<ffffffff8087ea92>] unregister_netdevice+0x22/0x80
[<ffffffff80999bcc>] ieee80211_register_hw+0x2ec/0x3b0
[<ffffffff80eb0dbc>] init_mac80211_hwsim+0x1bc/0x370
[<ffffffff80209045>] do_one_initcall+0x45/0x180
[<ffffffff80e7fb75>] kernel_init+0x1d5/0x2f0
[<ffffffff8020d599>] child_rip+0xa/0x11
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 326488
hardirqs last enabled at (326488): [<ffffffff80265cbd>] trace_hardirqs_on+0xd/0x10
hardirqs last disabled at (326487): [<ffffffff802642ed>] trace_hardirqs_off+0xd/0x10
softirqs last enabled at (326450): [<ffffffff80248462>] __do_softirq+0x102/0x120
softirqs last disabled at (326461): [<ffffffff8020d86c>] call_softirq+0x1c/0x30other info that might help us debug this:
1 lock held by swapper/0:
#0: (rcu_read_lock){..--}, at: [<ffffffff80880b2b>] dev_queue_xmit+0x5b/0x520stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.26-tip-00346-gb4d3941-dirty #13925Call Trace:
<IRQ> [<ffffffff80264c6b>] print_usage_bug+0x18b/0x190
[<ffffffff80265a67>] mark_lock+0x517/0x570
[<ffffffff802664ce>] __lo...
From: Ingo Molnar <mingo@elte.hu>
Thanks for the report, does this fix it?
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 4ac7e3a..43abd4d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -736,9 +736,9 @@ static void shutdown_scheduler_queue(struct net_device *dev,
dev_queue->qdisc = qdisc_default;
dev_queue->qdisc_sleeping = qdisc_default;- spin_lock(root_lock);
+ spin_lock_bh(root_lock);
qdisc_destroy(qdisc);
- spin_unlock(root_lock);
+ spin_unlock_bh(root_lock);
}
}--
i've tested it and the warning went away - thanks David!
Tested-by: Ingo Molnar <mingo@elte.hu>
Ingo
--
From: Ingo Molnar <mingo@elte.hu>
Thanks for testing.
--
I'm guessing this needs similar lockdep class initializations
to _xmit_lock since it essentially has the same nesting rules.--
From: Patrick McHardy <kaber@trash.net>
Yes, I figured that out just now :-)
Maybe something like the following should do it?
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9737c06..a641eea 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5041,6 +5041,7 @@ static int bond_check_params(struct bond_params *params)
}static struct lock_class_key bonding_netdev_xmit_lock_key;
+static struct lock_class_key bonding_netdev_addr_lock_key;static void bond_set_lockdep_class_one(struct net_device *dev,
struct netdev_queue *txq,
@@ -5052,6 +5053,8 @@ static void bond_set_lockdep_class_one(struct net_device *dev,static void bond_set_lockdep_class(struct net_device *dev)
{
+ lockdep_set_class(&dev->addr_list_lock,
+ &bonding_netdev_addr_lock_key);
netdev_for_each_tx_queue(dev, bond_set_lockdep_class_one, NULL);
}diff --git a/drivers/net/hamradio/bpqether.c b/drivers/net/hamradio/bpqether.c
index b6500b2..58f4b1d 100644
--- a/drivers/net/hamradio/bpqether.c
+++ b/drivers/net/hamradio/bpqether.c
@@ -123,6 +123,7 @@ static LIST_HEAD(bpq_devices);
* off into a separate class since they always nest.
*/
static struct lock_class_key bpq_netdev_xmit_lock_key;
+static struct lock_class_key bpq_netdev_addr_lock_key;static void bpq_set_lockdep_class_one(struct net_device *dev,
struct netdev_queue *txq,
@@ -133,6 +134,7 @@ static void bpq_set_lockdep_class_one(struct net_device *dev,static void bpq_set_lockdep_class(struct net_device *dev)
{
+ lockdep_set_class(&dev->addr_list_lock, &bpq_netdev_addr_lock_key);
netdev_for_each_tx_queue(dev, bpq_set_lockdep_class_one, NULL);
}diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index efbc155..4239450 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -276,6 +276,7 @@ static int macvlan_change_mtu(struct net_device *dev, int new_mtu)
* separate cla...
It looks correct in any case. I'm not sure whether it fixes
this lockdep warning though, according to the backtrace and
module list its b43 and dev_mc_sync in net/mac80211/main.c
that are causing the error, which don't seem to be included
in your patch. I'm unable to find where it previously
initialized the xmit_lock lockdep class though, so I must
be missing something :)--
This is what I was missing, we're setting a lockdep class
by default depending on dev->type. This patch combined
with yours should fix all addr_list_lock warnings.
No cigar yet. I tried davem's patch first, then yours on top of his. I still get
both the recursive locking and the kernel warning.BTW, wireless doesn't work but if I plug in the wire, then networking is OK. It
seems to be in mac80211, which is strange because I routinely run the latest
wireless-testing kernel, and all the wireless bits should be there already.I'm still plugging away at the bisection. I think I got away from the kernel
that won't build.Larry
--
Does this one earn me my cigar? :)
Sorry :(
I used the davem patch, the second version of your first one, and your second
one. Both problems persist.Still plugging away on bisection.
Larry
--
From: Larry Finger <Larry.Finger@lwfinger.net>
GIT bisecting the lockdep problem is surely going the land you on:
commit e308a5d806c852f56590ffdd3834d0df0cbed8d7
Author: David S. Miller <davem@davemloft.net>
Date: Tue Jul 15 00:13:44 2008 -0700netdev: Add netdev->addr_list_lock protection.
Add netif_addr_{lock,unlock}{,_bh}() helpers.
Use them to protect operations that operate on or read
the network device unicast and multicast address lists.Also use them in cases where the code simply wants to
block calls into the driver's ->set_rx_mode() and
->set_multicast_list() methods.Signed-off-by: David S. Miller <davem@davemloft.net>
--
No. It landed on this one.
37437bb2e1ae8af470dfcd5b4ff454110894ccaf is first bad commit
commit 37437bb2e1ae8af470dfcd5b4ff454110894ccaf
Author: David S. Miller <davem@davemloft.net>
Date: Wed Jul 16 02:15:04 2008 -0700pkt_sched: Schedule qdiscs instead of netdev_queue.
When we have shared qdiscs, packets come out of the qdiscs
for multiple transmit queues.Therefore it doesn't make any sense to schedule the transmit
queue when logically we cannot know ahead of time the TX
queue of the SKB that the qdisc->dequeue() will give us.Just for sanity I added a BUG check to make sure we never
get into a state where the noop_qdisc is scheduled.4854d4f4df9726a2e8837037f82bde807bed2ede M net
Larry
--
From: Larry Finger <Larry.Finger@lwfinger.net>
For the lockdep warnings?
--
When I was just one commit later, I got both the lockdep warning and the BUG.
This is the commit in question.commit 16361127ebed0fb8f9d7cc94c6e137eaf710f676
Author: David S. Miller <davem@davemloft.net>
Date: Wed Jul 16 02:23:17 2008 -0700pkt_sched: dev_init_scheduler() does not need to lock qdisc tree.
We are registering the device, there is no way anyone can get
at this object's qdiscs yet in any meaningful way.Signed-off-by: David S. Miller <davem@davemloft.net>
Larry
--
David and Patrick,
Here is the latest on this problem.
I pulled from Linus's tree this morning and now have git-05752-g93ded9b. The
kernel WARNING from __netif_schedule and the lockdep warning are present with or
without the patches from yesterday.As I stated earlier, the kernel WARNING (it was a BUG then) was introduced in
commit 37437bb2 when the BUG statement was entered.The lockdep warning started with the next commit (16361127).
I am not using any network traffic shaping. Is it correct that the faulty
condition is not that q == &noop_qdisc, but that __netif_schedule was called
when that condition exists?The lockdep warning is:
=============================================
[ INFO: possible recursive locking detected ]
2.6.26-Linus-git-05752-g93ded9b #49
---------------------------------------------
NetworkManager/2611 is trying to acquire lock:
(&dev->addr_list_lock){-...}, at: [<ffffffff803a2ad1>] dev_mc_sync+0x19/0x57but task is already holding lock:
(&dev->addr_list_lock){-...}, at: [<ffffffff8039e909>] dev_set_rx_mode+0x19/0x2eother info that might help us debug this:
2 locks held by NetworkManager/2611:
#0: (rtnl_mutex){--..}, at: [<ffffffff803a8488>] rtnetlink_rcv+0x12/0x27
#1: (&dev->addr_list_lock){-...}, at: [<ffffffff8039e909>]
dev_set_rx_mode+0x19/0x2estack backtrace:
Pid: 2611, comm: NetworkManager Not tainted 2.6.26-Linus-git-05752-g93ded9b #49Call Trace:
[<ffffffff80251b02>] __lock_acquire+0xb7b/0xecc
[<ffffffff80251ea4>] lock_acquire+0x51/0x6a
[<ffffffff803a2ad1>] dev_mc_sync+0x19/0x57
[<ffffffff8040b3fc>] _spin_lock_bh+0x23/0x2c
[<ffffffff803a2ad1>] dev_mc_sync+0x19/0x57
[<ffffffff8039e911>] dev_set_rx_mode+0x21/0x2e
[<ffffffff803a04da>] dev_open+0x8e/0xb0
[<ffffffff8039fe84>] dev_change_flags+0xa6/0x163
[<ffffffff803a7591>] do_setlink+0x286/0x349
[<ffffffff803a849d>] rtnetlink_rcv_ms...
^^^^^^^^^^^^^^ dirty?
This kernel is not using the patches we sent.
--
No, but they didn't make any difference. I tried with all 3 applied, then backed
them out one by one. That was the state when I posted before.Here are the dumps with all 3 patches applied:
=============================================
[ INFO: possible recursive locking detected ]
2.6.26-Linus-05752-g93ded9b-dirty #53
---------------------------------------------
b43/1997 is trying to acquire lock:
(_xmit_IEEE80211#2){-...}, at: [<ffffffffa028f322>]
ieee80211_scan_completed+0x130/0x2e1 [mac80211]but task is already holding lock:
(_xmit_IEEE80211#2){-...}, at: [<ffffffffa028f322>]
ieee80211_scan_completed+0x130/0x2e1 [mac80211]other info that might help us debug this:
3 locks held by b43/1997:
#0: ((name)){--..}, at: [<ffffffff80245185>] run_workqueue+0xa7/0x1f2
#1: (&(&local->scan_work)->work){--..}, at: [<ffffffff80245185>]
run_workqueue+0xa7/0x1f2
#2: (_xmit_IEEE80211#2){-...}, at: [<ffffffffa028f322>]
ieee80211_scan_completed+0x130/0x2e1 [mac80211]stack backtrace:
Pid: 1997, comm: b43 Not tainted 2.6.26-Linus-05752-g93ded9b-dirty #53Call Trace:
[<ffffffff80255616>] __lock_acquire+0xb7b/0xecc
[<ffffffff8040c9a0>] __mutex_unlock_slowpath+0x100/0x10b
[<ffffffff802559b8>] lock_acquire+0x51/0x6a
[<ffffffffa028f322>] ieee80211_scan_completed+0x130/0x2e1 [mac80211]
[<ffffffff8040dc08>] _spin_lock+0x1e/0x27
[<ffffffffa028f322>] ieee80211_scan_completed+0x130/0x2e1 [mac80211]
[<ffffffffa028f6ce>] ieee80211_sta_scan_work+0x0/0x1b8 [mac80211]
[<ffffffff802451ce>] run_workqueue+0xf0/0x1f2
[<ffffffff802453ab>] worker_thread+0xdb/0xea
[<ffffffff80248a5f>] autoremove_wake_function+0x0/0x2e
[<ffffffff802452d0>] worker_thread+0x0/0xea
[<ffffffff80248731>] kthread+0x47/0x73
[<ffffffff8040d7b1>] trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8020ceb9>] child_rip+0xa/0x11
[<ffffffff8020c4ef>] restore...
From: Larry Finger <Larry.Finger@lwfinger.net>
Lockdep doesn't like that we have an array of objects (the TX queues)
and we're iterating over them grabbing all of their locks.Does anyone know how to teach lockdep that this is OK?
--
I guess, David Miller knows...:
http://permalink.gmane.org/gmane.linux.network/99784
Jarek P.
PS: if there is nothing new in lockdep the classical method would
be to change this static array:static struct lock_class_key
netdev_xmit_lock_key[ARRAY_SIZE(netdev_lock_type)];to
static struct lock_class_key
netdev_xmit_lock_key[ARRAY_SIZE(netdev_lock_type)][MAX_NUM_TX_QUEUES];and set lockdep classes per queue as well. (If we are sure we don't
need lockdep subclasses anywhere this could be optimized by using
one lock_class_key per 8 queues and spin_lock_nested()).
--
From: Jarek Poplawski <jarkao2@gmail.com>
Unfortunately MAX_NUM_TX_QUEUES is USHORT_MAX, so this isn't really
a feasible approach.spin_lock_nested() isn't all that viable either, as the subclass
limit is something like 8.--
Is it used by real devices already? Maybe for the beginning we could
This method would need to do some additional counting: depending of
a queue number each 8 subsequent queues share (are set to) the same
class and their number mod 8 gives the subqueue number for
spin_lock_nested().I'll try to find if there is something new around this in lockdep.
(lockdep people added to CC.)Jarek P.
--
There isn't.
Is there a static data structure that the driver needs to instantiate to
'create' a queue? Something like:/* this imaginary e1000 hardware has 16 hardware queues */
static struct net_tx_queue e1000e_tx_queues[16];In that case you can stick the key in there and do:
int e1000e_init_tx_queue(struct net_tx_queue *txq)
{
...spin_lock_init(&txq->tx_lock);
lockdep_set_class(&txq->tx_lock, &txq->tx_lock_key);...
}( This is what the scheduler runqueues also do )
--
I guess, not.
Then, IMHO, we could be practical and simply skip lockdep validation
for "some" range of locks: e.g. to set the table for the first 256
queues only, and to do e.g. __raw_spin_lock() for bigger numbers. (If
there are any bad locking patterns this should be enough for checking.)Jarek P.
--
definite NAK on using raw spinlock ops...
I'll go look at this multiqueue stuff to see if there is anything to be
done.. David, what would be a good commit to start reading?--
On Wed, Jul 23, 2008 at 11:50:14AM +0200, Peter Zijlstra wrote:
...In the case David ever sleeps: I think, the current Linus' git is
good enough (the problem is with netdevice.h: netif_tx_lock()).Jarek P.
--
Ah, right,...
that takes a whole bunch of locks at once..
Is that really needed? - when I grep for its usage its surprisingly few
drivers using it and even less generic code.When I look at the mac802.11 code in ieee80211_tx_pending() it looks
like it can do with just one lock at a time, instead of all - but I
might be missing some obvious details.So I guess my question is, is netif_tx_lock() here to stay, or is the
right fix to convert all those drivers to use __netif_tx_lock() which
locks only a single queue?--
From: Peter Zijlstra <peterz@infradead.org>
It's staying.
It's trying to block all potential calls into the ->hard_start_xmit()
method of the driver, and the only reliable way to do that is to take
all the TX queue locks. And in one form or another, we're going to
have this "grab/release all the TX queue locks" construct.--
Hi David,
I'm sure as hell, I miss sth. but can't it be done by this pseudo-code:
netif_tx_lock(device)
{
mutex_lock(device->queue_entry_mutex);
foreach_queue_entries(queue, device->queues)
{
spin_lock(queue->tx_lock);
set_noop_tx_handler(queue);
spin_unlock(queue->tx_lock);
}
mutex_unlock(device->queue_entry_mutex);
}netif_tx_unlock(device)
{
mutex_lock(device->queue_entry_mutex);
foreach_queue_entries(queue, device->queues)
{
spin_lock(queue->tx_lock);
set_useful_tx_handler(queue);
spin_unlock(queue->tx_lock);
}
mutex_unlock(device->queue_entry_mutex);
}Then protect use of the queues by queue->tx_lock in transmit path.
The first setup of the queue doesn't need to be protected, since no-one
knows the device. The final cleanup of the device doesn't need to be
protected either, because netif_tx_lock() and netif_tx_unlock() should
not be called after entering the final cleanup.Some VM locking works this way...
Best Regards
Ingo Oeser
--
On Fri, Jul 25, 2008 at 07:04:36PM +0200, Ingo Oeser wrote:
...And I really doubt it can't be done like this.
--
Umm, of course it cannot, because then we'd have to take the mutex in
the TX path, which we cannot. We cannot have another lock in the TX
path, what's so hard to understand about? We need to be able to lock all
queues to lock out multiple tx paths at once in some (really) slow paths
but not have any extra lock overhead for the tx path, especially not a
single lock.johannes
But this mutex doesn't have to be mutex. And it's not for the tx path,
only for "service" just like netif_tx_lock(). The fast path needs only
queue->tx_lock.Jarek P.
--
No, we need to be able to lock out multiple TX paths at once.
johannes
IMHO, it can do the same. We could e.g. insert a locked spinlock into
this noop_tx_handler(), to give everyone some waiting.Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
I think there might be an easier way, but we may have
to modify the state bits a little.Every call into ->hard_start_xmit() is made like this:
1. lock TX queue
2. check TX queue stopped
3. call ->hard_start_xmit() if not stoppedThis means that we can in fact do something like:
unsigned int i;
for (i = 0; i < dev->num_tx_queues; i++) {
struct netdev_queue *txq;txq = netdev_get_tx_queue(dev, i);
spin_lock_bh(&txq->_xmit_lock);
netif_tx_freeze_queue(txq);
spin_unlock_bh(&txq->_xmit_lock);
}netif_tx_freeze_queue() just sets a new bit we add.
Then we go to the ->hard_start_xmit() call sites and check this new
"frozen" bit as well as the existing "stopped" bit.When we unfreeze each queue later, we see if it is stopped, and if not
we schedule it's qdisc for packet processing.A patch below shows how the guarding would work. It doesn't implement
the actual freeze/unfreeze.We need to use a side-state bit to do this because we don't
want this operation to get all mixed up with the queue waking
operations that the driver TX reclaim code will be doing
asynchronously.diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b4d056c..cba98fb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -440,6 +440,7 @@ static inline void napi_synchronize(const struct napi_struct *n)
enum netdev_queue_state_t
{
__QUEUE_STATE_XOFF,
+ __QUEUE_STATE_FROZEN,
};struct netdev_queue {
@@ -1099,6 +1100,11 @@ static inline int netif_queue_stopped(const struct net_device *dev)
return netif_tx_queue_stopped(netdev_get_tx_queue(dev, 0));
}+static inline int netif_tx_queue_frozen(const struct netdev_queue *dev_queue)
+{
+ return test_bit(__QUEUE_STATE_FROZEN, &dev_queue->state);
+}
+
/**
* netif_running - test if up
* @dev: network device
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index c127208..6c7af39 1...
On Sat, Jul 26, 2008 at 02:18:46AM -0700, David Miller wrote:
I guess some additional synchronization will be added yet to prevent
parallel freeze and especially unfreeze.Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
Yes, that could be a problem. Using test_and_set_bit() can
guard the freezing sequence itself, but it won't handle
letting two threads of control freeze and unfreeze safely
without a reference count.We want this thing to be able to be used flexbly, which means
we can't just assume that this is a short code sequence and
the unfreeze will come quickly. That pretty much rules
out using a new lock around the operation or anything
like that.So I guess we could replace the state bit with a reference
count. It doesn't even need to be atomic since it is set
and tested under dev_queue->_xmit_lock--
Looks like enough to me. (Probably it could even share space with
the state.)Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
So I made some progress on this, three things:
1) I remember why I choose a to use a bit in my design, it's so that
it does not increase the costs of the checks in the fast paths.
test_bit(X) && test_bit(Y) can be combined into a single test by
the compiler.2) We can't use the reference counting scheme, because we don't want
to let a second cpu into these protected code paths just because
another is in the middle of using a freeze too.3) So we can simply put a top-level TX spinlock around these things.
Therefore all the hot paths:
a) grab _xmit_lock
b) check XOFF and FROZEN
c) only call ->hard_start_xmit() if both bits are clearnetif_tx_lock() does:
1) grab netdev->tx_global_lock
2) for_each_tx_queue() {
lock(txq);
set_bit(FROZEN);
unlock(txq);
}and unlock does:
1) for_each_tx_queue() {
clear_bit(FROZEN);
if (!test_bit(XOFF))
__netif_schedule();
}
2) release netdev->tx_global_lockAnd this seems to satisfy all the constraints which are:
1) Must act like a lock and protect execution of the code path
which occurs inside of "netif_tx_{lock,unlock}()"2) Must ensure no cpus are executing inside of ->hard_start_xmit()
after netif_tx_lock() returns.3) Must not try to grab all the TX queue locks at once.
This top-level tx_global_lock also simplifies the freezing, as
it makes sure only one cpu is initiating or finishing a freeze
at any given time.I've also adjusted code that really and truly only wanted to
lock one queue at a time, which in particular was IFB and the
teql scheduler.It's late here, but I'll start testing the following patch on my
multiqueue capable cards after some sleep.diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 0960e69..e4fbefc 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -69,18 +69,20 @@ static void ri_tasklet(unsigned long dev)
struct net_device *_dev = (struct net_device *)dev;
s...
Alas I've some doubts here...
The comments in asm-x86/bitops.h to set_bit/clear_bit are rather queer
about reordering on non x86: isn't eg. smp_mb_before_clear_bit()This will probably need some lockdep annotations similar to
This thing is the most doubtful to me: before this patch callers would
wait on this lock. Now they take the lock without problems, check the
flags, and let to take this lock again, doing some re-queing in the
meantime.So, it seems HARD_TX_LOCK should rather do some busy looping now with
a trylock, and re-checking the _FROZEN flag. Maybe even this should
be done in __netif_tx_lock(). On the other hand, this shouldn't block
too much the owner of tx_global_lock() with taking such a lock.Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
It doesn't matter, we need no synchronization here at all.
We unconditionally perform a __netif_schedule(), and that
will run the TX queue on the local cpu. We will take the
_xmit_lock at least once time if in fact the queue was notI highly doubt it. It will never be taken nested with another
device's instance.It is only ->hard_start_xmit() leading to another ->hard_start_xmit()
where this can currently happen, but tx_global_lock will not be used'ret' will be NETDEV_TX_BUSY in such a case (finding the queue
frozen), which will cause the while() loop in __qdisc_run() to
terminate.The freezer will unconditionally schedule a new __qdisc_run()
when it unfreezes the queue.Sure it's possible for some cpus to bang in and out of there
a few times, but that's completely harmless. And it can only
happen a few times since this freeze state is only held across
a critical section.--
...BTW, we probably could also consider some optimization here: the
xmit_lock of the first queue could be treated as special, and only
the owner could do such a freezing. This would save changes of
functionality to non mq devices. On the other hand, it would need
remembering about this special treatment (so, eg. a separate lockdep
initialization than all the others).Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
I think special casing the zero's queue's lock is a bad idea.
Having a real top-level synchronizer is a powerful tool and
we could use it for other things.
--
Sure, if there is really no problem with lockdep here, there is no
need for this at all.Thanks for the explanations,
Jarek P.
--
From: David Miller <davem@davemloft.net>
As a quick followup, I tested this on a machine where I had
a multiqueue interface and could reproduce the lockdep warnings,
and the patch makes them go away.So I've pushed the patch into net-2.6 and will send it to Linus.
--
Thanks david!
--
Except for the braindead volatile that gets stuck on the bitops pointer.
Last time I complained about this, a lot of noise was made and I think
Linus wanted it to stay around so we could pass volatile pointers to
bitops & co without warnings. I say we should just remove the volatile
and kill any callers that might warn...
--
From: Nick Piggin <nickpiggin@yahoo.com.au>
Ho hum... :)
Another way to approach that, and keep the volatile, is to have
a "test_flags()" interface that takes the bit mask of values
you want to test for cases where you know it is a single word
flags value.The downside is that this kind of interface is easy to use
incorrectly especially when accesses to the same flags use
bot test_bit() and test_flags().
--
Yes, this looks definitely easier, but here is this one little bit
more, plus additional code to handle this in various places. Ingo's
proposal needs a (one?!) bit more thinking in one place, but it
shouldn't add even a bit to tx path (and it looks really cool!). Of
course, it could be re-considered in some other time too.BTW, it seems with "Ingo's method" this netif_queue_stopped() check
could be removed too - the change of handlers could be done with
single qdiscs as well.Jarek P.
--
If you think its OK to take USHORT_MAX locks at once, I'm afraid we'll
have to agree to disagree :-/Thing is, lockdep wants to be able to describe the locking hierarchy
with classes, and each class needs to be in static storage for various
reasons.So if you make a locking hierarchy that is USHORT_MAX deep, you need at
least that amount of static classes.Also, you'll run into the fact that lockdep will only track like 48 held
locks, after that it self terminates.I'm aware of only 2 sites in the kernel that break this limit. The
down-side of stretching this limit is that deep lock chains come with
costs (esp so on -rt), so I'm not particularly eager to grow this - it
might give the impresssion its a good idea to have very long lock
chains.--
On Wed, Jul 23, 2008 at 12:58:16PM +0200, Peter Zijlstra wrote:
It's a new thing mainly for new hardware/drivers, and just after
conversion (older drivers effectively use __netif_tx_lock()), so it'll
probably stay for some time until something better is found. David,
will tell the rest, I hope.Jarek P.
--
...And, of course, these new drivers should also lock a single queue
where possible.Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
It isn't going away.
There will always be a need for a "stop all the TX queues" operation.
--
Ok, then how about something like this, the idea is to wrap the per tx
lock with a read lock of the device and let the netif_tx_lock() be the
write side, therefore excluding all device locks, but not incure the
cacheline bouncing on the read side by using per-cpu counters like rcu
does.This of course requires that netif_tx_lock() is rare, otherwise stuff
will go bounce anyway...Probably missed a few details,.. but I think the below ought to show the
idea...struct tx_lock {
int busy;
spinlock_t lock;
unsigned long *counters;
};int tx_lock_init(struct tx_lock *txl)
{
txl->busy = 0;
spin_lock_init(&txl->lock);
txl->counters = alloc_percpu(unsigned long);if (!txl->counters)
return -ENOMEM;return 0;
}void __netif_tx_lock(struct netdev_queue *txq, cpu)
{
struct net_device *dev = txq->dev;if (rcu_dereference(dev->tx_lock.busy)) {
spin_lock(&dev->tx_lock.lock);
(*percpu_ptr(dev->tx_lock.counters, cpu))++;
spin_unlock(&dev->tx_lock.lock);
} else
(*percpu_ptr(dev->tx_lock.counters, cpu))++;spin_lock(&txq->_xmit_lock);
txq->xmit_lock_owner = cpu;
}void __netif_tx_unlock(struct netdev_queue *txq)
{
struct net_device *dev = txq->dev;(*percpu_ptr(dev->tx_lock.counters, txq->xmit_lock_owner))--;
txq->xmit_lock_owner = -1;
spin_unlock(&txq->xmit_lock);
}unsigned long tx_lock_read_counters(struct tx_lock *txl)
{
int i;
unsigned long counter = 0;/* can use online - the inc/dec are matched per cpu */
for_each_online_cpu(i)
counter += *percpu_ptr(txl->counters, i);return counter;
}void netif_tx_lock(struct net_device *dev)
{
spin_lock(&dev->tx_lock.lock);
rcu_assign_pointer(dev->tx_lock.busy, 1);while (tx_lock_read_counters(&dev->tx_lock)
cpu_relax();
}void netif_tx_unlock(struct net_device *dev)
{
rcu_assign_pointer(dev->tx_lock.busy, 0);
smp_wmb(); /* because rcu_assign_pointer is bro...
From: Peter Zijlstra <peterz@infradead.org>
Thanks for the effort, but I don't think we can seriously consider
this.This lock is taken for every packet transmitted by the system, adding
another memory reference (the RCU deref) and a counter bump is just
not something we can just add to placate lockdep. We going through
all of this effort to seperate the TX locking into individual
queues, it would be silly to go back and make it more expensive.I have other ideas which I've expanded upon in other emails. They
involve creating a netif_tx_freeze() interface and getting the drivers
to start using it.
--
Well, not only lockdep, taking a very large number of locks is expensive
OK, as long as we get there :-)
--
From: Peter Zijlstra <peterz@infradead.org>
Right now it would be on the order of 16 or 32 for
real hardware.Much less than the scheduler currently takes on some
of my systems, so currently you are the pot calling the
kettle black. :-)USHORT_MAX is just the upper hard limit imposed by the
software interface merely as a side effect of storing
the queue number of the SKB as a u16.
--
One nit, and then I'll let this issue rest :-)
The scheduler has a long lock dependancy chain (nr_cpu_ids rq locks),
but it never takes all of them at the same time. Any one code path will
at most hold two rq locks.--
Aside from lockdep, is there a particular problem with taking 64k locks
at once? (in a very slow path, of course) I don't think it causes a
problem with preempt_count, does it cause issues with -rt kernel?Hey, something kind of cool (and OT) I've just thought of that we can
do with ticket locks is to take tickets for 2 (or 64K) nested locks,
and then wait for them both (all), so the cost is N*lock + longest spin,
rather than N*lock + N*avg spin.That would mean even at the worst case of a huge amount of contention
on all 64K locks, it should only take a couple of ms to take all of
them (assuming max spin time isn't ridiculous).Probably not the kind of feature we want to expose widely, but for
really special things like the scheduler, it might be a neat hack to
save a few cycles ;) Traditional implementations would just have
#define spin_lock_async spin_lock
#define spin_lock_async_wait do {} while (0)Sorry it's offtopic, but if I didn't post it, I'd forget to. Might be
a fun quick hack for someone.
--
FWIW, I did something similar in a previous life for the write-side of
a brlock-like locking mechanism. This was especially helpful if the
read-side critical sections were long.Thanx, Paul
--
PI-chains might explode I guess, Thomas?
Besides that, I just have this voice in my head telling me that
It might just be worth it for double_rq_lock() - if you can sort out the
deadlock potential Miklos just raised ;-)--
Isn't this deadlocky?
E.g. one task takes ticket x=1, then other task comes in and takes x=2
and y=1, then first task takes y=2. Then neither can actually
complete both locks.Miklos
--
Oh duh of course you still need mutual exclusion from the first lock
to order the subsequent :PSo yeah it only works for N > 2 locks, and you have to spin_lock the
first one... so unsuitable for scheduler.
--
Or sort the locks by address or some such.
Thanx, Paul
--
On Wed, Jul 23, 2008 at 01:16:07PM -0700, David Miller wrote:
The question is if the current way is "all correct". As a matter of
fact I think Peter's doubts could be justified: taking "USHORT_MAX"
locks looks really dubious (so maybe it's not so strange lockdep
didn't get used to this).Jarek P.
--
From: Jarek Poplawski <jarkao2@gmail.com>
There are, of course, potentially other ways to achieve the objective.
And for non-multiqueue aware devices (which is the vast majority of
the 400 or so networking drivers we have) there is only one queue and
thus one lock taken.
--
--
Sorry. After all those kernel builds I got bleary-eyed and just looked for
recursive locking without any regard for the details.Larry
--
I actually don't see how you could still get the warning with
Dave's patch and the two I sent applied.The warning is triggered by the dev_mc_sync call in
ieee80211_set_multicast_list:dev_mc_sync(local->mdev, dev);
local->mdev is the wmaster device, which has its type set to
ARPHRD_IEEE80211. dev is regular wireless device with type set
to ARPHRD_ETHER. So they have distinct lockdep classes set
by register_netdevice.The warning is:
Jul 21 15:11:07 larrylap kernel: NetworkManager/2661 is trying to
acquire lock:
Jul 21 15:11:07 larrylap kernel: (&dev->addr_list_lock){-...}, at:
[<ffffffff803a2961>] dev_mc_sync+0x19/0x57
Jul 21 15:11:07 larrylap kernel:
Jul 21 15:11:07 larrylap kernel: but task is already holding lock:
Jul 21 15:11:07 larrylap kernel: (&dev->addr_list_lock){-...}, at:
[<ffffffff8039e7c5>] dev_set_rx_mode+0x19/0x2e
Jul 21 15:11:07 larrylap kernel:The only already held is dev->addr_list_lock, the one taken
by dev_mc_sync is local->mdev->addr_list_lock. And this shouldn't
cause any warnings because of the distinct lockdep classes.Could you please retry with the three patches attached to this
mail? If the lockdep warning still triggers, please post it again.
From: Patrick McHardy <kaber@trash.net>
Since I'm convinced the original lockdep spurious warning
issue is cured, and we're looking at something different
now, I've integrated all of these fixes together as one
commit as below.Thanks a lot Patrick.
netdev: Handle ->addr_list_lock just like ->_xmit_lock for lockdep.
The new address list lock needs to handle the same device layering
issues that the _xmit_lock one does.This integrates work done by Patrick McHardy.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9737c06..a641eea 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -5041,6 +5041,7 @@ static int bond_check_params(struct bond_params *params)
}static struct lock_class_key bonding_netdev_xmit_lock_key;
+static struct lock_class_key bonding_netdev_addr_lock_key;static void bond_set_lockdep_class_one(struct net_device *dev,
struct netdev_queue *txq,
@@ -5052,6 +5053,8 @@ static void bond_set_lockdep_class_one(struct net_device *dev,static void bond_set_lockdep_class(struct net_device *dev)
{
+ lockdep_set_class(&dev->addr_list_lock,
+ &bonding_netdev_addr_lock_key);
netdev_for_each_tx_queue(dev, bond_set_lockdep_class_one, NULL);
}diff --git a/drivers/net/hamradio/bpqether.c b/drivers/net/hamradio/bpqether.c
index b6500b2..58f4b1d 100644
--- a/drivers/net/hamradio/bpqether.c
+++ b/drivers/net/hamradio/bpqether.c
@@ -123,6 +123,7 @@ static LIST_HEAD(bpq_devices);
* off into a separate class since they always nest.
*/
static struct lock_class_key bpq_netdev_xmit_lock_key;
+static struct lock_class_key bpq_netdev_addr_lock_key;static void bpq_set_lockdep_class_one(struct net_device *dev,
struct netdev_queue *txq,
@@ -133,6 +134,7 @@ static void bpq_set_lockdep_class_one(struct net_device *dev,static void bpq_set_lockdep_class(struct net_devi...
No - this one triggers the kernel BUG as follows:
------------[ cut here ]------------
kernel BUG at net/core/dev.c:1328!
invalid opcode: 0000 [1] SMP
CPU 0
Modules linked in: af_packet rfkill_input nfs lockd nfs_acl sunrpc
cpufreq_conservative cpufreq_userspace cpufreq_powersave powernow_k8 fuse loop
dm_mod arc4 ecb crypto_blkcipher b43 firmware_class rfkill mac80211 cfg80211
led_class input_polldev k8temp sr_mod battery ac ssb button hwmon forcedeth
cdrom serio_raw sg ohci_hcd ehci_hcd sd_mod usbcore edd fan thermal processor
ext3 mbcache jbd pata_amd ahci libata scsi_mod dock
Pid: 2003, comm: b43 Not tainted 2.6.26-rc8-Linus-git-01424-g37437bb #43
RIP: 0010:[<ffffffff803958c6>] [<ffffffff803958c6>] __netif_schedule+0x12/0x75
RSP: 0018:ffff8100b9e33de0 EFLAGS: 00010246
RAX: ffff8100b63819c0 RBX: ffffffff80545300 RCX: ffff8100b6381980
RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff80545300
RBP: ffff8100b7b45158 R08: ffff8100b89d8000 R09: ffff8100b9d26000
R10: ffff8100b7b44480 R11: ffffffffa01239ef R12: ffff8100b7b44480
R13: ffff8100b9d26000 R14: ffff8100b89d8000 R15: 0000000000000000
FS: 00007f494406a6f0(0000) GS:ffffffff8055e000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f49440933dc CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process b43 (pid: 2003, threadinfo ffff8100b9e32000, task ffff8100b4a3e480)
Stack: ffff8100b7b45158 ffff8100b89d8900 ffff8100b7b45158 ffffffffa0158455
ffff8100ba3287c0 0000000000000246 0000000000000000 0000000000000000
ffff8100b9e33e70 ffff8100b7b451b8 ffff8100ba3287c0 ffff8100b7b451b0
Call Trace:
[<ffffffffa0158455>] ? :mac80211:ieee80211_scan_completed+0x25b/0x2e1
[<ffffffffa01586d6>] ? :mac80211:ieee80211_sta_scan_work+0x0/0x1b8
[<ffffffff8023f7d7>] ? run_workqueue+0xf1/0x1f3
[<ffffffff8023f9b4>] ? worker_threa...
From: Larry Finger <Larry.Finger@lwfinger.net>
Well, we were trying to fix the lockdep warnings. One
thing at a time :-)And that BUG is now just a WARN in Linus's tree, so if
you update you should get further along.
--
On 22-07-2008 08:34, Larry Finger wrote:
Could you send lockdep info after Patrick's "set lockdep classes"
patch?Jarek P.
--
This one is a bit nicer, since we only have a single
addr_list_lock we don't need to pass a pointer to the
lock.
Ok, that one is fixed now in my tree. Or at least it's turned into a
No, invalid opcode is because we use the "ud2" instruction for BUG(),
which causes an invalid op exception. So any BUG[_ON]() will always cause
that on x86.Linus
--
Thanks for the explanation.
With your latest tree, I do get the warning. Unfortunately, it still breaks my
wireless and I still need to do the bisection. That is complicated by getting a
kernel that won't build after the first try. I think I now have a workaround.Larry
--
From: Linus Torvalds <torvalds@linux-foundation.org>
Is there really no more backtrace from that crash message?
It would tell me what driver it's in.There is some "comm: b43" in the log so I'll check that one.
--
please find a small build fix below.
Ingo
-------------->
commit c61b0199e779caf2dcfdb6e83439c1fdf9f20209
Author: Ingo Molnar <mingo@elte.hu>
Date: Mon Jul 21 10:33:42 2008 +0200iwlwifi: fix build bug in "iwlwifi: fix LED stall"
-tip testing found the following build failure:
drivers/net/wireless/iwlwifi/iwl-led.c: In function ‘iwl_led_brightness_set’:
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: ‘led_type_str’ undeclared (first use in this function)
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: (Each undeclared identifier is reported only once
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: for each function it appears in.)Triggered if this driver is built with !CONFIG_IWLWIFI_DEBUG. Introduced
by commit 0eee61273.The best fix is to make led_type_str available as a zero-size symbol and to
only add members to the array if CONFIG_IWLWIFI_DEBUG is set. This way
there's no overhead in the debugging case and we have type checking in the
IWL_DEBUG_LED() macro as well.Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/wireless/iwlwifi/iwl-led.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)diff --git a/drivers/net/wireless/iwlwifi/iwl-led.c b/drivers/net/wireless/iwlwifi/iwl-led.c
index 899d7a2..d7129f7 100644
--- a/drivers/net/wireless/iwlwifi/iwl-led.c
+++ b/drivers/net/wireless/iwlwifi/iwl-led.c
@@ -44,15 +44,15 @@
#include "iwl-io.h"
#include "iwl-helpers.h"-#ifdef CONFIG_IWLWIFI_DEBUG
static const char *led_type_str[] = {
+#ifdef CONFIG_IWLWIFI_DEBUG
__stringify(IWL_LED_TRG_TX),
__stringify(IWL_LED_TRG_RX),
__stringify(IWL_LED_TRG_ASSOC),
__stringify(IWL_LED_TRG_RADIO),
NULL
-};
#endif /* CONFIG_IWLWIFI_DEBUG */
+};static const struct {
--
From: Denis V. Lunev <den@openvz.org>
iwl-agn-rs.c: In function 'rs_clear':
iwl-agn-rs.c:2405: warning: unused variable 'privSigned-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
---
drivers/net/wireless/iwlwifi/iwl-agn-rs.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
index 6a4b229..7ddd07d 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-rs.c
@@ -2396,6 +2396,7 @@ static void rs_free(void *priv_rate)static void rs_clear(void *priv_rate)
{
+#ifdef CONFIG_IWLWIFI_DEBUG
struct iwl_priv *priv = (struct iwl_priv *) priv_rate;IWL_DEBUG_RATE("enter\n");
@@ -2403,6 +2404,7 @@ static void rs_clear(void *priv_rate)
/* TODO - add rate scale state reset */IWL_DEBUG_RATE("leave\n");
+#endif /* CONFIG_IWLWIFI_DEBUG */
}static void rs_free_sta(void *priv_rate, void *priv_sta)
--
1.5.4.1---------------------------------------------------------------------
Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.--
From: Denis V. Lunev <den@openvz.org>
CC [M] drivers/net/wireless/iwlwifi/iwl-scan.o
drivers/net/wireless/iwlwifi/iwl-scan.c: In function 'iwl_rx_scan_complete_notif':
drivers/net/wireless/iwlwifi/iwl-scan.c:274: warning: unused variable 'scan_notif'Signed-off-by: Denis V. Lunev <den@openvz.org>
Acked-by: Tomas Winkler <tomasw@gmail.com>
---
drivers/net/wireless/iwlwifi/iwl-scan.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)diff --git a/drivers/net/wireless/iwlwifi/iwl-scan.c b/drivers/net/wireless/iwlwifi/iwl-scan.c
index 847690b..0033232 100644
--- a/drivers/net/wireless/iwlwifi/iwl-scan.c
+++ b/drivers/net/wireless/iwlwifi/iwl-scan.c
@@ -270,6 +270,7 @@ static void iwl_rx_scan_results_notif(struct iwl_priv *priv,
static void iwl_rx_scan_complete_notif(struct iwl_priv *priv,
struct iwl_rx_mem_buffer *rxb)
{
+#ifdef CONFIG_IWLWIFI_DEBUG
struct iwl_rx_packet *pkt = (struct iwl_rx_packet *)rxb->skb->data;
struct iwl_scancomplete_notification *scan_notif = (void *)pkt->u.raw;@@ -277,6 +278,7 @@ static void iwl_rx_scan_complete_notif(struct iwl_priv *priv,
scan_notif->scanned_channels,
scan_notif->tsf_low,
scan_notif->tsf_high, scan_notif->status);
+#endif/* The HW is no longer scanning */
clear_bit(STATUS_SCAN_HW, &priv->status);
--
1.5.4.1---------------------------------------------------------------------
Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.--
From: Denis V. Lunev <den@openvz.org>
CC [M] drivers/net/wireless/iwlwifi/iwl-rfkill.o
drivers/net/wireless/iwlwifi/iwl-led.c: In function 'iwl_led_brightness_set':
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: 'led_type_str' undeclared (first use in this function)
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: (Each undeclared identifier is reported only once
drivers/net/wireless/iwlwifi/iwl-led.c:198: error: for each function it appears in.)The problem is that led_type_str is defined under CONFIG_IWLWIFI_DEBUG
while IWL_DEBUG is a static inline function in this case. Replace it
with macro.Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>
---
drivers/net/wireless/iwlwifi/iwl-debug.h | 8 ++------
1 files changed, 2 insertions(+), 6 deletions(-)diff --git a/drivers/net/wireless/iwlwifi/iwl-debug.h b/drivers/net/wireless/iwlwifi/iwl-debug.h
index 097a72f..b4ffd33 100644
--- a/drivers/net/wireless/iwlwifi/iwl-debug.h
+++ b/drivers/net/wireless/iwlwifi/iwl-debug.h
@@ -68,12 +68,8 @@ void iwl_dbgfs_unregister(struct iwl_priv *priv);
#endif#else
-static inline void IWL_DEBUG(int level, const char *fmt, ...)
-{
-}
-static inline void IWL_DEBUG_LIMIT(int level, const char *fmt, ...)
-{
-}
+#define IWL_DEBUG(level, fmt, args...)
+#define IWL_DEBUG_LIMIT(level, fmt, args...)
#endif /* CONFIG_IWLWIFI_DEBUG */--
1.5.4.1---------------------------------------------------------------------
Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.--
There are already patches that fix this. John has to pick them up from
the mail.
Thanks
Tomas---------------------------------------------------------------------
Intel Israel (74) LimitedThis e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.--
could you post the fix in this thread too please?
Ingo
--
Grr. And I quote:
Security table (IP_NF_SECURITY) [Y/n/?] (NEW) ?
This option adds a `security' table to iptables, for use
with Mandatory Access Control (MAC) policy.If unsure, say N.
why the heck does this new config option apparently default to 'Y'? It's a
new option, so no old users can need it, and the docs even say you should
say 'N' unless you know what you're doing.(Same issue with the IPv6 version).
Don't do this.
Linus
--
This warning also seems to be new:
net/ipv6/addrconf.c: In function ‘addrconf_add_linklocal’:
net/ipv6/addrconf.c:2318: warning: unused variable ‘net’and looking at the code it's apparently because I'm not an optimistic
enough dad.But hey, if you had three pre-teenage girls, you might not be all that
optimistic either. So I think that's reasonable.Problem seems to have been introduced by 53b7997f ("ipv6 netns: Make
several "global" sysctl variables namespace aware")Linus
--
RnJvbTogTGludXMgVG9ydmFsZHMgPHRvcnZhbGRzQGxpbnV4LWZvdW5kYXRpb24ub3JnPg0KRGF0
ZTogU3VuLCAyMCBKdWwgMjAwOCAxODowNzoyOCAtMDcwMCAoUERUKQ0KDQo+IFRoaXMgd2Fybmlu
ZyBhbHNvIHNlZW1zIHRvIGJlIG5ldzoNCj4gDQo+IAluZXQvaXB2Ni9hZGRyY29uZi5jOiBJbiBm
dW5jdGlvbiChYWRkcmNvbmZfYWRkX2xpbmtsb2NhbKI6DQo+IAluZXQvaXB2Ni9hZGRyY29uZi5j
OjIzMTg6IHdhcm5pbmc6IHVudXNlZCB2YXJpYWJsZSChbmV0og0KPiANCj4gYW5kIGxvb2tpbmcg
YXQgdGhlIGNvZGUgaXQncyBhcHBhcmVudGx5IGJlY2F1c2UgSSdtIG5vdCBhbiBvcHRpbWlzdGlj
IA0KPiBlbm91Z2ggZGFkLg0KPiANCj4gQnV0IGhleSwgaWYgeW91IGhhZCB0aHJlZSBwcmUtdGVl
bmFnZSBnaXJscywgeW91IG1pZ2h0IG5vdCBiZSBhbGwgdGhhdCANCj4gb3B0aW1pc3RpYyBlaXRo
ZXIuIFNvIEkgdGhpbmsgdGhhdCdzIHJlYXNvbmFibGUuDQo+IA0KPiBQcm9ibGVtIHNlZW1zIHRv
IGhhdmUgYmVlbiBpbnRyb2R1Y2VkIGJ5IDUzYjc5OTdmICgiaXB2NiBuZXRuczogTWFrZSANCj4g
c2V2ZXJhbCAiZ2xvYmFsIiBzeXNjdGwgdmFyaWFibGVzIG5hbWVzcGFjZSBhd2FyZSIpDQoNCkxl
dCdzIGp1c3QgZXhwYW5kIHRoZSB0aGluZyBpbiB0aGUgb25lIHNwb3QgaXQgZ2V0cw0KcmVmZXJl
bmNlZCBpbi4NCg0KUGxlYXNlIGFwcGx5LCB0aGFua3MuDQoNCmlwdjY6IEZpeCB3YXJuaW5nIGlu
IGFkZHJjb25mIGNvZGUuDQoNClJlcG9ydGVkIGJ5IExpbnVzLg0KDQpTaWduZWQtb2ZmLWJ5OiBE
YXZpZCBTLiBNaWxsZXIgPGRhdmVtQGRhdmVtbG9mdC5uZXQ+DQoNCmRpZmYgLS1naXQgYS9uZXQv
aXB2Ni9hZGRyY29uZi5jIGIvbmV0L2lwdjYvYWRkcmNvbmYuYw0KaW5kZXggNTgwYWU1MC4uOWY0
ZmNjZSAxMDA2NDQNCi0tLSBhL25ldC9pcHY2L2FkZHJjb25mLmMNCisrKyBiL25ldC9pcHY2L2Fk
ZHJjb25mLmMNCkBAIC0yMzE1LDEyICsyMzE1LDExIEBAIHN0YXRpYyB2b2lkIGluaXRfbG9vcGJh
Y2soc3RydWN0IG5ldF9kZXZpY2UgKmRldikNCiBzdGF0aWMgdm9pZCBhZGRyY29uZl9hZGRfbGlu
a2xvY2FsKHN0cnVjdCBpbmV0Nl9kZXYgKmlkZXYsIHN0cnVjdCBpbjZfYWRkciAqYWRkcikNCiB7
DQogCXN0cnVjdCBpbmV0Nl9pZmFkZHIgKiBpZnA7DQotCXN0cnVjdCBuZXQgKm5ldCA9IGRldl9u
ZXQoaWRldi0+ZGV2KTsNCiAJdTMyIGFkZHJfZmxhZ3MgPSBJRkFfRl9QRVJNQU5FTlQ7DQogDQog
I2lmZGVmIENPTkZJR19JUFY2X09QVElNSVNUSUNfREFEDQogCWlmIChpZGV2LT5jbmYub3B0aW1p
c3RpY19kYWQgJiYNCi0JICAgICFuZXQtPmlwdjYuZGV2Y29uZl9hbGwtPmZvcndhcmRpbmcpDQor
CSAgICAhZGV2X25ldChpZGV2LT5kZXYpLT5pcHY2LmRldmNvbmZfYWxsLT5mb3J3YXJkaW5nKQ0K
IAkJYWRkcl9mbGFncyB8PSBJRkFfRl9PUFRJTUlTVElDOw0KICNlbmRpZg0KIA0K
--
From: Linus Torvalds <torvalds@linux-foundation.org>
James/Patrick please fix this.
Thanks.
--
This is only the NETFILTER_ADVANCED=n default (for SECURITY=y).
The netfilter defaults for NETFILTER_ADVANCED=n should be m/y for
things that are needed by mainstream distributions for normal
usage.I'm not sure how this is going to be used, James?
--
I think the idea now is that everything new is N by default, but the
intention is to have this enabled in Fedora/RHEL.Patrick, would you please fix this up? The only dev box I have access to
at the moment doesn't boot with recent git (I think it's the macbook2
issue).- James
--
James Morris
<jmorris@namei.org>
--
Well, this option (NETFILTER_ADVANCED) was introduced specifically
so Linus doesn't have to go through and enable all the netfilter
options manually :)The idea was that NETFILTER_ADVANCED=n enables everything needed
by mainstream distributions and hides the rest. We can certainly
change the default for this option, but that makes NETFILTER_ADVANCEDSure. I'd like to hear whether Linus still wants this changed though.
--
From: Patrick McHardy <kaber@trash.net>
A new feature cannot possibly be used by existing distributions. I
think that's the main gripe.Distributions themselves will enable the feature no matter what we
mark it's default as.And once the feature thus becomes pervasive we can re-adjust the
default.
--
Well, if the feature really is going to be something that a _normal_
netfilter config needs, then it should indeed be turned on.However, nothing in the docs imply that at all. Can you explain? Why
should IP_NF_SECURITY be on, and why should a default netfilter table
enable it? And if it should, WHY THE HELL IS IT DOCUMENTED THAT YOU SHOULD
SAY 'N'?That option as it stands now MAKES NO SENSE. Either you should say 'Y'
(and you should explain _why_), or you should say 'N' (as documented) and
it should damn well default to 'N' too!Linus
--
As I said, I don't know whether its needed, but judging by James'
response, its going to be needed for a regular FC installation.Its not needed today of course, so the attached patch changes it
I think I'll just change all the help texts for options having
different defaults with NETFILTER_ADVANCED=n to say "If unsure,
choose the default" to remove the contradictions we'd otherwise
always have.
From: Patrick McHardy <kaber@trash.net>
I've applied this, thanks Patrick.
--
Don't ask what mainstream distributors would choose, ask how "make
oldconfig" or "make silentoldconfig" should behave. They should not add
features that weren't there before and are not necessary to continue
using all features of the old configuration.
--
Stefan Richter
-=====-==--- -=== =-=-=
http://arcgraph.de/sr/
--
> > > This is the main networking merge for 2.6.27
Looks like multiqueue changes broken netconsole (it remains silent).
Also, I had to enable CONFIG_NET_SCHED=y, otherwise it wouldn't link:
net/built-in.o: In function `dev_queue_xmit':
(.text+0x11210): undefined reference to `qdisc_calculate_pkt_len'
net/built-in.o: In function `__qdisc_destroy':
sch_generic.c:(.text+0x1ddcd): undefined reference to `qdisc_put_stab'And finally, boot process hangs after TCP cubic is initialized.
This is net-2.6 tree (175f9c1bba9b825d22b142d183c9e175488b260c)
(tainted with conntracking-in-net-ns changes, though).--
From: Alexey Dobriyan <adobriyan@gmail.com>
Can you dig a little bit deeper into this one? It ought to
work fine.I just pushed Linus a fix for that, Stephen Rothwell noticed it
That very much should not happen. Can you turn on initcall
tracing so we can see where that is happening exactly?
--
OK, the problem is really this buglet:
BUG_ON(q == &noop_qdisc);
__netif_schedule
netif_tx_wake_queue
netif_wake_queue
atl1_check_link
atl1_up or atlx_link_chg_task
run_workqueue--
From: Alexey Dobriyan <adobriyan@gmail.com>
Thanks for the backtrace I'll work on fixing this.
--
From: David Miller <davem@davemloft.net>
[ Jeff and co., this is basically the kind of patch I want
to see eventually for the Intel drivers too... ]Alexey, please try this patch:
atl1: Do not wake queue before queue has been started.
Based upon a bug report by Alexey Dobriyan.
Packet flow during link state events should not be done by
waking and stopping the TX queue anyways, that is handled
transparently by netif_carrier_{on,off}().So, remove the netif_{wake,stop}_queue() calls in the link
check code, and add the necessary netif_start_queue() call
to atl1_up().Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/net/atlx/atl1.c b/drivers/net/atlx/atl1.c
index 3e22e78..f12e3d1 100644
--- a/drivers/net/atlx/atl1.c
+++ b/drivers/net/atlx/atl1.c
@@ -1308,7 +1308,6 @@ static u32 atl1_check_link(struct atl1_adapter *adapter)
dev_info(&adapter->pdev->dev, "link is down\n");
adapter->link_speed = SPEED_0;
netif_carrier_off(netdev);
- netif_stop_queue(netdev);
}
return 0;
}
@@ -1358,7 +1357,6 @@ static u32 atl1_check_link(struct atl1_adapter *adapter)
if (!netif_carrier_ok(netdev)) {
/* Link down -> Up */
netif_carrier_on(netdev);
- netif_wake_queue(netdev);
}
return 0;
}
@@ -2627,6 +2625,7 @@ static s32 atl1_up(struct atl1_adapter *adapter)
mod_timer(&adapter->watchdog_timer, jiffies);
atlx_irq_enable(adapter);
atl1_check_link(adapter);
+ netif_start_queue(netdev);
return 0;err_up:
--
David, can we please make this all a bit less fragile?
There are _millions_ of network drivers, and these changes seem to have
broken not just common drivers, but also drivers that "work" seem to now
have broken suspend/resume.There's at least one report of suspend apparently oopsing now, and I
assume it's basically the same thing - it was bisected down to that same
37437bb2e1ae8af470dfcd5b4ff454110894ccaf commit ("pkt_sched: Schedule
qdiscs instead of netdev_queue.")Why is it so unnecessarily fragile to begin with? Especially for stuff
that happens at bootup or suspend, doing a BUG_ON() is _particularly_
painful, because a dead machine means that you cannot get any logs or
anything else.So wouldn't it be *much* better to do something like the appended, and at
least try to limp on, and maybe have a system that people can get logs
out of?Linus
---
net/core/dev.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)diff --git a/net/core/dev.c b/net/core/dev.c
index 2eed17b..43ab4f5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1325,7 +1325,8 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)void __netif_schedule(struct Qdisc *q)
{
- BUG_ON(q == &noop_qdisc);
+ if (WARN_ON_ONCE(q == &noop_qdisc))
+ return;if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state)) {
struct softnet_data *sd;
---
--
From: Linus Torvalds <torvalds@linux-foundation.org>
--
Patch helps and networking is back with and without using netconsole,
thank you!Also wan to say, 8139too driver survived multiqueue TX changes fine.
--
From: Alexey Dobriyan <adobriyan@gmail.com>
Thank you very much for testing, I'll push this fix.
--
Hi David!
There is another problem on sparc64 and happymeal ethernet card.
when tring to up interface:
kernel BUG at net/core/dev.c:1328
ip(1090): Kernel bad sw trap 5
__netif_schedule
happy_meal_set_multicast
__dev_set_rx_mode
dev_set_rx_mode
dev_open
dev_change_flags
do_setlink
rtnl_newlink
rtnetlink_rcv_msg
netlink_rcv_skb
--
[...]
This is yet another driver calling netif_wake_queue() during dev_open(),
when there is no real qdisc present. (And yes, sfc is another of those
drivers - I will post a patch after internal review.)Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
--
From: Ben Hutchings <bhutchings@solarflare.com>
Yep, what idiot wrote this driver? ;-)
Alexander please try this patch:
sunhme: Remove stop/wake TX queue calls in set-multicast-list handler.
Based upon a bug report by Alexander Beregalov and commentary
from Ben Hutchings.These are totally unnecessary, in particular because this
driver's ->hard_start_xmit() handler takes the same driver
spinlock that the set-multicast-list handler uses.Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/net/sunhme.c b/drivers/net/sunhme.c
index 1aa425b..b79d5f0 100644
--- a/drivers/net/sunhme.c
+++ b/drivers/net/sunhme.c
@@ -2377,8 +2377,6 @@ static void happy_meal_set_multicast(struct net_device *dev)spin_lock_irq(&hp->happy_lock);
- netif_stop_queue(dev);
-
if ((dev->flags & IFF_ALLMULTI) || (dev->mc_count > 64)) {
hme_write32(hp, bregs + BMAC_HTABLE0, 0xffff);
hme_write32(hp, bregs + BMAC_HTABLE1, 0xffff);
@@ -2410,8 +2408,6 @@ static void happy_meal_set_multicast(struct net_device *dev)
hme_write32(hp, bregs + BMAC_HTABLE3, hash_table[3]);
}- netif_wake_queue(dev);
-
spin_unlock_irq(&hp->happy_lock);
}--
It works, thanks.
--
Before that BUG_ON in __netif_schedule() happens (and quickly scrolls
off). But after netconsole ups device (which is ATL1 if it matters).--
Hi Dave,
I noticed the patch below isn't in this (yet)
did it slip through or does it still need more work?From: Arjan van de Ven <arjan@linux.intel.com>
Subject: Print the module name as part of the watchdog messageAs suggested by Dave:
This patch adds a function to get the driver name from a struct net_device,
and consequently uses this in the watchdog timeout handler to print as
part of the message.Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
include/linux/netdevice.h | 2 ++
net/core/dev.c | 20 ++++++++++++++++++++
net/sched/sch_generic.c | 6 +++---
3 files changed, 25 insertions(+), 3 deletions(-)Index: linux.trees.git/include/linux/netdevice.h
===================================================================
--- linux.trees.git.orig/include/linux/netdevice.h
+++ linux.trees.git/include/linux/netdevice.h
@@ -1516,6 +1516,8 @@ extern void dev_seq_stop(struct seq_file
extern int netdev_class_create_file(struct class_attribute *class_attr);
extern void netdev_class_remove_file(struct class_attribute *class_attr);+extern char *netdev_drivername(struct net_device *dev, char *buffer, int len);
+
extern void linkwatch_run_queue(void);extern int netdev_compute_features(unsigned long all, unsigned long one);
Index: linux.trees.git/net/core/dev.c
===================================================================
--- linux.trees.git.orig/net/core/dev.c
+++ linux.trees.git/net/core/dev.c
@@ -4554,6 +4554,26 @@ err_name:
return -ENOMEM;
}+char *netdev_drivername(struct net_device *dev, char *buffer, int len)
+{
+ struct device_driver *driver;
+ struct device *parent;
+
+ if (len <= 0 || !buffer)
+ return buffer;
+ buffer[0] = 0;
+
+ parent = dev->dev.parent;
+
+ if (!parent)
+ return buffer;
+
+ driver = parent->driver;
+ if (driver && driver->name)
+ strlcpy(buffer, driver->name, len);
+ return buffer;
+}
+
static void __net_exit netdev_exit(struct...
From: Arjan van de Ven <arjan@infradead.org>
Applied, thanks ARjan.
--
From: Arjan van de Ven <arjan@infradead.org>
It slipped through the cracks, sorry about that Arjan.
Thanks for reminding me, I'll look at it soon.
--
