Hi Got recently NULL pointer dereference Note - it seems in 32-bit i didn't experience this issue. Router with NAT, HFSC shapers terminated with bfifo qdiscs (some attached to 802.1Q vlan's), userspace application using tun. Hardware - network cards e1000e, Core 2 based architecture (two Quad Xeon), full message received over netconsole attached More info can provide on request
Hi Denys !
txq->qdisc is NULL at this point, I dont think its even possible, so my
random guess is dev_pick_tx() got an queue_index >
dev->real_num_tx_queues
txq = dev_pick_tx(dev, skb);
q = rcu_dereference(txq->qdisc); // q = NULL
#ifdef CONFIG_NET_CLS_ACT
skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
#endif
if (q->enqueue) { // crash in dereference
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}
I believe the following lines from dev_pick_tx() are not the problem :
if (sk && sk->sk_dst_cache)
sk_tx_queue_set(sk, queue_index);
It is IMHO not safe, because route for this socket might have just
changed and we are transmitting an old packet (queued some milli seconds
before, when route was different).
We then memorize a queue_index that might be too big for the new device
of new selected route.
Next packet we want to transmit will take the cached value of
queue_index, correct for old device, maybe not correct for new device.
You could try to revert commit a4ee3ce3293dc931fab19beb472a8bde1295aebe
commit a4ee3ce3293dc931fab19beb472a8bde1295aebe
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date: Mon Oct 19 23:50:07 2009 +0000
net: Use sk_tx_queue_mapping for connected sockets
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fa88dcd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
static struct netdev_queue *dev_pick_tx(struct net_device *dev,
struct sk_buff *skb)
{
- const struct ...I will try to revert it now. But to trigger bug probably i need 1-2 days. --
I can try to reproduce bug here, with a multiqueue device, a non multiqueue device, and changing routes while transmitting bulk of packets, and of course some trafic shaping to slow down xmits. Could you send : ifconfig -a lspci Thanks --
A bit more even, hopefully it is useful. Additional note - i have few blackhole routes, so if specific route announce will disappear - it will be less specific blackhole route. I have also some source routing (ip rule). lspci 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub (rev b1) 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev b1) 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev b1) 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 4-5 (rev b1) 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5 (rev b1) 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 6-7 (rev b1) 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7 (rev b1) 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev b1) 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1) 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev b1) 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1) 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev b1) 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09) 00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 (rev 09) 00:1d.7 USB Controller: Intel ...
Hi Eric,
When route changes, I think my patch had reset sk->sk_tx_queue_mapping
by calling sk_tx_queue_clear. I don't know if I missed any path where
the route changes and sk_dst_reset() was not called.
The following might be better to prove the panic is due to this, since
your suggestion will hide a panic that happens somewhat rare (according
to Denys):
if (sk_tx_queue_recorded(sk)) {
queue_index = sk_tx_queue_get(sk);
+ queue_index = dev_cap_txqueue(dev, queue_index);
} else {
Thanks,
--
Problem is when you reset sk->sk_tx_queue_mapping at the very moment route (or destination) changes, we might have old packets queued in tx queues, of the old ethernet device (eth0 : multi queue compatable) 2) Application does a sendmsg() or connect() call and sk->sk_dst_cache is rebuild, it points to a dst_entry referring a new device (eth1 : non multiqueue) 3) When one old packet finally is transmitted, we do : queue_index = 1; // any value > 0 if (sk && sk->sk_dst_cache) sk_tx_queue_set(sk, queue_index); // remember a >0 value 4) application does a sendmsg(), enqueues a new skb on eth1 5) We re-enter dev_pick_tx(), and consider cached value in 3) is valid. we pick a non existent txq for eth1 device. Sure, but I thought I was clear enough to prove this commit was wrong, --
Here the patch I cooked, this is a stable candidate, once tested and
acknowledged.
[PATCH] net: dev_pick_tx() fix
When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.
Bug introduced by commit a4ee3ce3
(net: Use sk_tx_queue_mapping for connected sockets)
Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 1c8a0ce..92584bf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1989,8 +1989,12 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
if (dev->real_num_tx_queues > 1)
queue_index = skb_tx_hash(dev, skb);
- if (sk && sk->sk_dst_cache)
- sk_tx_queue_set(sk, queue_index);
+ if (sk) {
+ struct dst_entry *dst = rcu_dereference(sk->sk_dst_cache);
+
+ if (dst && skb_dst(skb) == dst)
+ sk_tx_queue_set(sk, queue_index);
+ }
}
}
--
From: Eric Dumazet <eric.dumazet@gmail.com> Looks good Eric. I'll integrate this into net-2.6 and also queue it up for -stable in a day or two unless some problem is discovered. Thanks! --
From: Eric Dumazet <eric.dumazet@gmail.com> It looks like Denys is still getting crashes even with this patch applied. And I also think there is some meric to some of Krishna's analysis. To me it seems to make more sense to validate the SKB's queue against the real actual choosen device's range. The socket queue index will catch up and eventually become valid because the dst reset will invalidate the queue setting, and we'll thus recompute it as needed, as Krishna stated. So I'm tossing this patch for now since it doesn't even aparently fix the bug. --
I am a bit lost here David. Denys got a crash that we cannot explain yet. He said he has no multiqueue devices, so obviously my patch cant help him. But this patch was fixing a real issue, I believe I pointed it twice already... I'll try to setup an environment to trigger this bug for real, but this will take time, my dev machines are not multiqueue. --
From: Eric Dumazet <eric.dumazet@gmail.com> Ok, I thought your patch was specifically meant to fix Denys's bug. The confusion comes from the fact that you mention Denys's crash in your commit message: -------------------- When dev_pick_tx() caches tx queue_index on a socket, we must check socket dst_entry matches skb one, or risk a crash later, as reported by Denys Fedorysychenko, if old packets are in flight during a route change, involving devices with different number of queues. -------------------- Anyways, I studied your patch once more and read the thread discussion with Krishna again and your patch looks fine. I'll apply it to net-2.6, thanks! --
In any case, I think there is a fundamental problem with this sk caching. Because one packet can travel in many stacked devices before hitting the wire. (bonding, vlan, ethernet) for example. Socket cache is meaningfull for one level only... --
From: Eric Dumazet <eric.dumazet@gmail.com> We were talking the other day about that 'tun' change to orphan the SKB on TX, and I mentioned the possibility of just doing this in some generic location before we give the packet to the device ->xmit() method. Such a scheme could help with this problem too. --
Btw i have application using tun. --
Could you add following sanity test to catch the error ?
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..b67274a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -988,6 +988,7 @@ static inline
struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
unsigned int index)
{
+ WARN_ON(index >= dev->num_tx_queues);
return &dev->_tx[index];
}
--
Very sorry for being late, just i found way to stabilize kernel for me and to solve my personal life issues. It took 2 weeks... I will try this patch, but i'm sure i dont have multiqueue card there. I had shaper on eth0.33 and eth0 so i disable all offloading except checksumming there. Recently i found that on some interfaces (where is no shaper) gso left on. And yes, probably some traffic was queued, then route changed, and maybe it went from gso off interface to gso on... It can be just pure luck, that i dont have crashes anymore, but maybe trick is in gso... --
Same thing we did with
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
in dev_hard_start_xmit() ?
Problem is this skb_tstamp_tx() thing....
One possibility would be to change skb_orphan() to let everything done
by destructor.
One more argument would let destructor() know if this is the final
destructor() called from skb_release_head_state()
destructor() would be responsible to set skb->destructor and/or skb->sk
to NULL when possible.
All normal destructors would not care of this 2nd argument and just do
what they actually do, plus setting skb->sk = NULL, skb->destructor =
NULL
Fast path would stay as today, no extra test.
Only tstamp users would need to setup another destructor, a bit more
complex (it would have to take a look at 2nd argument before really
doing the job)
Completely untested patch to get the idea :
(to be completed for the tstamp thing)
include/linux/skbuff.h | 8 +++-----
include/net/sctp/sctp.h | 2 +-
include/net/sock.h | 4 ++--
net/caif/caif_socket.c | 4 +++-
net/core/dev.c | 26 +++++++++-----------------
net/core/skbuff.c | 2 +-
net/core/sock.c | 8 ++++++--
net/l2tp/l2tp_core.c | 4 +++-
net/netfilter/nf_tproxy_core.c | 2 +-
net/packet/af_packet.c | 4 ++--
net/unix/af_unix.c | 4 ++--
11 files changed, 33 insertions(+), 35 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 38501d2..7dfd833 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -353,7 +353,7 @@ struct sk_buff {
kmemcheck_bitfield_end(flags1);
__be16 protocol;
- void (*destructor)(struct sk_buff *skb);
+ void (*destructor)(struct sk_buff *skb, int final);
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack *nfct;
struct sk_buff *nfct_reasm;
@@ -1407,15 +1407,13 @@ static inline void ...Well, might need to test if skb is not shared before orphaning it if (!skb_shared(skb)) skb_orphan(nskb); --
From: Eric Dumazet <eric.dumazet@gmail.com> If it's not legal to skb_orphan() here then it would not be legal for the drivers to unconditionally skb_orphan(), which they do. --
I cooked following patch, introducing skb_orphan_try() helper, to document all known exceptions. I have a possible followup for this patch : Orphaning skbs earlier could also make dev_kfree_skb_irq() faster. Instead of queing skb into completion_queue and triggering NET_TX_SOFTIRQ, we would directly free an orphaned skb ? [PATCH net-next-2.6] net: Introduce skb_orphan_try() Transmitted skb might be attached to a socket and a destructor, for memory accounting purposes. Traditionally, this destructor is called at tx completion time, when skb is freed. When tx completion is performed by another cpu than the sender, this forces some cache lines to change ownership. XPS was an attempt to give tx completion to initial cpu. David idea is to call destructor right before giving skb to device (call to ndo_start_xmit()). Because device queues are usually small, orphaning skb before tx completion is not a big deal. Some drivers already do this, we could do it in upper level. There is one known exception to this early orphaning, called tx timestamping. It needs to keep a reference to socket until device can give a hardware or software timestamp. This patch adds a skb_orphan_try() helper, to centralize all exceptions to early orphaning in one spot, and use it in dev_hard_start_xmit(). "tbench 16" results on a Nehalem machine (2 X5570 @ 2.93GHz) before: Throughput 4428.9 MB/sec 16 procs after: Throughput 4448.14 MB/sec 16 procs UDP should get even better results, its destructor being more complex, since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- diff --git a/net/core/dev.c b/net/core/dev.c index e8041eb..acae5fe 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1880,6 +1880,17 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } +/* + * Try to orphan skb early, right before transmission by the device. + * We cannot orphan skb if tx timestamp is requested, ...
From: Eric Dumazet <eric.dumazet@gmail.com> Looks good, applied, thanks Eric. That timestamping issue, I bet there is some simple solution hiding in the bushes for that one? Hmmm, something like... there is a less transient piece of state (perhaps in the socket) and the skb has a pointer to that piece of state. Then the driver writes into this pointed-to place rather into some member of the skb itself. Then all of these problems with referencing the skb metadata across Sounds great. But I don't see dev_kfree_skb_irq() as much of a performance priority, since all sane modern drivers free skbs from softirq context (via NAPI). --
Hmm, looking at the GSO stuff, I believe we should not call
skb_orphan_try() on gso skbs ?
Thanks !
[PATCH net-next-2.6] net: Dont call skb_orphan_try() for GSO
At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..200d1e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1937,7 +1937,6 @@ gso:
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(nskb);
- skb_orphan_try(nskb);
rc = ops->ndo_start_xmit(nskb, dev);
if (unlikely(rc != NETDEV_TX_OK)) {
if (rc & ~NETDEV_TX_MASK)
--
From: Eric Dumazet <eric.dumazet@gmail.com> Right, I've applied this, thanks. What we should probably do instead is call and NULL out the DEV_GSO_CB() destructor. Right? --
Yes, probably, I'll take a look at this if you want. Thanks --
From: Eric Dumazet <eric.dumazet@gmail.com>
It might look something like this:
diff --git a/net/core/dev.c b/net/core/dev.c
index 9bf1ccc..13241da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1892,6 +1892,20 @@ static inline void skb_orphan_try(struct sk_buff *skb)
skb_orphan(skb);
}
+/*
+ * GSO packets need to be handled specially because such packets
+ * hold the normal SKB destructor in a backup pointer.
+ */
+static inline void skb_orphan_try_gso(struct sk_buff *skb)
+{
+ if (!skb_tx(skb)->flags) {
+ if (DEV_GSO_CB(skb)->destructor)
+ DEV_GSO_CB(skb)->destructor(skb);
+ DEV_GSO_CB(skb)->destructor = NULL;
+ skb->sk = NULL;
+ }
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
@@ -1937,6 +1951,7 @@ gso:
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(nskb);
+ skb_orphan_try_gso(skb);
rc = ops->ndo_start_xmit(nskb, dev);
if (unlikely(rc != NETDEV_TX_OK)) {
if (rc & ~NETDEV_TX_MASK)
--
Hmm... are you sure we want to call destructor for each skb ? Should'nt we do it before initial skb is split ? --
From: Eric Dumazet <eric.dumazet@gmail.com> Good idea, therefore you mean something like this? diff --git a/net/core/dev.c b/net/core/dev.c index 3ba774b..f3c3885 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1865,6 +1865,7 @@ static int dev_gso_segment(struct sk_buff *skb) int features = dev->features & ~(illegal_highdma(dev, skb) ? NETIF_F_SG : 0); + skb_orphan_try(skb); segs = skb_gso_segment(skb, features); /* Verifying header integrity only. */ --
Yes, it seems better. What about the if (dev->priv_flags & IFF_XMIT_DST_RELEASE) skb_dst_drop(skb); This thing might also be moved before the split, since split probably clone all dst ? --
From: Eric Dumazet <eric.dumazet@gmail.com>
Good catch, agreed.
diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..4f897e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1851,6 +1851,17 @@ static void dev_gso_skb_destructor(struct sk_buff *skb)
cb->destructor(skb);
}
+/*
+ * Try to orphan skb early, right before transmission by the device.
+ * We cannot orphan skb if tx timestamp is requested, since
+ * drivers need to call skb_tstamp_tx() to send the timestamp.
+ */
+static inline void skb_orphan_try(struct sk_buff *skb)
+{
+ if (!skb_tx(skb)->flags)
+ skb_orphan(skb);
+}
+
/**
* dev_gso_segment - Perform emulated hardware segmentation on skb.
* @skb: buffer to segment
@@ -1865,6 +1876,7 @@ static int dev_gso_segment(struct sk_buff *skb)
int features = dev->features & ~(illegal_highdma(dev, skb) ?
NETIF_F_SG : 0);
+ skb_orphan_try(skb);
segs = skb_gso_segment(skb, features);
/* Verifying header integrity only. */
@@ -1881,17 +1893,6 @@ static int dev_gso_segment(struct sk_buff *skb)
return 0;
}
-/*
- * Try to orphan skb early, right before transmission by the device.
- * We cannot orphan skb if tx timestamp is requested, since
- * drivers need to call skb_tstamp_tx() to send the timestamp.
- */
-static inline void skb_orphan_try(struct sk_buff *skb)
-{
- if (!skb_tx(skb)->flags)
- skb_orphan(skb);
-}
-
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
@@ -1902,13 +1903,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
- if (netif_needs_gso(dev, skb)) {
- if (unlikely(dev_gso_segment(skb)))
- goto out_kfree_skb;
- if (skb->next)
- goto gso;
- }
-
/*
* If device doesnt need skb->dst, release it right now while
* its hot in this cpu cache
@@ -1916,6 +1910,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
...You could have one skb_orphan_try() call before the
if (netif_needs_gso(dev, skb)) {
and remove it from dev_gso_segment() ?
--
From: Eric Dumazet <eric.dumazet@gmail.com>
Yes, that's much more concise. This should be ready to go:
net: Orphan and de-dst skbs earlier in xmit path.
This way GSO packets don't get handled differently.
With help from Eric Dumazet.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..a4a7c36 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1902,13 +1902,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
- if (netif_needs_gso(dev, skb)) {
- if (unlikely(dev_gso_segment(skb)))
- goto out_kfree_skb;
- if (skb->next)
- goto gso;
- }
-
/*
* If device doesnt need skb->dst, release it right now while
* its hot in this cpu cache
@@ -1917,6 +1910,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
skb_dst_drop(skb);
skb_orphan_try(skb);
+
+ if (netif_needs_gso(dev, skb)) {
+ if (unlikely(dev_gso_segment(skb)))
+ goto out_kfree_skb;
+ if (skb->next)
+ goto gso;
+ }
+
rc = ops->ndo_start_xmit(skb, dev);
if (rc == NETDEV_TX_OK)
txq_trans_update(txq);
--
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Thanks David ! --
Hi Eric, Thanks for your patch, just one question on it though: If the dst got changed between call to vlan_dev_hwaccel_hard_start_xmit and it's call to dev_queue_xmit, that change to dst should have reset sk_tx_queue_mapping to -1 by calling sk_tx_queue_clear (assuming that I have changed in all paths, eg __sk_dst_reset), and thus result in a new mapping in dev_pick_tx. Would the patch hide the actual bug where we do not clear sk_tx_queue_mapping, eg __sk_dst_set does it? I agree the patch will fix the panic, but this check could be removed if the code which changes the dst is fixed to clear the mapping. I could check that if you think this assumption is correct. Thanks, - KK --
I believe you focus on another problem. I am not saying we dont have another bug (forgetting to reset sk_dst_cache somewhere). I am only saying that when we want to cache the queue number on a given socket, we have to make sure current packet routing decision was taken on same dst_entries than current and future ones. Denys hit the problem because of long delays caused by traffic shaping. So the cache renew must be safe, which I tried to fix. You are saying that cache invalidation might be missing from some paths. I dont think so because I took an extensive look at these spots when working on yet another RCU conversion two days ago (sk_dst_lock becomes a spinlock). This was fresh in my mind, this is why I probably found Denys problem origin so quickly ;) --
On Monday 12 April 2010 12:31:43 Eric Dumazet wrote: . Seems problem still remain. Patched kernel, but paniced now. Btw, i dont have any multiqueue card, i think.
This is becoming tricky :( This is forwarding case, no socket involved in this case. If no multiqueue is involved, I dont see how it can happen. We should take a look at requeues (qdisc congestion), there might be a problem with them. --
