Re: NULL pointer dereference panic in stable (2.6.33.2), amd64

Previous thread: [RFC PATCH 0/9] net: support multiple independant multicast routing instances by kaber on Sunday, April 11, 2010 - 10:37 am. (7 messages)

Next thread: none
From: Denys Fedorysychenko
Date: Sunday, April 11, 2010 - 1:38 pm

Hi

Got recently NULL pointer dereference
Note - it seems in 32-bit i didn't experience this issue.

Router with NAT, HFSC shapers terminated with bfifo qdiscs (some attached to 
802.1Q vlan's), userspace application using tun.
Hardware - network cards e1000e, Core 2 based architecture (two Quad Xeon), 

full message received over netconsole attached
More info can provide on request
From: Eric Dumazet
Date: Sunday, April 11, 2010 - 3:35 pm

Hi Denys !

txq->qdisc is NULL at this point, I dont think its even possible, so my
random guess is dev_pick_tx() got an queue_index >
dev->real_num_tx_queues


        txq = dev_pick_tx(dev, skb);
        q = rcu_dereference(txq->qdisc);  // q = NULL

#ifdef CONFIG_NET_CLS_ACT
        skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
#endif
        if (q->enqueue) {  // crash in dereference
                rc = __dev_xmit_skb(skb, q, dev, txq);
                goto out;
        }


I believe the following lines from dev_pick_tx() are not the problem :

	if (sk && sk->sk_dst_cache) 
		sk_tx_queue_set(sk, queue_index);

It is IMHO not safe, because route for this socket might have just
changed and we are transmitting an old packet (queued some milli seconds
before, when route was different).

We then memorize a queue_index that might be too big for the new device
of new selected route.

Next packet we want to transmit will take the cached value of
queue_index, correct for old device, maybe not correct for new device.

You could try to revert commit a4ee3ce3293dc931fab19beb472a8bde1295aebe

commit a4ee3ce3293dc931fab19beb472a8bde1295aebe
Author: Krishna Kumar <krkumar2@in.ibm.com>
Date:   Mon Oct 19 23:50:07 2009 +0000

    net: Use sk_tx_queue_mapping for connected sockets
    
    For connected sockets, the first run of dev_pick_tx saves the
    calculated txq in sk_tx_queue_mapping. This is not saved if
    either the device has a queue select or the socket is not
    connected. Next iterations of dev_pick_tx uses the cached value
    of sk_tx_queue_mapping.
    
    Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 28b0b9e..fa88dcd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1791,13 +1791,25 @@ EXPORT_SYMBOL(skb_tx_hash);
 static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 					struct sk_buff *skb)
 {
-	const struct ...
From: Denys Fedorysychenko
Date: Sunday, April 11, 2010 - 4:04 pm

I will try to revert it now. But to trigger bug probably i need 1-2 days.

--

From: Eric Dumazet
Date: Sunday, April 11, 2010 - 4:11 pm

I can try to reproduce bug here, with a multiqueue device, a non
multiqueue device, and changing routes while transmitting bulk of
packets, and of course some trafic shaping to slow down xmits.

Could you send :

ifconfig -a
lspci

Thanks


--

From: Denys Fedorysychenko
Date: Sunday, April 11, 2010 - 4:36 pm

A bit more even, hopefully it is useful.
Additional note - i have few blackhole routes, so if specific route announce 
will disappear - it will be less specific blackhole route.

I have also some source routing (ip rule).


lspci

00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub 
(rev b1)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
2 (rev b1)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
3 (rev b1)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 
4-5 (rev b1)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
5 (rev b1)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 
6-7 (rev b1)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
7 (rev b1)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 
b1)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 
b1)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 
b1)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
(rev b1)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
(rev b1)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
b1)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
b1)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express 
Root Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #3 (rev 09)
00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #4 (rev 09)
00:1d.7 USB Controller: Intel ...
From: Krishna Kumar2
Date: Sunday, April 11, 2010 - 8:38 pm

Hi Eric,


When route changes, I think my patch had reset sk->sk_tx_queue_mapping
by calling sk_tx_queue_clear. I don't know if I missed any path where
the route changes and sk_dst_reset() was not called.

The following might be better to prove the panic is due to this, since
your suggestion will hide a panic that happens somewhat rare (according
to Denys):

      if (sk_tx_queue_recorded(sk)) {
            queue_index = sk_tx_queue_get(sk);
+           queue_index = dev_cap_txqueue(dev, queue_index);
      } else {

Thanks,


--

From: Eric Dumazet
Date: Sunday, April 11, 2010 - 11:01 pm

Problem is when you reset sk->sk_tx_queue_mapping at the very moment
route (or destination) changes, we might have old packets queued in tx
queues, of the old ethernet device (eth0 : multi queue compatable)

2) Application does a sendmsg() or connect() call and sk->sk_dst_cache
is rebuild, it points to a dst_entry referring a new device (eth1 : non
multiqueue)

3) When one old packet finally is transmitted, we do :

	queue_index = 1; // any value > 0

	if (sk && sk->sk_dst_cache)
		sk_tx_queue_set(sk, queue_index); // remember a >0 value

4) application does a sendmsg(), enqueues a new skb on eth1

5) We re-enter dev_pick_tx(), and consider cached value in 3) is valid.
   we pick a non existent txq for eth1 device.


Sure, but I thought I was clear enough to prove this commit was wrong,


--

From: Eric Dumazet
Date: Monday, April 12, 2010 - 12:18 am

Here the patch I cooked, this is a stable candidate, once tested and
acknowledged.

[PATCH] net: dev_pick_tx() fix

When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.

Bug introduced by commit a4ee3ce3
(net: Use sk_tx_queue_mapping for connected sockets)

Reported-by: Denys Fedorysychenko <nuclearcat@nuclearcat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 1c8a0ce..92584bf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1989,8 +1989,12 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 			if (dev->real_num_tx_queues > 1)
 				queue_index = skb_tx_hash(dev, skb);
 
-			if (sk && sk->sk_dst_cache)
-				sk_tx_queue_set(sk, queue_index);
+			if (sk) {
+				struct dst_entry *dst = rcu_dereference(sk->sk_dst_cache);
+
+				if (dst && skb_dst(skb) == dst)
+					sk_tx_queue_set(sk, queue_index);
+			}
 		}
 	}
 


--

From: David Miller
Date: Monday, April 12, 2010 - 12:36 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Looks good Eric.  I'll integrate this into net-2.6 and also queue it
up for -stable in a day or two unless some problem is discovered.

Thanks!
--

From: David Miller
Date: Wednesday, April 14, 2010 - 11:52 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

It looks like Denys is still getting crashes even with this patch
applied.  And I also think there is some meric to some of Krishna's
analysis.

To me it seems to make more sense to validate the SKB's queue against
the real actual choosen device's range.

The socket queue index will catch up and eventually become valid
because the dst reset will invalidate the queue setting, and we'll
thus recompute it as needed, as Krishna stated.

So I'm tossing this patch for now since it doesn't even aparently
fix the bug.

--

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 1:02 am

I am a bit lost here David.

Denys got a crash that we cannot explain yet. He said he has no
multiqueue devices, so obviously my patch cant help him.

But this patch was fixing a real issue, I believe I pointed it twice
already...

I'll try to setup an environment to trigger this bug for real, but this
will take time, my dev machines are not multiqueue.



--

From: David Miller
Date: Thursday, April 15, 2010 - 1:26 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Ok, I thought your patch was specifically meant to fix Denys's bug.

The confusion comes from the fact that you mention Denys's crash in
your commit message:

--------------------
When dev_pick_tx() caches tx queue_index on a socket, we must check
socket dst_entry matches skb one, or risk a crash later, as reported by
Denys Fedorysychenko, if old packets are in flight during a route
change, involving devices with different number of queues.
--------------------

Anyways, I studied your patch once more and read the thread discussion
with Krishna again and your patch looks fine.  I'll apply it to
net-2.6, thanks!

--

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 1:51 am

In any case, I think there is a fundamental problem with this sk
caching. Because one packet can travel in many stacked devices before
hitting the wire.

(bonding, vlan, ethernet) for example.

Socket cache is meaningfull for one level only...



--

From: David Miller
Date: Thursday, April 15, 2010 - 2:06 am

From: Eric Dumazet <eric.dumazet@gmail.com>

We were talking the other day about that 'tun' change to orphan the
SKB on TX, and I mentioned the possibility of just doing this in some
generic location before we give the packet to the device ->xmit()
method.

Such a scheme could help with this problem too.
--

From: Denys Fedorysychenko
Date: Thursday, April 15, 2010 - 2:11 am

Btw i have application using tun.

--

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 3:37 am

Could you add following sanity test to catch the error ?

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..b67274a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -988,6 +988,7 @@ static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
+	WARN_ON(index >= dev->num_tx_queues);
 	return &dev->_tx[index];
 }
 


--

From: Denys Fedorysychenko
Date: Thursday, April 29, 2010 - 3:50 am

Very sorry for being late, just i found way to stabilize kernel for me and to 
solve my personal life issues. It took 2 weeks...
I will try this patch, but i'm sure i dont have multiqueue card there.

I had shaper on eth0.33 and eth0 so i disable all offloading except 
checksumming there. Recently i found  that on some interfaces (where is no 
shaper) gso left on. And yes, probably some traffic was queued, then route 
changed, and maybe it went from gso off interface to gso on... 
It can be just pure luck, that i dont have crashes anymore, but maybe trick is 
in gso...
--

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 1:30 pm

Same thing we did with 

if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
	skb_dst_drop(skb);

in dev_hard_start_xmit() ?

Problem is this skb_tstamp_tx() thing....

One possibility would be to change skb_orphan() to let everything done
by destructor. 

One more argument would let destructor() know if this is the final
destructor() called from skb_release_head_state()

destructor() would be responsible to set skb->destructor and/or skb->sk
to NULL when possible. 


All normal destructors would not care of this 2nd argument and just do
what they actually do, plus setting skb->sk = NULL, skb->destructor =
NULL

Fast path would stay as today, no extra test.

Only tstamp users would need to setup another destructor, a bit more
complex (it would have to take a look at 2nd argument before really
doing the job)


Completely untested patch to get the idea :
(to be completed for the tstamp thing)


 include/linux/skbuff.h         |    8 +++-----
 include/net/sctp/sctp.h        |    2 +-
 include/net/sock.h             |    4 ++--
 net/caif/caif_socket.c         |    4 +++-
 net/core/dev.c                 |   26 +++++++++-----------------
 net/core/skbuff.c              |    2 +-
 net/core/sock.c                |    8 ++++++--
 net/l2tp/l2tp_core.c           |    4 +++-
 net/netfilter/nf_tproxy_core.c |    2 +-
 net/packet/af_packet.c         |    4 ++--
 net/unix/af_unix.c             |    4 ++--
 11 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 38501d2..7dfd833 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -353,7 +353,7 @@ struct sk_buff {
 	kmemcheck_bitfield_end(flags1);
 	__be16			protocol;
 
-	void			(*destructor)(struct sk_buff *skb);
+	void			(*destructor)(struct sk_buff *skb, int final);
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	struct nf_conntrack	*nfct;
 	struct sk_buff		*nfct_reasm;
@@ -1407,15 +1407,13 @@ static inline void ...
From: Eric Dumazet
Date: Thursday, April 15, 2010 - 1:46 pm

Well, might need to test if skb is not shared before orphaning it

	if (!skb_shared(skb))
		skb_orphan(nskb);



--

From: David Miller
Date: Thursday, April 15, 2010 - 2:33 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

If it's not legal to skb_orphan() here then it would not be legal for
the drivers to unconditionally skb_orphan(), which they do.


--

From: Eric Dumazet
Date: Friday, April 16, 2010 - 3:18 pm

I cooked following patch, introducing skb_orphan_try() helper, to
document all known exceptions.

I have a possible followup for this patch :

Orphaning skbs earlier could also make dev_kfree_skb_irq() faster.
Instead of queing skb into completion_queue and triggering
NET_TX_SOFTIRQ, we would directly free an orphaned skb ?



[PATCH net-next-2.6] net: Introduce skb_orphan_try()

Transmitted skb might be attached to a socket and a destructor, for
memory accounting purposes.

Traditionally, this destructor is called at tx completion time, when skb
is freed.

When tx completion is performed by another cpu than the sender, this
forces some cache lines to change ownership. XPS was an attempt to give
tx completion to initial cpu.

David idea is to call destructor right before giving skb to device (call
to ndo_start_xmit()). Because device queues are usually small, orphaning
skb before tx completion is not a big deal. Some drivers already do
this, we could do it in upper level.

There is one known exception to this early orphaning, called tx
timestamping. It needs to keep a reference to socket until device can
give a hardware or software timestamp.

This patch adds a skb_orphan_try() helper, to centralize all exceptions
to early orphaning in one spot, and use it in dev_hard_start_xmit().

"tbench 16" results on a Nehalem machine (2 X5570  @ 2.93GHz)
before: Throughput 4428.9 MB/sec 16 procs
after: Throughput 4448.14 MB/sec 16 procs

UDP should get even better results, its destructor being more complex,
since SOCK_USE_WRITE_QUEUE is not set (four atomic ops instead of one)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index e8041eb..acae5fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1880,6 +1880,17 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+/*
+ * Try to orphan skb early, right before transmission by the device.
+ * We cannot orphan skb if tx timestamp is requested, ...
From: David Miller
Date: Sunday, April 18, 2010 - 2:46 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, applied, thanks Eric.

That timestamping issue, I bet there is some simple solution hiding
in the bushes for that one?

Hmmm, something like... there is a less transient piece of state
(perhaps in the socket) and the skb has a pointer to that piece of
state.  Then the driver writes into this pointed-to place rather
into some member of the skb itself.

Then all of these problems with referencing the skb metadata across

Sounds great.  But I don't see dev_kfree_skb_irq() as much of a
performance priority, since all sane modern drivers free skbs from
softirq context (via NAPI).
--

From: Eric Dumazet
Date: Tuesday, April 20, 2010 - 11:08 pm

Hmm, looking at the GSO stuff, I believe we should not call
skb_orphan_try() on gso skbs ?

Thanks !

[PATCH net-next-2.6] net: Dont call skb_orphan_try() for GSO

At this point, skb->destructor is not the original one (stored in
DEV_GSO_CB(skb)->destructor)

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..200d1e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1937,7 +1937,6 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);
 
-		skb_orphan_try(nskb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)


--

From: David Miller
Date: Wednesday, April 21, 2010 - 10:56 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Right, I've applied this, thanks.

What we should probably do instead is call and NULL out the
DEV_GSO_CB() destructor.  Right?
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 12:10 am

Yes, probably, I'll take a look at this if you want.

Thanks


--

From: David Miller
Date: Thursday, April 22, 2010 - 12:16 am

From: Eric Dumazet <eric.dumazet@gmail.com>

It might look something like this:

diff --git a/net/core/dev.c b/net/core/dev.c
index 9bf1ccc..13241da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1892,6 +1892,20 @@ static inline void skb_orphan_try(struct sk_buff *skb)
 		skb_orphan(skb);
 }
 
+/*
+ * GSO packets need to be handled specially because such packets
+ * hold the normal SKB destructor in a backup pointer.
+ */
+static inline void skb_orphan_try_gso(struct sk_buff *skb)
+{
+	if (!skb_tx(skb)->flags) {
+		if (DEV_GSO_CB(skb)->destructor)
+			DEV_GSO_CB(skb)->destructor(skb);
+		DEV_GSO_CB(skb)->destructor = NULL;
+		skb->sk = NULL;
+	}
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1937,6 +1951,7 @@ gso:
 		if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
 			skb_dst_drop(nskb);
 
+		skb_orphan_try_gso(skb);
 		rc = ops->ndo_start_xmit(nskb, dev);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 12:24 am

Hmm... are you sure we want to call destructor for each skb ?

Should'nt we do it before initial skb is split ?



--

From: David Miller
Date: Thursday, April 22, 2010 - 12:26 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Good idea, therefore you mean something like this?

diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..f3c3885 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1865,6 +1865,7 @@ static int dev_gso_segment(struct sk_buff *skb)
 	int features = dev->features & ~(illegal_highdma(dev, skb) ?
 					 NETIF_F_SG : 0);
 
+	skb_orphan_try(skb);
 	segs = skb_gso_segment(skb, features);
 
 	/* Verifying header integrity only. */
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 12:33 am

Yes, it seems better.

What about the 

if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
	skb_dst_drop(skb);

This thing might also be moved before the split, since split probably
clone all dst ?


--

From: David Miller
Date: Thursday, April 22, 2010 - 12:41 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Good catch, agreed.

diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..4f897e2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1851,6 +1851,17 @@ static void dev_gso_skb_destructor(struct sk_buff *skb)
 		cb->destructor(skb);
 }
 
+/*
+ * Try to orphan skb early, right before transmission by the device.
+ * We cannot orphan skb if tx timestamp is requested, since
+ * drivers need to call skb_tstamp_tx() to send the timestamp.
+ */
+static inline void skb_orphan_try(struct sk_buff *skb)
+{
+	if (!skb_tx(skb)->flags)
+		skb_orphan(skb);
+}
+
 /**
  *	dev_gso_segment - Perform emulated hardware segmentation on skb.
  *	@skb: buffer to segment
@@ -1865,6 +1876,7 @@ static int dev_gso_segment(struct sk_buff *skb)
 	int features = dev->features & ~(illegal_highdma(dev, skb) ?
 					 NETIF_F_SG : 0);
 
+	skb_orphan_try(skb);
 	segs = skb_gso_segment(skb, features);
 
 	/* Verifying header integrity only. */
@@ -1881,17 +1893,6 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
-/*
- * Try to orphan skb early, right before transmission by the device.
- * We cannot orphan skb if tx timestamp is requested, since
- * drivers need to call skb_tstamp_tx() to send the timestamp.
- */
-static inline void skb_orphan_try(struct sk_buff *skb)
-{
-	if (!skb_tx(skb)->flags)
-		skb_orphan(skb);
-}
-
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1902,13 +1903,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
 
-		if (netif_needs_gso(dev, skb)) {
-			if (unlikely(dev_gso_segment(skb)))
-				goto out_kfree_skb;
-			if (skb->next)
-				goto gso;
-		}
-
 		/*
 		 * If device doesnt need skb->dst, release it right now while
 		 * its hot in this cpu cache
@@ -1916,6 +1910,13 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 ...
From: Eric Dumazet
Date: Thursday, April 22, 2010 - 12:47 am

You could have one skb_orphan_try() call before the

if (netif_needs_gso(dev, skb)) {

and remove it from dev_gso_segment() ?


--

From: David Miller
Date: Thursday, April 22, 2010 - 12:54 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Yes, that's much more concise.  This should be ready to go:

net: Orphan and de-dst skbs earlier in xmit path.

This way GSO packets don't get handled differently.

With help from Eric Dumazet.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/core/dev.c b/net/core/dev.c
index 3ba774b..a4a7c36 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1902,13 +1902,6 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		if (!list_empty(&ptype_all))
 			dev_queue_xmit_nit(skb, dev);
 
-		if (netif_needs_gso(dev, skb)) {
-			if (unlikely(dev_gso_segment(skb)))
-				goto out_kfree_skb;
-			if (skb->next)
-				goto gso;
-		}
-
 		/*
 		 * If device doesnt need skb->dst, release it right now while
 		 * its hot in this cpu cache
@@ -1917,6 +1910,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			skb_dst_drop(skb);
 
 		skb_orphan_try(skb);
+
+		if (netif_needs_gso(dev, skb)) {
+			if (unlikely(dev_gso_segment(skb)))
+				goto out_kfree_skb;
+			if (skb->next)
+				goto gso;
+		}
+
 		rc = ops->ndo_start_xmit(skb, dev);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
--

From: Eric Dumazet
Date: Thursday, April 22, 2010 - 12:59 am

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks David !


--

From: Krishna Kumar2
Date: Monday, April 12, 2010 - 12:54 am

Hi Eric,

Thanks for your patch, just one question on it though:


If the dst got changed between call to vlan_dev_hwaccel_hard_start_xmit
and it's call to dev_queue_xmit, that change to dst should have reset
sk_tx_queue_mapping to -1 by calling sk_tx_queue_clear (assuming that I
have changed in all paths, eg __sk_dst_reset), and thus result in a new
mapping in dev_pick_tx. Would the patch hide the actual bug where we do
not clear sk_tx_queue_mapping, eg __sk_dst_set does it? I agree the
patch will fix the panic, but this check could be removed if the code
which changes the dst is fixed to clear the mapping. I could check that
if you think this assumption is correct.

Thanks,

- KK

--

From: Eric Dumazet
Date: Monday, April 12, 2010 - 2:31 am

I believe you focus on another problem. I am not saying we dont have
another bug (forgetting to reset sk_dst_cache somewhere).

I am only saying that when we want to cache the queue number on a given
socket, we have to make sure current packet routing decision was taken
on same dst_entries than current and future ones. Denys hit the problem
because of long delays caused by traffic shaping.

So the cache renew must be safe, which I tried to fix.

You are saying that cache invalidation might be missing from some paths.
I dont think so because I took an extensive look at these spots when
working on yet another RCU conversion two days ago (sk_dst_lock becomes
a spinlock). This was fresh in my mind, this is why I probably found
Denys problem origin so quickly ;)



--

From: Denys Fedorysychenko
Date: Monday, April 12, 2010 - 9:11 am

On Monday 12 April 2010 12:31:43 Eric Dumazet wrote:
.
Seems problem still remain. Patched kernel, but paniced now.
Btw, i dont have any multiqueue card, i think.
From: Eric Dumazet
Date: Monday, April 12, 2010 - 1:09 pm

This is becoming tricky :(

This is forwarding case, no socket involved in this case.

If no multiqueue is involved, I dont see how it can happen.

We should take a look at requeues (qdisc congestion), there might be a
problem with them.



--

Previous thread: [RFC PATCH 0/9] net: support multiple independant multicast routing instances by kaber on Sunday, April 11, 2010 - 10:37 am. (7 messages)

Next thread: none