Re: [RFC] net: release dst entry in dev_queue_xmit()

Previous thread: [PATCH] Phonet:fix build problem by Alexander Beregalov on Friday, March 20, 2009 - 4:58 am. (8 messages)

Next thread: This Internet company is giving away money worldwide by Dice-k on Friday, March 20, 2009 - 6:44 am. (1 message)
From: Eric Dumazet
Date: Friday, March 20, 2009 - 4:40 am

One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.

I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.

I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..9e0fd01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,6 +1852,20 @@ gso:
 	if (q->enqueue) {
 		spinlock_t *root_lock = qdisc_lock(q);
 
+		/*
+		 * Release dst while its refcnt is hot in CPU cache, instead
+		 * of waiting NIC tx completion
+		 */
+		if (likely(skb->dst)) {
+			if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+				int newrefcnt;
+
+				smp_mb__before_atomic_dec();
+				newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+				WARN_ON(newrefcnt < 0);
+				skb->dst = NULL;
+			}
+		}
 		spin_lock(root_lock);
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
--

From: Neil Horman
Date: Friday, March 20, 2009 - 7:10 am

I think this seems like a pretty good idea.  I thought for a moment that some
stacked interfaces (bonds, vlan devices), might have a problem with this, since
they tend to pass through dev_queue_xmit twice, but I can't see a problem with
either one of those cases either, since neither of thier xmit routines makes any
use of the dst pointer.  I'd say include it

Acked-by: Neil Horman <nhorman@tuxdriver.com>

--

From: David Miller
Date: Tuesday, March 24, 2009 - 11:43 pm

From: Eric Dumazet <dada1@cosmosbay.com>

This will break various packet schedulers and classifiers.

Heck sch_sfq.c even uses skb->dst as part of it's flow hash
function :-)
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 12:13 am

Well, as one of the hash perturbator, for other protocols than
IPV4 & IPV6...

        default:
                h = (unsigned long)skb->dst ^ skb->protocol;
                h2 = (unsigned long)skb->sk;
        }

        return sfq_fold_hash(q, h, h2);

But teql indeed mandates dst in __teql_resolve() 

Darn...

This dst freeing should be done very late then, in the NIC driver itself, just
before giving skb to hardware, or right before in dev_hard_start_xmit()

If done in dev_hard_start_xmit(), skb could be requeued (because of NETDEV_TX_BUSY).
Then if requeued, maybe at this time, dst being NULL is not a problem ?

Thanks a lot David

--

From: David Miller
Date: Wednesday, March 25, 2009 - 12:17 am

From: Eric Dumazet <dada1@cosmosbay.com>

Usually it should be OK because the packet schedulers have
a sort-of one-behind state that allows them to reinsert
the SKB into their queue datastructures without reclassifying.

But I'm not %100 sure there isn't some case that might still
need skb->dst there.
--

From: Jarek Poplawski
Date: Wednesday, March 25, 2009 - 11:22 am

Actually, since David has dumped requeuing there is no reinserting;
this last one "requeued" skb is buffered at the top in q->gso_skb
and waiting for better times.

Jarek P.
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 11:41 am

Yes indeed, this is what I thought too, thanks Jarek.

I tested following patch today on my machine, but obviously could not try 
all possible quirks :)

[PATCH] net: release dst entry in dev_hard_start_xmit()

One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.

I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.

I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..a622db6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb)
 	return 0;
 }
 
+
+/*
+ * Release dst while its refcnt is likely hot in CPU cache, instead
+ * of waiting NIC tx completion.
+ * We inline dst_release() code for performance reason
+ */
+static void release_skb_dst(struct sk_buff *skb)
+{
+	if (likely(skb->dst)) {
+		if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+			int newrefcnt;
+
+			smp_mb__before_atomic_dec();
+			newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+			WARN_ON(newrefcnt < 0);
+			skb->dst = NULL;
+		}
+	}
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 			struct netdev_queue *txq)
 {
@@ -1681,6 +1701,7 @@ int ...
From: Jarek Poplawski
Date: Wednesday, March 25, 2009 - 12:18 pm

Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at
xmits in macvlan and pppoe. Maybe this patch should exclude them?

--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 12:40 pm

Yes, MACVLAN :) its macvlan_start_xmit() function calls
dev_queue_xmit(skb) again, so we go back to packet schedulers and
classifiers, they might need dst again :(

Only other potential problem I found was in 
drivers/net/appletalk/ipddp.c

static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev)
{
__be32 paddr = ((struct rtable*)skb->dst)->rt_gateway;

Not sure this driver is still supported, or if this paddr can be found elsewhere...
 __sk_dst_get(skb->sk) ???

--

From: Jarek Poplawski
Date: Wednesday, March 25, 2009 - 12:54 pm

I guess you've considered the loopback too...

Jarek P.
--

From: Eric Dumazet
Date: Wednesday, March 25, 2009 - 1:28 pm

I had for the first patch (carefuly avoiding loopback being not responsive) :
I was releasing dst only if enqueue() was called.

You are right, loopback doesnt work anymore with last patch,
and I have no idea why...

Do you know why ?

Thank you

--

From: Jarek Poplawski
Date: Wednesday, March 25, 2009 - 2:12 pm

Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be
negative for loopback source?

Jarek P.
--

From: Patrick McHardy
Date: Wednesday, March 25, 2009 - 2:20 pm

ip_route_input() doesn't accept loopback addresses, so loopback packets
already need to have a dst_entry attached.
--

From: Eric Dumazet
Date: Tuesday, May 12, 2009 - 1:12 am

One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.

David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().

List of devices that must clear this flag is :

- loopback device, because it calls netif_rx() and quoting Patrick :
    "ip_route_input() doesn't accept loopback addresses, so loopback packets
     already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function

- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 drivers/net/appletalk/ipddp.c   |    1 +
 drivers/net/bonding/bond_main.c |    1 +
 drivers/net/eql.c               |    1 +
 drivers/net/ifb.c               |    1 +
 drivers/net/loopback.c          |    1 +
 drivers/net/macvlan.c           |    1 +
 drivers/net/wan/hdlc_fr.c       |    1 +
 include/linux/if.h              |    3 +++
 net/core/dev.c                  |    9 +++++++++
 9 files changed, 19 insertions(+)

diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c
index da64ba8..0268561 100644
--- a/drivers/net/appletalk/ipddp.c
+++ b/drivers/net/appletalk/ipddp.c
@@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void)
 	if (!dev)
 		return ERR_PTR(-ENOMEM);
 ...
From: Eric Dumazet
Date: Tuesday, May 12, 2009 - 1:21 am

From: Jarek Poplawski
Date: Tuesday, May 12, 2009 - 12:27 pm

Why not vlan (and/or maybe others in net/ using dev_queue_xmit)?

Jarek P.
--

From: Eric Dumazet
Date: Tuesday, May 12, 2009 - 12:44 pm

Yes I spoted vlan earlier this afternoon.

For other net/*, I didnt not find candidates yet.


--

From: Jarek Poplawski
Date: Tuesday, May 12, 2009 - 1:05 pm

Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/
using dev_queue_xmit)?

Jarek P.
--

From: Jarek Poplawski
Date: Tuesday, May 12, 2009 - 1:24 pm

On Tue, May 12, 2009 at 10:05:15PM +0200, Jarek Poplawski wrote:

I wonder if we can't simplify this e.g. by checking indirectly for
some hardware flag/capability like checksums etc. Some older hardware
could be omitted, by is it really so big deal?

Jarek P.
--

From: Eric Dumazet
Date: Tuesday, May 12, 2009 - 1:52 pm

As I said, I didnt found other relevant uses, but I am probably wrong :)

About ppoe for example, two calls to dev_queue_xmit() are not relevant.

One is from 

static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock,
          struct msghdr *m, size_t total_len)

This is a normal direct call, not a call from its ndo_start_xmit()

Second one is from

static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)

Same observation, there is no impact for this one as well.



So we have to care on device drivers that have a ndo_start_xmit() call that could
re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from
other paths.

--

From: Jarek Poplawski
Date: Tuesday, May 12, 2009 - 1:59 pm

Isn't it called through ppp_generic's ndo_start_xmit()?

Jarek P.
--

From: Eric Dumazet
Date: Tuesday, May 12, 2009 - 12:26 pm

Here is a second version, including change to vlan code as well.

Thank you

[PATCH] net: release dst entry in dev_hard_start_xmit()

One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).

CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.

It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.

David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().

List of devices that must clear this flag is :

- loopback device, because it calls netif_rx() and quoting Patrick :
    "ip_route_input() doesn't accept loopback addresses, so loopback packets
     already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function

- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, vlan, macvlan, eql, ifb, hdlc_fr

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 drivers/net/appletalk/ipddp.c   |    1 +
 drivers/net/bonding/bond_main.c |    1 +
 drivers/net/eql.c               |    1 +
 drivers/net/ifb.c               |    1 +
 drivers/net/loopback.c          |    1 +
 drivers/net/macvlan.c           |    1 +
 drivers/net/wan/hdlc_fr.c       |    1 +
 include/linux/if.h              |    3 +++
 net/8021q/vlan_dev.c            |    1 +
 net/core/dev.c                  |    9 +++++++++
 10 files changed, 20 insertions(+)

diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c
index da64ba8..f939e92 ...
From: David Miller
Date: Monday, May 18, 2009 - 10:19 pm

From: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.

Eric, please followup and double-check the pppoe paths
that Jarek mentioned.  I never saw that fully resolved.

Thanks!
--

From: Eric Dumazet
Date: Monday, May 18, 2009 - 10:44 pm

Ah OK, I 'll do this David

Thanks !


--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 12:44 pm

[PATCH] ppp: unset IFF_XMIT_DST_RELEASE in ppp_setup()

Jarek pointed pppoe can call back dev_queue_xmit(), and might need
skb->dst, so its safer to unset IFF_XMIT_DST_RELEASE on ppp devices.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 8ee9142..639d11b 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1054,6 +1054,7 @@ static void ppp_setup(struct net_device *dev)
 	dev->type = ARPHRD_PPP;
 	dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
 	dev->features |= NETIF_F_NETNS_LOCAL;
+	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
 }
 
 /*

--

From: Jarek Poplawski
Date: Tuesday, May 19, 2009 - 2:09 pm

Hmm... Of course, this patch looks OK to me, but actually my main
concern was more general. We avoid adding such flags for each "real"
dev, but if so IMHO it would be safer to generally add them to all
"virtual" devs - needed or not. You prefer to do this only where
necessary, but it's not always clear if it's omitted on purpose or
by chance. So, now I'm wondering about xen-netfront - needlessly I
hope ;-)

--

From: Eric Dumazet
Date: Tuesday, May 19, 2009 - 2:21 pm

This is the deal in fact, tracking all valid uses, and I'll check this.

Another path would have to set the flag only for fast devices (Gb and 10Gb)



--

From: David Miller
Date: Tuesday, May 19, 2009 - 2:24 pm

From: Jarek Poplawski <jarkao2@gmail.com>

It is an issue that surely needs to be fleshed out, to make
sure we use this where possible without breaking odd cases
too easily.

But for now I'm applying Eric's patch because it does fix
something until we have these issues sorted more generally.
--

Previous thread: [PATCH] Phonet:fix build problem by Alexander Beregalov on Friday, March 20, 2009 - 4:58 am. (8 messages)

Next thread: This Internet company is giving away money worldwide by Dice-k on Friday, March 20, 2009 - 6:44 am. (1 message)