One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.
I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.
I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..9e0fd01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,6 +1852,20 @@ gso:
if (q->enqueue) {
spinlock_t *root_lock = qdisc_lock(q);
+ /*
+ * Release dst while its refcnt is hot in CPU cache, instead
+ * of waiting NIC tx completion
+ */
+ if (likely(skb->dst)) {
+ if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+ int newrefcnt;
+
+ smp_mb__before_atomic_dec();
+ newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+ WARN_ON(newrefcnt < 0);
+ skb->dst = NULL;
+ }
+ }
spin_lock(root_lock);
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
--
I think this seems like a pretty good idea. I thought for a moment that some stacked interfaces (bonds, vlan devices), might have a problem with this, since they tend to pass through dev_queue_xmit twice, but I can't see a problem with either one of those cases either, since neither of thier xmit routines makes any use of the dst pointer. I'd say include it Acked-by: Neil Horman <nhorman@tuxdriver.com> --
From: Eric Dumazet <dada1@cosmosbay.com> This will break various packet schedulers and classifiers. Heck sch_sfq.c even uses skb->dst as part of it's flow hash function :-) --
Well, as one of the hash perturbator, for other protocols than
IPV4 & IPV6...
default:
h = (unsigned long)skb->dst ^ skb->protocol;
h2 = (unsigned long)skb->sk;
}
return sfq_fold_hash(q, h, h2);
But teql indeed mandates dst in __teql_resolve()
Darn...
This dst freeing should be done very late then, in the NIC driver itself, just
before giving skb to hardware, or right before in dev_hard_start_xmit()
If done in dev_hard_start_xmit(), skb could be requeued (because of NETDEV_TX_BUSY).
Then if requeued, maybe at this time, dst being NULL is not a problem ?
Thanks a lot David
--
From: Eric Dumazet <dada1@cosmosbay.com> Usually it should be OK because the packet schedulers have a sort-of one-behind state that allows them to reinsert the SKB into their queue datastructures without reclassifying. But I'm not %100 sure there isn't some case that might still need skb->dst there. --
Actually, since David has dumped requeuing there is no reinserting; this last one "requeued" skb is buffered at the top in q->gso_skb and waiting for better times. Jarek P. --
Yes indeed, this is what I thought too, thanks Jarek.
I tested following patch today on my machine, but obviously could not try
all possible quirks :)
[PATCH] net: release dst entry in dev_hard_start_xmit()
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.
I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.
I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..a622db6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb)
return 0;
}
+
+/*
+ * Release dst while its refcnt is likely hot in CPU cache, instead
+ * of waiting NIC tx completion.
+ * We inline dst_release() code for performance reason
+ */
+static void release_skb_dst(struct sk_buff *skb)
+{
+ if (likely(skb->dst)) {
+ if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+ int newrefcnt;
+
+ smp_mb__before_atomic_dec();
+ newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+ WARN_ON(newrefcnt < 0);
+ skb->dst = NULL;
+ }
+ }
+}
+
int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq)
{
@@ -1681,6 +1701,7 @@ int ...Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at xmits in macvlan and pppoe. Maybe this patch should exclude them? --
Yes, MACVLAN :) its macvlan_start_xmit() function calls
dev_queue_xmit(skb) again, so we go back to packet schedulers and
classifiers, they might need dst again :(
Only other potential problem I found was in
drivers/net/appletalk/ipddp.c
static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev)
{
__be32 paddr = ((struct rtable*)skb->dst)->rt_gateway;
Not sure this driver is still supported, or if this paddr can be found elsewhere...
__sk_dst_get(skb->sk) ???
--
I guess you've considered the loopback too... Jarek P. --
I had for the first patch (carefuly avoiding loopback being not responsive) : I was releasing dst only if enqueue() was called. You are right, loopback doesnt work anymore with last patch, and I have no idea why... Do you know why ? Thank you --
Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be negative for loopback source? Jarek P. --
ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached. --
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.
David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().
List of devices that must clear this flag is :
- loopback device, because it calls netif_rx() and quoting Patrick :
"ip_route_input() doesn't accept loopback addresses, so loopback packets
already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function
- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
drivers/net/appletalk/ipddp.c | 1 +
drivers/net/bonding/bond_main.c | 1 +
drivers/net/eql.c | 1 +
drivers/net/ifb.c | 1 +
drivers/net/loopback.c | 1 +
drivers/net/macvlan.c | 1 +
drivers/net/wan/hdlc_fr.c | 1 +
include/linux/if.h | 3 +++
net/core/dev.c | 9 +++++++++
9 files changed, 19 insertions(+)
diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c
index da64ba8..0268561 100644
--- a/drivers/net/appletalk/ipddp.c
+++ b/drivers/net/appletalk/ipddp.c
@@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void)
if (!dev)
return ERR_PTR(-ENOMEM);
...Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? Jarek P. --
Yes I spoted vlan earlier this afternoon. For other net/*, I didnt not find candidates yet. --
Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ using dev_queue_xmit)? Jarek P. --
On Tue, May 12, 2009 at 10:05:15PM +0200, Jarek Poplawski wrote: I wonder if we can't simplify this e.g. by checking indirectly for some hardware flag/capability like checksums etc. Some older hardware could be omitted, by is it really so big deal? Jarek P. --
As I said, I didnt found other relevant uses, but I am probably wrong :)
About ppoe for example, two calls to dev_queue_xmit() are not relevant.
One is from
static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
This is a normal direct call, not a call from its ndo_start_xmit()
Second one is from
static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb)
Same observation, there is no impact for this one as well.
So we have to care on device drivers that have a ndo_start_xmit() call that could
re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from
other paths.
--
Isn't it called through ppp_generic's ndo_start_xmit()? Jarek P. --
Here is a second version, including change to vlan code as well.
Thank you
[PATCH] net: release dst entry in dev_hard_start_xmit()
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls
dev_kree_skb() long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
It seems right place to release dst is in dev_hard_start_xmit(), for most
devices but ones that are virtual, and some exceptions.
David Miller suggested to define a new device flag, set in alloc_netdev_mq()
(so that most devices set it at init time), and carefuly unset in devices
which dont want a NULL skb->dst in their ndo_start_xmit().
List of devices that must clear this flag is :
- loopback device, because it calls netif_rx() and quoting Patrick :
"ip_route_input() doesn't accept loopback addresses, so loopback packets
already need to have a dst_entry attached."
- appletalk/ipddp.c : needs skb->dst in its xmit function
- And all devices that call again dev_queue_xmit() from their xmit function
(as some classifiers need skb->dst) : bonding, vlan, macvlan, eql, ifb, hdlc_fr
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
drivers/net/appletalk/ipddp.c | 1 +
drivers/net/bonding/bond_main.c | 1 +
drivers/net/eql.c | 1 +
drivers/net/ifb.c | 1 +
drivers/net/loopback.c | 1 +
drivers/net/macvlan.c | 1 +
drivers/net/wan/hdlc_fr.c | 1 +
include/linux/if.h | 3 +++
net/8021q/vlan_dev.c | 1 +
net/core/dev.c | 9 +++++++++
10 files changed, 20 insertions(+)
diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c
index da64ba8..f939e92 ...From: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. Eric, please followup and double-check the pppoe paths that Jarek mentioned. I never saw that fully resolved. Thanks! --
[PATCH] ppp: unset IFF_XMIT_DST_RELEASE in ppp_setup() Jarek pointed pppoe can call back dev_queue_xmit(), and might need skb->dst, so its safer to unset IFF_XMIT_DST_RELEASE on ppp devices. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c index 8ee9142..639d11b 100644 --- a/drivers/net/ppp_generic.c +++ b/drivers/net/ppp_generic.c @@ -1054,6 +1054,7 @@ static void ppp_setup(struct net_device *dev) dev->type = ARPHRD_PPP; dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST; dev->features |= NETIF_F_NETNS_LOCAL; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } /* --
Hmm... Of course, this patch looks OK to me, but actually my main concern was more general. We avoid adding such flags for each "real" dev, but if so IMHO it would be safer to generally add them to all "virtual" devs - needed or not. You prefer to do this only where necessary, but it's not always clear if it's omitted on purpose or by chance. So, now I'm wondering about xen-netfront - needlessly I hope ;-) --
This is the deal in fact, tracking all valid uses, and I'll check this. Another path would have to set the flag only for fast devices (Gb and 10Gb) --
From: Jarek Poplawski <jarkao2@gmail.com> It is an issue that surely needs to be fleshed out, to make sure we use this where possible without breaking odd cases too easily. But for now I'm applying Eric's patch because it does fix something until we have these issues sorted more generally. --
