Re: High contention on the sk_buff_head.lock

Previous thread: [PATCH] netdev: handle setting transmit queue length on active device. by Stephen Hemminger on Wednesday, March 18, 2009 - 10:02 am. (9 messages)

Next thread: [PATCH -next] gianfar, ucc_geth: Use proper address translation for MDIO buses by Anton Vorontsov on Wednesday, March 18, 2009 - 12:21 pm. (4 messages)
From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 10:24 am

I have been beating on network throughput in the -rt kernel for some time
now.  After digging down through the send path of UDP packets, I found
that the sk_buff_head.lock is under some very high contention.  This lock
is acquired each time a packet is enqueued on a qdisc and then acquired
again to dequeue the packet.  Under high networking loads, the enqueueing
processes are not only contending among each other for the lock, but also
with the net-tx soft irq.  This makes for some very high contention on this
one lock.  My testcase is running varying numbers of concurrent netperf
instances pushing UDP traffic to another machine.  As the count goes from
1 to 2, the network performance increases.  But from 2 to 4 and from 4 to 8,
we see a big decline, with 8 instances pushing about half of what a single
thread can do.

Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried
both NetXen and Broadcom, with very similar results), I can only push about
1200 Mb/s.  Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly
6000 Mb/s. But still not as much as I think is possible.  I was curious and
decided to see if the mainline kernel was hitting the same lock, and using
/proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the
number one contended lock).

So while this issue really hits -rt kernels hard, it has a real effect on
mainline kernels as well.  The contention of the spinlocks is amplified
when they get turned into rt-mutexes, which causes a double context switch.

Below is the top of the lock_stat for 2.6.29-rc8.  This was captured from
a 1 minute network stress test.  The next high contender had 2 orders of
magnitude fewer contentions.  Think of the throughput increase if we could
ease this contention a bit.  We might even be able to saturate a 10GbE
link.

lock_stat version ...
From: Eric Dumazet
Date: Wednesday, March 18, 2009 - 12:07 pm

Yes we have a known contention point here, but before adding more complex code,
could you try following patch please ?

[PATCH] net: Reorder fields of struct Qdisc

dev_queue_xmit() needs to dirty fields "state" and "q"

On x86_64 arch, they currently span two cache lines, involving more
cache line ping pongs than necessary.

Before patch :

offsetof(struct Qdisc, state)=0x38
offsetof(struct Qdisc, q)=0x48
offsetof(struct Qdisc, dev_queue)=0x60

After patch :

offsetof(struct Qdisc, dev_queue)=0x38
offsetof(struct Qdisc, state)=0x48
offsetof(struct Qdisc, q)=0x50


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f8c4742..e24feeb 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -51,10 +51,11 @@ struct Qdisc
 	u32			handle;
 	u32			parent;
 	atomic_t		refcnt;
-	unsigned long		state;
+	struct netdev_queue	*dev_queue;
+
 	struct sk_buff		*gso_skb;
+	unsigned long		state;
 	struct sk_buff_head	q;
-	struct netdev_queue	*dev_queue;
 	struct Qdisc		*next_sched;
 	struct list_head	list;
 


--

From: Eric Dumazet
Date: Monday, March 23, 2009 - 1:32 am

I dont understand, doesnt it defeat the ticket spinlock thing and fairness ?

Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit.

trying or taking spinlock has same effect, since it force a cache line ping pong,
and this is the real problem.

--

From: Jarek Poplawski
Date: Friday, March 20, 2009 - 4:29 pm

Vernon Mauery wrote, On 03/18/2009 09:17 PM:

I think there would be interesting to check another idea around this
contention: not all contenders are equal here. One thread is doing
qdisc_run() and owning the transmit queue (even after releasing the TX
lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
is idle. Probably some handicap like in the patch below could make
some difference in throughput; alas I didn't test it.

Jarek P.
---

 net/core/dev.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..d5ad808 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,7 +1852,11 @@ gso:
 	if (q->enqueue) {
 		spinlock_t *root_lock = qdisc_lock(q);
 
-		spin_lock(root_lock);
+		while (!spin_trylock(root_lock)) {
+			do {
+				cpu_relax();
+			} while (spin_is_locked(root_lock));
+		}
 
 		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 			kfree_skb(skb);
--

From: Jarek Poplawski
Date: Monday, March 23, 2009 - 1:50 am

But this "busy cpu" can't take packets out of the queue when it's
waiting on the contended spinlock. Anyway, it's only for testing,
and I didn't say it has to be right.

Jarek P.
--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 7:13 am

Come on guys, if this lock is a problem. go out and buy a proper
NIC that supports multiequeue TX!

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 1:17 pm

This patch does seem to reduce the number of contentions by about 10%.  That is
a good start (and a good catch on the cacheline bounces).  But, like I mentioned
above, this lock still has 2 orders of magnitude greater contention than the
next lock, so even a large decrease like 10% makes little difference in the
overall contention characteristics.

So we will have to do something more.  Whether it needs to be more complex or
not is still up in the air.  Batched enqueueing/dequeueing are just two options
and the former would be a *lot* less complex than the latter.

If anyone else has any ideas they have been holding back, now would be a great
time to get them out in the open.


--

From: Herbert Xu
Date: Thursday, April 2, 2009 - 7:15 am

Nevermind, I was reading old emails again :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Monday, March 23, 2009 - 1:37 am

From: Eric Dumazet <dada1@cosmosbay.com>

Right.

Remember, the way this is designed is that if there is a busy
cpu taking packets out of the queue and putting them into the
device then other cpus will simply add to the queue and immediately
return.  This effectively keeps the queue running there processing
all the new work that other cpus are adding to the qdisc.

Those other cpus make these decisions by looking at that
__QDISC_STATE_RUNNING bit, which the queue runner grabs before
it does any work.
--

From: Andi Kleen
Date: Wednesday, March 18, 2009 - 1:54 pm

The new adaptive spin heuristics that have been discussed some time
ago didn't help? Unsurprisingly making locks a lot more expensive

The real "fix" would be probably to use a multi queue capable NIC
and a NIC driver that sets up multiple queues for TX (normally they
only do for RX). Then cores or a set of cores (often the number
of cores is larger than the number of NIC queues) could avoid this
problem. Disadvantage: more memory use.

But then again I'm not sure it's  worth it if the problem only
happens in out of tree RT.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: David Miller
Date: Wednesday, March 18, 2009 - 2:03 pm

From: Andi Kleen <andi@firstfloor.org>

The list of problems that only show up with the RT kernel seems to be
constantly increasing, but honestly is very surprising to me.

I don't understand why we even need to be concerned about this stuff
upstream, to be honest.

Please reproduce this in the vanilla kernel, then get back to us.
--

From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 2:10 pm

Huh? The numbers that I posted *were* from the vanilla kernel. I ran
the 2.6.29-rc8 kernel with lock_stat enabled.  The lock contention
happens on the same lock in both vanilla and -rt, it just happens
to be more pronounced in the -rt kernel because of the double context
switches that the sleeping spinlock/rt-mutexes introduce.

--Vernon
--

From: David Miller
Date: Wednesday, March 18, 2009 - 2:38 pm

From: Vernon Mauery <vernux@us.ibm.com>

And the double context switches are probably also why less
natural batching and locality are achieved in the RT kernel.

Isn't that true?
--

From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 2:49 pm

Yes, the double context switches surely hurt the temporal and
spatial locality of the vanilla codepath, but it also induces a
longer penalty for blocking on a lock -- instead of a nanoseconds
or a few microseconds, the task gets delayed for tens of
microseconds.  So really, the -rt kernel has more to fix than
the vanilla kernel in this case, but any improvement in the lock
contention in the vanilla case would be magnified and would cause
dramatic improvements in the -rt kernel.

--Vernon
--

From: David Miller
Date: Wednesday, March 18, 2009 - 6:02 pm

From: Vernon Mauery <vernux@us.ibm.com>

Contention on a shared resource is not implicitly bad.

And with upstream spinlocks the cost is relatively low for a case like
this where a thread of control goes in and only holds the lock for
long enough to unlink a packet from the list and immediately the lock
is released.

The cost should be, cache line move from cpu-to-cpu, atomic lock,
linked list unlink, a store, and a memory barrier.  And that's
all it is upstream.

If the -rt kernel makes this 10 times more expensive, I really don't
see why that is an upstream concern at the current point in time.

That's the tradeoff, common situations where locks are held by
multiple threads with contention, but only for mere cycles, are
seemingly a lot more expensive in the -rt kernel.

I mean, for example, why doesn't the -rt kernel just spin for a little
while like a normal spinlock would instead of always entering that
expensive contended code path?  It would be a huge win here, but I
have yet to see discussion of changes in that area of the -rt kernel
to combat these effects.  Is it the networking that always has to
change for the sake of -rt? :-)
--

From: Gregory Haskins
Date: Wednesday, March 18, 2009 - 2:54 pm

Note that -rt doesnt typically context-switch under contention anymore
since we introduced adaptive-locks.  Also note that the contention
against the lock is still contention, regardless of whether you have -rt
or not.  Its just that the slow-path to handle the contended case for
-rt is more expensive than mainline.  However, once you have the
contention as stated, you have already lost.

We have observed the posters findings ourselves in both mainline and
-rt.  I.e. That lock doesnt scale very well once you have more than a
handful of cores.  It's certainly a great area to look at for improving
the overall stack, IMO, as I believe there is quite a bit of headroom
left to be recovered that is buried there.

Regards,
-Greg

From: David Miller
Date: Wednesday, March 18, 2009 - 6:03 pm

From: Gregory Haskins <ghaskins@novell.com>

First, contention is not implicitly a bad thing.

Second, if the -rt kernel is doing adaptive spinning I see no
reason why that adaptive spinning is not kicking in here to
make this problem just go away.

This lock is held for mere cycles, just to unlink an SKB from
the networking qdisc, and then it is immediately released.
--

From: Sven-Thorsten Dietrich
Date: Wednesday, March 18, 2009 - 6:13 pm

If only the first of N contending threads gets to spin, 2..N would

For very short hold times, and heavy contention, as well as for
scalability, the solution may lie in tunable spinner-count and adaptive
spinner time-out.


--

From: Gregory Haskins
Date: Wednesday, March 18, 2009 - 8:48 pm

However, when the contention in question is your top bottleneck, even
small improvements have the potential to yield large performance gains. :=
Well, "go away" is a relative term.  Before adaptive-locks, the box was
heavily context-switching under a network workload like we are
discussing, and that is where much of the performance was lost.=20
Adaptive-locks mitigate that particular problem so that spinlock_t
clients now more closely model mainline behavior (i.e. spin under
contention, at least when they can) and this brought -rt much closer to
what you expect for performance from mainline.

However, in -rt the contention path is, by necessity, still moderately
more heavy weight than a simple mainline spin (PI, etc), so any
contention will still hurt relatively more.   And obviously
adaptive-locks do nothing to address the contention itself.  Therefore,
as long as the contention remains it will have a higher impact in -rt.=20
But make no mistake: the contention can (and I believe does) affect both
trees.

To see this in action, try taking a moderately large smp system (8-way+)
and scaling the number of flows.  At 1 flow the stack is generally quite
capable of maintaining line-rate (at least up to GigE).  With two flows
this should result in each achieving roughly 50% of the line-rate, for a
sum total of line-rate.  But as you add flows to the equation, the
aggregate bandwidth typically starts to drop off to be something less
than line-rate (many times its actually much less).  And when this
happens, the contention in question is usually at the top of the charts
To clarify: I haven't been looking at the stack code since last summer,
so some things may have changed since then.  However, the issue back
then from my perspective was the general backpressure in the qdisc
subsystem (e.g. dev->queue_lock), not just the skb head.lock per se.=20
This dev->queue_lock can be held for quite a long time, and the
mechanism in general scales poorly as the fabric speeds and ...
From: David Miller
Date: Wednesday, March 18, 2009 - 10:38 pm

From: Gregory Haskins <ghaskins@novell.com>

I can maintain line-rate over 10GB with a 64-cpu box.  It's not
a problem.
--

From: Gregory Haskins
Date: Thursday, March 19, 2009 - 5:42 am

Oh man, I am jealous of that 64-way :)

How many simultaneous flows?  What hardware?  What qdisc and other
config do you use?  MTU?  I cannot replicate such results on 10GB even
with much smaller cpu counts.

On my test rig here, I have a 10GB link connected by crossover between
two 8-core boxes.  Running one unidirectional TCP flow is typically
toping out at ~5.5Gb/s on 2.6.29-rc8.  Granted we are using MTU=3D1500,
which in of itself is part of the upper limit.  However, that result in
of itself isn't a problem, per se.  What is a problem is the aggregate
bandwidth drops with scaling the number of flows.  I would like to
understand how to make this better, if possible, and perhaps I can learn

To clarify terms, we are not saying "the stack performs inadequately".=20
What we are saying here is that analysis of our workloads and of the
current stack indicates that we are io-bound, and that this particular
locking architecture in the qdisc subsystem is the apparent top gating
factor from going faster.  Therefore we are really saying "how can we
make it even better"?  This is not a bad question to ask in general,
would you agree?

To vet our findings, we built that prototype I mentioned in the last
mail where we substituted the single queue and queue_lock with a
per-cpu, lockless queue.  This meant each cpu could submit work
independent of the others with substantially reduced contention.  More
importantly, it eliminated the property of scaling the RFO pressure on a
single cache-lne for the queue-lock.  When we did this, we got
significant increases in aggregate throughput (somewhere on the order of
6%-25% depending on workload,  but this was last summer so I am a little
hazy on the exact numbers).

So you had said something to the effect of "Contention isn't implicitly
a bad thing".   I agree to a point.  At least so much as contention
cannot always be avoided.  Ultimately we only have one resource in this
equation:  the phy-link in question.  So naturally multiple ...
From: David Miller
Date: Thursday, March 19, 2009 - 1:52 pm

From: Gregory Haskins <ghaskins@novell.com>

Sun Neptune NIU 10G with 24 TX queues.

And it's all because of the number of TX queues, nothing more.

It is essential for good performance with any level of cpu
arity.

--

From: Peter W. Morreale
Date: Thursday, March 19, 2009 - 5:50 am

The basic 'problem' with comparing RT adaptive spinning to non-rt
spinlocks is that if the lock owner is !oncpu, all spinners must break
and go to sleep, otherwise we (potentially) deadlock.  This does not
exist for non-rt spinners. 

Best,

--

From: Evgeniy Polyakov
Date: Thursday, March 19, 2009 - 12:15 am

Something tells me that you observer skb head lock contention because
you stressed the network and anything else on that machine slacked in
sleep. What if you will start IO stress, will __queue_lock contention
have the same magnitude order? Or having as many processes as skbs each
of which will race for the scheduler? Will run-queue lock show up in the
stats?

I believe the asnwer is yes for all the questions.
You stressed one subsystem and it showed up in the statistics.

-- 
	Evgeniy Polyakov
--

From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 2:07 pm

Yes.  Well, while the adaptive spinlocks did great things for the
network throughput last time I tested them, they also didn't quite
give the determinism in other areas.  It would be nice to be able to
target a handful of trouble locks with adaptive spinlocks.

Even so, though I saw dramatic throughput increases with adaptive
spinlocks, they would still be bound by this same lock contention

Hmmm.  So do either the netxen_nic or bnx2x drivers support multiple
queues?  (that is the HW that I have access to right now).  And do I

The effects of the high contention are not quite so pronounced in the
vanilla kernel, but I think we are still limited by this lock.  In the
-rt kernel, it is obvious that the lock contention is causing lots of
trouble.

--Vernon
--

From: Eilon Greenstein
Date: Wednesday, March 18, 2009 - 2:45 pm

The version of bnx2x in net-next support multi Tx queues (and Rx). It
will open an equal number of Tx and Rx queues up to 16 or the number of
cores in the system. You can validate that all queues are transmitting
with "ethtool -S" which has per queue statistics in that version.

I hope it will help,
Eilon


--

From: Vernon Mauery
Date: Wednesday, March 18, 2009 - 2:51 pm

Thanks.  I will test to see how this affects this lock contention the
next time the broadcom hardware is available.

--Vernon
--

From: Andi Kleen
Date: Wednesday, March 18, 2009 - 2:59 pm

The other strategy to reduce lock contention here is to use TSO/GSO/USO.
With that the lock has to be taken less often because there are less packets
travelling down the stack. I'm not sure how well that works with netperf style 
workloads though. Using multiple TX queues is probably better though if you have
capable hardware. Or ideally both.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Rick Jones
Date: Wednesday, March 18, 2009 - 3:19 pm

All depends on what the user provides with the test-specific -m option for how 
much data they shove into the socket each time "send" is called, and I suppose if 
they use a test-specific -D option to set TCP_NODELAY in the case of a TCP test 
when they have small values of -m.  Eg

netperf -t TCP_STREAM ... -- -m 64K
vs
netperf -t TCP_STREAM ... -- -m 1024
vs
netperf -t TCP_STREAM ... -- -m 1024 -D
vs
netperf -t UDP_STREAM ... -- -m 1024

etc etc.

If the netperf test is:

netperf -t TCP_RR ... -- -r 1   (single-byte request/response)

then TSO/GSO/USO won't matter at all, and probably still wont matter even if the 
user has ./configure'd netperf with --enable-burst and does:

netperf -t TCP_RR ... -- -r 1 -b 64
or
netperf -t TCP_RR ... -- -r 1 -b 64 -D

which was basically what I was doing for the 32-core scaling stuff I posted about 
a few weeks ago.  That was running on multi-queue NICs, so looking at some of the 
profiles of the "no iptables" data may help show how big/small the problem is, 
keeping in mind that my runs (either the XFrame II runs, or the Chelsio T3C runs 
before them) had one queue per core in the system...and as such may be a best 
case scenario as far as lock contention on a per-queue basis goes.

ftp://ftp.netperf.org/

happy benchmarking,

rick jones

BTW, that setup went "poof" and had to go to other nefarious porpoises.  I'm not 
sure when I can recreate it, but I still have both the XFrame and T3C NICs when 
the HW comes free again.
--

From: Peter W. Morreale
Date: Thursday, March 19, 2009 - 5:59 am

Heh.  That darn bastard child again, eh?  :-)


Recall that adaptive spin,developed in RT, is now in mainline. 

Best,
-PWM


--

From: Peter W. Morreale
Date: Thursday, March 19, 2009 - 6:36 am

Err..  correction... *about* to be in mainline.   Only point I'm making
is that RT does implicitly, if not explicitly, contributes to mainline.
Its not a one-way street. 

Best,

--

From: Andi Kleen
Date: Thursday, March 19, 2009 - 6:46 am

> Recall that adaptive spin,developed in RT, is now in mainline. 

For mutexes, but not for spinlocks. And qdisc uses spinlocks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

Previous thread: [PATCH] netdev: handle setting transmit queue length on active device. by Stephen Hemminger on Wednesday, March 18, 2009 - 10:02 am. (9 messages)

Next thread: [PATCH -next] gianfar, ucc_geth: Use proper address translation for MDIO buses by Anton Vorontsov on Wednesday, March 18, 2009 - 12:21 pm. (4 messages)