I have been beating on network throughput in the -rt kernel for some time now. After digging down through the send path of UDP packets, I found that the sk_buff_head.lock is under some very high contention. This lock is acquired each time a packet is enqueued on a qdisc and then acquired again to dequeue the packet. Under high networking loads, the enqueueing processes are not only contending among each other for the lock, but also with the net-tx soft irq. This makes for some very high contention on this one lock. My testcase is running varying numbers of concurrent netperf instances pushing UDP traffic to another machine. As the count goes from 1 to 2, the network performance increases. But from 2 to 4 and from 4 to 8, we see a big decline, with 8 instances pushing about half of what a single thread can do. Running 2.6.29-rc6-rt3 on an 8-way machine with a 10GbE card (I have tried both NetXen and Broadcom, with very similar results), I can only push about 1200 Mb/s. Whereas with the mainline 2.6.29-rc8 kernel, I can push nearly 6000 Mb/s. But still not as much as I think is possible. I was curious and decided to see if the mainline kernel was hitting the same lock, and using /proc/lock_stat, it is hitting the sk_buff_head.lock as well (it was the number one contended lock). So while this issue really hits -rt kernels hard, it has a real effect on mainline kernels as well. The contention of the spinlocks is amplified when they get turned into rt-mutexes, which causes a double context switch. Below is the top of the lock_stat for 2.6.29-rc8. This was captured from a 1 minute network stress test. The next high contender had 2 orders of magnitude fewer contentions. Think of the throughput increase if we could ease this contention a bit. We might even be able to saturate a 10GbE link. lock_stat version ...
Yes we have a known contention point here, but before adding more complex code, could you try following patch please ? [PATCH] net: Reorder fields of struct Qdisc dev_queue_xmit() needs to dirty fields "state" and "q" On x86_64 arch, they currently span two cache lines, involving more cache line ping pongs than necessary. Before patch : offsetof(struct Qdisc, state)=0x38 offsetof(struct Qdisc, q)=0x48 offsetof(struct Qdisc, dev_queue)=0x60 After patch : offsetof(struct Qdisc, dev_queue)=0x38 offsetof(struct Qdisc, state)=0x48 offsetof(struct Qdisc, q)=0x50 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index f8c4742..e24feeb 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -51,10 +51,11 @@ struct Qdisc u32 handle; u32 parent; atomic_t refcnt; - unsigned long state; + struct netdev_queue *dev_queue; + struct sk_buff *gso_skb; + unsigned long state; struct sk_buff_head q; - struct netdev_queue *dev_queue; struct Qdisc *next_sched; struct list_head list; --
I dont understand, doesnt it defeat the ticket spinlock thing and fairness ? Thread doing __qdisc_run() already owns the __QDISC_STATE_RUNNING bit. trying or taking spinlock has same effect, since it force a cache line ping pong, and this is the real problem. --
Vernon Mauery wrote, On 03/18/2009 09:17 PM:
I think there would be interesting to check another idea around this
contention: not all contenders are equal here. One thread is doing
qdisc_run() and owning the transmit queue (even after releasing the TX
lock). So if it waits for the qdisc lock the NIC, if not multiqueue,
is idle. Probably some handicap like in the patch below could make
some difference in throughput; alas I didn't test it.
Jarek P.
---
net/core/dev.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..d5ad808 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,7 +1852,11 @@ gso:
if (q->enqueue) {
spinlock_t *root_lock = qdisc_lock(q);
- spin_lock(root_lock);
+ while (!spin_trylock(root_lock)) {
+ do {
+ cpu_relax();
+ } while (spin_is_locked(root_lock));
+ }
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
kfree_skb(skb);
--
But this "busy cpu" can't take packets out of the queue when it's waiting on the contended spinlock. Anyway, it's only for testing, and I didn't say it has to be right. Jarek P. --
Come on guys, if this lock is a problem. go out and buy a proper NIC that supports multiequeue TX! Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
This patch does seem to reduce the number of contentions by about 10%. That is a good start (and a good catch on the cacheline bounces). But, like I mentioned above, this lock still has 2 orders of magnitude greater contention than the next lock, so even a large decrease like 10% makes little difference in the overall contention characteristics. So we will have to do something more. Whether it needs to be more complex or not is still up in the air. Batched enqueueing/dequeueing are just two options and the former would be a *lot* less complex than the latter. If anyone else has any ideas they have been holding back, now would be a great time to get them out in the open. --
Nevermind, I was reading old emails again :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Eric Dumazet <dada1@cosmosbay.com> Right. Remember, the way this is designed is that if there is a busy cpu taking packets out of the queue and putting them into the device then other cpus will simply add to the queue and immediately return. This effectively keeps the queue running there processing all the new work that other cpus are adding to the qdisc. Those other cpus make these decisions by looking at that __QDISC_STATE_RUNNING bit, which the queue runner grabs before it does any work. --
The new adaptive spin heuristics that have been discussed some time ago didn't help? Unsurprisingly making locks a lot more expensive The real "fix" would be probably to use a multi queue capable NIC and a NIC driver that sets up multiple queues for TX (normally they only do for RX). Then cores or a set of cores (often the number of cores is larger than the number of NIC queues) could avoid this problem. Disadvantage: more memory use. But then again I'm not sure it's worth it if the problem only happens in out of tree RT. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
From: Andi Kleen <andi@firstfloor.org> The list of problems that only show up with the RT kernel seems to be constantly increasing, but honestly is very surprising to me. I don't understand why we even need to be concerned about this stuff upstream, to be honest. Please reproduce this in the vanilla kernel, then get back to us. --
Huh? The numbers that I posted *were* from the vanilla kernel. I ran the 2.6.29-rc8 kernel with lock_stat enabled. The lock contention happens on the same lock in both vanilla and -rt, it just happens to be more pronounced in the -rt kernel because of the double context switches that the sleeping spinlock/rt-mutexes introduce. --Vernon --
From: Vernon Mauery <vernux@us.ibm.com> And the double context switches are probably also why less natural batching and locality are achieved in the RT kernel. Isn't that true? --
Yes, the double context switches surely hurt the temporal and spatial locality of the vanilla codepath, but it also induces a longer penalty for blocking on a lock -- instead of a nanoseconds or a few microseconds, the task gets delayed for tens of microseconds. So really, the -rt kernel has more to fix than the vanilla kernel in this case, but any improvement in the lock contention in the vanilla case would be magnified and would cause dramatic improvements in the -rt kernel. --Vernon --
From: Vernon Mauery <vernux@us.ibm.com> Contention on a shared resource is not implicitly bad. And with upstream spinlocks the cost is relatively low for a case like this where a thread of control goes in and only holds the lock for long enough to unlink a packet from the list and immediately the lock is released. The cost should be, cache line move from cpu-to-cpu, atomic lock, linked list unlink, a store, and a memory barrier. And that's all it is upstream. If the -rt kernel makes this 10 times more expensive, I really don't see why that is an upstream concern at the current point in time. That's the tradeoff, common situations where locks are held by multiple threads with contention, but only for mere cycles, are seemingly a lot more expensive in the -rt kernel. I mean, for example, why doesn't the -rt kernel just spin for a little while like a normal spinlock would instead of always entering that expensive contended code path? It would be a huge win here, but I have yet to see discussion of changes in that area of the -rt kernel to combat these effects. Is it the networking that always has to change for the sake of -rt? :-) --
Note that -rt doesnt typically context-switch under contention anymore since we introduced adaptive-locks. Also note that the contention against the lock is still contention, regardless of whether you have -rt or not. Its just that the slow-path to handle the contended case for -rt is more expensive than mainline. However, once you have the contention as stated, you have already lost. We have observed the posters findings ourselves in both mainline and -rt. I.e. That lock doesnt scale very well once you have more than a handful of cores. It's certainly a great area to look at for improving the overall stack, IMO, as I believe there is quite a bit of headroom left to be recovered that is buried there. Regards, -Greg
From: Gregory Haskins <ghaskins@novell.com> First, contention is not implicitly a bad thing. Second, if the -rt kernel is doing adaptive spinning I see no reason why that adaptive spinning is not kicking in here to make this problem just go away. This lock is held for mere cycles, just to unlink an SKB from the networking qdisc, and then it is immediately released. --
If only the first of N contending threads gets to spin, 2..N would For very short hold times, and heavy contention, as well as for scalability, the solution may lie in tunable spinner-count and adaptive spinner time-out. --
However, when the contention in question is your top bottleneck, even small improvements have the potential to yield large performance gains. := Well, "go away" is a relative term. Before adaptive-locks, the box was heavily context-switching under a network workload like we are discussing, and that is where much of the performance was lost.=20 Adaptive-locks mitigate that particular problem so that spinlock_t clients now more closely model mainline behavior (i.e. spin under contention, at least when they can) and this brought -rt much closer to what you expect for performance from mainline. However, in -rt the contention path is, by necessity, still moderately more heavy weight than a simple mainline spin (PI, etc), so any contention will still hurt relatively more. And obviously adaptive-locks do nothing to address the contention itself. Therefore, as long as the contention remains it will have a higher impact in -rt.=20 But make no mistake: the contention can (and I believe does) affect both trees. To see this in action, try taking a moderately large smp system (8-way+) and scaling the number of flows. At 1 flow the stack is generally quite capable of maintaining line-rate (at least up to GigE). With two flows this should result in each achieving roughly 50% of the line-rate, for a sum total of line-rate. But as you add flows to the equation, the aggregate bandwidth typically starts to drop off to be something less than line-rate (many times its actually much less). And when this happens, the contention in question is usually at the top of the charts To clarify: I haven't been looking at the stack code since last summer, so some things may have changed since then. However, the issue back then from my perspective was the general backpressure in the qdisc subsystem (e.g. dev->queue_lock), not just the skb head.lock per se.=20 This dev->queue_lock can be held for quite a long time, and the mechanism in general scales poorly as the fabric speeds and ...
From: Gregory Haskins <ghaskins@novell.com> I can maintain line-rate over 10GB with a 64-cpu box. It's not a problem. --
Oh man, I am jealous of that 64-way :) How many simultaneous flows? What hardware? What qdisc and other config do you use? MTU? I cannot replicate such results on 10GB even with much smaller cpu counts. On my test rig here, I have a 10GB link connected by crossover between two 8-core boxes. Running one unidirectional TCP flow is typically toping out at ~5.5Gb/s on 2.6.29-rc8. Granted we are using MTU=3D1500, which in of itself is part of the upper limit. However, that result in of itself isn't a problem, per se. What is a problem is the aggregate bandwidth drops with scaling the number of flows. I would like to understand how to make this better, if possible, and perhaps I can learn To clarify terms, we are not saying "the stack performs inadequately".=20 What we are saying here is that analysis of our workloads and of the current stack indicates that we are io-bound, and that this particular locking architecture in the qdisc subsystem is the apparent top gating factor from going faster. Therefore we are really saying "how can we make it even better"? This is not a bad question to ask in general, would you agree? To vet our findings, we built that prototype I mentioned in the last mail where we substituted the single queue and queue_lock with a per-cpu, lockless queue. This meant each cpu could submit work independent of the others with substantially reduced contention. More importantly, it eliminated the property of scaling the RFO pressure on a single cache-lne for the queue-lock. When we did this, we got significant increases in aggregate throughput (somewhere on the order of 6%-25% depending on workload, but this was last summer so I am a little hazy on the exact numbers). So you had said something to the effect of "Contention isn't implicitly a bad thing". I agree to a point. At least so much as contention cannot always be avoided. Ultimately we only have one resource in this equation: the phy-link in question. So naturally multiple ...
From: Gregory Haskins <ghaskins@novell.com> Sun Neptune NIU 10G with 24 TX queues. And it's all because of the number of TX queues, nothing more. It is essential for good performance with any level of cpu arity. --
The basic 'problem' with comparing RT adaptive spinning to non-rt spinlocks is that if the lock owner is !oncpu, all spinners must break and go to sleep, otherwise we (potentially) deadlock. This does not exist for non-rt spinners. Best, --
Something tells me that you observer skb head lock contention because you stressed the network and anything else on that machine slacked in sleep. What if you will start IO stress, will __queue_lock contention have the same magnitude order? Or having as many processes as skbs each of which will race for the scheduler? Will run-queue lock show up in the stats? I believe the asnwer is yes for all the questions. You stressed one subsystem and it showed up in the statistics. -- Evgeniy Polyakov --
Yes. Well, while the adaptive spinlocks did great things for the network throughput last time I tested them, they also didn't quite give the determinism in other areas. It would be nice to be able to target a handful of trouble locks with adaptive spinlocks. Even so, though I saw dramatic throughput increases with adaptive spinlocks, they would still be bound by this same lock contention Hmmm. So do either the netxen_nic or bnx2x drivers support multiple queues? (that is the HW that I have access to right now). And do I The effects of the high contention are not quite so pronounced in the vanilla kernel, but I think we are still limited by this lock. In the -rt kernel, it is obvious that the lock contention is causing lots of trouble. --Vernon --
The version of bnx2x in net-next support multi Tx queues (and Rx). It will open an equal number of Tx and Rx queues up to 16 or the number of cores in the system. You can validate that all queues are transmitting with "ethtool -S" which has per queue statistics in that version. I hope it will help, Eilon --
Thanks. I will test to see how this affects this lock contention the next time the broadcom hardware is available. --Vernon --
The other strategy to reduce lock contention here is to use TSO/GSO/USO. With that the lock has to be taken less often because there are less packets travelling down the stack. I'm not sure how well that works with netperf style workloads though. Using multiple TX queues is probably better though if you have capable hardware. Or ideally both. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
All depends on what the user provides with the test-specific -m option for how much data they shove into the socket each time "send" is called, and I suppose if they use a test-specific -D option to set TCP_NODELAY in the case of a TCP test when they have small values of -m. Eg netperf -t TCP_STREAM ... -- -m 64K vs netperf -t TCP_STREAM ... -- -m 1024 vs netperf -t TCP_STREAM ... -- -m 1024 -D vs netperf -t UDP_STREAM ... -- -m 1024 etc etc. If the netperf test is: netperf -t TCP_RR ... -- -r 1 (single-byte request/response) then TSO/GSO/USO won't matter at all, and probably still wont matter even if the user has ./configure'd netperf with --enable-burst and does: netperf -t TCP_RR ... -- -r 1 -b 64 or netperf -t TCP_RR ... -- -r 1 -b 64 -D which was basically what I was doing for the 32-core scaling stuff I posted about a few weeks ago. That was running on multi-queue NICs, so looking at some of the profiles of the "no iptables" data may help show how big/small the problem is, keeping in mind that my runs (either the XFrame II runs, or the Chelsio T3C runs before them) had one queue per core in the system...and as such may be a best case scenario as far as lock contention on a per-queue basis goes. ftp://ftp.netperf.org/ happy benchmarking, rick jones BTW, that setup went "poof" and had to go to other nefarious porpoises. I'm not sure when I can recreate it, but I still have both the XFrame and T3C NICs when the HW comes free again. --
Heh. That darn bastard child again, eh? :-) Recall that adaptive spin,developed in RT, is now in mainline. Best, -PWM --
Err.. correction... *about* to be in mainline. Only point I'm making is that RT does implicitly, if not explicitly, contributes to mainline. Its not a one-way street. Best, --
> Recall that adaptive spin,developed in RT, is now in mainline. For mutexes, but not for spinlocks. And qdisc uses spinlocks. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
