Jens Axboe has written some hacks for the block layer that allow queueing softirq work to remote cpus. In the context of the block layer he used this facility to trigger the softirq block I/O completion on the same cpu where the I/O was submitted. I want to make use of a similar facility for the networking, so we should make this thing generic. It depends upon the generic SMP call function infrastructure, which Jens wrote specifically to do these remote softirq hacks. For each softirq there is a per-cpu list head which is where the work is queued up. If the platform doesn't support the generic SMP call function bits, the work is queued onto the local cpu. The first patch adds a NR_SOFTIRQS so that we can size these arrays by the actual number of softirqs instead of the magic number "32" which is what is used now. The second patch adds the infrastructure and provides intefaces to invoke softirqs on remove cpus. Jen's, as stated, has block layer uses for this. I intend to use this for receive side flow seperation on non-multiqueue network cards. And Steffen Klassert has a set of IPSEC parallelization changes that can very likely make use of this. These patches are against current 2.6.27-rcX I would suggest that if nobody has any problems with this, we put it into a GIT tree on kernel.org and any subsystem that wants to use it can just pull that tree into their GIT tree. This way, it doesn't matter which tree Linus pulls in first, he'll get this stuff properly regardless of ordering. --
What's the benefit that you (or Jens) sees from migrating softirqs from specific cpu's to others? Daniel --
it means you do all the processing on the CPU that submitted the IO in the first place, and likely still has the various metadata pieces in its CPU cache (or at least you know you won't need to bounce them over) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
In the case of networking and block I would think a lot of the softirq activity is asserted from userspace.. Maybe the scheduler shouldn't be migrating these tasks, or could take this softirq activity into account .. Daniel --
On Sat, 20 Sep 2008 09:02:09 -0700 well a lot of it comes from completion interrupts. and moving userspace isn't a good option; think of the case of 1 nic but 4 apache processes doing the work... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
One nic, so one interrupt ? I guess we're talking about an SMP machine? It seems case dependent .. If you send a lot, or receive a lot.. BUT it's all speculation on my part.. Dave didn't supply the users of his code, or what kind of improvement was seen, or the case in which it would be needed. I think Dave knowns his subsystem, but the code on the surface looks like an end run around some other problem area.. Daniel --
On Sat, 20 Sep 2008 10:40:04 -0700 completions trigger the next send as well (for both block and net) so or multicore it's very fundamental, and has been talked about at various conferences as well. the basic problem is that the submitter of the IO (be it block or net) creates a ton of metadata state on submit, and ideally the completion processing happens on the same CPU, for two reasons 1) to use the state in the cache 2) for the case where you touch userland data/structures, we assume the scheduler kept affinity it's a Moses-to-the-Mountain problem, except we have four Moses' but only one Mountain. Or in CS terms: we move the work to the CPU where the userland is rather than moving the userland to the IRQ CPU, since there is usually only one IRQ but many userlands and many cpu cores. (for the UP case this is all very irrelevant obviously) I assume Dave will pipe in if he disagrees with me ;-) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
There must be some kind of trade off here .. There's a fairly good performance gain from have the softirq asserted and run on the same cpu since it runs in interrupt context right after the interrupt. If you move the softirq to another cpu then you have to re-assert and either wait for ksoftirqd to handle it or wait for an interrupt on the new cpu .. Neither is very predictable.. All that vs. bouncing data around the caches.. To what degree has all that been handled or thought about? Daniel --
From: Daniel Walker <dwalker@mvista.com> Give concrete things to discuss or just be quiet. --
From: Daniel Walker <dwalker@mvista.com> I posted an example use case on netdev a few days ago, and the block layer example is in Jen's block layer tree. It's for networking cards that don't do flow seperation on receive using multiple RX queues and MSI-X interrupts. It's also for things like IPSEC where the per-packet cpu usage is so huge (to do the crypto) that it makes sense to even split up the work to multiple cpus within the same flow. --
Unfortunately doing this with IPsec is going to be non-trivial since we still want to maintain packet ordering inside IPsec and you don't get the inner flow information until you decrypt the packet. So if we want to process IPsec packets in parallel it's best to implement that from within the crypto API where we can queue the result in order to ensure proper ordering. Of course, we need to balance any effort spent on this with the likelihood that hardware improvements will soon make this obsolete (for IPsec anyway). Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
From: Herbert Xu <herbert@gondor.apana.org.au> That's another option, of course. And crypto could use remote True, but old hardware will always exist. A lot of very reasonable machines out there will benefit from software RX flow seperation. --
...Also, producing buggy hardware will not suddently just vanish either ("can you please turn of ipsec offloading and see if you can still reproduce" :-))... -- i. --
That's fine. If your AES hardware is buggy you just fall back to using the software version. In fact if this was done through the crypto API it would even happen automatically. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
Why do you have to preserve packet ordering? TCP/IP does not preserve packet ordering across the network. IPSEC uses a sliding window for anti-relay detection precisely because it has to be able to handle out-of-order packets. Sharing the sliding window between CPUs might be interesting! James --
From: James Courtier-Dutton <James@superbug.co.uk> Yes, but we should preserve per-flow ordering as much as possible within the local system for optimal performance. Things fall apart completely, even with TCP, once you reorder more Again, Steffen's patches take care of this issue. --
It's non-trivial but possible. I have a test implementation that runs the whole IP layer in parallel. The basic idea to keep track of the packet ordering is to give the packets sequence numbers befor we run in parallel. Befor we push the packets to the upper layers or to the neighboring subsystem I have a mechanism that brings them back to the right order. With my test environment (two quad core boxes) I get with IPSEC aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s compared to about 200 Mbit/s without the parallel processing. --
Yes this would definitely help IPsec. However, I'm not so sure of its benefit to routing and other parts of networking. That's why I'd rather have this sort of hack stay in the crypto system where it's isolated rather than having it proliferate throughout the network stack. When the time comes to weed out this because all CPUs that matter have encryption in hardware then it'll be much easier to delete a crypto algorithm as opposed to removing parts of the network infrastructure :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
The crypto benefits the most of course, but routing and xfrm lookups could benefit on bigger networks too. However, the method to bring the packets back to order is quite generic and could be used even in the crypto system. The important thing for me is that we Yes, if you think about how to remove it I agree here. --
From: Daniel Walker <dwalker@mvista.com> Absolutely wrong. On a per-flow basis you want to push the work down as far as possible down to individual cpus. Why do you think the hardware folks are devoting silicon to RX multiqueue facilities that spread the RX work amongst available cpus using MSI-X? --
I'm not sure this belongs in this particular thread but I was interested in how you're planning on doing this? Is there going to be a way for userspace to specify which traffic flows they'd like to direct to particular cpus, or will the kernel try to figure it out on the fly? We have application guys that would like very much to be able to nail specific apps to specific cores and have the kernel send all their packets to those cores for processing. Chris --
From: "Chris Friesen" <cfriesen@nortel.com>
Something like this patch which I posted last week on
netdev.
net: Do software flow seperation on receive.
Push netif_receive_skb() work to remote cpus via flow
hashing and remove softirqs.
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/linux/interrupt.h | 1 +
include/linux/netdevice.h | 2 -
include/linux/skbuff.h | 3 +
net/core/dev.c | 273 +++++++++++++++++++++++++--------------------
4 files changed, 157 insertions(+), 122 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 806b38f..223e68f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,6 +247,7 @@ enum
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
+ NET_RECEIVE_SOFTIRQ,
BLOCK_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 488c56e..a044caa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -965,11 +965,9 @@ static inline int unregister_gifconf(unsigned int family)
struct softnet_data
{
struct Qdisc *output_queue;
- struct sk_buff_head input_pkt_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
- struct napi_struct backlog;
#ifdef CONFIG_NET_DMA
struct dma_chan *net_dma;
#endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9099237..e36bc86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -18,6 +18,7 @@
#include <linux/compiler.h>
#include <linux/time.h>
#include <linux/cache.h>
+#include <linux/smp.h>
#include <asm/atomic.h>
#include <asm/types.h>
@@ -255,6 +256,8 @@ struct sk_buff {
struct sk_buff *next;
struct sk_buff *prev;
+ struct call_single_data csd;
+
struct sock *sk;
ktime_t tstamp;
struct net_device *dev;
diff --git a/net/core/dev.c b/net/core/dev.c
index e719ed2..09827c7 100644
--- a/net/core/dev.c
+++ ...That patch basically just picks an arbitrary cpu for each flow. This would spread the load out across cpus, but it doesn't allow any input from userspace. We have a current application where there are 16 cores and 16 threads. They would really like to be able to pin one thread to each core and tell the kernel what packets they're interested in so that the kernel can process those packets on that core to gain the maximum caching benefit as well as reduce reordering issues. In our case the hardware supports filtering for multiqueues, so we could pass this information down to the hardware to avoid software filtering. Either way, it requires some way for userspace to indicate interest in a particular flow. Has anyone given any thought to what an API like this would look like? I suppose we could automatically look at bound network sockets owned by tasks that are affined to single cpus. This would simplify userspace but would reduce flexibility for things like packet sockets with socket filters applied. Chris --
We've been running softRSS for a while (http://marc.info/?l=linux-netdev&m=120475045519940&w=2) which I believe has very similar functionality to this patch. From this work we found some nice ways to improve scaling that might be applicable: - When routing packets to CPU based on hash, sending to another CPU sharing L2 or L3 cache is best performance. - We added a simple functionality to route packets to the CPU on which the application last did a read for the socket. This seems to be a win for cache locality. - We added a lookup table that maps the Toeplitz hash to the receiving CPU where the application is running. This is for those devices that provide the Toeplitz hash in the receive descriptor. This is a win since the CPU receiving the interrupt doesn't need to take any cache misses on the packet itself. - In our (preliminary) 10G testing we found that routing packets in software with the the above trick actually allows higher PPS and better CPU utilization than using hardware RSS. Also, using both the software routing and hardware RSS yields the best results. --
From: "Chris Friesen" <cfriesen@nortel.com> Many cards cannot configure this, but yes we should allow an interface to configure RX flow seperation preferences, and we do plan on adding that at some point. It's probably be an ethtool operation of some sort. We already have a minimalistic RX flow hashing configuration knob, see ETHTOOL_GRXFH and ETHTOOL_SRXFH. --
From: David Miller <davem@davemloft.net> As a followup to this, I've refreshed my patches and put them in a tree cloned from Linus's current GIT tree: master.kernel.org:/pub/scm/linux/kernel/git/davem/softirq-2.6.git I made minor touchups to the second patch, such as adding a few more descriptive comments, and adding the missing export of the softirq_work list array. Updated version below for reference: softirq: Add support for triggering softirq work on softirqs. This is basically a genericization of Jens Axboe's block layer remote softirq changes. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> --- include/linux/interrupt.h | 21 +++++++ include/linux/smp.h | 4 +- kernel/softirq.c | 129 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 153 insertions(+), 1 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index fdd7b90..0a7a14b 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -11,6 +11,8 @@ #include <linux/hardirq.h> #include <linux/sched.h> #include <linux/irqflags.h> +#include <linux/smp.h> +#include <linux/percpu.h> #include <asm/atomic.h> #include <asm/ptrace.h> #include <asm/system.h> @@ -272,6 +274,25 @@ extern void softirq_init(void); extern void raise_softirq_irqoff(unsigned int nr); extern void raise_softirq(unsigned int nr); +/* This is the worklist that queues up per-cpu softirq work. + * + * send_remote_sendirq() adds work to these lists, and + * the softirq handler itself dequeues from them. The queues + * are protected by disabling local cpu interrupts and they must + * only be accessed by the local cpu that they are for. + */ +DECLARE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list); + +/* Try to send a softirq to a remote cpu. If this cannot be done, the + * work will be queued to the local cpu. + */ +extern void send_remote_softirq(struct call_single_data *cp, ...
