Re: [PATCH 0/2]: Remote softirq invocation infrastructure.

Previous thread: [PATCH] debug: Introduce a dev_WARN() function by Arjan van de Ven on Friday, September 19, 2008 - 9:07 pm. (7 messages)

Next thread: [PATCH 1/2]: softirq: Define and use NR_SOFTIRQ by David Miller on Friday, September 19, 2008 - 11:48 pm. (1 message)
From: David Miller
Date: Friday, September 19, 2008 - 11:48 pm

Jens Axboe has written some hacks for the block layer that allow
queueing softirq work to remote cpus.  In the context of the block
layer he used this facility to trigger the softirq block I/O
completion on the same cpu where the I/O was submitted.

I want to make use of a similar facility for the networking, so we
should make this thing generic.

It depends upon the generic SMP call function infrastructure, which
Jens wrote specifically to do these remote softirq hacks.

For each softirq there is a per-cpu list head which is where the work
is queued up.

If the platform doesn't support the generic SMP call function bits,
the work is queued onto the local cpu.

The first patch adds a NR_SOFTIRQS so that we can size these arrays
by the actual number of softirqs instead of the magic number "32"
which is what is used now.

The second patch adds the infrastructure and provides intefaces to
invoke softirqs on remove cpus.

Jen's, as stated, has block layer uses for this.  I intend to use this
for receive side flow seperation on non-multiqueue network cards.  And
Steffen Klassert has a set of IPSEC parallelization changes that can
very likely make use of this.

These patches are against current 2.6.27-rcX

I would suggest that if nobody has any problems with this, we put it
into a GIT tree on kernel.org and any subsystem that wants to use it
can just pull that tree into their GIT tree.  This way, it doesn't
matter which tree Linus pulls in first, he'll get this stuff properly
regardless of ordering.
--

From: Daniel Walker
Date: Saturday, September 20, 2008 - 8:29 am

What's the benefit that you (or Jens) sees from migrating softirqs from
specific cpu's to others?

Daniel

--

From: Arjan van de Ven
Date: Saturday, September 20, 2008 - 8:45 am

it means you do all the processing on the CPU that submitted the IO in
the first place, and likely still has the various metadata pieces in
its CPU cache (or at least you know you won't need to bounce them over)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Daniel Walker
Date: Saturday, September 20, 2008 - 9:02 am

In the case of networking and block I would think a lot of the softirq
activity is asserted from userspace.. Maybe the scheduler shouldn't be
migrating these tasks, or could take this softirq activity into
account ..

Daniel

--

From: Arjan van de Ven
Date: Saturday, September 20, 2008 - 9:19 am

On Sat, 20 Sep 2008 09:02:09 -0700

well a lot of it comes from completion interrupts.

and moving userspace isn't a good option; think of the case of 1 nic
but 4 apache processes doing the work...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Daniel Walker
Date: Saturday, September 20, 2008 - 10:40 am

One nic, so one interrupt ? I guess we're talking about an SMP machine? 
It seems case dependent .. If you send a lot, or receive a lot.. BUT
it's all speculation on my part..

Dave didn't supply the users of his code, or what kind of improvement
was seen, or the case in which it would be needed. I think Dave knowns
his subsystem, but the code on the surface looks like an end run around
some other problem area..

Daniel

--

From: Arjan van de Ven
Date: Saturday, September 20, 2008 - 11:09 am

On Sat, 20 Sep 2008 10:40:04 -0700

completions trigger the next send as well (for both block and net) so

or multicore


it's very fundamental, and has been talked about at various conferences
as well.

the basic problem is that the submitter of the IO (be it block or net)
creates a ton of metadata state on submit, and ideally the completion
processing happens on the same CPU, for two reasons
1) to use the state in the cache
2) for the case where you touch userland data/structures, we assume the
scheduler kept affinity

it's a Moses-to-the-Mountain problem, except we have four Moses' but
only one Mountain. 

Or in CS terms: we move the work to the CPU where the userland is
rather than moving the userland to the IRQ CPU, since there is usually
only one IRQ but many userlands and many cpu cores.

(for the UP case this is all very irrelevant obviously)

I assume Dave will pipe in if he disagrees with me ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Daniel Walker
Date: Saturday, September 20, 2008 - 11:52 am

There must be some kind of trade off here .. There's a fairly good
performance gain from have the softirq asserted and run on the same cpu
since it runs in interrupt context right after the interrupt.

If you move the softirq to another cpu then you have to re-assert and
either wait for ksoftirqd to handle it or wait for an interrupt on the
new cpu .. Neither is very predictable..

All that vs. bouncing data around the caches.. To what degree has all
that been handled or thought about?

Daniel

--

From: David Miller
Date: Saturday, September 20, 2008 - 1:04 pm

From: Daniel Walker <dwalker@mvista.com>

Give concrete things to discuss or just be quiet.
--

From: David Miller
Date: Saturday, September 20, 2008 - 12:59 pm

From: Daniel Walker <dwalker@mvista.com>

I posted an example use case on netdev a few days ago, and
the block layer example is in Jen's block layer tree.

It's for networking cards that don't do flow seperation on
receive using multiple RX queues and MSI-X interrupts.  It's
also for things like IPSEC where the per-packet cpu usage
is so huge (to do the crypto) that it makes sense to even
split up the work to multiple cpus within the same flow.
--

From: Herbert Xu
Date: Saturday, September 20, 2008 - 11:05 pm

Unfortunately doing this with IPsec is going to be non-trivial
since we still want to maintain packet ordering inside IPsec
and you don't get the inner flow information until you decrypt
the packet.

So if we want to process IPsec packets in parallel it's best to
implement that from within the crypto API where we can queue the
result in order to ensure proper ordering.

Of course, we need to balance any effort spent on this with the
likelihood that hardware improvements will soon make this obsolete
(for IPsec anyway).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: David Miller
Date: Saturday, September 20, 2008 - 11:57 pm

From: Herbert Xu <herbert@gondor.apana.org.au>


That's another option, of course.  And crypto could use remote

True, but old hardware will always exist.

A lot of very reasonable machines out there will benefit from software
RX flow seperation.
--

From: Ilpo Järvinen
Date: Monday, September 22, 2008 - 3:36 am

...Also, producing buggy hardware will not suddently just vanish either 
("can you please turn of ipsec offloading and see if you can still
reproduce" :-))...

-- 
 i.
--

From: Herbert Xu
Date: Tuesday, September 23, 2008 - 9:54 pm

That's fine.  If your AES hardware is buggy you just fall back
to using the software version.

In fact if this was done through the crypto API it would even
happen automatically.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: James Courtier-Dutton
Date: Sunday, September 21, 2008 - 2:13 am

Why do you have to preserve packet ordering?
TCP/IP does not preserve packet ordering across the network.
IPSEC uses a sliding window for anti-relay detection precisely because
it has to be able to handle out-of-order packets.

Sharing the sliding window between CPUs might be interesting!

James
--

From: David Miller
Date: Sunday, September 21, 2008 - 2:17 am

From: James Courtier-Dutton <James@superbug.co.uk>

Yes, but we should preserve per-flow ordering as much as
possible within the local system for optimal performance.

Things fall apart completely, even with TCP, once you reorder more

Again, Steffen's patches take care of this issue.
--

From: Steffen Klassert
Date: Sunday, September 21, 2008 - 2:46 am

It's non-trivial but possible. I have a test implementation that
runs the whole IP layer in parallel. The basic idea to keep track
of the packet ordering is to give the packets sequence numbers
befor we run in parallel. Befor we push the packets to the upper
layers or to the neighboring subsystem I have a mechanism that
brings them back to the right order. 

With my test environment (two quad core boxes) I get with IPSEC
aes192-sha1 and one tcp stream a throughput of about 600 Mbit/s
compared to about 200 Mbit/s without the parallel processing. 
--

From: Herbert Xu
Date: Monday, September 22, 2008 - 1:23 am

Yes this would definitely help IPsec.  However, I'm not so sure
of its benefit to routing and other parts of networking.  That's
why I'd rather have this sort of hack stay in the crypto system
where it's isolated rather than having it proliferate throughout
the network stack.

When the time comes to weed out this because all CPUs that matter
have encryption in hardware then it'll be much easier to delete a
crypto algorithm as opposed to removing parts of the network
infrastructure :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Steffen Klassert
Date: Monday, September 22, 2008 - 6:54 am

The crypto benefits the most of course, but routing and xfrm lookups
could benefit on bigger networks too. However, the method to bring
the packets back to order is quite generic and could be used
even in the crypto system. The important thing for me is that we

Yes, if you think about how to remove it I agree here.
--

From: David Miller
Date: Saturday, September 20, 2008 - 1:00 pm

From: Daniel Walker <dwalker@mvista.com>

Absolutely wrong.

On a per-flow basis you want to push the work down as far
as possible down to individual cpus.  Why do you think the
hardware folks are devoting silicon to RX multiqueue facilities
that spread the RX work amongst available cpus using MSI-X?
--

From: Chris Friesen
Date: Monday, September 22, 2008 - 2:22 pm

I'm not sure this belongs in this particular thread but I was
interested in how you're planning on doing this?  Is there going to be a
way for userspace to specify which traffic flows they'd like to direct
to particular cpus, or will the kernel try to figure it out on the fly?

We have application guys that would like very much to be able to nail 
specific apps to specific cores and have the kernel send all their 
packets to those cores for processing.

Chris
--

From: David Miller
Date: Monday, September 22, 2008 - 3:12 pm

From: "Chris Friesen" <cfriesen@nortel.com>

Something like this patch which I posted last week on
netdev.

net: Do software flow seperation on receive.

Push netif_receive_skb() work to remote cpus via flow
hashing and remove softirqs.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/interrupt.h |    1 +
 include/linux/netdevice.h |    2 -
 include/linux/skbuff.h    |    3 +
 net/core/dev.c            |  273 +++++++++++++++++++++++++--------------------
 4 files changed, 157 insertions(+), 122 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 806b38f..223e68f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,6 +247,7 @@ enum
 	TIMER_SOFTIRQ,
 	NET_TX_SOFTIRQ,
 	NET_RX_SOFTIRQ,
+	NET_RECEIVE_SOFTIRQ,
 	BLOCK_SOFTIRQ,
 	TASKLET_SOFTIRQ,
 	SCHED_SOFTIRQ,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 488c56e..a044caa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -965,11 +965,9 @@ static inline int unregister_gifconf(unsigned int family)
 struct softnet_data
 {
 	struct Qdisc		*output_queue;
-	struct sk_buff_head	input_pkt_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	struct napi_struct	backlog;
 #ifdef CONFIG_NET_DMA
 	struct dma_chan		*net_dma;
 #endif
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9099237..e36bc86 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -18,6 +18,7 @@
 #include <linux/compiler.h>
 #include <linux/time.h>
 #include <linux/cache.h>
+#include <linux/smp.h>
 
 #include <asm/atomic.h>
 #include <asm/types.h>
@@ -255,6 +256,8 @@ struct sk_buff {
 	struct sk_buff		*next;
 	struct sk_buff		*prev;
 
+	struct call_single_data	csd;
+
 	struct sock		*sk;
 	ktime_t			tstamp;
 	struct net_device	*dev;
diff --git a/net/core/dev.c b/net/core/dev.c
index e719ed2..09827c7 100644
--- a/net/core/dev.c
+++ ...
From: Chris Friesen
Date: Tuesday, September 23, 2008 - 10:03 am

That patch basically just picks an arbitrary cpu for each flow.  This 
would spread the load out across cpus, but it doesn't allow any input 
from userspace.

We have a current application where there are 16 cores and 16 threads. 
They would really like to be able to pin one thread to each core and 
tell the kernel what packets they're interested in so that the kernel 
can process those packets on that core to gain the maximum caching 
benefit as well as reduce reordering issues.  In our case the hardware 
supports filtering for multiqueues, so we could pass this information 
down to the hardware to avoid software filtering.

Either way, it requires some way for userspace to indicate interest in a 
particular flow.  Has anyone given any thought to what an API like this 
would look like?

I suppose we could automatically look at bound network sockets owned by 
tasks that are affined to single cpus.  This would simplify userspace 
but would reduce flexibility for things like packet sockets with socket 
filters applied.

Chris
--

From: Tom Herbert
Date: Tuesday, September 23, 2008 - 2:10 pm

We've been running softRSS for a while
(http://marc.info/?l=linux-netdev&m=120475045519940&w=2) which I
believe has very similar functionality to this patch.  From this work
we found some nice ways to improve scaling that might be applicable:

- When routing packets to CPU based on hash, sending to another CPU
sharing L2 or L3 cache is best performance.
- We added a simple functionality to route packets to the CPU on which
the application last did a read for the socket.  This seems to be a
win for cache locality.
- We added a lookup table that maps the Toeplitz hash to the receiving
CPU where the application is running.  This is for those devices that
provide the Toeplitz hash in the receive descriptor.  This is a win
since the CPU receiving the interrupt doesn't need to take any cache
misses on the packet itself.
- In our (preliminary) 10G testing we found that routing packets in
software with the the above trick actually allows higher PPS and
better CPU utilization than using hardware RSS.  Also, using both the
software routing and hardware RSS yields the best results.

--

From: David Miller
Date: Tuesday, September 23, 2008 - 2:51 pm

From: "Chris Friesen" <cfriesen@nortel.com>


Many cards cannot configure this, but yes we should allow an interface to configure
RX flow seperation preferences, and we do plan on adding that at some point.

It's probably be an ethtool operation of some sort.  We already have a minimalistic
RX flow hashing configuration knob, see ETHTOOL_GRXFH and ETHTOOL_SRXFH.
--

From: David Miller
Date: Wednesday, September 24, 2008 - 12:42 am

From: David Miller <davem@davemloft.net>

As a followup to this, I've refreshed my patches and put them
in a tree cloned from Linus's current GIT tree:

	master.kernel.org:/pub/scm/linux/kernel/git/davem/softirq-2.6.git

I made minor touchups to the second patch, such as adding a few more
descriptive comments, and adding the missing export of the softirq_work
list array.

Updated version below for reference:

softirq: Add support for triggering softirq work on softirqs.

This is basically a genericization of Jens Axboe's block layer
remote softirq changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 include/linux/interrupt.h |   21 +++++++
 include/linux/smp.h       |    4 +-
 kernel/softirq.c          |  129 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 153 insertions(+), 1 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index fdd7b90..0a7a14b 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -11,6 +11,8 @@
 #include <linux/hardirq.h>
 #include <linux/sched.h>
 #include <linux/irqflags.h>
+#include <linux/smp.h>
+#include <linux/percpu.h>
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
@@ -272,6 +274,25 @@ extern void softirq_init(void);
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 
+/* This is the worklist that queues up per-cpu softirq work.
+ *
+ * send_remote_sendirq() adds work to these lists, and
+ * the softirq handler itself dequeues from them.  The queues
+ * are protected by disabling local cpu interrupts and they must
+ * only be accessed by the local cpu that they are for.
+ */
+DECLARE_PER_CPU(struct list_head [NR_SOFTIRQ], softirq_work_list);
+
+/* Try to send a softirq to a remote cpu.  If this cannot be done, the
+ * work will be queued to the local cpu.
+ */
+extern void send_remote_softirq(struct call_single_data *cp, ...
Previous thread: [PATCH] debug: Introduce a dev_WARN() function by Arjan van de Ven on Friday, September 19, 2008 - 9:07 pm. (7 messages)

Next thread: [PATCH 1/2]: softirq: Define and use NR_SOFTIRQ by David Miller on Friday, September 19, 2008 - 11:48 pm. (1 message)