Re: [RFC v2: Patch 1/3] net: hand off skb list to other cpu to submit to upper layer

Previous thread: [PATCH] eHEA: Don't do memory allocation under lock if not necessary by David Howells on Wednesday, March 11, 2009 - 1:44 am. (2 messages)

Next thread: [RFC v2: Patch 3/3] net: hand off skb list to other cpu to submit to upper layer by Zhang, Yanmin on Wednesday, March 11, 2009 - 1:53 am. (1 message)
From: Zhang, Yanmin
Date: Wednesday, March 11, 2009 - 1:53 am

I got some comments. Special thanks to Stephen Hemminger for teaching me on
what reorder is and some other comments. Also thank other guys who raised comments.

v2 has some improvements.
1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin
could use it to configure the binding between RX and cpu number. So it's convenient
for drivers to use the new capability.
2) Delete function netif_rx_queue;
3) Optimize ipi notification. There is no new notification when destination's
input_pkt_alien_queue isn't empty.
4) Did lots of testing, mostly focusing on slab allocator (slab/slub/slqb) and use
SLUB with big slub_max_order currently.

---

Subject: net: hand off skb list to other cpu to submit to upper layer
From: Zhang Yanmin <yanmin.zhang@linux.intel.com>

Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
from the 2nd NIC. 

Initial testing showed cpu cache sharing has impact on speed. As NICs supports
multi-queue, I bind the queues to different logical cpu of different physical cpu
while considering cache sharing carefully. I could get about 30~40% improvement;

Comparing with sending speed on the 1st machine, the forward speed is still not good,
only about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when
interrupt arrives. When ip_forward=1, receiver collects a packet and forwards it out
immediately. So although IXGBE collects packets with NAPI, the forwarding really has
much impact on collection. As IXGBE runs very fast, it drops packets quickly. The better
way for receiving cpu is doing nothing than just collecting packets.

Currently kernel has backlog to support a similar capability, but process_backlog still
runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
softnet_data. Receving cpu ...
From: Andi Kleen
Date: Wednesday, March 11, 2009 - 4:13 am

Seems very inconvenient to have to configure this by hand. How about
auto selecting one that shares the same LLC or somesuch? Passing
data to anything with the same LLC should be cheap enough.

BTW the standard idea to balance processing over multiple CPUs was to
use MSI-X to multiple CPUs. and just use the hash function on the
NIC. Have you considered this for forwarding too? The trick here would
be to try to avoid reordering inside streams as far as possible, but
since the NIC hash should work on flow basis that should be ok.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Zhang, Yanmin
Date: Thursday, March 12, 2009 - 1:16 am

There are 2 kinds of LLC sharing here.
1) RX/TX share the LLC;
2) All RX share the LLC of some cpus and TX share the LLC of other cpus.

Item 1) is important, but sometimes item 2) is also important when the sending speed is
very high and huge data is on flight which flushes cpu cache quickly.
Yes, when the data isn't huge. My forwarding testing currently could reach at 270M bytes per
Yes. My method still depends on MSI-X and multi-queue. One difference is I just need less than
Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could
define that all packets received from a RX queue should be sent out from a specific TX queue.
So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But
sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant
bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the
It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
work on packet receiving dedicately. If they work on other things, NIC might drop packets
quickly.

The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface,
Yes, hardware is good at preventing reorder. My method doesn't change the order in software
layer.

Thanks Andi.


--

From: Ben Hutchings
Date: Thursday, March 12, 2009 - 7:08 am

Yes, that's exactly what they do.  This feature is sometimes called
Receive-Side Scaling (RSS) which is Microsoft's name for it.  Microsoft
requires Windows drivers performing RSS to provide the hash value to the
networking stack, so Linux drivers for the same hardware should be able

The choice of TX queue can be based on the RX hash so that configuration

Aggressive power-saving causes far greater latency than context-
switching under Linux.  I believe most 10G NICs have large RX FIFOs to
mitigate against this.  Ethernet flow control also helps to prevent
[...]

Or through the ethtool API, which already has some multiqueue control
operations.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Zhang, Yanmin
Date: Thursday, March 12, 2009 - 11:43 pm

Oh, I didn't know the background. I need study more about network.
I agree. I double checked the latest codes of tree net-next-2.6 and function skb_tx_hash
Yes when NIC is free mostly. When NIC is busy, it wouldn't enter power-saving mode.
I guess NIC might allocate resources evenly for all queues, at least by default. If considering
packet sending burst with the same SRC/DST, a specific queue might be full quickly. I
instrumented driver and kernel to print out packet receiving and forwarding. As The latest IXGBE
driver gets a packet and forwards it immediately, I think most packets are dropped by hardware
because cpu doesn't collects packets quickly when the specific receiving queue is full. By
comparing the sending speed and forwarding speed, we could get the dropping rate easily.

My experiment shows receving cpu idle is more than 50% and cpu does often collect all packets
till the specific queue is empty. I think that's because pktgen switches to a new SRC/DST to
produce another burst to fill other queues quickly.

It's hard to say cpu is slower than NIC because they work on different parts of the full
That's an alternative approach to configure it. If checking the sample patch on driver,
we can find the change is very small.

Thanks for your kind comments.

Yanmin


--

From: Tom Herbert
Date: Friday, March 13, 2009 - 10:06 am

On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin

You'll definitely want to look at the hardware provided hash.  We've
been using a 10G NIC which provides a Toeplitz hash (the one defined
by Microsoft) and a software RSS-like capability to move packets from
an interrupting CPU to another for processing.  The hash could be used
to index to a set of CPUs, but we also use the hash as a connection
identifier to key into a lookup table to steer packets to the CPU
where the application is running based on the running CPU of the last
recvmsg.  Using the device provided hash in this manner is a HUGE win,
as opposed to taking cache misses to get 4-tuple from packet itself to
compute a hash.  I posted some patches a while back on our work if
you're interested.

We also using multiple RX queues of the 10G device in concert with
pretty good results.  We have noticed that the interrupt overheads
substantially mitigate the benefits.  In fact, I would say the
software packet steering has provided the greater benefit (and it's
very useful on our many 1G NICS that don't have multiq!).

Tom
--

From: David Miller
Date: Friday, March 13, 2009 - 11:51 am

From: Tom Herbert <therbert@google.com>

I never understood this.

If you don't let the APIC move the interrupt around, the individual
MSI-X interrupts will steer packets to individual specific CPUS and as
a result the scheduler will migrate tasks over to those cpus since the
wakeup events keep occuring there.
--

From: Tom Herbert
Date: Friday, March 13, 2009 - 2:01 pm

We are trying to follow the decisions scheduler as opposed to leading
it.  This works on very loaded systems, with applications binding to
cpusets, with threads that are receiving on multiple sockets.  I
suppose it might be compelling if a NIC could steer packets per flow,
instead of by a hash...
--

From: Ben Hutchings
Date: Friday, March 13, 2009 - 3:10 pm

Depending on the NIC, RX queue selection may be done using a large
number of bits of the hash value and an indirection table or by matching
against specific values in the headers.  The SFC4000 supports both of
these, though limited to TCP/IPv4 and UDP/IPv4.  I think Neptune may be
more flexible.  Of course, both indirection table entries and filter
table entries will be limited resources in any NIC, so allocating these
wholly automatically is an interesting challenge.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--

From: Stephen Hemminger
Date: Friday, March 13, 2009 - 3:15 pm

On Fri, 13 Mar 2009 22:10:59 +0000

The problem is that without hardware support, handing off the packet
may take more effort than processing it. Especially when cache line
has to bounce to other CPU and trying to keep up with DoS attacks.
It all depends how much processing is required, and the architecture
of the system. The tradeoff would change over time based on processing
speed and optimizing the receive/firewall code.
--

From: Zhang, Yanmin
Date: Sunday, March 15, 2009 - 8:20 pm

Your scenario is different from mine. My case is ip_forward which happens
in kernel and there is no application participating in the forwarding.

I might test the application communication on 10G NIC with my method later.


--

From: Andi Kleen
Date: Thursday, March 12, 2009 - 7:34 am

Interrupt binding is something popular for benchmarks, but most users
don't (and shouldn't need to) care. Having it work well out of the box


There's a Microsoft spec for a standard hash function that does this
on NICs and all the serious ones support it these days. The hash 
is normally used to select a MSI-X target based on the input header.


Point was that any solution shouldn't add more reordering. But when a RSS
hash is used there is no reordering on stream basis.

-Andi 

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Zhang, Yanmin
Date: Friday, March 13, 2009 - 2:06 am

Thanks Andi. You tell the truth. Now I understand why David Miller is working
on auto TX selection.

One thing I want to clarify is, with the default configuration, the processing path
still goes to current automation selection. That means my method has little impact
on current automation selection with default configuration, except a small cache miss.
Another exception is IXGBE prefers to getting one packet and sending one packet
immediately instead of backlog.

Even when turning on the new capability to separate packet receiving and packet
processing, TX selection is still following current automatic selection. The difference
is we use different cpu. Driver still could record RX number into skb which is used
RX binding depends on interrupt binding totally. If the MSI-X interrupt is sent to cpu A,
cpu A will collect the packets on the RX queue. By default, interrupt isn't bound. 
Software knows the LLC sharing of cpu A. If cpu A receives the interrupt, it couldn't just
throw packets to other cpus which share its LLC, because it doesn't know whether other cpus
Thanks for the explanation. The capability defined by the spec is to choose
a MSI-X number and provides a hint when sending a cloned packet out. Does the NIC
know how cpu is busy? I assume not. So the hash is trying to distribute packets
into RX queues evenly while also avoiding reorder. 

We might say irqbalance could balance workload so we expect cpu workload is
even. My testing shows such evenly distribution of packets on all cpu isn't
Here are 2 targets with my method. The one is packet collecting cpu and the other
is packet processing cpu. 
Yes.

Thanks again.

Yanmin


--

Previous thread: [PATCH] eHEA: Don't do memory allocation under lock if not necessary by David Howells on Wednesday, March 11, 2009 - 1:44 am. (2 messages)

Next thread: [RFC v2: Patch 3/3] net: hand off skb list to other cpu to submit to upper layer by Zhang, Yanmin on Wednesday, March 11, 2009 - 1:53 am. (1 message)