I got some comments. Special thanks to Stephen Hemminger for teaching me on what reorder is and some other comments. Also thank other guys who raised comments. v2 has some improvements. 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin could use it to configure the binding between RX and cpu number. So it's convenient for drivers to use the new capability. 2) Delete function netif_rx_queue; 3) Optimize ipi notification. There is no new notification when destination's input_pkt_alien_queue isn't empty. 4) Did lots of testing, mostly focusing on slab allocator (slab/slub/slqb) and use SLUB with big slub_max_order currently. --- Subject: net: hand off skb list to other cpu to submit to upper layer From: Zhang Yanmin <yanmin.zhang@linux.intel.com> Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds packets by pktgen. The 2nd receives the packets from one NIC and forwards them out from the 2nd NIC. Initial testing showed cpu cache sharing has impact on speed. As NICs supports multi-queue, I bind the queues to different logical cpu of different physical cpu while considering cache sharing carefully. I could get about 30~40% improvement; Comparing with sending speed on the 1st machine, the forward speed is still not good, only about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. So although IXGBE collects packets with NAPI, the forwarding really has much impact on collection. As IXGBE runs very fast, it drops packets quickly. The better way for receiving cpu is doing nothing than just collecting packets. Currently kernel has backlog to support a similar capability, but process_backlog still runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to softnet_data. Receving cpu ...
Seems very inconvenient to have to configure this by hand. How about auto selecting one that shares the same LLC or somesuch? Passing data to anything with the same LLC should be cheap enough. BTW the standard idea to balance processing over multiple CPUs was to use MSI-X to multiple CPUs. and just use the hash function on the NIC. Have you considered this for forwarding too? The trick here would be to try to avoid reordering inside streams as far as possible, but since the NIC hash should work on flow basis that should be ok. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
There are 2 kinds of LLC sharing here. 1) RX/TX share the LLC; 2) All RX share the LLC of some cpus and TX share the LLC of other cpus. Item 1) is important, but sometimes item 2) is also important when the sending speed is very high and huge data is on flight which flushes cpu cache quickly. Yes, when the data isn't huge. My forwarding testing currently could reach at 270M bytes per Yes. My method still depends on MSI-X and multi-queue. One difference is I just need less than Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could define that all packets received from a RX queue should be sent out from a specific TX queue. So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu work on packet receiving dedicately. If they work on other things, NIC might drop packets quickly. The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface, Yes, hardware is good at preventing reorder. My method doesn't change the order in software layer. Thanks Andi. --
Yes, that's exactly what they do. This feature is sometimes called Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft requires Windows drivers performing RSS to provide the hash value to the networking stack, so Linux drivers for the same hardware should be able The choice of TX queue can be based on the RX hash so that configuration Aggressive power-saving causes far greater latency than context- switching under Linux. I believe most 10G NICs have large RX FIFOs to mitigate against this. Ethernet flow control also helps to prevent [...] Or through the ethtool API, which already has some multiqueue control operations. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
Oh, I didn't know the background. I need study more about network. I agree. I double checked the latest codes of tree net-next-2.6 and function skb_tx_hash Yes when NIC is free mostly. When NIC is busy, it wouldn't enter power-saving mode. I guess NIC might allocate resources evenly for all queues, at least by default. If considering packet sending burst with the same SRC/DST, a specific queue might be full quickly. I instrumented driver and kernel to print out packet receiving and forwarding. As The latest IXGBE driver gets a packet and forwards it immediately, I think most packets are dropped by hardware because cpu doesn't collects packets quickly when the specific receiving queue is full. By comparing the sending speed and forwarding speed, we could get the dropping rate easily. My experiment shows receving cpu idle is more than 50% and cpu does often collect all packets till the specific queue is empty. I think that's because pktgen switches to a new SRC/DST to produce another burst to fill other queues quickly. It's hard to say cpu is slower than NIC because they work on different parts of the full That's an alternative approach to configure it. If checking the sample patch on driver, we can find the change is very small. Thanks for your kind comments. Yanmin --
On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin You'll definitely want to look at the hardware provided hash. We've been using a 10G NIC which provides a Toeplitz hash (the one defined by Microsoft) and a software RSS-like capability to move packets from an interrupting CPU to another for processing. The hash could be used to index to a set of CPUs, but we also use the hash as a connection identifier to key into a lookup table to steer packets to the CPU where the application is running based on the running CPU of the last recvmsg. Using the device provided hash in this manner is a HUGE win, as opposed to taking cache misses to get 4-tuple from packet itself to compute a hash. I posted some patches a while back on our work if you're interested. We also using multiple RX queues of the 10G device in concert with pretty good results. We have noticed that the interrupt overheads substantially mitigate the benefits. In fact, I would say the software packet steering has provided the greater benefit (and it's very useful on our many 1G NICS that don't have multiq!). Tom --
From: Tom Herbert <therbert@google.com> I never understood this. If you don't let the APIC move the interrupt around, the individual MSI-X interrupts will steer packets to individual specific CPUS and as a result the scheduler will migrate tasks over to those cpus since the wakeup events keep occuring there. --
We are trying to follow the decisions scheduler as opposed to leading it. This works on very loaded systems, with applications binding to cpusets, with threads that are receiving on multiple sockets. I suppose it might be compelling if a NIC could steer packets per flow, instead of by a hash... --
Depending on the NIC, RX queue selection may be done using a large number of bits of the hash value and an indirection table or by matching against specific values in the headers. The SFC4000 supports both of these, though limited to TCP/IPv4 and UDP/IPv4. I think Neptune may be more flexible. Of course, both indirection table entries and filter table entries will be limited resources in any NIC, so allocating these wholly automatically is an interesting challenge. Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. --
On Fri, 13 Mar 2009 22:10:59 +0000 The problem is that without hardware support, handing off the packet may take more effort than processing it. Especially when cache line has to bounce to other CPU and trying to keep up with DoS attacks. It all depends how much processing is required, and the architecture of the system. The tradeoff would change over time based on processing speed and optimizing the receive/firewall code. --
Your scenario is different from mine. My case is ip_forward which happens in kernel and there is no application participating in the forwarding. I might test the application communication on 10G NIC with my method later. --
Interrupt binding is something popular for benchmarks, but most users don't (and shouldn't need to) care. Having it work well out of the box There's a Microsoft spec for a standard hash function that does this on NICs and all the serious ones support it these days. The hash is normally used to select a MSI-X target based on the input header. Point was that any solution shouldn't add more reordering. But when a RSS hash is used there is no reordering on stream basis. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Thanks Andi. You tell the truth. Now I understand why David Miller is working on auto TX selection. One thing I want to clarify is, with the default configuration, the processing path still goes to current automation selection. That means my method has little impact on current automation selection with default configuration, except a small cache miss. Another exception is IXGBE prefers to getting one packet and sending one packet immediately instead of backlog. Even when turning on the new capability to separate packet receiving and packet processing, TX selection is still following current automatic selection. The difference is we use different cpu. Driver still could record RX number into skb which is used RX binding depends on interrupt binding totally. If the MSI-X interrupt is sent to cpu A, cpu A will collect the packets on the RX queue. By default, interrupt isn't bound. Software knows the LLC sharing of cpu A. If cpu A receives the interrupt, it couldn't just throw packets to other cpus which share its LLC, because it doesn't know whether other cpus Thanks for the explanation. The capability defined by the spec is to choose a MSI-X number and provides a hint when sending a cloned packet out. Does the NIC know how cpu is busy? I assume not. So the hash is trying to distribute packets into RX queues evenly while also avoiding reorder. We might say irqbalance could balance workload so we expect cpu workload is even. My testing shows such evenly distribution of packets on all cpu isn't Here are 2 targets with my method. The one is packet collecting cpu and the other is packet processing cpu. Yes. Thanks again. Yanmin --
