Hi Tom, First off: Kudos on the numbers you are seeing; they are impressive. Do you have any numbers on a forwarding path test? My first impression when i saw the numbers was one of suprise. Back in the days when we tried to split stack processing the way you did(it was one of the experiments on early NAPI), IPIs were _damn_ expensive. What changed in current architecture that makes this more palatable? IPIs are still synchronous AFAIK (and the more IPI receiver there are, the worse the ACK latency). Did you test this across other archs or say 3-4 year old machines? cheers, jamal --
I don't have specific numbers, although we are using this on application doing forwarding and numbers seem in line with what we see No, the cost of the IPIs hasn't been an issue for us performance-wise. We are using them extensively-- up to one per core per device interrupt. We're calling __smp_call_function_single which is asynchronous in that the caller provides the call structure and there is not waiting for the IPI to complete. A flag is used with each call structure that is set when the IPI is in progress, this prevents simultaneous use of a call structure. I haven't seen any architectural specific issues with the IPIs, I believe they are completing in < 2 usecs on platforms we're running (some opteron systems that are over 3yrs old). --
When i get the chance i will give it a run. I have access to an i7 Ok, so you are not going across cores then? I wonder if there's some new optimization to reduce IPI latency when both sender/receiver It is possible that is just an abstraction hiding the details.. AFAIK, IPIs are synchronous. Remote has to ack with another IPI 2 usecs aint bad (at 10G you only accumulate a few packets while stalled). I think we saw much higher values. I was asking on different architectures because I have tried something equivalent as recent as 2 years back on a MIPS multicore and the forwarding results were horrible. IPIs flush the processor pipeline so they aint cheap - but that may vary depending on the architecture. Someone more knowledgeable should be able to give better insights. My suspicion is that with low transaction rate (with appropriate traffic patterns) you will see a very much increased latency since you will be sending more IPIs.. cheers, jamal --
Following up like promised: I did step #0 last night on an i7 (single Nehalem). I think more than anything i was impressed by the Nehalem's excellent caching system. Robert, I am almost tempted to say skb recycling performance will be excellent on this machine given the cost of a cache miss is much lower than previous generation hardware. My test was simple: irq affinity on cpu0(core0) and rps redirection to cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs) on different cores with similar results. I base tested against no rps being used and a kernel which didnt have any RPS config on. [BTW, I had to hand-edit the .config since i couldnt do it from menuconfig (Is there any reason for it to be so?)] Traffic was sent from another machine into the i7 via an el-cheapo sky2 (dont know how shitty this NIC is, but it seems to know how to do MSI so probably capable of multiqueueing); the test was several sets of a ping first and then a ping -f (I will get more sophisticated in my next test likely this weekend). Results: CPU utilization was about 20-30% higher in the case of rps. On cpu0, the cpu was being chewed highly by sky2_poll and on the redirected-to-core it was always smp_call_function_single. Latency was (consistently) on average 5 microseconds. So if i sent 1M ping -f packets, without RPS it took on average 176 seconds and with RPS it took 181 seconds to do a round-trip. Throughput didnt change but this could be attributed to the low amounts of data i was sending. I observed that we were generating, on average, an IPI per packet even with ping -f. (added an extra stat to record when we sent an IPI and counted against the number of packets sent). In my opinion it is these IPIs that contribute the most to the latency and i think it happens that the Nehalem is just highly improved in this area. I wish i had a more commonly used machine to test rps on. I expect that rps will perform worse on currently cheaper/older hardware for the traffic ...
The point of RPS is to increase parallelism, but the cost of that is more overhead per packet. If you are running a single flow, then you'll see latency increase for that flow. With more concurrent flows the benefits of parallelism kick in and latency gets better.-- we've seen the break even point around ten connections in our tests. Also, I don't think we've made the claim that RPS should generally perform better than multi-queue, the primary motivation for RPS is make single queue NICs give reasonable performance. --
Yes, multiqueue is far better of course, but in case of hardware lacking multiqueue, RPS can help many workloads, where application has _some_ work to do, not only counting frames or so... RPS overhead (IPI, cache misses, ...) must be amortized by parallelization or we lose. A ping test is not an ideal candidate for RPS, since everything is done at softirq level, and should be faster without RPS... --
Agreed. So to enumerate, the benefits come in if: a) you have many processors b) you have single-queue nic c) at sub-threshold traffic you dont care about a little latency d) you have a specific cache hierachy Indeed. How well they can be amortized seems very cpu or board specific. I think the main challenge for my pedantic mind is missing details. Is there a paper on rps? Example for #d above, the commit log mentions that rps benefits if you have certain types of "cache hierachy". Probably some arch with large shared L2/3 (maybe inclusive) cache will benefit. example: it does well on Nehalem and probably opterons as long (as you dont start stacking these things on some interconnect like QPI or HT). But what happens when you have FSB sharing across cores (still a very common setup)? etc etc ping wont do justice to the possible potential of rps mostly because it generates very little traffic i.e the part #c above. But it helps me at least boot a machine with proper setup - but it is not totally useless because i think the cost of IPI can be deduced from the results. I am going to put together some udp app with variable think-time to see what happens. Would that be a reasonable thing to test on? It would be valuable to have something like Documentation/networking/rps to detail things a little more. cheers, jamal --
On Wed, 14 Apr 2010 14:53:42 -0400 There probably needs to be better autotuning for this, there is no reason that RPS to be steering packets unless the queue is getting backed up. Some kind of high / low water mark mechanism is needed. RPS might also interact with the core turbo boost functionality on Intel chips. Newer chips will make a single core faster if other core can be kept idle. -- --
This was discussed a while ago, and Out Of Order packet delivery was the thing that frightened us a bit. Every time we change RPS to be on or off, we might have some extra noise. Maybe we already have this problem with irqbalance ? --
From: Eric Dumazet <eric.dumazet@gmail.com> irqbalance should never move network device interrupts around under normal circumstances. Arjan assured me that there is specific logic in the irqbalance daemon to not move NIC interrupts around once a target has been choosen. --
how well does it work with Linux? Sounds like all i need to do is turn on some BIOS feature. One of the negatives with multiqueue nics is because the core selection is static, you could end up overloading one core while others stay idle. This seems to steal cycle capacity from the idle cores and gives it to the busy cpus. nice. So i see it as a boost to multiqueue. cheers, jamal --
Only if more than one flow is involved. And if you have many flows, chance they will spread several queues... --
Over long period of time measurement, true; but even with > 1 flows, it is possible that one flow is more active/intense than others (rtp vs some bulk file transfer) or more processor intensive than others(eg ipsec vs clear text) etc. BTW: just poking at intel doc on turbo boost and it seems the max a core can steal from others is 400Mhz; so a core can go from 2.8Ghz to 3.2Ghz. I am sure theres a lot of interesting dynamics from this ;-> I think i will try turning this thing in my tests since i have an i7. cheers, jamal --
But use too many queues and the efficiency of NAPI drops and cost of device interrupts becomes dominant, so that the overhead from additional hard interrupts can surpass the overhead of doing RPS and the IPIs. I believe we are seeing this is in some of our results which shows that a combination of multi-queue and RPS can be better than just multi-queue (see rps changelog). Again, I'm not claiming that is generally true, but there are a lot of factors to consider. --
RPS can be tuned (Changli wants a finer tuning...), it would be intereting to tune multiqueue devices too. I dont know if its possible right now. On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E 10Gigabit has 16 queues. It might be good to use less queues according to your results on some workloads, and eventually use RPS on a second layering. --
My idear is: run a daemon in userland to monitor the softnet statistics, and tun the RPS setting if necessary. It seems that the current softnet statistics data isn't correct. Long time ago, I did a test, and the conclution was call_function_single IPI was more expensive than resched IPI, so I moved to kernel thread from softirq for packet processing. I'll redo the test later. -- Regards, Changli Gao(xiaosuo@gmail.com) --
On Thu, 15 Apr 2010 06:51:29 +0800 The big thing is data, data, data... Performance can only be examined with real hard data with multiple different kind of hardware. Also, check for regressions in lmbench and TPC benchmarks. Yes this is hard, but papers on this would allow for rational rather than speculative choices. Adding more tuning knobs is not the answer unless you can show when the tuning helps. -- --
Agree 100%, and irqbalance is the existing daemon. It should be used and changed if necessary. Changli, my stronges argument about your patches is that our scheduler and memory affinity api (numactl driven) is bitmask oriented, giving the same weight to individual cpu or individual memory node. --
It works with the assumption: the workloads handled in non-schedulable context are less than the others. If most of work is done in non-schedulable(softirq) context, scheduler can't keep load balance. -- Regards, Changli Gao(xiaosuo@gmail.com) --
From: Eric Dumazet <eric.dumazet@gmail.com> Only NIU allows real detailed control over queue selection and stuff like that, because the hardware has a real TCAM for packet matching and packets which match in TCAM entries can steer to different collections of queues. We have ethtool interfaces for this (ETHTOOL_GRXCLS*), so you can change it. For most other chips we only have interfaces for modifying the RX hashing algorithm or what the RX hash covers, stuff like that. See also ETHTOOL_GRXFH, ETHTOOL_SRXFH, ETHTOOL_SRXNTUPLE, and ETHTOOL_GRXNTUPLE, the latter two of which were added for Intel NICs. --
Ok Eric, you seem to be running a system with two Nehalems interconnected by QPI. Is there any difference, performance-wise, between redirecting from coreX to coreY when they are on the same Nehalem vs when you are going across QPI? cheers, jamal --
For historical reason, we use Linux-2.6.18. Our company have several products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and Multi-Threaded. We use the similar mechanism like dynamic weighted RPS. The total throughput is increased nearly linear with the number of the worker threads(one worker thread per CPU). -- Regards, Changli Gao(xiaosuo@gmail.com) --
Thanks for sharing. How much more can you say? ;-> Do you have a paper Other than the i7 - have you tried to run rps on on the P4? cheers, jamal --
On a dual 4-core Xeon, we use one core for NIC in internal side, one core for NIC in the external side, one for inbound QoS, one for outbound QoS, and the CPU cycles left are used by DPI(DFA), the total No. -- Regards, Changli Gao(xiaosuo@gmail.com) --
From: jamal <hadi@cyberus.ca> It's completely transparent and should just happen without any BIOS tweaks. --
In addition to Turbo using less cores can also help to save power. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
From: Stephen Hemminger <shemminger@vyatta.com> I disagree, if the goal is to migrate the bulk of packet processing to where the app will actually sink and process the data then it should forward to RPS marked cpus regardless of local queue levels. --
From: jamal <hadi@cyberus.ca> A single-queue NIC is actually not a requirement, RPS helps also in cases where you have 'N' application threads and N is less than the number of CPUs your multi-queue NIC is distributing traffic to. Moving the bulk of the input packet processing to the cpus where the applications actually sit had a non-trivial benefit. RFS takes I think for the case where application locality is important, RPS/RFS can help regardless of cache details. --
rfs looks quiet interesting;-> I think with some twist it could be Generally true, as long as there's not much shared data across the cpus or the cost of a cache miss is reasonably tolerable. The socket layer just happens to be not sharing much with ingress packet path and for a single processor Nehalem, the caching system works so well that the cost of cache misses is not as an important a variable. Everything is on the same die including the MM controller etc. I am speculating (didnt get any answer to the question i asked) that people running rps use such hardware;-> I speculate again that it may be too costly to run rps on something like a tigerton or intel clovertown where you have cores sharing/contending for an FSB. If I can get answers to the question: "What h/ware are people running?" i could be proven wrong. [Note: I am not against RPS - i think it has its place; so i hope my desire to find out when to use rps doesnt show as hostility towards rps.] cheers, jamal --
IPS (~= RPS) was running on shared FSB HP9000's. Now, that was also a BSD networking stack with netisrq's and the like. TOPS (~= RFS) was also run on shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems. TOPS was implemented in a Streams-based stack tracing its history to a common ancestor with Solaris (Mentat). rick jones --
Sounds interesting. Wikipedia information overload. Any arch description of the HP9000? Did your scheme use IPIs to message the other CPUs? cheers, jamal --
I should have been more specific - HP 9000 Model 800's :) PA-RISC based business Netisrs were kernel processes one per CPU (back then a core, a processor and a CPU were one and the same :), and while we didn't call them IPI's, yes, it was a "soft interrupt" directed at the given processor to launch the netisr if it wasn't already running. TOPS was similar, but was with Streams and that did/does have some kernel processes not everything would happen as a kernel process. rick jones HP 3000 Model 900's - by and large the same PA-RISC hardware but running MPE/XL (later called MPE/iX) HP 9000 Model 700's - PA-RISC based workstations HP 9000 Model 300's - Moto 68K-based workstations (replaced by the 700s) --
If you doubt the cost of smp_call_function_single(), how about having a try with my another patch, which implements the similar of RPS, but uses kernel threads instead, so no explicit IPI. http://patchwork.ozlabs.org/patch/38319/ -- Regards, Changli Gao(xiaosuo@gmail.com) --
Come on Changli. How do you wake up a thread on a remote cpu ? To answer Jamal question, we need to answer to Jamal question, that is timing cost of IPIS. A kernel module might do this, this could be integrated in perf bench so that we can regression tests upcoming kernels. --
resched IPI, apparently. But it is async absolutely. and its IRQ handler is lighter. -- Regards, Changli Gao(xiaosuo@gmail.com) --
You still dont answer to the question, and your claims are not grounded by hard facts, but by your interpretation of code. --
My understanding of current scheduler is it does use IPIs to migrate tasks around - so thats why things may be working for Changli. i.e it is scheduler magic if you use kthreads. It is hard to say if this would work better... cheers, jamal --
It shouldn't be a lot lighter than the new fancy "queued smp_call_function" that's in the tree for a few releases. So it would surprise me if it made much difference. In the old days when there was only a single lock for s_c_f() perhaps... -Andi -- ak@linux.intel.com -- Speaking for myself only. --
So you are saying that the old implementation of IPI (likely what i tried pre-napi and as recent as 2-3 years ago) was bad because of a single lock? BTW, I directed some questions to you earlier but didnt get a response, to quote: --- On IPIs: Is anyone familiar with what is going on with Nehalem? Why is it this good? I expect things will get a lot nastier with other hardware like xeon based or even Nehalem with rps going across QPI. Here's why i think IPIs are bad, please correct me if i am wrong: - they are synchronous. i.e an IPI issuer has to wait for an ACK (which is in the form of an IPI). - data cache has to be synced to main memory - the instruction pipeline is flushed - what else did i miss? Andi? --- Do you know any specs i could read up which will tell me a little more? cheers, jamal --
Yes. The old implementation of smp_call_function. Also in the really old days there was no smp_call_function_single() so you tended to broadcast. Nehalem is just fast. I don't know why it's fast in your specific case. It might be simply because it has lots of bandwidth everywhere. In the hardware there's no ack, but in the Linux implementation there is usually (because need to know when to free the stack state used to pass information) However there's also now support for queued IPI At least on Nehalem data transfer can be often through the cache. IPIs involve APIC accesses which are not very fast (so overall it's far more than a pipeline worth of work), but it's still not a incredible expensive operation. There's also X2APIC now which should be slightly faster, but it's If you're just interested in IPI and cache line transfer performance it's probably best to just measure it. Some general information is always in the Intel optimization guide. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Nice - thanks for that info! So not only has h/ware improved, but Well, the cache architecture is nicer. The on-die MC is nice. No more shared MC hub/FSB. The 3 MC channels are nice. Intel finally beating AMD ;-> someone did a measurement of the memory timings (L1, L2, L3, MM There are tools like benchit which would give me L1,2,3,MM measurements; Thanks Andi! cheers, jamal --
Perf would be good - but even softnet_stat cleaner than the the nasty hack i use (attached) would be a good start; the ping with and without rps gives me a ballpark number. IPI is important to me because having tried it before it and failed miserably. I was thinking the improvement may be due to hardware used but i am having a hard time to get people to tell me what hardware they used! I am old school - I need data;-> The RFS patch commit seems to have more info but still vague, example: "The benefits of RFS are dependent on cache hierarchy, application load, and other factors" Also, what does a "simple" or "complex" benchmark mean?;-> I think it is only fair to get this info, no? Please dont consider what i say above as being anti-RPS. 5 microsec extra latency is not bad if it can be amortized. Unfortunately, the best traffic i could generate was < 20Kpps of ping which still manages to get 1 IPI/packet on Nehalem. I am going to write up some app (lots of cycles available tommorow). I still think it is valueable. cheers, jamal
+ seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
s->total, s->dropped, s->time_squeeze, 0,
0, 0, 0, 0, /* was fastroute */
- s->cpu_collision, s->received_rps);
+ s->cpu_collision, s->received_rps, s->ipi_rps);
Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.
@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
* equipped to do the right thing...
*/
if (ipi)
+{
arch_send_call_function_single_ipi(cpu);
+ __get_cpu_var(netdev_rx_stat).ipi_rps++;
+}
--
Regards,
Changli Gao(xiaosuo@gmail.com)
--
my observation is: s->total is the sum of all packets received by cpu (some directly from ethernet) s->received_rps was what the count receiver cpu saw incoming if they were sent by another cpu. s-> ipi_rps is the times we tried to enq to remote cpu but found it to be empty and had to send an IPI. ipi_rps can be < received_rps if we receive > 1 packet without generating an IPI. What did i miss? cheers, jamal --
It is meaningless currently. If rps is enabled, it may be twice of the number of the packets received, because one packet may be count twice: one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I had posted a patch to solve this problem. http://patchwork.ozlabs.org/patch/50217/ If you don't apply my patch, you'd better refer to /proc/net/dev for Maybe its name confused you. /* Called from hardirq (IPI) context */ static void trigger_softirq(void *data) { struct softnet_data *queue = data; __napi_schedule(&queue->backlog); __get_cpu_var(netdev_rx_stat).received_rps++; } the function above is called in IRQ of IPI. It counts the number of -- Regards, Changli Gao(xiaosuo@gmail.com) --
You are probably right - you made me look at my collected data ;-> i will look closely later, but it seems they are accounting for different cpus, no? Example, attached are some of the stats i captured when i was running the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just cut to the first and last two columns): cpu Total |rps_recv |rps_ipi -----+----------+---------+--------- cpu0 | 002dc7f1 |00000000 |000f4246 cpu1 | 002dc804 |000f4240 |00000000 ------------------------------------- So: cpu0 receive 0x2dc7f1 pkts accummulative over time and redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear the data) and for the test 0xf4246 times it generated an IPI. It can be seen that total running for CPU1 is 0x2dc804 but in this one run it received 1M packets (0xf4240). i.e i dont see the double accounting.. cheers, jamal
I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows: a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by: echo 1 > ..../rps_cpus you will find the total number are doubled. -- Regards, Changli Gao(xiaosuo@gmail.com) --
Sorry, didnt respond to you - busyed out setting up before trying to think a little more about this.. If you look at the patch, I am zeroing those stats - so 0xf4240 is only one test (decimal 1M). I think there is something to what you are saying; rps_ipi on cpu0 is ambigous because it counts the number of times cpu0 softirq was scheduled as well as the number of times cpu0 scheduled other cpus. The extra six for cpu0 turn out to be the times an ethernet interrupt Well, the counts have different meanings; rps_ipi applies to source cpu activity and rps_recv applies to destination. Example, if cpu0 in total 6 times found some destination cpu to be empty and 2 of those happen to be on cpu1, cpu2, cpu3 then cpu0: ipi_rps = 6 cpu1: rps_recv = 2 cpu2: rps_recv = 2 This is true. But IMO deserving and should be double counted. It is just more fine-grained accounting. IOW, I am not sure we need your patch because we will loose the fine-grain accounting - and mine requires more work to be less ambigous. cheers, jamal --
I did some tests on a dual quad core machine (E5450 @ 3.00GHz), not nehalem. So a 3-4 years old design. For all test, I use the best time of 3 runs of "ping -f -q -c 100000 192.168.0.2". Yes ping is not very good, but its available ;) Note: I make sure all 8 cpus of target are busy, eating cpu cycles in user land. I dont want to tweak acpi or whatever smart power saving mechanisms. When RPS off 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms RPS on, but directed on the cpu0 handling device interrupts (tg3, napi) (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus) 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms So the cost of queing the packet into our own queue (netif_receive_skb -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) I personally think we should process packet instead of queeing it, but Tom disagree with me. RPS on, directed on cpu1 (other socket) (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus) 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms So extra cost to enqueue to a remote cpu queue, IPI, softirq handling... is 3 us. Note this cost is in case we receive a single packet. I suspect IPI itself is in the 1.5 us range, not very far from the queing to ourself case. For me RPS use cases are : 1) Value added apps handling lot of TCP data, where the costs of cache misses in tcp stack easily justify to spend 3 us to gain much more. 2) Network appliance, where a single cpu is filled 100% to handle one device hardware and software/RPS interrupts, delegating all higher level works to a pool of cpus. I'll try to do these tests on a Nehalem target. --
You could do that, but then the packet processing becomes HOL blocking on all the packets that are being sent to other queues for processing-- remember the IPIs is only sent at the end of the NAPI. So unless the upper stack processing is <0.74us in your case, I think processing packets directly on the local queue would improve best case latency, but would increase average latency and even more likely worse --
Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu() itself, computing skb->rxhash and all. We should make a review of how many cache lines we exchange per skb, and try to reduce this number. --
Tom, I am not sure what you describe is even respected for NAPI devices. (I hope you use napi devices in your company ;) ) If we enqueue a skb to backlog, we also link our backlog napi into our poll_list, if not already there. So the loop in net_rx_action() will make us handle our backlog napi a bit after this network device napi (if time limit of 2 jiffies not elapsed) and *before* sending IPIS to remote cpus anyway. --
- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.
- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..8092f01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -264,7 +264,7 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
* queue in the local softnet handler.
*/
-DEFINE_PER_CPU(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
EXPORT_PER_CPU_SYMBOL(softnet_data);
#ifdef CONFIG_LOCKDEP
@@ -3232,7 +3232,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
{
int work = 0;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
- unsigned long start_time = jiffies;
napi->weight = weight_p;
do {
@@ -3252,7 +3251,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
local_irq_enable();
__netif_receive_skb(skb);
- } while (++work < quota && jiffies == start_time);
+ } while (++work < quota);
return work;
}
--
From: Eric Dumazet <eric.dumazet@gmail.com> Yep, doing this time limit at two levels is pointless. Applied, thanks Eric! --
Eric, I thank you kind sir for going out of your way to do this - it is
I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences; I am speculating you probably ended having greater
I should mention i turned off acpi as well in the bios; it was consuming
Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...
The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket
Good test - should be worst case scenario. But there are two other
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores
Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it
Sound about right maybe 2 us in my case. I am still ...No, only one packet per IPI, since I setup my tg3 coalescing parameter to the minimum value, I received one packet per interrupt. The specific app is : Sure, lets redo a full test, taking lowest time of three ping runs echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms # egrep "physical id|core|apicid" /proc/cpuinfo physical id : 0 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 physical id : 1 core id : 0 cpu cores : 4 apicid : 4 initial apicid : 4 physical id : 0 core id : 2 cpu cores : 4 apicid : 2 initial apicid : 2 physical id : 1 core id : 2 cpu cores : 4 apicid : 6 initial apicid : 6 physical id : 0 core id : 1 cpu cores : 4 apicid : 1 initial apicid : 1 physical id : 1 core id : 1 cpu cores : 4 apicid : 5 initial apicid : 5 physical id : 0 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 physical id : 1 core id : 3 cpu cores : ...
An other interesting user land app would be to use a cpu _and_ memory
cruncher, because of caches misses we'll get.
$ cat nloop.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#define SZ 4*1024*1024
int main(int argc, char *argv[])
{
int nproc = 8;
char *buffer;
if (argc > 1)
nproc = atoi(argv[1]);
while (nproc > 1) {
if (fork() == 0)
break;
nproc--;
}
buffer = malloc(SZ);
while (1)
memset(buffer, 0x55, SZ);
}
$ ./nloop 8 &
echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4861ms
echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4981ms
echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7191ms
echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7128ms
echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7107ms
echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
5505ms
echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7125ms
echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7022ms
echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7157ms
Maximum overhead is 7191-4861 = 23.3 us per packet
--
Thanks Eric. I tried to visualize your results - attached. There are 2-3 odd numbers (labelled with *) but other than that results are as expected... I did run some experiments with some udp sink server and i saw the IPIs amortized; unfortunately sky2 h/ware proved to be bottleneck (at > 750Kpps incoming, it started dropping and wasnt recording the drops, so i had to slow things down). I need to digest my results a little more - but it seems i was getting better throughput results with RPS (i.e it was able to sink more packets).. cheers, jamal
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.
I add a flag to scan cpumask only if at least one IPI was scheduled.
Even cpumask_weight() might be expensive on some setups, where
nr_cpumask_bits could be very big (4096 for example)
Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)
Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.
In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/linux/netdevice.h | 5 +-
net/core/dev.c | 73 ++++++++++++++++--------------------
2 files changed, 38 insertions(+), 40 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..283d3ef 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1389,8 +1389,11 @@ struct softnet_data {
struct list_head poll_list;
struct sk_buff *completion_queue;
- /* Elements below can be accessed between CPUs for RPS */
#ifdef CONFIG_RPS
+ unsigned int rps_ipis_scheduled;
+ unsigned int rps_select;
+ cpumask_t rps_mask[2];
+ /* Elements below can be accessed between CPUs for RPS */
struct call_single_data csd ____cacheline_aligned_in_smp;
unsigned int input_queue_head;
#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..3e6e420 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2347,19 +2347,14 @@ done:
}
/*
- * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * sofnet_data holds the per-CPU mask of CPUs for which IPIs are scheduled
* to be sent to kick remote softirq processing. There are two masks since
- * the sending of IPIs must be done with interrupts enabled. The select field
+ * the sending of IPIs must be done with interrupts enabled. The rps_select field
* ...How about using a array to save the cpu IDs. The number of CPUs, to which the IPI will be sent, should be small. -- Regards, Changli Gao(xiaosuo@gmail.com) --
Yes it should be small, yet the two arrays would be big enough to make
softnet_data first part use at least two cache lines instead of one,
even in the case we handle one cpu/IPI per net_rps_action()
As several packets can be enqueued for a given cpu, we would need to
keep bitmasks.
We would have to add one test in enqueue_to_backlog()
if (cpu_test_and_set(cpu, mask)) {
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
array[nb++] = cpu;
}
--
rps_lock(queue);
if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
if (queue->input_pkt_queue.qlen) {
...
if (napi_schedule_prep(&queue->backlog)) {
#ifdef CONFIG_RPS
if (cpu != smp_processor_id()) {
struct rps_remote_softirq_cpus *rcpus =
&__get_cpu_var(rps_remote_softirq_cpus);
cpu_set(cpu, rcpus->mask[rcpus->select]);
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
goto enqueue;
}
#endif
__napi_schedule(&queue->backlog);
}
Only the first packet of a softnet.input_pkt_queue may trigger IPI, so
we don't need to keep bitmasks.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
--
This is not true Changli Please read again all previous mails about RPS, or the code. --
Hmm, I just read again, and I now remember Tom used a single bitmap, then we had to add a second set because of a possible race. A list would be enough. --
Here is the updated patch, using a single list instead of bitmap
RFC status becomes official patch ;)
Thanks Changli for your array suggestion !
[PATCH net-next-2.6] rps: shortcut net_rps_action()
net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.
Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.
Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)
Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.
In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/linux/netdevice.h | 9 ++--
net/core/dev.c | 79 ++++++++++++++----------------------
2 files changed, 38 insertions(+), 50 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..83ab3da 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1381,17 +1381,20 @@ static inline int unregister_gifconf(unsigned int family)
}
/*
- * Incoming packets are placed on per-cpu queues so that
- * no locking is needed.
+ * Incoming packets are placed on per-cpu queues
*/
struct softnet_data {
struct Qdisc *output_queue;
struct list_head poll_list;
struct sk_buff *completion_queue;
- /* Elements below can be accessed between CPUs for RPS */
#ifdef CONFIG_RPS
+ struct softnet_data *rps_ipi_list;
+
+ /* Elements below can be accessed between CPUs for RPS */
struct call_single_data csd ____cacheline_aligned_in_smp;
+ struct softnet_data *rps_ipi_next;
+ unsigned int cpu;
unsigned int input_queue_head;
#endif
struct sk_buff_head input_pkt_queue;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..f6ff2cf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ ...Yes. I did some quick experiments last night and there does seem to --
From: Eric Dumazet <eric.dumazet@gmail.com> Applied, thanks Eric. It is getting increasingly complicated to follow who enables and disabled local cpu irqs in these code paths. We could combat this by adding something like "_irq_enable()" to the function names. --
Yes I agree, we need a general cleanup in this file
Thanks David !
[PATCH net-next-2.6] rps: cleanups
struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.
Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()
Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.
incr_input_queue_head() becomes input_queue_head_incr()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/linux/netdevice.h | 4
net/core/dev.c | 149 +++++++++++++++++++-----------------
2 files changed, 82 insertions(+), 71 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 83ab3da..3c5ed5f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,10 +1401,10 @@ struct softnet_data {
struct napi_struct backlog;
};
-static inline void incr_input_queue_head(struct softnet_data *queue)
+static inline void input_queue_head_incr(struct softnet_data *sd)
{
#ifdef CONFIG_RPS
- queue->input_queue_head++;
+ sd->input_queue_head++;
#endif
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..70df048 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -208,17 +208,17 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
}
-static inline void rps_lock(struct softnet_data *queue)
+static inline void rps_lock(struct softnet_data *sd)
{
#ifdef CONFIG_RPS
- spin_lock(&queue->input_pkt_queue.lock);
+ spin_lock(&sd->input_pkt_queue.lock);
#endif
}
-static inline void rps_unlock(struct softnet_data *queue)
+static inline void rps_unlock(struct softnet_data *sd)
{
#ifdef CONFIG_RPS
- spin_unlock(&queue->input_pkt_queue.lock);
+ spin_unlock(&sd->input_pkt_queue.lock);
#endif
}
@@ -2346,63 +2346,74 @@ done:
}
/* Called from hardirq (IPI) context */
-static void trigger_softirq(void *data)
+static void ...From: Eric Dumazet <eric.dumazet@gmail.com> Applied. --
It seems you prefetch rps_ipi_next. I think it isn't necessary, as the list should be short. If you insist on this, is the macro prefetch() better? -- Regards, Changli Gao(xiaosuo@gmail.com)
Oh, I read the code again and got the answer. After the IPI is sent, this softnet will be queued by the other CPUs. We prefetch the pointer rps_ipi_next to avoid this race condition. Sorry for noise :) -- Regards, Changli Gao(xiaosuo@gmail.com) --
Speaking of prefetch business,
I partly tested following patch, I will submit it if it happens to be a
clear win.
diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..fe6fc9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2349,7 +2349,9 @@ done:
static void trigger_softirq(void *data)
{
struct softnet_data *queue = data;
+
__napi_schedule(&queue->backlog);
+ prefetch(queue->input_pkt_queue.next);
__get_cpu_var(netdev_rx_stat).received_rps++;
}
#endif /* CONFIG_RPS */
--
folks, Thanks to everybody (Eric stands out) for your patience. I ended mostly validating whats already been said. I have a lot of data and can describe in details how i tested etc but it would require patience in reading, so i will spare you;-> If you are interested let me know and i will be happy to share. Summary is: -rps good, gives higher throughput for apps -rps not so good, latency worse but gets better with higher input rate or increasing number of flows (which translates to higher pps) -rps works well with newer hardware that has better cache structures. [Gives great results on my test machine a Nehalem single processor, 4 cores each with two SMT threads that has a shared L2 between threads and a shared L3 between cores]. Your selection of what the demux cpu is and where the target cpus are is an influencing factor in the latency results. If you have a system with multiple sockets, you should get better numbers if you stay within the same socket relative to going across sockets. -rps does a better job at helping schedule apps on same cpu thus localizing the app. The throughput results with rps are very consistent and better whereas in non-rps case, variance is _high_. My next step is to do some forwarding tests - probably next week. I am concerned here because i expect the cache misses to be higher than the app scenario (netdev structure and attributes could be touched by many cpus) cheers, jamal --
Hi Jamal I think your tests are very interesting, maybe could you publish them somehow ? (I forgot to thank you about the previous report and nice graph) perf reports would be good too to help to spot hot points. --
Thanks a lot Jamal, this is really useful Drawback of using a fixed src ip from your generator is that all flows share the same struct dst entry on SUT. This might explain some glitches you noticed (ip_route_input + ip_rcv at high level on slave/application cpus) Also note your test is one way. If some data was replied we would see much use of the 'flows' I notice epoll_ctl() used a lot, are you re-arming epoll each time you receive a datagram ? I see slave/application cpus hit _raw_spin_lock_irqsave() and _raw_spin_unlock_irqrestore(). Maybe a ring buffer could help (instead of a double linked queue) for backlog, or the double queue trick, if Changli wants to respin his patch. --
OK, I'll post a new patch against the current tree, so Jamal can have a try. I am sorry, but I don't have a suitable computer for benchmark. -- Regards, Changli Gao(xiaosuo@gmail.com) --
yes, that would explain it ;-> I could have flows going to each cpu In my next step i wanted to "route" these packets at app level and for this stage of testing just wanted to sink the data to reduce experiment variables. Reason: The netdev structure would hit a lot of cache misses if i started using it to both send/recv since lots of things are shared on tx/rx (example napi tx prunning could happen on either tx or receive path); same thing with qdisc path which is at netdev granularity.. I think there may be I am using default libevent on debian. It looks very old and maybe buggy. I will try to upgrade first and if still see the same Ok, I will have some cycles later today/tommorow or for sure on weekend. My setup is still intact - so i can test. cheers, jamal --
I read the code again, and find that we don't use spin_lock_irqsave(), and we use local_irq_save() and spin_lock() instead, so _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be related to backlog. the lock maybe sk_receive_queue.lock. Jamal, did you use a single socket to serve all the clients? BTW: completion_queue and output_queue in softnet_data both are LIFO queues. For completion_queue, FIFO is better, as the last used skb is more likely in cache, and should be used first. Since slab has always cache the last used memory at the head, we'd better free the skb in FIFO manner. For output_queue, FIFO is good for fairness among qdiscs. -- Regards, Changli Gao(xiaosuo@gmail.com) --
Possible. I am wondering if there's a way we can precisely nail where that is happening? is lockstat any use? Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit. So looking at your patch now i see it is likely there was an improvement made for non-rps case (moving out of loop some irq_enable etc). i.e my results may not be crazy after adding your patch and seeing an improvement for non-rps case. However, whatever your patch did - it did not help the rps case case: call_function_single_interrupt() comes out higher in the profile, and # of IPIs seems to have gone up (although i did not measure this, I I think it will depend on how many of those skbs are sitting in the completion queue, cache warmth etc. LIFO is always safest, you have higher probability of finding a cached skb infront. cheers, jamal --
Did you apply the patch from Eric? It would reduce the number of we call kfree_skb() to release skbs to slab allocator, then slab allocator stores them in a LIFO queue. If completion queue is also a LIFO queue, the latest unused skb will be in the front of the queue, and will be released to slab allocator at first. At the next time, we call alloc_skb(), the memory used by the skb in the end of the completion queue will be returned instead of the hot one. However, as Eric said, new drivers don't rely on completion queue, it isn't a real problem, especially in your test case. -- Regards, Changli Gao(xiaosuo@gmail.com) --
No need to apologize, if you like I'd be happy to discuss netperf usage tips offline. That offer stands for everyone. happy benchmarking, rick jones --
On Wed, 14 Apr 2010 07:53:06 -0400 I posted a patch to use sky2 hardware hash (RSS) which should lower the cost per packet. -- --
From: jamal <hadi@cyberus.ca> The RPS config is merely an indirect dependency on SMP as we have it coded up in the Kconfig files, it's not meant to be user selectable and is intended to be unconditionally on for SMP builds. --
