[RFC] rps: shortcut net_rps_action()

Previous thread: [RFC]: xfrm by mark by jamal on Sunday, February 7, 2010 - 11:32 am. (7 messages)

Next thread: [2.6.33-rc only] kerneloops.org report for the week of Feb 7 2010 for the 2.6.33-rc kernel series by Arjan van de Ven on Sunday, February 7, 2010 - 12:38 pm. (1 message)
From: jamal
Subject: rps: question
Date: Sunday, February 7, 2010 - 11:42 am

Hi Tom,

First off: Kudos on the numbers you are seeing; they are
impressive. Do you have any numbers on a forwarding path test?

My first impression when i saw the numbers was one of suprise.
Back in the days when we tried to split stack processing the way
you did(it was one of the experiments on early NAPI), IPIs were 
_damn_ expensive. What changed in current architecture that makes
this more palatable? IPIs are still synchronous AFAIK (and the more
IPI receiver there are, the worse the ACK latency). Did you test this
across other archs or say 3-4 year old machines?

cheers,
jamal 

--

From: Tom Herbert
Date: Sunday, February 7, 2010 - 10:58 pm

I don't have specific numbers, although we are using this on
application doing forwarding and numbers seem in line with what we see

No, the cost of the IPIs hasn't been an issue for us performance-wise.
 We are using them extensively-- up to one per core per device
interrupt.

We're calling __smp_call_function_single which is asynchronous in that
the caller provides the call structure and there is not waiting for
the IPI to complete.  A flag is used with each call structure that is
set when the IPI is in progress, this prevents simultaneous use of a
call structure.

I haven't seen any architectural specific issues with the IPIs, I
believe they are completing in < 2 usecs on platforms we're running
(some opteron systems that are over 3yrs old).

--

From: jamal
Date: Monday, February 8, 2010 - 8:09 am

When i get the chance i will give it a run. I have access to an i7

Ok, so you are not going across cores then? I wonder if there's
some new optimization to reduce IPI latency  when both sender/receiver

It is possible that is just an abstraction hiding the details..
AFAIK, IPIs are synchronous. Remote has to ack with another IPI 

2 usecs aint bad (at 10G you only accumulate a few packets while
stalled). I think we saw much higher values.
I was asking on different architectures because I have tried something
equivalent as recent as 2 years back on a MIPS multicore and the
forwarding results were horrible. 
IPIs flush the processor pipeline so they aint cheap - but that may
vary depending on the architecture. Someone more knowledgeable should
be able to give better insights.
My suspicion is that with low transaction rate (with appropriate traffic
patterns) you will see a very much increased latency since you will 
be sending more IPIs..

cheers,
jamal

--

From: jamal
Date: Wednesday, April 14, 2010 - 4:53 am

Following up like promised:


I did step #0 last night on an i7 (single Nehalem). I think more than
anything i was impressed by the Nehalem's excellent caching system.
Robert, I am almost tempted to say skb recycling performance will be
excellent on this  machine given the cost of a cache miss is much lower
than previous generation hardware.

My test was simple: irq affinity on cpu0(core0) and rps redirection to
cpu1(core 1); tried also to redirect to different SMT threads (aka CPUs)
on different cores with similar results. I base tested against no rps
being used and a kernel which didnt have any RPS config on.
[BTW, I had to hand-edit the .config since i couldnt do it from
menuconfig (Is there any reason for it to be so?)]

Traffic was sent from another machine into the i7 via an el-cheapo sky2
(dont know how shitty this NIC is, but it seems to know how to do MSI so
probably capable of multiqueueing); the test was several sets of 
a ping first and then a ping -f (I will get more sophisticated in my
next test likely this weekend).

Results:
CPU utilization was about 20-30% higher in the case of rps. On cpu0, the
cpu was being chewed highly by sky2_poll and on the redirected-to-core
it was always smp_call_function_single.
Latency was (consistently) on average 5 microseconds. 
So if i sent 1M ping -f packets, without RPS it took on average
176 seconds and with RPS it took 181 seconds to do a round-trip.
Throughput didnt change but this could be attributed to the low amounts
of data i was sending.
I observed that we were generating, on average, an IPI per packet even
with ping -f. (added an extra stat to record when we sent an IPI and
counted against the number of packets sent).
In my opinion it is these IPIs that contribute the most to the latency
and i think it happens that the Nehalem is just highly improved in this 
area. I wish i had a more commonly used machine to test rps on.
I expect that rps will perform worse on currently cheaper/older hardware
for the traffic ...
From: Tom Herbert
Date: Wednesday, April 14, 2010 - 10:31 am

The point of RPS is to increase parallelism, but the cost of that is
more overhead per packet.  If you are running a single flow, then
you'll see latency increase for that flow.  With more concurrent flows
the benefits of parallelism kick in and latency gets better.-- we've
seen the break even point around ten connections in our tests.  Also,
I don't think we've made the claim that RPS should generally perform
better than multi-queue, the primary motivation for RPS is make single
queue NICs give reasonable performance.


--

From: Eric Dumazet
Date: Wednesday, April 14, 2010 - 11:04 am

Yes, multiqueue is far better of course, but in case of hardware lacking
multiqueue, RPS can help many workloads, where application has _some_
work to do, not only counting frames or so...

RPS overhead (IPI, cache misses, ...) must be amortized by
parallelization or we lose.

A ping test is not an ideal candidate for RPS, since everything is done
at softirq level, and should be faster without RPS...



--

From: jamal
Date: Wednesday, April 14, 2010 - 11:53 am

Agreed. So to enumerate, the benefits come in if:
a) you have many processors
b) you have single-queue nic
c) at sub-threshold traffic you dont care about a little latency
d) you have a specific cache hierachy

Indeed. 
How well they can be amortized seems very cpu or board specific.

I think the main challenge for my pedantic mind is missing details. Is
there a paper on rps? Example for #d above, the commit log mentions that
rps benefits if you have certain types of "cache hierachy". Probably
some arch with large shared L2/3 (maybe inclusive) cache will benefit.
example: it does well on Nehalem and probably opterons as long (as you
dont start stacking these things on some interconnect like QPI or HT).
But what happens when you have FSB sharing across cores (still a very
common setup)? etc etc


ping wont do justice to the possible potential of rps mostly because it
generates very little traffic i.e the part #c above. But it helps me at
least boot a machine with proper setup - but it is not totally useless
because i think the cost of IPI can be deduced from the results.
I am going to put together some udp app with variable think-time to see
what happens. Would that be a reasonable thing to test on?

It would be valuable to have something like Documentation/networking/rps
to detail things a little more. 

cheers,
jamal

--

From: Stephen Hemminger
Date: Wednesday, April 14, 2010 - 12:44 pm

On Wed, 14 Apr 2010 14:53:42 -0400

There probably needs to be better autotuning for this, there is no reason
that RPS to be steering packets unless the queue is getting backed up.
Some kind of high / low water mark mechanism is needed.

RPS might also interact with the core turbo boost functionality on Intel chips.
Newer chips will make a single core faster if other core can be kept idle.


-- 
--

From: Eric Dumazet
Date: Wednesday, April 14, 2010 - 12:58 pm

This was discussed a while ago, and Out Of Order packet delivery was the
thing that frightened us a bit.

Every time we change RPS to be on or off, we might have some extra
noise. Maybe we already have this problem with irqbalance ?



--

From: David Miller
Date: Thursday, April 15, 2010 - 1:51 am

From: Eric Dumazet <eric.dumazet@gmail.com>

irqbalance should never move network device interrupts around
under normal circumstances.  Arjan assured me that there is
specific logic in the irqbalance daemon to not move NIC
interrupts around once a target has been choosen.

--

From: jamal
Date: Wednesday, April 14, 2010 - 1:22 pm

how well does it work with Linux? Sounds like all i need to do is turn
on some BIOS feature. 
One of the negatives with multiqueue nics is because the core selection
is static, you could end up overloading one core while others stay idle.
This seems to steal cycle capacity from the idle cores and gives it to
the busy cpus. nice. So i see it as a boost to multiqueue.

cheers,
jamal

--

From: Eric Dumazet
Date: Wednesday, April 14, 2010 - 1:27 pm

Only if more than one flow is involved.

And if you have many flows, chance they will spread several queues...



--

From: jamal
Date: Wednesday, April 14, 2010 - 1:38 pm

Over long period of time measurement, true; but even with > 1 flows, it
is possible that one flow is more active/intense than others (rtp vs
some bulk file transfer) or more processor intensive than others(eg
ipsec vs clear text) etc. 
 
BTW: just poking at intel doc on turbo boost and it seems the max a
core can steal from others is 400Mhz; so a core can go from 2.8Ghz
to 3.2Ghz. I am sure theres a lot of interesting dynamics from this ;->
I think i will try turning this thing in my tests since i have an i7.

cheers,
jamal

--

From: Tom Herbert
Date: Wednesday, April 14, 2010 - 1:45 pm

But use too many queues and the efficiency of NAPI drops and cost of
device interrupts becomes dominant, so that the overhead from
additional hard interrupts can surpass the overhead of doing RPS and
the IPIs.  I believe we are seeing this is in some of our results
which shows that a combination of multi-queue and RPS can be better
than just multi-queue (see rps changelog).  Again, I'm not claiming
that is generally true, but there are a lot of factors to consider.
--

From: Eric Dumazet
Date: Wednesday, April 14, 2010 - 1:57 pm

RPS can be tuned (Changli wants a finer tuning...), it would be
intereting to tune multiqueue devices too. I dont know if its possible
right now.

On my Nehalem machine (16 logical cpus), its NetXtreme II BCM57711E
10Gigabit has 16 queues. It might be good to use less queues according
to your results on some workloads, and eventually use RPS on a second
layering.




--

From: Changli Gao
Date: Wednesday, April 14, 2010 - 3:51 pm

My idear is: run a daemon in userland to monitor the softnet
statistics, and tun the RPS setting if necessary. It seems that the
current softnet statistics data isn't correct.

Long time ago, I did a test, and the conclution was
call_function_single IPI was more expensive than resched IPI, so I
moved to kernel thread from softirq for packet processing. I'll redo
the test later.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Stephen Hemminger
Date: Wednesday, April 14, 2010 - 4:02 pm

On Thu, 15 Apr 2010 06:51:29 +0800

The big thing is data, data, data... Performance can only be examined
with real hard data with multiple different kind of hardware.  Also, check for
regressions in lmbench and TPC benchmarks. Yes this is hard, but papers
on this would allow for rational rather than speculative choices.

Adding more tuning knobs is not the answer unless you can show when
the tuning helps.

-- 
--

From: Eric Dumazet
Date: Wednesday, April 14, 2010 - 7:40 pm

Agree 100%, and irqbalance is the existing daemon. It should be used and
changed if necessary.

Changli, my stronges argument about your patches is that our scheduler
and memory affinity api (numactl driven) is bitmask oriented, giving the
same weight to individual cpu or individual memory node.



--

From: Changli Gao
Date: Wednesday, April 14, 2010 - 7:50 pm

It works with the assumption: the workloads handled in non-schedulable
context are less than the others. If most of work is done in
non-schedulable(softirq) context, scheduler can't keep load balance.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: David Miller
Date: Thursday, April 15, 2010 - 1:57 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Only NIU allows real detailed control over queue selection and
stuff like that, because the hardware has a real TCAM for
packet matching and packets which match in TCAM entries can
steer to different collections of queues.

We have ethtool interfaces for this (ETHTOOL_GRXCLS*), so you can
change it.

For most other chips we only have interfaces for modifying the
RX hashing algorithm or what the RX hash covers, stuff like
that.

See also ETHTOOL_GRXFH, ETHTOOL_SRXFH, ETHTOOL_SRXNTUPLE, and
ETHTOOL_GRXNTUPLE, the latter two of which were added for Intel
NICs.
--

From: jamal
Date: Thursday, April 15, 2010 - 5:10 am

Ok Eric, you seem to be running a system with two Nehalems
interconnected by QPI.
Is there any difference, performance-wise, between redirecting from
coreX to coreY when they are on the same Nehalem vs when you
are going across QPI?

cheers,
jamal

--

From: Changli Gao
Date: Thursday, April 15, 2010 - 5:32 am

For historical reason, we use Linux-2.6.18. Our company have several
products with CPU Xen, P4, or i7. Some of them are SMP, Multi-Core and
Multi-Threaded. We use the similar mechanism like dynamic weighted
RPS. The total throughput is increased nearly linear with the number
of the worker threads(one worker thread per CPU).

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Thursday, April 15, 2010 - 5:50 am

Thanks for sharing. How much more can you say? ;-> Do you have a paper

Other than the i7 - have you tried to run rps on on the P4?

cheers,
jamal


--

From: Changli Gao
Date: Thursday, April 15, 2010 - 4:51 pm

On a dual 4-core Xeon, we use one core for NIC in internal side, one
core for NIC in the external side, one for inbound QoS, one for
outbound QoS, and the CPU cycles left are used by DPI(DFA), the total

No.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: David Miller
Date: Thursday, April 15, 2010 - 1:51 am

From: jamal <hadi@cyberus.ca>

It's completely transparent and should just happen without any
BIOS tweaks.

--

From: Andi Kleen
Date: Wednesday, April 14, 2010 - 1:34 pm

In addition to Turbo using less cores can also help to save power.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: David Miller
Date: Thursday, April 15, 2010 - 1:50 am

From: Stephen Hemminger <shemminger@vyatta.com>

I disagree, if the goal is to migrate the bulk of packet processing
to where the app will actually sink and process the data then it should
forward to RPS marked cpus regardless of local queue levels.

--

From: David Miller
Date: Thursday, April 15, 2010 - 1:48 am

From: jamal <hadi@cyberus.ca>

A single-queue NIC is actually not a requirement, RPS helps also in
cases where you have 'N' application threads and N is less than the
number of CPUs your multi-queue NIC is distributing traffic to.

Moving the bulk of the input packet processing to the cpus where
the applications actually sit had a non-trivial benefit.  RFS takes

I think for the case where application locality is important,
RPS/RFS can help regardless of cache details.
--

From: jamal
Date: Thursday, April 15, 2010 - 4:55 am

rfs looks quiet interesting;-> I think with some twist it could be

Generally true, as long as there's not much shared data across the cpus
or the cost of a cache miss is reasonably tolerable. The socket layer
just happens to be not sharing much with ingress packet path and
for a single processor Nehalem, the caching system works so well that
the cost of cache misses is not as an important a variable. Everything
is on the same die including the MM controller etc.
I am speculating (didnt get any answer to the question i asked) that
people running rps use such hardware;->

I speculate again that it may be too costly to run rps on something like
a tigerton or intel clovertown where you have cores sharing/contending
for an FSB. If I can get answers to the question: "What h/ware are
people running?" i could be proven wrong.
[Note: I am not against RPS - i think it has its place; so i hope my
desire to find out when to use rps doesnt show as hostility towards
rps.]

cheers,
jamal

--

From: Rick Jones
Date: Thursday, April 15, 2010 - 9:41 am

IPS (~= RPS) was running on shared FSB HP9000's.  Now, that was also a BSD 
networking stack with netisrq's and the like.  TOPS (~= RFS) was also run on 
shared FSB HP9000s, as well as CC-NUMA HP9000s and Integrity systems.  TOPS was 
implemented in a Streams-based stack tracing its history to a common ancestor 
with Solaris (Mentat).

rick jones
--

From: jamal
Date: Thursday, April 15, 2010 - 1:16 pm

Sounds interesting.
Wikipedia information overload. Any arch description of the HP9000? 
Did your scheme use IPIs to message the other CPUs?

cheers,
jamal 

--

From: Rick Jones
Date: Thursday, April 15, 2010 - 1:25 pm

I should have been more specific - HP 9000 Model 800's :) PA-RISC based business 

Netisrs were kernel processes one per CPU (back then a core, a processor and a 
CPU were one and the same :), and while we didn't call them IPI's, yes, it was a 
"soft interrupt" directed at the given processor to launch the netisr if it 
wasn't already running.

TOPS was similar, but was with Streams and that did/does have some kernel 
processes not everything would happen as a kernel process.

rick jones

HP 3000 Model 900's - by and large the same PA-RISC hardware but running MPE/XL 
(later called MPE/iX)
HP 9000 Model 700's - PA-RISC based workstations
HP 9000 Model 300's - Moto 68K-based workstations (replaced by the 700s)
--

From: Changli Gao
Date: Thursday, April 15, 2010 - 4:56 pm

If you doubt the cost of smp_call_function_single(), how about having
a try with my another patch, which implements the similar of RPS, but
uses kernel threads instead, so no explicit IPI.

http://patchwork.ozlabs.org/patch/38319/


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 10:18 pm

Come on Changli.

How do you wake up a thread on a remote cpu ?

To answer Jamal question, we need to answer to Jamal question, that is
timing cost of IPIS.

A kernel module might do this, this could be integrated in perf bench so
that we can regression tests upcoming kernels.



--

From: Changli Gao
Date: Thursday, April 15, 2010 - 11:02 pm

resched IPI, apparently. But it is async absolutely. and its IRQ
handler is lighter.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Tom Herbert
Date: Thursday, April 15, 2010 - 11:28 pm

From: Eric Dumazet
Date: Thursday, April 15, 2010 - 11:32 pm

You still dont answer to the question, and your claims are not grounded
by hard facts, but by your interpretation of code.


--

From: jamal
Date: Friday, April 16, 2010 - 6:42 am

My understanding of current scheduler is it does use IPIs to migrate
tasks around - so thats why things may be working for Changli. i.e
it is scheduler magic if you use kthreads. It is hard to say if this
would work better...

cheers,
jamal

--

From: Andi Kleen
Date: Friday, April 16, 2010 - 12:15 am

It shouldn't be a lot lighter than the new fancy "queued smp_call_function"
that's in the tree for a few releases. So it would surprise me if it made
much difference. In the old days when there was only a single lock for
s_c_f() perhaps...

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: jamal
Date: Friday, April 16, 2010 - 6:27 am

So you are saying that the old implementation of IPI (likely what i
tried pre-napi and as recent as 2-3 years ago) was bad because of a
single lock?

BTW, I directed some questions to you earlier but didnt get a response,
to quote:
---
On IPIs:
Is anyone familiar with what is going on with Nehalem? Why is it this
good? I expect things will get a lot nastier with other hardware like
xeon based or even Nehalem with rps going across QPI.
Here's why i think IPIs are bad, please correct me if i am wrong:
- they are synchronous. i.e an IPI issuer has to wait for an ACK (which
is in the form of an IPI).
- data cache has to be synced to main memory
- the instruction pipeline is flushed
- what else did i miss? Andi?
---

Do you know any specs i could read up which will tell me a little more?

cheers,
jamal


--

From: Andi Kleen
Date: Friday, April 16, 2010 - 6:37 am

Yes.

The old implementation of smp_call_function. Also in the really old
days there was no smp_call_function_single() so you tended to broadcast.


Nehalem is just fast. I don't know why it's fast in your specific
case. It might be simply because it has lots of bandwidth everywhere.

In the hardware there's no ack, but in the Linux implementation there
is usually (because need to know when to free the stack state used
to pass information)

However there's also now support for queued IPI

At least on Nehalem data transfer can be often through the cache.

IPIs involve APIC accesses which are not very fast (so overall
it's far more than a pipeline worth of work), but it's still
not a incredible expensive operation.

There's also X2APIC now which should be slightly faster, but it's 

If you're just interested in IPI and cache line transfer performance it's
probably best to just measure it.

Some general information is always in the Intel optimization guide.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: jamal
Date: Friday, April 16, 2010 - 6:58 am

Nice - thanks for that info! So not only has h/ware improved, but

Well, the cache architecture is nicer. The on-die MC is nice. No more
shared MC hub/FSB. The 3 MC channels are nice. Intel finally beating
AMD ;-> someone did a measurement of the memory timings (L1, L2, L3, MM




There are tools like benchit which would give me L1,2,3,MM measurements;

Thanks Andi!

cheers,
jamal

--

From: jamal
Date: Friday, April 16, 2010 - 6:21 am

Perf would be good - but even softnet_stat cleaner than the the nasty
hack i use (attached) would be a good start; the ping with and without
rps gives me a ballpark number.

IPI is important to me because having tried it before it and failed
miserably. I was thinking the improvement may be due to hardware used
but i am having a hard time to get people to tell me what hardware they
used! I am old school - I need data;-> The RFS patch commit seems to
have more info but still vague, example: 
"The benefits of RFS are dependent on cache hierarchy, application
load, and other factors"
Also, what does a "simple" or "complex" benchmark mean?;->
I think it is only fair to get this info, no?

Please dont consider what i say above as being anti-RPS.
5 microsec extra latency is not bad if it can be amortized.
Unfortunately, the best traffic i could generate was < 20Kpps of
ping which still manages to get 1 IPI/packet on Nehalem. I am going
to write up some app (lots of cycles available tommorow). I still think
it is valueable.

cheers,
jamal
From: Changli Gao
Date: Friday, April 16, 2010 - 6:34 am

+	seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n",
 		   s->total, s->dropped, s->time_squeeze, 0,
 		   0, 0, 0, 0, /* was fastroute */
-		   s->cpu_collision, s->received_rps);
+		   s->cpu_collision, s->received_rps, s->ipi_rps);

Do you mean that received_rps is equal to ipi_rps? received_rps is the
number of IPI used by RPS. And ipi_rps is the number of IPIs sent by
function generic_exec_single(). If there isn't other user of
generic_exec_single(), received_rps should be equal to ipi_rps.

@@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct
call_single_data *data, int wait)
 	 * equipped to do the right thing...
 	 */
 	if (ipi)
+{
 		arch_send_call_function_single_ipi(cpu);
+		__get_cpu_var(netdev_rx_stat).ipi_rps++;
+}


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Friday, April 16, 2010 - 6:49 am

my observation is:
s->total is the sum of all packets received by cpu (some directly from
ethernet)
s->received_rps was what the count receiver cpu saw incoming if they
were sent by another cpu. 
s-> ipi_rps is the times we tried to enq to remote cpu but found it to
be empty and had to send an IPI. 
ipi_rps can be < received_rps if we receive > 1 packet without
generating an IPI. What did i miss?

cheers,
jamal

--

From: Changli Gao
Date: Friday, April 16, 2010 - 7:10 am

It is meaningless currently. If rps is enabled, it may be twice of the
number of the packets received, because one packet may be count twice:
one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I
had posted a patch to solve this problem.

http://patchwork.ozlabs.org/patch/50217/

If you don't apply my patch, you'd better refer to /proc/net/dev for

Maybe its name confused you.

/* Called from hardirq (IPI) context */
static void trigger_softirq(void *data)
{
        struct softnet_data *queue = data;
        __napi_schedule(&queue->backlog);
        __get_cpu_var(netdev_rx_stat).received_rps++;
}

the function above is called in IRQ of IPI. It counts the number of


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Friday, April 16, 2010 - 7:43 am

You are probably right - you made me look at my collected data ;->
i will look closely later, but it seems they are accounting for
different cpus, no? 
Example, attached are some of the stats i captured when i was running
the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just
cut to the first and last two columns):

cpu   Total     |rps_recv |rps_ipi
-----+----------+---------+---------
cpu0 | 002dc7f1 |00000000 |000f4246
cpu1 | 002dc804 |000f4240 |00000000
-------------------------------------

So: cpu0 receive 0x2dc7f1 pkts accummulative over time and
redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear
the data) and for the test 0xf4246 times it generated an IPI. It can be
seen that total running for CPU1 is 0x2dc804 but in this one run it
received 1M packets (0xf4240). 
i.e i dont see the double accounting..

cheers,
jamal
From: Changli Gao
Date: Friday, April 16, 2010 - 7:58 am

I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows:


a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by:

echo 1 > ..../rps_cpus

you will find the total number are doubled.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Monday, April 19, 2010 - 5:48 am

Sorry, didnt respond to you - busyed out setting up before trying
to think a little more about this..


If you look at the patch, I am zeroing those stats - so 0xf4240 is only
one test (decimal 1M). I think there is something to what you are
saying; rps_ipi on cpu0 is ambigous because it counts the number of
times cpu0 softirq was scheduled as well as the number of times cpu0
scheduled other cpus. 
The extra six for cpu0 turn out to be the times an ethernet interrupt

Well, the counts have different meanings; rps_ipi applies to source cpu
activity and rps_recv applies to destination. Example, if cpu0 in total
6 times found some destination cpu to be empty and 2 of those happen to
be on cpu1, cpu2, cpu3 then
cpu0: ipi_rps = 6
cpu1: rps_recv = 2
cpu2: rps_recv = 2

This is true. But IMO deserving and should be double counted.
It is just more fine-grained accounting.
IOW, I am not sure we need your patch because we will loose the
fine-grain accounting - and mine requires more work to be less ambigous.

cheers,
jamal 

--

From: Eric Dumazet
Date: Saturday, April 17, 2010 - 12:35 am

I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
nehalem. So a 3-4 years old design.

For all test, I use the best time of 3 runs of "ping -f -q -c 100000
192.168.0.2". Yes ping is not very good, but its available ;)

Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
user land. I dont want to tweak acpi or whatever smart power saving
mechanisms.

When RPS off
100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms

RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
(echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms

So the cost of queing the packet into our own queue (netif_receive_skb
-> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)

I personally think we should process packet instead of queeing it, but
Tom disagree with me.

RPS on, directed on cpu1 (other socket)
(echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms

So extra cost to enqueue to a remote cpu queue, IPI, softirq handling...
is 3 us. Note this cost is in case we receive a single packet.

I suspect IPI itself is in the 1.5 us range, not very far from the
queing to ourself case.

For me RPS use cases are :

1) Value added apps handling lot of TCP data, where the costs of cache
misses in tcp stack easily justify to spend 3 us to gain much more.

2) Network appliance, where a single cpu is filled 100% to handle one
device hardware and software/RPS interrupts, delegating all higher level
works to a pool of cpus.

I'll try to do these tests on a Nehalem target.



--

From: Tom Herbert
Date: Saturday, April 17, 2010 - 1:43 am

You could do that, but then the packet processing becomes HOL blocking
on all the packets that are being sent to other queues for
processing-- remember the IPIs is only sent at the end of the NAPI.
So unless the upper stack processing is <0.74us in your case, I think
processing packets directly on the local queue would improve best case
latency, but would increase average latency and even more likely worse
--

From: Eric Dumazet
Date: Saturday, April 17, 2010 - 2:23 am

Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu()
itself, computing skb->rxhash and all. We should make a review of how
many cache lines we exchange per skb, and try to reduce this number.



--

From: Eric Dumazet
Date: Saturday, April 17, 2010 - 7:27 am

Tom, I am not sure what you describe is even respected for NAPI devices.
(I hope you use napi devices in your company ;) )

If we enqueue a skb to backlog, we also link our backlog napi into our
poll_list, if not already there.

So the loop in net_rx_action() will make us handle our backlog napi a
bit after this network device napi (if time limit of 2 jiffies not
elapsed) and *before* sending IPIS to remote cpus anyway.




--

From: Tom Herbert
Date: Saturday, April 17, 2010 - 10:26 am

From: Eric Dumazet
Date: Saturday, April 17, 2010 - 7:17 am

- There is no point to enforce a time limit in process_backlog(), since
other napi instances dont follow same rule. We can exit after only one
packet processed...
The normal quota of 64 packets per napi instance should be the norm, and
net_rx_action() already has its own time limit.
Note : /proc/net/core/dev_weight can be used to tune this 64 default
value.

- Use DEFINE_PER_CPU_ALIGNED for softnet_data definition.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..8092f01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -264,7 +264,7 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
  *	queue in the local softnet handler.
  */
 
-DEFINE_PER_CPU(struct softnet_data, softnet_data);
+DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
 EXPORT_PER_CPU_SYMBOL(softnet_data);
 
 #ifdef CONFIG_LOCKDEP
@@ -3232,7 +3232,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 {
 	int work = 0;
 	struct softnet_data *queue = &__get_cpu_var(softnet_data);
-	unsigned long start_time = jiffies;
 
 	napi->weight = weight_p;
 	do {
@@ -3252,7 +3251,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_enable();
 
 		__netif_receive_skb(skb);
-	} while (++work < quota && jiffies == start_time);
+	} while (++work < quota);
 
 	return work;
 }


--

From: David Miller
Date: Sunday, April 18, 2010 - 2:36 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Yep, doing this time limit at two levels is pointless.

Applied, thanks Eric!
--

From: jamal
Date: Saturday, April 17, 2010 - 10:31 am

Eric, I thank you kind sir for going out of your way to do this - it is


I didnt keep the cpus busy. I should re-run with such a setup, any
specific app that you used to keep them busy? Keeping them busy could
have consequences;  I am speculating you probably ended having greater

I should mention i turned off acpi as well in the bios; it was consuming


Sorry - I am gonna have to turn on some pedagogy and offer my
Canadian 2 cents;->
I would lean on agreeing with Tom, but maybe go one step further (sans
packet-reordering): we should never process packets to socket layer on
the demuxing cpu.
enqueue everything you receive on a different cpu - so somehow receiving
cpu becomes part of a hashing decision ...

The reason is derived from queueing theory - of which i know dangerously
little - but refer you to mr. little his-self[1] (pun fully
intended;->):
i.e fixed serving time provides more predictable results as opposed to
once in a while a spike as you receive packets destined to "our cpu".
Queueing packets and later allocating cycles to processing them adds to
variability, but is not as bad as processing to completion to socket

Good test - should be worst case scenario. But there are two other 
scenarios which will give different results in my opinion.
On your setup i think each socket has two dies, each with two cores. So
my feeling is you will get different numbers if you go within same die
and across dies within same socket. If i am not mistaken, the mapping
would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
socket1/die0{core1/3}, socket1{core5/7}.
If you have cycles can you try the same socket+die but different cores

Which is not too bad if amortized. Were you able to check if you
processed a packet/IPI? One way to achieve that is just standard ping.
In the nehalem my number for going to a different core was in the range
of 5 microseconds effect on RTT when system was not busy. I think it

Sound about right maybe 2 us in my case. I am still ...
From: Eric Dumazet
Date: Sunday, April 18, 2010 - 2:39 am

No, only one packet per IPI, since I setup my tg3 coalescing parameter
to the minimum value, I received one packet per interrupt.

The specific app is :


Sure, lets redo a full test, taking lowest time of three ping runs


echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms


# egrep "physical id|core|apicid" /proc/cpuinfo 
physical id	: 0
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0

physical id	: 1
core id		: 0
cpu cores	: 4
apicid		: 4
initial apicid	: 4

physical id	: 0
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2

physical id	: 1
core id		: 2
cpu cores	: 4
apicid		: 6
initial apicid	: 6

physical id	: 0
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1

physical id	: 1
core id		: 1
cpu cores	: 4
apicid		: 5
initial apicid	: 5

physical id	: 0
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3

physical id	: 1
core id		: 3
cpu cores	: ...
From: Eric Dumazet
Date: Sunday, April 18, 2010 - 4:34 am

An other interesting user land app would be to use a cpu _and_ memory
cruncher, because of caches misses we'll get.

$ cat nloop.c
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define SZ 4*1024*1024

int main(int argc, char *argv[])
{
	int nproc = 8;
	char *buffer;

	if (argc > 1)
		nproc = atoi(argv[1]);
	while (nproc > 1) {
		if (fork() == 0)
			break;
		nproc--;
	}
	buffer = malloc(SZ);
	while (1)
		memset(buffer, 0x55, SZ);
}

$ ./nloop 8 &

echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4861ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
4981ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7191ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7128ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7107ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
5505ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7125ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7022ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
7157ms


Maximum overhead is 7191-4861 = 23.3 us per packet



--

From: jamal
Date: Sunday, April 18, 2010 - 7:09 pm

Thanks Eric. I tried to visualize your results - attached.
There are 2-3 odd numbers (labelled with *) but other
than that results are as expected...

I did run some experiments with some udp sink server
and i saw the IPIs amortized; unfortunately sky2 h/ware 
proved to be bottleneck (at > 750Kpps incoming, it started 
dropping and wasnt recording the drops, so i had to slow things down). I
need to digest my results a little more - but it seems i was getting
better throughput results with RPS (i.e it was able to sink
more packets)..

cheers,
jamal
From: Eric Dumazet
Date: Monday, April 19, 2010 - 2:37 am

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

I add a flag to scan cpumask only if at least one IPI was scheduled.
Even cpumask_weight() might be expensive on some setups, where
nr_cpumask_bits could be very big (4096 for example)

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    5 +-
 net/core/dev.c            |   73 ++++++++++++++++--------------------
 2 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..283d3ef 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1389,8 +1389,11 @@ struct softnet_data {
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
+	unsigned int		rps_ipis_scheduled;
+	unsigned int		rps_select;
+	cpumask_t		rps_mask[2];
+	/* Elements below can be accessed between CPUs for RPS */
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
 	unsigned int		input_queue_head;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..3e6e420 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2347,19 +2347,14 @@ done:
 }
 
 /*
- * This structure holds the per-CPU mask of CPUs for which IPIs are scheduled
+ * sofnet_data holds the per-CPU mask of CPUs for which IPIs are scheduled
  * to be sent to kick remote softirq processing.  There are two masks since
- * the sending of IPIs must be done with interrupts enabled.  The select field
+ * the sending of IPIs must be done with interrupts enabled.  The rps_select field
  * ...
From: Changli Gao
Date: Monday, April 19, 2010 - 2:48 am

How about using a array to save the cpu IDs. The number of CPUs, to
which the IPI will be sent, should be small.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Eric Dumazet
Date: Monday, April 19, 2010 - 5:14 am

Yes it should be small, yet the two arrays would be big enough to make
softnet_data first part use at least two cache lines instead of one,
even in the case we handle one cpu/IPI per net_rps_action()

As several packets can be enqueued for a given cpu, we would need to
keep bitmasks.
We would have to add one test in enqueue_to_backlog()

if (cpu_test_and_set(cpu, mask)) {
	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
	array[nb++] = cpu;
}



--

From: Changli Gao
Date: Monday, April 19, 2010 - 5:28 am

rps_lock(queue);
        if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
                if (queue->input_pkt_queue.qlen) {
...
                if (napi_schedule_prep(&queue->backlog)) {
#ifdef CONFIG_RPS
                        if (cpu != smp_processor_id()) {
                                struct rps_remote_softirq_cpus *rcpus =
                                    &__get_cpu_var(rps_remote_softirq_cpus);

                                cpu_set(cpu, rcpus->mask[rcpus->select]);
                                __raise_softirq_irqoff(NET_RX_SOFTIRQ);
                                goto enqueue;
                        }
#endif
                        __napi_schedule(&queue->backlog);
                }

Only the first packet of a softnet.input_pkt_queue may trigger IPI, so
we don't need to keep bitmasks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Eric Dumazet
Date: Monday, April 19, 2010 - 6:27 am

This is not true Changli

Please read again all previous mails about RPS, or the code.



--

From: Eric Dumazet
Date: Monday, April 19, 2010 - 7:22 am

Hmm, I just read again, and I now remember Tom used a single bitmap,
then we had to add a second set because of a possible race.

A list would be enough.



--

From: Eric Dumazet
Date: Monday, April 19, 2010 - 8:07 am

Here is the updated patch, using a single list instead of bitmap

RFC status becomes official patch ;)

Thanks Changli for your array suggestion !


[PATCH net-next-2.6] rps: shortcut net_rps_action()

net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
RPS is not active.

Tom Herbert used two bitmasks to hold information needed to send IPI,
but a single LIFO list seems more appropriate.

Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
(remove two ifdefs)

Move rps_remote_softirq_cpus into softnet_data to share its first cache
line, filling an existing hole.

In a future patch, we could call net_rps_action() from process_backlog()
to make sure we send IPI before handling this cpu backlog.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    9 ++--
 net/core/dev.c            |   79 ++++++++++++++----------------------
 2 files changed, 38 insertions(+), 50 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 649a025..83ab3da 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1381,17 +1381,20 @@ static inline int unregister_gifconf(unsigned int family)
 }
 
 /*
- * Incoming packets are placed on per-cpu queues so that
- * no locking is needed.
+ * Incoming packets are placed on per-cpu queues
  */
 struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct list_head	poll_list;
 	struct sk_buff		*completion_queue;
 
-	/* Elements below can be accessed between CPUs for RPS */
 #ifdef CONFIG_RPS
+	struct softnet_data	*rps_ipi_list;
+
+	/* Elements below can be accessed between CPUs for RPS */
 	struct call_single_data	csd ____cacheline_aligned_in_smp;
+	struct softnet_data	*rps_ipi_next;
+	unsigned int		cpu;
 	unsigned int		input_queue_head;
 #endif
 	struct sk_buff_head	input_pkt_queue;
diff --git a/net/core/dev.c b/net/core/dev.c
index 7abf959..f6ff2cf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ ...
From: Tom Herbert
Date: Monday, April 19, 2010 - 9:02 am

Yes.  I did some quick experiments last night and there does seem to
--

From: David Miller
Date: Monday, April 19, 2010 - 1:21 pm

From: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

It is getting increasingly complicated to follow who enables and
disabled local cpu irqs in these code paths.  We could combat
this by adding something like "_irq_enable()" to the function
names.
--

From: Eric Dumazet
Date: Tuesday, April 20, 2010 - 12:17 am

Yes I agree, we need a general cleanup in this file

Thanks David !

[PATCH net-next-2.6] rps: cleanups

struct softnet_data holds many queues, so consistent use "sd" name
instead of "queue" is better.

Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()

Adds a _and_irq_disable suffix to net_rps_action() name, as David
suggested.

incr_input_queue_head() becomes input_queue_head_incr()

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netdevice.h |    4 
 net/core/dev.c            |  149 +++++++++++++++++++-----------------
 2 files changed, 82 insertions(+), 71 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 83ab3da..3c5ed5f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1401,10 +1401,10 @@ struct softnet_data {
 	struct napi_struct	backlog;
 };
 
-static inline void incr_input_queue_head(struct softnet_data *queue)
+static inline void input_queue_head_incr(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	queue->input_queue_head++;
+	sd->input_queue_head++;
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..70df048 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -208,17 +208,17 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
 }
 
-static inline void rps_lock(struct softnet_data *queue)
+static inline void rps_lock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_lock(&queue->input_pkt_queue.lock);
+	spin_lock(&sd->input_pkt_queue.lock);
 #endif
 }
 
-static inline void rps_unlock(struct softnet_data *queue)
+static inline void rps_unlock(struct softnet_data *sd)
 {
 #ifdef CONFIG_RPS
-	spin_unlock(&queue->input_pkt_queue.lock);
+	spin_unlock(&sd->input_pkt_queue.lock);
 #endif
 }
 
@@ -2346,63 +2346,74 @@ done:
 }
 
 /* Called from hardirq (IPI) context */
-static void trigger_softirq(void *data)
+static void ...
From: David Miller
Date: Tuesday, April 20, 2010 - 1:18 am

From: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--

From: Changli Gao
Date: Monday, April 19, 2010 - 4:56 pm

It seems you prefetch rps_ipi_next. I think it isn't necessary, as the
list should be short. If you insist on this, is the macro prefetch()
better?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
From: Changli Gao
Date: Monday, April 19, 2010 - 5:32 pm

Oh, I read the code again and got the answer. After the IPI is sent,
this softnet will be queued by the other CPUs. We prefetch the pointer
rps_ipi_next to avoid this race condition.

Sorry for noise :)


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Eric Dumazet
Date: Monday, April 19, 2010 - 10:55 pm

Speaking of prefetch business,

I partly tested following patch, I will submit it if it happens to be a
clear win.

diff --git a/net/core/dev.c b/net/core/dev.c
index 05a2b29..fe6fc9f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2349,7 +2349,9 @@ done:
 static void trigger_softirq(void *data)
 {
 	struct softnet_data *queue = data;
+
 	__napi_schedule(&queue->backlog);
+	prefetch(queue->input_pkt_queue.next);
 	__get_cpu_var(netdev_rx_stat).received_rps++;
 }
 #endif /* CONFIG_RPS */


--

From: jamal
Date: Tuesday, April 20, 2010 - 5:02 am

folks,

Thanks to everybody (Eric stands out) for your patience. 
I ended mostly validating whats already been said. I have a lot of data
and can describe in details how i tested etc but it would require
patience in reading, so i will spare you;-> If you are interested let me
know and i will be happy to share.

Summary is: 
-rps good, gives higher throughput for apps
-rps not so good, latency worse but gets better with higher input rate
or increasing number of flows (which translates to higher pps)
-rps works well with newer hardware that has better cache structures.
[Gives great results on my test machine a Nehalem single processor, 4
cores each with two SMT threads that has a shared L2 between threads and
a shared L3 between cores]. 
Your selection of what the demux cpu is and where the target cpus are is
an influencing factor in the latency results. If you have a system with
multiple sockets, you should get better numbers if you stay within the
same socket relative to going across sockets.
-rps does a better job at helping schedule apps on same cpu thus
localizing the app. The throughput results with rps are very consistent
and better whereas in non-rps case, variance is _high_.

My next step is to do some forwarding tests - probably next week. I am
concerned here because i expect the cache misses to be higher than the
app scenario (netdev structure and attributes could be touched by many
cpus)

cheers,
jamal

--

From: Eric Dumazet
Date: Tuesday, April 20, 2010 - 6:13 am

Hi Jamal

I think your tests are very interesting, maybe could you publish them
somehow ? (I forgot to thank you about the previous report and nice
graph)

perf reports would be good too to help to spot hot points.



--

From: Eric Dumazet
Date: Wednesday, April 21, 2010 - 12:01 pm

Thanks a lot Jamal, this is really useful

Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'

I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?

I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().

Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.





--

From: Changli Gao
Date: Wednesday, April 21, 2010 - 6:27 pm

OK, I'll post a new patch against the current tree, so Jamal can have
a try. I am sorry, but I don't have a suitable computer for benchmark.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Thursday, April 22, 2010 - 5:12 am

yes, that would explain it ;-> I could have flows going to each cpu

In my next step i wanted to "route" these packets at app level and for
this stage of testing just wanted to sink the data to reduce experiment
variables. Reason:
The netdev structure would hit a lot of cache misses if i started using
it to both send/recv since lots of things are shared on tx/rx (example
napi tx prunning could happen on either tx or receive path); same thing
with qdisc path which is at netdev granularity.. I think there may be

I am using default libevent on debian. It looks very old and maybe
buggy. I will try to upgrade first and if still see the same

Ok, I will have some cycles later today/tommorow or for sure on weekend.
My setup is still intact - so i can test.

cheers,
jamal

--

From: Changli Gao
Date: Saturday, April 24, 2010 - 7:31 pm

I read the code again, and find that we don't use spin_lock_irqsave(),
and we use local_irq_save() and spin_lock() instead, so
_raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be
related to backlog. the lock maybe sk_receive_queue.lock.

Jamal, did you use a single socket to serve all the clients?

BTW:  completion_queue and output_queue in softnet_data both are LIFO
queues. For completion_queue, FIFO is better, as the last used skb is
more likely in cache, and should be used first. Since slab has always
cache the last used memory at the head, we'd better free the skb in
FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: jamal
Date: Monday, April 26, 2010 - 4:35 am

Possible.
I am wondering if there's a way we can precisely nail where that is
happening? is lockstat any use? 
Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit.

So looking at your patch now i see it is likely there was an improvement
made for non-rps case (moving out of loop some irq_enable etc).
i.e my results may not be crazy after adding your patch and seeing an
improvement for non-rps case.
However, whatever your patch did - it did not help the rps case case:
call_function_single_interrupt() comes out higher in the profile,
and # of IPIs seems to have gone up (although i did not measure this, I


I think it will depend on how many of those skbs are sitting in the
completion queue, cache warmth etc. LIFO is always safest, you have
higher probability of finding a cached skb infront.

cheers,
jamal

--

From: Changli Gao
Date: Monday, April 26, 2010 - 6:35 am

Did you apply the patch from Eric? It would reduce the number of


we call kfree_skb() to release skbs to slab allocator, then slab
allocator stores them in a LIFO queue. If completion queue is also a
LIFO queue, the latest unused skb will be in the front of the queue,
and will be released to slab allocator at first. At the next time, we
call alloc_skb(), the memory used by the skb in the end of the
completion queue will be returned instead of the hot one.

However, as Eric said, new drivers don't rely on completion queue, it
isn't a real problem, especially in your test case.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--

From: Rick Jones
Date: Wednesday, April 21, 2010 - 2:53 pm

No need to apologize,  if you like I'd be happy to discuss netperf usage tips 
offline.  That offer stands for everyone.

happy benchmarking,

rick jones
--

From: Tom Herbert
Date: Friday, April 16, 2010 - 8:57 am

From: Stephen Hemminger
Date: Wednesday, April 14, 2010 - 11:53 am

On Wed, 14 Apr 2010 07:53:06 -0400

I posted a patch to use sky2 hardware hash (RSS) which should lower the
cost per packet.

-- 
--

From: David Miller
Date: Thursday, April 15, 2010 - 1:42 am

From: jamal <hadi@cyberus.ca>

The RPS config is merely an indirect dependency on SMP as we have it
coded up in the Kconfig files, it's not meant to be user selectable
and is intended to be unconditionally on for SMP builds.
--

Previous thread: [RFC]: xfrm by mark by jamal on Sunday, February 7, 2010 - 11:32 am. (7 messages)

Next thread: [2.6.33-rc only] kerneloops.org report for the week of Feb 7 2010 for the 2.6.33-rc kernel series by Arjan van de Ven on Sunday, February 7, 2010 - 12:38 pm. (1 message)