Re: Multicast packet loss

Previous thread: ixgbe in kernel driver question by Paweł Staszewski on Friday, January 30, 2009 - 9:11 am. (4 messages)

Next thread: Re: skge finds PCI error cmd=0x117 status=0x22b0 in 2.6.27.7 by Stephen Hemminger on Friday, January 30, 2009 - 12:15 pm. (1 message)
From: Kenny Chang
Date: Friday, January 30, 2009 - 10:49 am

Hi all,

We've been having some issues with multicast packet loss, we were wondering
if anyone knows anything about the behavior we're seeing.

Background: we use multicast messaging with lots of messages per sec for our
work. We recently transitioned many of our systems from an Ubuntu Dapper Drake
ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've
noticed much more multicast packet loss, and we think it's related to the
transition. Our particular theory is that it's specifically a 32 vs 64-bit
issue.

We narrowed the problem down to the attached program (mcasttest.cc).  Run
"mcasttest server" on one machine -- it'll send 500,000 messages small message
to a multicast group, 50,000 messages per second.  If we run "mcasttest client"
on another machine, it'll receive all those messages and print a count at the
end of how many messages it sees. It almost never loses any messages. However,
if we run 4 copies of the client on the same machine, receiving the same data,
then the programs usually sees fewer than 500,000 messages. We're running with:

for i in $(seq 1 4); do (./mcasttest client &); done

We know this because the program prints a count, but dropped packets also
show up in ifconfig's "RX packets" section.

Things we're curious about: do other people see similar problems?  The tests
we've done: we've tried this program on a bunch of different machines, all of
which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper
machines have no problems but on certain machines, Hardy shows significant loss. 
We did some experiments on a troubled machine, varying the OS install, 
including mixed installations where the kernel was 64-bit and the userspace was
32-bit. This is what we found:

On machines that exhibit this problem, the ksoftirqd process seems to be 
pegged to 100% CPU when receiving packets.

Note: while we're on Ubuntu, we've tried this with other distros and have seen

It's not exactly clear what exactly the problem is but ...
From: Eric Dumazet
Date: Friday, January 30, 2009 - 12:04 pm

Hi Kenny

Interesting... You forgot the mcasttest.cc program

Any chance you try a recent kernel (2.6.29-rcX) ?

Could you post "cat /proc/interrupts" results (one for working
 setup, another for non working/droping setup)


--

From: Denys Fedoryschenko
Date: Friday, January 30, 2009 - 12:17 pm

I didn't work much on multicast, but i have heavy unicast udp streaming (PEP 
for satellite).

First thing to check
net.core.wmem_max = 131071
net.core.wmem_default = 124928
net.ipv4.udp_mem = 379008       505344  758016

Usually they are too small by default.

next
netstat -s

Important part
Udp:
    1263992126 packets received
    260196 packets to unknown port received.
    627001 packet receive errors
    74235906 packets sent
    RcvbufErrors: 56683
    SndbufErrors: 4295851


in your case     SndbufErrors matter.
--

From: Neil Horman
Date: Friday, January 30, 2009 - 1:03 pm

Like Eric mentioned, I'd start with a latest kernel if at all possible.  If it
doesn't happen there, you're work is half over, you just need to figure out what
changed, and tell Canonical to backport it.

From there, you can solve this like most packet loss issues are solved:

1) Determine if its a rx or tx packet loss.  From your comments above it sounds
like this is an rx side issue

2) Look at statistics from the hardware to the application.  Use ethtool &
/proc/net/dev to get hardware packet loss stats, /proc/net/snmp netstat -s to
get core network loss stats

3) Use those stats to identify where and why packets are getting dropped.
Posting some summary of that data here is something we can help with if need be

4) Determine how to reduce the loss (i.e. code change vs. tuning)

5) Lather, rinse repeat (given that eliminating a drop cause in one location
will likely increase througput, potentially putting strain on another location
in the code path, possibly leading to more drops elsewhere. 


You had mentioned that ifconfig was showing rx drops, which indicates that your
hardware rx buffer is likely overflowing.  Usually the best way to fix that is
to:

1) modify any available interrupt coalescing parameters on the driver such that
interrupts have less latency between packet arrival and assertion

2) increase (if possible) the napi weight (I think thats still the right term)
so that each napi poll interation receives more frames on the interface,
draining that queue more quickly.

Neil

--

From: Kenny Chang
Date: Friday, January 30, 2009 - 3:29 pm

Ah, sorry, here's the test program attached.

We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the 
2.6.29.-rcX.

Right now, we are trying to step through the kernel versions until we 
see where the performance drops significantly.  We'll try 2.6.29-rc soon 
and post the result.

Neil Norman wrote:

1) Determine if its a rx or tx packet loss.  From your comments above it sounds
like this is an rx side issue

   We're pretty sure it's an rx issue.  Other machines receiving at the 
same time will
   get all the packets.

I'll gather the information mentioned and summarize in a subsequent email.

Thanks!
Kenny


From: Eric Dumazet
Date: Friday, January 30, 2009 - 3:41 pm

2.6.29-rc contains UDP receive improvements (lockless)

Problem is multicast handling was not yet updated, but could be :)


I was asking you "cat /proc/interrupts" because I believe you might
have a problem NIC interrupts being handled by one CPU only (when having problems)


--

From: Neil Horman
Date: Saturday, January 31, 2009 - 9:03 am

That would be expected (if irqbalance is running), and desireable, since
spreading high volume interrupts like NICS accross multiple cores (or more
specifically multiple L2 caches), is going increase your cache line miss rate
significantly and decrease rx throughput.

Although you do have a point here, if the system isn't running irqbalance, and
the NICS irq affinity is spread accross multiple L2 caches, that would be a
point of improvement performance-wise.  

Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo
and your stats that I asked about earlier, that would be a big help.

Regards
Neil

--

From: Kenny Chang
Date: Monday, February 2, 2009 - 9:13 am

Hi Neil,

Here's the information you requested.

Kenny

kchang@beast8:~$ uname -a
Linux beast8 2.6.24-19-server #1 SMP Wed Aug 20 18:43:06 UTC 2008 x86_64 
GNU/Linux
kchang@beast8:~$ cat /proc/cpuinfo
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5322.91
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips    : 5320.03
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           L5430  @ 2.66GHz
stepping    : 10
cpu MHz        : 2659.999
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id    ...
From: Kenny Chang
Date: Monday, February 2, 2009 - 9:48 am

This is for a working setup.

-Kenny

kchang@fiji:~$ uname -a
Linux fiji 2.6.24-19-generic #1 SMP Wed Aug 20 17:53:40 UTC 2008 x86_64 
GNU/Linux
kchang@fiji:~$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping        : 11
cpu MHz         : 1600.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4791.31
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 1
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping    : 11
cpu MHz        : 1600.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 1
cpu cores    : 4
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx 
est tm2 ssse3 cx16 xtpr lahf_lm
bogomips    : 4788.05
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

processor    : 2
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
stepping    : 11
cpu MHz        : 1600.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : ...
From: Neil Horman
Date: Tuesday, February 3, 2009 - 4:55 am

Are these quad core systems?  Or dual core w/ hyperthreading?  I ask because in
your working setup you have 1/2 the number of cpus' and was not sure if you
removed an entire package of if you just disabled hyperthreading.


Neil

--

From: Kenny Chang
Date: Tuesday, February 3, 2009 - 8:20 am

Yeah, these are quad core systems.  The 8 cpu system is a dual-processor 
quad-core.  The other is my desktop, single cpu quad core.

Kenny

--

From: Neil Horman
Date: Tuesday, February 3, 2009 - 6:15 pm

Ok, so their separate systms then.  Did you actually experience drops on the
8-core system since the last reboot?  I ask because even when its distributed
across all 8 cores, you only have about 500 total interrupts from the NIC, and
if you did get drops, something more than just affinity is wrong.

Regardless, spreading interrupts across cores is definately a problem.  As eric
says, quad core chips are actually 2x2 cores, so you'll want to either just run
irqbalance to assign an apropriate affinity to the NIC, or manually look at each
cores physical id and sibling id, to assign affininty to a core or cores that
share an L2 cache.  If you need to, as you've found, you may need to disable msi
interrupt mode on your bnx2 driver.  That kinda stinks, but bnx2 IIRC isn't
multiqueue, so its not like msi provides you any real performance gain.

--

From: Kenny Chang
Date: Wednesday, February 4, 2009 - 9:07 am

Hi Neil,

Yeah, we've been rebooting this system left and right switch kernels.  
The results are fairly consistent.  We were able to set the irq 
affinities, and as Wes had mentioned, what we see is that if we pin the 
softirq to 1 core, and pin the app to its sibling, we see really good 
performance, but as we load up other cores, the machine reaches a 
breaking point where all hell breaks loose and we drop a bunch.  (we 
hadn't turned off msi btw..)

While we were able to tune and adjust performance like that, in the end, 
it doesn't really explain the difference between earlier and recent 
kernels, also it doesn't quite explain the difference between machines.

You mentioned it would be good to see the interrupts for each kernel, in 
light of the above information, would it still be useful for me to 
provide that?

Kenny

--

From: Wesley Chow
Date: Wednesday, February 4, 2009 - 9:46 am

Just to be clear: on the 2 x quad core system, we can run with a  
2.6.15 kernel and see no packet drops. In fact, we can run with  
2.6.19, 2.6.20, and 2.6.21 just fine. 2.6.22 is the first kernel that  
shows problems.

Kenny posted results from a working setup on a different machine.

What I would really like to know is if whatever changed between 2.6.21  
and 2.6.22 that broke things is confined just to bnx2. To make this a  
rigorous test, we would need to use the same machine with a different  
nic, which we don't have quite yet. An Intel Pro 1000 ethernet card is  
in the mail as I type this.

I also tried forward porting the bnx2 driver in 2.6.21 to 2.6.22  
(unsuccessfully), and building the most recent driver from the  
Broadcom site to Ubuntu Hardy's 2.6.24. The most recent driver with  
hardy 2.6.24 showed similar packet dropping problems. Hm, perhaps I'll  
try to build the most recent broadcom driver against 2.6.21.


Wes

--

From: Eric Dumazet
Date: Wednesday, February 4, 2009 - 11:11 am

Try oprofile session, you shall see a scheduler effect (dont want to call
this a regression, no need for another flame war).

also give us "vmstat 1" results  (number of context switches per second)

On recent kernels, scheduler might be faster than before: You get more wakeups per
second and more work to do by softirq handler (it does more calls to scheduler,
thus less cpu cycles available for draining NIC RX queue in time)

opcontrol --vmlinux=/path/vmlinux --start
<run benchmark>
opreport -l /path/vmlinux | head -n 50

Recent schedulers tend to be optimum for lower latencies (and thus, on
a high level of wakeups, you get less bandwidth because of sofirq using
a whole CPU)

For example, if you have one tread receiving data on 4 or 8 sockets, you'll
probably notice better throughput (because it will sleep less often)

Multicast receiving on N sockets, with one thread waiting on each socket
is basically a way to trigger a scheduler storm. (N wakeups per packet).
So its more a benchmark to stress scheduler than stressing network stack...


Maybe its time to change user side, and not try to find an appropriate kernel :)

If you know you have to receive N frames per 20us units, then its better to :
Use non blocking sockets, and doing such loop :

{
usleep(20); // or try to compensate if this thread is slowed too much by following code
for (i = 0 ; i < N ; i++) {
	while (revfrom(socket[N], ....) != -1)
		receive_frame(...);
	}
}

That way, you are pretty sure network softirq handler wont have to spend time trying
to wakeup 400.000 time per second one thread. All cpu cycles can be spent in NIC driver
and network stack.

Your thread will do 50.000 calls to nanosleep() per second, that is not really expensive,
then N recvfrom() per iteration. It should work on all past , current and future kernels.


--

From: Neil Horman
Date: Thursday, February 5, 2009 - 6:33 am

+1 to this idea.  Since the last oprofile traces showed significant variance in
the time spent in schedule(), it might be worthwhile to investigate the affects
of the application behavior on this.  I might also be worth adding a systemtap
probe to sys_recvmsg, to count how many times we receive frames on a working and
non-working system.  If the app is behaving differently on different kernels,
and its affecting the number of times you go to get a frame out of the stack,
that would affect your drop rates, and it would show up in sys_recvmsg

Neil

--

From: Wesley Chow
Date: Thursday, February 5, 2009 - 6:46 am

I did some work to our test program to spin on a non-blocking socket  
and it indeed seems to fix the problem, at least for 2.6.28.1, which  
was a kernel we had problems with. The number of context switches  
drastically drops -- from 200,000+ to less than 50!

I haven't done totally comprehensive tests yet, so I don't want to  
officially state any results. I'm also out today, but Kenny may get a  
chance to play with it. Spinning on the socket is looking like an  
interesting solution, but we're a bit nervous about seeing our  
processes constantly running at 100% CPU. Does C++ have a  
MachineOnFire exception?


Wes

--

From: Neil Horman
Date: Thursday, February 5, 2009 - 6:29 am

In light of what you said, I probably don't need to see it no, although if you
go through testing on all these kernels again, I would suggest you take a look
at the /proc/interrupt numbers yourself.  Like you said, its odd that this
behavior changed, since the fast receive path is fairly consistent.  It may be
that te nic driver your using (bnx2 I think?), had a change that either broke
the ability to set affinity for msi interrupts, forcing an irq spread and
killing performance, or perhaps some large inefficiency was introduced either in
the interrupt handler itself, or in the napi poll method of the driver.  Another
good analysis technique would be to grab the latest kernel (which is 'broken' I
think your chart indicated), and the nic driver from the last working kernel.
Merge the driver into the latest kernel and see if the problem persists.  If
not, thats a pretty good indicator that a change in the driver has at least some
culpability.

Best
--

From: Eric Dumazet
Date: Sunday, February 1, 2009 - 5:40 am

I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)

With 8 clients, about 10% packet loss, 

Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
wakeups per second... But at least UDP receive path seems OK.

Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
readers...

I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)

I tried playing with ethtool -C|c G|g params...
And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)

I can have 0% packet loss if booting with msi_disable and

echo 1 >/proc/irq/16/smp_affinities

(16 being interrupt of eth0 NIC)

then, a second run gave me errors, about 2%, oh well...


oprofile numbers without playing IRQ affinities:

CPU: Core 2, speed 2999.89 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        symbol name
327928   10.1427  schedule
259625    8.0301  mwait_idle
187337    5.7943  __skb_recv_datagram
109854    3.3977  lock_sock_nested
104713    3.2387  tick_nohz_stop_sched_tick
98831     3.0568  select_nohz_load_balancer
88163     2.7268  skb_release_data
78552     2.4296  update_curr
75241     2.3272  getnstimeofday
71400     2.2084  set_next_entity
67629     2.0917  get_next_timer_interrupt
67375     2.0839  sched_clock_tick
58112     1.7974  enqueue_entity
56462     1.7463  udp_recvmsg
55049     1.7026  copy_to_user
54277     1.6788  sched_clock_cpu
54031     1.6712  __copy_skb_header
51859     1.6040  __slab_free
51786     1.6017  prepare_to_wait_exclusive
51776     1.6014  sock_def_readable
50062     1.5484  try_to_wake_up
42182     1.3047  __switch_to
41631     1.2876  read_tsc
38337     1.1857  tick_nohz_restart_sched_tick
34358     1.0627  cpu_idle
34194     1.0576  native_sched_clock
33812 ...
From: Neil Horman
Date: Monday, February 2, 2009 - 6:45 am

I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
affinity when operating in msi interrupt mode.  If this is the case with this
particular bnx2 card, then I would expect some packet loss, simply due to the
constant cache misses.  It would be interesting to re-run your oprofile cases,
counting L2 cache hits/misses (if your cpu supports that class of counter) for
both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
interesting to use a different card, that can set irq affinity, and compare loss
with irqbalance on, and irqbalance off with irq afninty set to all cpus.

--

From: Eric Dumazet
Date: Monday, February 2, 2009 - 9:57 am

booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
oprofile results sorted on CPU0 numbers.

We can see scheduler has hard time to cope with this workload with more of two CPUS

OK up to 30.000 (* 8 sockets) packets per second. 

CPU0 is 100% handling softirq (ksoftirqd/0)


CPU: Core 2, speed 3000.31 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Samples on CPU 0
Samples on CPU 1
Samples on CPU 2
Samples on CPU 3
Samples on CPU 4
Samples on CPU 5
Samples on CPU 6
Samples on CPU 7
samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        samples  %        symbol name
6152     12.5595  3         0.0098  3         0.0090  5         0.0156  1         0.0582  0              0  2         0.0065  3         0.0169  enqueue_entity
4453      9.0909  2         0.0065  3         0.0090  4         0.0125  5         0.2910  0              0  1         0.0033  2         0.0113  try_to_wake_up
3837      7.8333  3         0.0098  8         0.0241  0              0  0              0  0              0  0              0  0              0  sock_def_readable
3694      7.5414  0              0  0              0  0              0  0              0  0              0  0              0  0              0  __copy_skb_header
2320      4.7363  1         0.0033  2         0.0060  2         0.0062  1         0.0582  1         0.0028  2         0.0065  0              0  resched_task
1818      3.7115  6         0.0196  32        0.0962  0              0  0              0  0              0  0              0  0              0  sock_queue_rcv_skb
1776      3.6257  0              0  0              0  0              0  0              0  0              0  0              0  0              0  udp_queue_rcv_skb
1677      3.4236  0              0  1         0.0030  0              0  1         0.0582  1         0.0028  0              0  0           ...
From: Neil Horman
Date: Monday, February 2, 2009 - 11:22 am

This explains alot.  if the application is scheduled to run on the same cpu that
has the irq for the NIC bound to it, you get a perf boost by not having to warm
up two caches (1 for the app cpu and one for the irq & softirq work), but you
loose it and then some fighting for cpu time.  If both the app and the irq are
on the same cpu, and we spend so much time in softirq context, we will
eventually overflow higher up the network stack, as the application doesn't have
enough time to dequeue frames.

It may also speak to the need to make the bnx2 napi routine more efficient :)

Neil

--

From: Christoph Lameter
Date: Friday, February 27, 2009 - 11:40 am

When will that happen?

--

From: Eric Dumazet
Date: Friday, February 27, 2009 - 11:56 am

When proven necessary :)

Kenny problem is about scheduling storm. The extra spin_lock() in UDP 
multicast receive is not a problem.


--

From: Christoph Lameter
Date: Friday, February 27, 2009 - 12:45 pm

My tests here show that 2.6.29-rc5 still looses ~5usec vs. 2.6.22 via
UDP. This would fix a regression.....

--

From: Eric Dumazet
Date: Friday, February 27, 2009 - 1:12 pm

Could you elaborate ?

I just retried Kenny test here. As one cpu is looping in ksoftirqd, only this cpu
touches the spin_lock, so spin_lock()/spin_unlock() is free.

oprofile shows that udp stack is lightweight in this case. Problem is about wakeing up
so many threads...

CPU: Core 2, speed 3000.16 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
356857   356857        15.1789  15.1789    schedule
274028   630885        11.6557  26.8346    mwait_idle
189218   820103         8.0484  34.8829    __skb_recv_datagram
116903   937006         4.9725  39.8554    skb_release_data
103152   1040158        4.3876  44.2430    lock_sock_nested
89600    1129758        3.8111  48.0541    udp_recvmsg
74171    1203929        3.1549  51.2089    copy_to_user
72299    1276228        3.0752  54.2842    set_next_entity
60392    1336620        2.5688  56.8529    sched_clock_cpu
54026    1390646        2.2980  59.1509    __slab_free
50212    1440858        2.1358  61.2867    prepare_to_wait_exclusive
38689    1479547        1.6456  62.9323    cpu_idle
38142    1517689        1.6224  64.5547    __switch_to
36701    1554390        1.5611  66.1157    hrtick_start_fair
36673    1591063        1.5599  67.6756    dst_release
36268    1627331        1.5427  69.2183    sys_recvfrom
35052    1662383        1.4909  70.7092    kmem_cache_free
32680    1695063        1.3900  72.0992    pick_next_task_fair
31209    1726272        1.3275  73.4267    try_to_wake_up
30382    1756654        1.2923  74.7190    dequeue_task_fair
29048    1785702        1.2356  75.9545    __copy_skb_header
28801    1814503        1.2250  77.1796    sock_def_readable
28655    1843158        1.2188  78.3984    enqueue_task_fair
27232    1870390        1.1583  79.5567    update_curr
21688    1892078        0.9225  80.4792    copy_from_user
18832    1910910        0.8010  81.2802    ...
From: Eric Dumazet
Date: Friday, February 27, 2009 - 2:36 pm

My guess is commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting)
Hideo Aoki [Mon, 31 Dec 2007 08:29:24 +0000 (00:29 -0800)]

and 3ab224be6d69de912ee21302745ea45a99274dbc
[NET] CORE: Introducing new memory accounting interface.
Date:   Mon Dec 31 00:11:19 2007 -0800

are responsible for slowdown, because they add some
lock_sock()/release_sock() pairs.

function udp_recvmsg()

out_free:
+       lock_sock(sk);
        skb_free_datagram(sk, skb);
+       release_sock(sk);
 out:

I wonder why we can call __sk_mem_reclaim() when dequeing *one* UDP
frame in queue, while many others can still be in sk_receive_queue.
This defeats memory accounting, no ?

We should avoid lock_sock() if possible, or risk delaying
softirq RX in udp_queue_rcv_skb()


--

From: Eric Dumazet
Date: Monday, February 2, 2009 - 6:53 am

On latest kernels, we have a "timer_slack_ns" default of 50.000 ns, aka 50us

So usleep(20) sleeps much more than expected.

You might add in your program a call to prcrl() to setup a smaller timer_slack :

#ifndef PR_SET_TIMERSLACK
#define PR_SET_TIMERSLACK 29
#endif
/*
 * Setup a timer resolution of 1000 ns : 1 us
 */
prctl(PR_SET_TIMERSLACK, 1000); 




--

Previous thread: ixgbe in kernel driver question by Paweł Staszewski on Friday, January 30, 2009 - 9:11 am. (4 messages)

Next thread: Re: skge finds PCI error cmd=0x117 status=0x22b0 in 2.6.27.7 by Stephen Hemminger on Friday, January 30, 2009 - 12:15 pm. (1 message)