Hi all, We've been having some issues with multicast packet loss, we were wondering if anyone knows anything about the behavior we're seeing. Background: we use multicast messaging with lots of messages per sec for our work. We recently transitioned many of our systems from an Ubuntu Dapper Drake ia32 distribution to Ubuntu Hardy Heron x86_64. Since the transition, we've noticed much more multicast packet loss, and we think it's related to the transition. Our particular theory is that it's specifically a 32 vs 64-bit issue. We narrowed the problem down to the attached program (mcasttest.cc). Run "mcasttest server" on one machine -- it'll send 500,000 messages small message to a multicast group, 50,000 messages per second. If we run "mcasttest client" on another machine, it'll receive all those messages and print a count at the end of how many messages it sees. It almost never loses any messages. However, if we run 4 copies of the client on the same machine, receiving the same data, then the programs usually sees fewer than 500,000 messages. We're running with: for i in $(seq 1 4); do (./mcasttest client &); done We know this because the program prints a count, but dropped packets also show up in ifconfig's "RX packets" section. Things we're curious about: do other people see similar problems? The tests we've done: we've tried this program on a bunch of different machines, all of which are running either dapper ia32 or hardy x86_64. Uniformly, the dapper machines have no problems but on certain machines, Hardy shows significant loss. We did some experiments on a troubled machine, varying the OS install, including mixed installations where the kernel was 64-bit and the userspace was 32-bit. This is what we found: On machines that exhibit this problem, the ksoftirqd process seems to be pegged to 100% CPU when receiving packets. Note: while we're on Ubuntu, we've tried this with other distros and have seen It's not exactly clear what exactly the problem is but ...
Hi Kenny Interesting... You forgot the mcasttest.cc program Any chance you try a recent kernel (2.6.29-rcX) ? Could you post "cat /proc/interrupts" results (one for working setup, another for non working/droping setup) --
I didn't work much on multicast, but i have heavy unicast udp streaming (PEP
for satellite).
First thing to check
net.core.wmem_max = 131071
net.core.wmem_default = 124928
net.ipv4.udp_mem = 379008 505344 758016
Usually they are too small by default.
next
netstat -s
Important part
Udp:
1263992126 packets received
260196 packets to unknown port received.
627001 packet receive errors
74235906 packets sent
RcvbufErrors: 56683
SndbufErrors: 4295851
in your case SndbufErrors matter.
--
Like Eric mentioned, I'd start with a latest kernel if at all possible. If it doesn't happen there, you're work is half over, you just need to figure out what changed, and tell Canonical to backport it. From there, you can solve this like most packet loss issues are solved: 1) Determine if its a rx or tx packet loss. From your comments above it sounds like this is an rx side issue 2) Look at statistics from the hardware to the application. Use ethtool & /proc/net/dev to get hardware packet loss stats, /proc/net/snmp netstat -s to get core network loss stats 3) Use those stats to identify where and why packets are getting dropped. Posting some summary of that data here is something we can help with if need be 4) Determine how to reduce the loss (i.e. code change vs. tuning) 5) Lather, rinse repeat (given that eliminating a drop cause in one location will likely increase througput, potentially putting strain on another location in the code path, possibly leading to more drops elsewhere. You had mentioned that ifconfig was showing rx drops, which indicates that your hardware rx buffer is likely overflowing. Usually the best way to fix that is to: 1) modify any available interrupt coalescing parameters on the driver such that interrupts have less latency between packet arrival and assertion 2) increase (if possible) the napi weight (I think thats still the right term) so that each napi poll interation receives more frames on the interface, draining that queue more quickly. Neil --
Ah, sorry, here's the test program attached. We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the 2.6.29.-rcX. Right now, we are trying to step through the kernel versions until we see where the performance drops significantly. We'll try 2.6.29-rc soon and post the result. Neil Norman wrote: 1) Determine if its a rx or tx packet loss. From your comments above it sounds like this is an rx side issue We're pretty sure it's an rx issue. Other machines receiving at the same time will get all the packets. I'll gather the information mentioned and summarize in a subsequent email. Thanks! Kenny
2.6.29-rc contains UDP receive improvements (lockless) Problem is multicast handling was not yet updated, but could be :) I was asking you "cat /proc/interrupts" because I believe you might have a problem NIC interrupts being handled by one CPU only (when having problems) --
That would be expected (if irqbalance is running), and desireable, since spreading high volume interrupts like NICS accross multiple cores (or more specifically multiple L2 caches), is going increase your cache line miss rate significantly and decrease rx throughput. Although you do have a point here, if the system isn't running irqbalance, and the NICS irq affinity is spread accross multiple L2 caches, that would be a point of improvement performance-wise. Kenny, if you could provide the /proc/interrupts info along with /proc/cpuinfo and your stats that I asked about earlier, that would be a big help. Regards Neil --
Hi Neil, Here's the information you requested. Kenny kchang@beast8:~$ uname -a Linux beast8 2.6.24-19-server #1 SMP Wed Aug 20 18:43:06 UTC 2008 x86_64 GNU/Linux kchang@beast8:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU L5430 @ 2.66GHz stepping : 10 cpu MHz : 2659.999 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 5322.91 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU L5430 @ 2.66GHz stepping : 10 cpu MHz : 2659.999 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 5320.03 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU L5430 @ 2.66GHz stepping : 10 cpu MHz : 2659.999 cache size : 6144 KB physical id : 0 siblings : 4 core id ...
This is for a working setup. -Kenny kchang@fiji:~$ uname -a Linux fiji 2.6.24-19-generic #1 SMP Wed Aug 20 17:53:40 UTC 2008 x86_64 GNU/Linux kchang@fiji:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 1600.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4791.31 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 1600.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4788.05 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz stepping : 11 cpu MHz : 1600.000 cache size : 4096 KB physical id : 0 siblings : 4 core id : ...
Are these quad core systems? Or dual core w/ hyperthreading? I ask because in your working setup you have 1/2 the number of cpus' and was not sure if you removed an entire package of if you just disabled hyperthreading. Neil --
Yeah, these are quad core systems. The 8 cpu system is a dual-processor quad-core. The other is my desktop, single cpu quad core. Kenny --
Ok, so their separate systms then. Did you actually experience drops on the 8-core system since the last reboot? I ask because even when its distributed across all 8 cores, you only have about 500 total interrupts from the NIC, and if you did get drops, something more than just affinity is wrong. Regardless, spreading interrupts across cores is definately a problem. As eric says, quad core chips are actually 2x2 cores, so you'll want to either just run irqbalance to assign an apropriate affinity to the NIC, or manually look at each cores physical id and sibling id, to assign affininty to a core or cores that share an L2 cache. If you need to, as you've found, you may need to disable msi interrupt mode on your bnx2 driver. That kinda stinks, but bnx2 IIRC isn't multiqueue, so its not like msi provides you any real performance gain. --
Hi Neil, Yeah, we've been rebooting this system left and right switch kernels. The results are fairly consistent. We were able to set the irq affinities, and as Wes had mentioned, what we see is that if we pin the softirq to 1 core, and pin the app to its sibling, we see really good performance, but as we load up other cores, the machine reaches a breaking point where all hell breaks loose and we drop a bunch. (we hadn't turned off msi btw..) While we were able to tune and adjust performance like that, in the end, it doesn't really explain the difference between earlier and recent kernels, also it doesn't quite explain the difference between machines. You mentioned it would be good to see the interrupts for each kernel, in light of the above information, would it still be useful for me to provide that? Kenny --
Just to be clear: on the 2 x quad core system, we can run with a 2.6.15 kernel and see no packet drops. In fact, we can run with 2.6.19, 2.6.20, and 2.6.21 just fine. 2.6.22 is the first kernel that shows problems. Kenny posted results from a working setup on a different machine. What I would really like to know is if whatever changed between 2.6.21 and 2.6.22 that broke things is confined just to bnx2. To make this a rigorous test, we would need to use the same machine with a different nic, which we don't have quite yet. An Intel Pro 1000 ethernet card is in the mail as I type this. I also tried forward porting the bnx2 driver in 2.6.21 to 2.6.22 (unsuccessfully), and building the most recent driver from the Broadcom site to Ubuntu Hardy's 2.6.24. The most recent driver with hardy 2.6.24 showed similar packet dropping problems. Hm, perhaps I'll try to build the most recent broadcom driver against 2.6.21. Wes --
Try oprofile session, you shall see a scheduler effect (dont want to call
this a regression, no need for another flame war).
also give us "vmstat 1" results (number of context switches per second)
On recent kernels, scheduler might be faster than before: You get more wakeups per
second and more work to do by softirq handler (it does more calls to scheduler,
thus less cpu cycles available for draining NIC RX queue in time)
opcontrol --vmlinux=/path/vmlinux --start
<run benchmark>
opreport -l /path/vmlinux | head -n 50
Recent schedulers tend to be optimum for lower latencies (and thus, on
a high level of wakeups, you get less bandwidth because of sofirq using
a whole CPU)
For example, if you have one tread receiving data on 4 or 8 sockets, you'll
probably notice better throughput (because it will sleep less often)
Multicast receiving on N sockets, with one thread waiting on each socket
is basically a way to trigger a scheduler storm. (N wakeups per packet).
So its more a benchmark to stress scheduler than stressing network stack...
Maybe its time to change user side, and not try to find an appropriate kernel :)
If you know you have to receive N frames per 20us units, then its better to :
Use non blocking sockets, and doing such loop :
{
usleep(20); // or try to compensate if this thread is slowed too much by following code
for (i = 0 ; i < N ; i++) {
while (revfrom(socket[N], ....) != -1)
receive_frame(...);
}
}
That way, you are pretty sure network softirq handler wont have to spend time trying
to wakeup 400.000 time per second one thread. All cpu cycles can be spent in NIC driver
and network stack.
Your thread will do 50.000 calls to nanosleep() per second, that is not really expensive,
then N recvfrom() per iteration. It should work on all past , current and future kernels.
--
+1 to this idea. Since the last oprofile traces showed significant variance in the time spent in schedule(), it might be worthwhile to investigate the affects of the application behavior on this. I might also be worth adding a systemtap probe to sys_recvmsg, to count how many times we receive frames on a working and non-working system. If the app is behaving differently on different kernels, and its affecting the number of times you go to get a frame out of the stack, that would affect your drop rates, and it would show up in sys_recvmsg Neil --
I did some work to our test program to spin on a non-blocking socket and it indeed seems to fix the problem, at least for 2.6.28.1, which was a kernel we had problems with. The number of context switches drastically drops -- from 200,000+ to less than 50! I haven't done totally comprehensive tests yet, so I don't want to officially state any results. I'm also out today, but Kenny may get a chance to play with it. Spinning on the socket is looking like an interesting solution, but we're a bit nervous about seeing our processes constantly running at 100% CPU. Does C++ have a MachineOnFire exception? Wes --
In light of what you said, I probably don't need to see it no, although if you go through testing on all these kernels again, I would suggest you take a look at the /proc/interrupt numbers yourself. Like you said, its odd that this behavior changed, since the fast receive path is fairly consistent. It may be that te nic driver your using (bnx2 I think?), had a change that either broke the ability to set affinity for msi interrupts, forcing an irq spread and killing performance, or perhaps some large inefficiency was introduced either in the interrupt handler itself, or in the napi poll method of the driver. Another good analysis technique would be to grab the latest kernel (which is 'broken' I think your chart indicated), and the nic driver from the last working kernel. Merge the driver into the latest kernel and see if the problem persists. If not, thats a pretty good indicator that a change in the driver has at least some culpability. Best --
I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel) With 8 clients, about 10% packet loss, Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000 wakeups per second... But at least UDP receive path seems OK. Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with readers... I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1) I tried playing with ethtool -C|c G|g params... And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program) I can have 0% packet loss if booting with msi_disable and echo 1 >/proc/irq/16/smp_affinities (16 being interrupt of eth0 NIC) then, a second run gave me errors, about 2%, oh well... oprofile numbers without playing IRQ affinities: CPU: Core 2, speed 2999.89 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % symbol name 327928 10.1427 schedule 259625 8.0301 mwait_idle 187337 5.7943 __skb_recv_datagram 109854 3.3977 lock_sock_nested 104713 3.2387 tick_nohz_stop_sched_tick 98831 3.0568 select_nohz_load_balancer 88163 2.7268 skb_release_data 78552 2.4296 update_curr 75241 2.3272 getnstimeofday 71400 2.2084 set_next_entity 67629 2.0917 get_next_timer_interrupt 67375 2.0839 sched_clock_tick 58112 1.7974 enqueue_entity 56462 1.7463 udp_recvmsg 55049 1.7026 copy_to_user 54277 1.6788 sched_clock_cpu 54031 1.6712 __copy_skb_header 51859 1.6040 __slab_free 51786 1.6017 prepare_to_wait_exclusive 51776 1.6014 sock_def_readable 50062 1.5484 try_to_wake_up 42182 1.3047 __switch_to 41631 1.2876 read_tsc 38337 1.1857 tick_nohz_restart_sched_tick 34358 1.0627 cpu_idle 34194 1.0576 native_sched_clock 33812 ...
I'm not 100% versed on this, but IIRC, some hardware simply can't set irq affinity when operating in msi interrupt mode. If this is the case with this particular bnx2 card, then I would expect some packet loss, simply due to the constant cache misses. It would be interesting to re-run your oprofile cases, counting L2 cache hits/misses (if your cpu supports that class of counter) for both bnx2 running in msi enabled mode and msi disabled mode. It would also be interesting to use a different card, that can set irq affinity, and compare loss with irqbalance on, and irqbalance off with irq afninty set to all cpus. --
booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that oprofile results sorted on CPU0 numbers. We can see scheduler has hard time to cope with this workload with more of two CPUS OK up to 30.000 (* 8 sockets) packets per second. CPU0 is 100% handling softirq (ksoftirqd/0) CPU: Core 2, speed 3000.31 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 Samples on CPU 0 Samples on CPU 1 Samples on CPU 2 Samples on CPU 3 Samples on CPU 4 Samples on CPU 5 Samples on CPU 6 Samples on CPU 7 samples % samples % samples % samples % samples % samples % samples % samples % symbol name 6152 12.5595 3 0.0098 3 0.0090 5 0.0156 1 0.0582 0 0 2 0.0065 3 0.0169 enqueue_entity 4453 9.0909 2 0.0065 3 0.0090 4 0.0125 5 0.2910 0 0 1 0.0033 2 0.0113 try_to_wake_up 3837 7.8333 3 0.0098 8 0.0241 0 0 0 0 0 0 0 0 0 0 sock_def_readable 3694 7.5414 0 0 0 0 0 0 0 0 0 0 0 0 0 0 __copy_skb_header 2320 4.7363 1 0.0033 2 0.0060 2 0.0062 1 0.0582 1 0.0028 2 0.0065 0 0 resched_task 1818 3.7115 6 0.0196 32 0.0962 0 0 0 0 0 0 0 0 0 0 sock_queue_rcv_skb 1776 3.6257 0 0 0 0 0 0 0 0 0 0 0 0 0 0 udp_queue_rcv_skb 1677 3.4236 0 0 1 0.0030 0 0 1 0.0582 1 0.0028 0 0 0 ...
This explains alot. if the application is scheduled to run on the same cpu that has the irq for the NIC bound to it, you get a perf boost by not having to warm up two caches (1 for the app cpu and one for the irq & softirq work), but you loose it and then some fighting for cpu time. If both the app and the irq are on the same cpu, and we spend so much time in softirq context, we will eventually overflow higher up the network stack, as the application doesn't have enough time to dequeue frames. It may also speak to the need to make the bnx2 napi routine more efficient :) Neil --
When will that happen? --
When proven necessary :) Kenny problem is about scheduling storm. The extra spin_lock() in UDP multicast receive is not a problem. --
My tests here show that 2.6.29-rc5 still looses ~5usec vs. 2.6.22 via UDP. This would fix a regression..... --
Could you elaborate ? I just retried Kenny test here. As one cpu is looping in ksoftirqd, only this cpu touches the spin_lock, so spin_lock()/spin_unlock() is free. oprofile shows that udp stack is lightweight in this case. Problem is about wakeing up so many threads... CPU: Core 2, speed 3000.16 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 356857 356857 15.1789 15.1789 schedule 274028 630885 11.6557 26.8346 mwait_idle 189218 820103 8.0484 34.8829 __skb_recv_datagram 116903 937006 4.9725 39.8554 skb_release_data 103152 1040158 4.3876 44.2430 lock_sock_nested 89600 1129758 3.8111 48.0541 udp_recvmsg 74171 1203929 3.1549 51.2089 copy_to_user 72299 1276228 3.0752 54.2842 set_next_entity 60392 1336620 2.5688 56.8529 sched_clock_cpu 54026 1390646 2.2980 59.1509 __slab_free 50212 1440858 2.1358 61.2867 prepare_to_wait_exclusive 38689 1479547 1.6456 62.9323 cpu_idle 38142 1517689 1.6224 64.5547 __switch_to 36701 1554390 1.5611 66.1157 hrtick_start_fair 36673 1591063 1.5599 67.6756 dst_release 36268 1627331 1.5427 69.2183 sys_recvfrom 35052 1662383 1.4909 70.7092 kmem_cache_free 32680 1695063 1.3900 72.0992 pick_next_task_fair 31209 1726272 1.3275 73.4267 try_to_wake_up 30382 1756654 1.2923 74.7190 dequeue_task_fair 29048 1785702 1.2356 75.9545 __copy_skb_header 28801 1814503 1.2250 77.1796 sock_def_readable 28655 1843158 1.2188 78.3984 enqueue_task_fair 27232 1870390 1.1583 79.5567 update_curr 21688 1892078 0.9225 80.4792 copy_from_user 18832 1910910 0.8010 81.2802 ...
My guess is commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting)
Hideo Aoki [Mon, 31 Dec 2007 08:29:24 +0000 (00:29 -0800)]
and 3ab224be6d69de912ee21302745ea45a99274dbc
[NET] CORE: Introducing new memory accounting interface.
Date: Mon Dec 31 00:11:19 2007 -0800
are responsible for slowdown, because they add some
lock_sock()/release_sock() pairs.
function udp_recvmsg()
out_free:
+ lock_sock(sk);
skb_free_datagram(sk, skb);
+ release_sock(sk);
out:
I wonder why we can call __sk_mem_reclaim() when dequeing *one* UDP
frame in queue, while many others can still be in sk_receive_queue.
This defeats memory accounting, no ?
We should avoid lock_sock() if possible, or risk delaying
softirq RX in udp_queue_rcv_skb()
--
On latest kernels, we have a "timer_slack_ns" default of 50.000 ns, aka 50us So usleep(20) sleeps much more than expected. You might add in your program a call to prcrl() to setup a smaller timer_slack : #ifndef PR_SET_TIMERSLACK #define PR_SET_TIMERSLACK 29 #endif /* * Setup a timer resolution of 1000 ns : 1 us */ prctl(PR_SET_TIMERSLACK, 1000); --
