I'm running into an interesting problem with joining multiple
multicast feeds. If you join multiple multicast feeds using
setsockopt(...,IP_ADD_MEMBERSHIP...) it causes packets on UNRELATED
multicast feeds to get dropped. We have a multicast feed on a rock
solid network, and we were very surprised to see dropped packets. The
cause was a different process/program being run by a different user
joining a bunch of mulitcast feeds.
I can recreate this with a fairly simple testcase (attached below.)
The problem doesn't happen with unicast UDP data, and it doesn't
happen with loopback, so you need at least two systems to run this
(and what subscriber to netdev doesn't have at least two systems.) To
recreate, run "receiver" on one system, "sender", on another, and then
"joiner" on the receiving system. You should see a message pop out
saying that packets have been dropped. I've recreated this on a few
different kernel versions (the latest being 2.6.28) and a few
different sets off hardware. I HAVEN"T recreated it if the system
doing the IP_ADD_MEMBERSHIP specifies a specific interface rather than
INADDR_ANY. I'm not sure if that is core to the issue or not. You
may also need to bump the value in
/proc/sys/net/ipv4/igmp_max_memberships (though that hasn't seemed
necessary for me.)
I poked around in igmp.c, but its mojo exceeds my threshold. If
anyone has any ideas or questions I'd be happy to hear them.
diff -uNr null/joiner.c multicast/joiner.c
--- null/joiner.c 1969-12-31 18:00:00.000000000 -0600
+++ multicast/joiner.c 2009-03-14 15:04:10.000000000 -0500
@@ -0,0 +1,44 @@
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+
+#define NUMSOCK 55
+
+int main(int argc, char **argv)
+{
+ struct ip_mreq mreq;
+ int i;
+ int sd;
+ char ipaddr[64];
+
+ for (i=0; i<NUMSOCK; i++) {
+ sd = socket(PF_INET, ...I could not reproduce the problem on my machines (bnx2 adapter), even if changing NUMSOCK from 55 to 200 in joiner.c Is your network a 100Mb one or Gigabit ? Try to slow down your joiner ? (Could be a flood of IGMP messages your router/switch cannot cope with) Please describe your "rock solid" network setup (kind of network adapters you have, kind of router...) Each time an address is added, NIC driver have to reprogram mcfilter of the device. Maybe some NIC can drop some packets at this moment... If using tcpdump to force promiscuous mode on the device also triggers packet losses ? (see also ifconfig ethX promisc|allmulti) --
Thanks for trying Eric. Based on your email I did some more testing and thus far I've only recreated this on x86_64 arches, not on i386. Which arch did you The problem originally manifest itself at work on a 24-core Dell server with 6 NICs. The network is gigabit with a Cisco 4900 switch. I recreated it in my basement on my little white-box system and a cheap netgear switch. The NIC at work is Intel e1000e driver, the one I haven't had a chance to play with promiscuous yet... -- Dave B --
I tried both, 32 and 64 bit kernels. No problems so far. --
Also, is using a third machine to start your joiner program is able to trigger packet losses too ? --
Eric, based on your inability to recreate this, I tried on some other
hardware I had lying around that has an AMD chipset built-in NIC.
I could not recreate the problem on that hardware. I'm starting to
think this is an e1000 problem. In both the e1000 and e1000e
drivers they do the following logic:
/* clear the old settings from the multicast hash table */
for (i = 0; i < mta_reg_count; i++) {
E1000_WRITE_REG_ARRAY(hw, MTA, i, 0);
E1000_WRITE_FLUSH();
}
/* load any remaining addresses into the hash table */
for (; mc_ptr; mc_ptr = mc_ptr->next) {
hash_value = e1000_hash_mc_addr(hw, mc_ptr->da_addr);
e1000_mta_set(hw, hash_value);
}
There's clearly a window where the NIC doesn't have the multicast
addresses loaded. This may just be broken-as-designed. If anyone
else happens to have some e1000 hardware and wants to see if you
can recreate this, I'd be curious.
Some other notes just FYI...
- RcvbufErrors in /proc/net/snmp doesn't get incremented when this happens
- there are no messages in dmesg
- frames get dropped when the program calls exit() and all the sockets
get closed
(and multicast joins dropped) as well as when the ADD_MEMBERSHIPs happen
- The problem happens even when adding a sleep(1) in between each of the
ADD_MEMBERSHIP calls.
--
Dave B
--
Ouch, you are probably right, this code needs a change.
tg3 for example has a loop bulding hash values in a local array,
then a write of this array on NIC.
for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
i++, mclist = mclist->next) {
crc = calc_crc (mclist->dmi_addr, ETH_ALEN);
bit = ~crc & 0x7f;
regidx = (bit & 0x60) >> 5;
bit &= 0x1f;
mc_filter[regidx] |= (1 << bit);
}
tw32(MAC_HASH_REG_0, mc_filter[0]);
tw32(MAC_HASH_REG_1, mc_filter[1]);
tw32(MAC_HASH_REG_2, mc_filter[2]);
tw32(MAC_HASH_REG_3, mc_filter[3]);
}
Other example , on bnx2, same logic :
memset(mc_filter, 0, 4 * NUM_MC_HASH_REGISTERS);
for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
i++, mclist = mclist->next) {
crc = ether_crc_le(ETH_ALEN, mclist->dmi_addr);
bit = crc & 0xff;
regidx = (bit & 0xe0) >> 5;
bit &= 0x1f;
mc_filter[regidx] |= (1 << bit);
}
for (i = 0; i < NUM_MC_HASH_REGISTERS; i++) {
REG_WR(bp, BNX2_EMAC_MULTICAST_HASH0 + (i * 4),
mc_filter[i]);
--
Interesting, this code has been there for eons (and probably this behavior) but that doesn't mean its not a problem. We are in the process of figuring out if there are any hardware corner cases to changing this code (particularly in e1000) Initial thoughts are: 1) kcalloc an array that we then populate with the hash functions, and then program every location only once (never flush) 2) only program a single hash value each time a multicast is added (bad because we can't tell the difference in the list since the last time the OS gave us the list) It really seems like this should be fixable, and I agree that the driver behavior is far from optimal, however well entrenched. Jesse --
On Wed, Mar 18, 2009 at 12:24 PM, Brandeburg, Jesse Hi Jesse, thanks for the response... If you go back in this thread I had a dead easy unprivileged user-land testcase that causes frame loss. We ran into this in a production environment (and I kind of glossed over how long it took to figure out why the hell we were dropping frames...you can only increase rmem_max so many times ;-) OTOH not that many people use multicast, and even fewer notice a few dropped frames, so the priority is probably lowish. On the other other hand, I'm working in the financial trading space these days, where Linux is pretty much king....and they're all about multicast. -- Dave B --
here is a patch proposal [RFC] only, I've just briefly tested it for e1000 parts. If you want to give it a spin I would appreciate feedback. [RFC] e1000: fix loss of multicast packets From: Jesse Brandeburg <jesse.brandeburg@intel.com> e1000 (and e1000e, igb, ixgbe, ixgb) all do a series of operations each time a multicast address is added. The flow goes something like 1) stack adds one multicast address 2) stack passes whole current list of unicast and multicast addresses to driver 3) driver clears entire list in hardware 4) driver programs each multicast address using iomem in a loop This was causing multicast packets to be lost during the reprogramming process. reference with test program: http://kerneltrap.org/mailarchive/linux-netdev/2009/3/14/5160514/thread Thanks to Dave Boutcher for his report and test program. This driver fix prepares an array all at once in memory and programs it in one shot to the hardware, not requiring an "erase" cycle. It would still be possible for packets to be dropped while the receiver is off during reprogramming. Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> CC: Dave Boutcher <daveboutcher@gmail.com> --- drivers/net/e1000/e1000_main.c | 40 +++++++++++++++++++++++++++++++--------- 1 files changed, 31 insertions(+), 9 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 26474c9..65697ab 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -2328,6 +2328,12 @@ static void e1000_set_rx_mode(struct net_device *netdev) int mta_reg_count = (hw->mac_type == e1000_ich8lan) ? E1000_NUM_MTA_REGISTERS_ICH8LAN : E1000_NUM_MTA_REGISTERS; + u32 *mcarray = kzalloc(512, GFP_ATOMIC); + + if (!mcarray) { + DPRINTK(PROBE, ERR, "memory allocation failed\n"); + return; + } if (hw->mac_type == e1000_ich8lan) rar_entries = E1000_RAR_ENTRIES_ICH8LAN; @@ -2394,22 +2400,38 @@ static void e1000_set_rx_mode(struct ...
From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com> Just do what tg3 does to fix this now, get fancy and "beautiful" later. --
