Re: IGMP Join dropping multicast packets

Previous thread: FINAL NOTIFICATION by siemens Online Promotion on Saturday, March 14, 2009 - 11:12 am. (1 message)

Next thread: 2.6.29-rc8: Reported regressions from 2.6.28 by Rafael J. Wysocki on Saturday, March 14, 2009 - 12:01 pm. (50 messages)
From: Dave Boutcher
Date: Saturday, March 14, 2009 - 1:16 pm

I'm running into an interesting problem with joining multiple
multicast feeds.  If you join multiple multicast feeds using
setsockopt(...,IP_ADD_MEMBERSHIP...) it causes packets on UNRELATED
multicast feeds to get dropped.  We have a multicast feed on a rock
solid network, and we were very surprised to see dropped packets.  The
cause was a different process/program being run by a different user
joining a bunch of mulitcast feeds.

I can recreate this with a fairly simple testcase (attached below.)
The problem doesn't happen with unicast UDP data, and it doesn't
happen with loopback, so you need at least two systems to run this
(and what subscriber to netdev doesn't have at least two systems.)  To
recreate, run "receiver" on one system, "sender", on another, and then
"joiner" on the receiving system.  You should see a message pop out
saying that packets have been dropped.  I've recreated this on a few
different kernel versions (the latest being 2.6.28) and a few
different sets off hardware.  I HAVEN"T recreated it if the system
doing the IP_ADD_MEMBERSHIP specifies a specific interface rather than
INADDR_ANY.  I'm not sure if that is core to the issue or not.  You
may also need to bump the value in
/proc/sys/net/ipv4/igmp_max_memberships (though that hasn't seemed
necessary for me.)

I poked around in igmp.c, but its mojo exceeds my threshold.  If
anyone has any ideas or questions I'd be happy to hear them.

diff -uNr null/joiner.c multicast/joiner.c
--- null/joiner.c	1969-12-31 18:00:00.000000000 -0600
+++ multicast/joiner.c	2009-03-14 15:04:10.000000000 -0500
@@ -0,0 +1,44 @@
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <unistd.h>
+
+#define NUMSOCK 55
+
+int main(int argc, char **argv)
+{
+	struct ip_mreq mreq;
+	int i;
+	int sd;
+	char ipaddr[64];
+
+	for (i=0; i<NUMSOCK; i++) {
+		sd = socket(PF_INET, ...
From: Eric Dumazet
Date: Saturday, March 14, 2009 - 7:37 pm

I could not reproduce the problem on my machines (bnx2 adapter), even if changing
NUMSOCK from 55 to 200 in joiner.c

Is your network a 100Mb one or Gigabit ?
Try to slow down your joiner ?
(Could be a flood of IGMP messages your router/switch cannot cope with)

Please describe your "rock solid" network setup (kind of network adapters you have, kind of router...)

Each time an address is added, NIC driver have to reprogram mcfilter of
the device. Maybe some NIC can drop some packets at this moment...

If using tcpdump to force promiscuous mode on the device also triggers packet losses ?

(see also ifconfig ethX promisc|allmulti)

--

From: Dave Boutcher
Date: Sunday, March 15, 2009 - 7:04 pm

Thanks for trying Eric.  Based on your email I did some more testing
and thus far I've
only recreated this on x86_64 arches, not on i386.  Which arch did you

The problem originally manifest itself at work on a 24-core Dell
server with 6 NICs.   The network
is gigabit with a Cisco 4900 switch.  I recreated it in my basement on
my little white-box
system and a cheap netgear switch.  The NIC at work is Intel e1000e
driver, the one

I haven't had a chance to play with promiscuous yet...

-- 
Dave B
--

From: Eric Dumazet
Date: Monday, March 16, 2009 - 12:01 pm

I tried both, 32 and 64 bit kernels. No problems so far.



--

From: Eric Dumazet
Date: Tuesday, March 17, 2009 - 12:08 am

Also, is using a third machine to start your joiner program is able to trigger
packet losses too ?


--

From: Dave Boutcher
Date: Tuesday, March 17, 2009 - 8:50 pm

Eric, based on your inability to recreate this, I tried on some other
hardware I had lying around that has an AMD chipset built-in NIC.
I could not recreate the problem on that hardware.  I'm starting to
think this is an e1000 problem.  In both the e1000 and e1000e
drivers they do the following logic:

      /* clear the old settings from the multicast hash table */

       for (i = 0; i < mta_reg_count; i++) {
               E1000_WRITE_REG_ARRAY(hw, MTA, i, 0);
               E1000_WRITE_FLUSH();
       }

       /* load any remaining addresses into the hash table */

       for (; mc_ptr; mc_ptr = mc_ptr->next) {
               hash_value = e1000_hash_mc_addr(hw, mc_ptr->da_addr);
               e1000_mta_set(hw, hash_value);
       }

There's clearly a window where the NIC doesn't have the multicast
addresses loaded.  This may just be broken-as-designed.  If anyone
else happens to have some e1000 hardware and wants to see if you
can recreate this, I'd be curious.

Some other notes just FYI...

- RcvbufErrors in /proc/net/snmp doesn't get incremented when this happens
- there are no messages in dmesg
- frames get dropped when the program calls exit() and all the sockets
get closed
  (and multicast joins dropped) as well as when the ADD_MEMBERSHIPs happen
- The problem happens even when adding a sleep(1) in between each of the
  ADD_MEMBERSHIP calls.

-- 
Dave B
--

From: Eric Dumazet
Date: Wednesday, March 18, 2009 - 12:38 am

Ouch, you are probably right, this code needs a change.

tg3 for example has a loop bulding hash values in a local array,
then a write of this array on NIC.

                for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
                     i++, mclist = mclist->next) {

                        crc = calc_crc (mclist->dmi_addr, ETH_ALEN);
                        bit = ~crc & 0x7f;
                        regidx = (bit & 0x60) >> 5;
                        bit &= 0x1f;
                        mc_filter[regidx] |= (1 << bit);
                }

                tw32(MAC_HASH_REG_0, mc_filter[0]);
                tw32(MAC_HASH_REG_1, mc_filter[1]);
                tw32(MAC_HASH_REG_2, mc_filter[2]);
                tw32(MAC_HASH_REG_3, mc_filter[3]);
        }

Other example , on bnx2, same logic :

               memset(mc_filter, 0, 4 * NUM_MC_HASH_REGISTERS);

                for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
                     i++, mclist = mclist->next) {

                        crc = ether_crc_le(ETH_ALEN, mclist->dmi_addr);
                        bit = crc & 0xff;
                        regidx = (bit & 0xe0) >> 5;
                        bit &= 0x1f;
                        mc_filter[regidx] |= (1 << bit);
                }

                for (i = 0; i < NUM_MC_HASH_REGISTERS; i++) {
                        REG_WR(bp, BNX2_EMAC_MULTICAST_HASH0 + (i * 4),
                               mc_filter[i]);


--

From: Brandeburg, Jesse
Date: Wednesday, March 18, 2009 - 10:24 am

Interesting, this code has been there for eons (and probably this 
behavior) but that doesn't mean its not a problem.

We are in the process of figuring out if there are any hardware corner 
cases to changing this code (particularly in e1000)

Initial thoughts are:
1) kcalloc an array that we then populate with the hash functions, and 
   then program every location only once (never flush)
2) only program a single hash value each time a multicast is added (bad 
   because we can't tell the difference in the list since the last time 
   the OS gave us the list)

It really seems like this should be fixable, and I agree that the driver 
behavior is far from optimal, however well entrenched.

Jesse
--

From: Dave Boutcher
Date: Wednesday, March 18, 2009 - 6:51 pm

On Wed, Mar 18, 2009 at 12:24 PM, Brandeburg, Jesse

Hi Jesse, thanks for the response...

If you go back in this thread I had a dead easy unprivileged user-land testcase
that causes frame loss.  We ran into this in a production environment
(and I kind
of glossed over how long it took to figure out why the hell we were dropping
frames...you can only increase rmem_max so many times ;-)  OTOH not that many
people use multicast, and even fewer notice a few dropped frames, so the
priority is probably lowish.

On the other other hand, I'm working in the financial trading space these days,
where Linux is pretty much king....and they're all about multicast.

--
Dave B
--

From: Brandeburg, Jesse
Date: Friday, March 20, 2009 - 1:36 pm

here is a patch proposal [RFC] only, I've just briefly tested it for e1000 
parts.  If you want to give it a spin I would appreciate feedback.

[RFC] e1000: fix loss of multicast packets

From: Jesse Brandeburg <jesse.brandeburg@intel.com>

e1000 (and e1000e, igb, ixgbe, ixgb) all do a series of operations each time a
multicast address is added.  The flow goes something like

1) stack adds one multicast address
2) stack passes whole current list of unicast and multicast addresses to
driver
3) driver clears entire list in hardware
4) driver programs each multicast address using iomem in a loop

This was causing multicast packets to be lost during the reprogramming
process.

reference with test program:
http://kerneltrap.org/mailarchive/linux-netdev/2009/3/14/5160514/thread

Thanks to Dave Boutcher for his report and test program.

This driver fix prepares an array all at once in memory and programs it in 
one shot to the hardware, not requiring an "erase" cycle.  It would still 
be possible for packets to be dropped while the receiver is off during 
reprogramming.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
CC: Dave Boutcher <daveboutcher@gmail.com>
---

 drivers/net/e1000/e1000_main.c |   40 +++++++++++++++++++++++++++++++---------
 1 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 26474c9..65697ab 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2328,6 +2328,12 @@ static void e1000_set_rx_mode(struct net_device *netdev)
 	int mta_reg_count = (hw->mac_type == e1000_ich8lan) ?
 				E1000_NUM_MTA_REGISTERS_ICH8LAN :
 				E1000_NUM_MTA_REGISTERS;
+	u32 *mcarray = kzalloc(512, GFP_ATOMIC);
+
+	if (!mcarray) {
+		DPRINTK(PROBE, ERR, "memory allocation failed\n");
+		return;
+	}
 
 	if (hw->mac_type == e1000_ich8lan)
 		rar_entries = E1000_RAR_ENTRIES_ICH8LAN;
@@ -2394,22 +2400,38 @@ static void e1000_set_rx_mode(struct ...
From: David Miller
Date: Wednesday, March 18, 2009 - 10:46 pm

From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>

Just do what tg3 does to fix this now, get fancy and "beautiful"
later.
--

Previous thread: FINAL NOTIFICATION by siemens Online Promotion on Saturday, March 14, 2009 - 11:12 am. (1 message)

Next thread: 2.6.29-rc8: Reported regressions from 2.6.28 by Rafael J. Wysocki on Saturday, March 14, 2009 - 12:01 pm. (50 messages)