Re: UDP multicast packet loss not reported if TX ring overrun?

Previous thread: mlx4 2.6.31-rc5: SW2HW_EQ failed. by Christoph Lameter on Monday, August 17, 2009 - 12:26 pm. (14 messages)

Next thread: [PATCH -next] phonet: fix build when PROC_FS is disabled by Randy Dunlap on Monday, August 17, 2009 - 1:35 pm. (3 messages)
From: Christoph Lameter
Date: Monday, August 17, 2009 - 1:01 pm

If I use a large send queue

(echo 2000000 >/proc/sys/net/core/wmem_default)

then lots of packet loss results if I try to send more packets than
the bandwidth of the wire (300 bytes the maximum rate is 341k pps). With a
size of 2M for the output buffer the TX ring is overrun.

But this loss is nowhere seen in any counter increments:

Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111

TotalMsg   Lost SeqErr Msg/Sec   Min/us  Avg/us  Max/us StdDev  Kbytes Idle  Samples
 3118667 532979  25341  308774 1734.96 1784.72 1904.04   44.11     0.0   0       10
 3415674 585146  27642  341569 1788.90 1858.40 1905.43   34.32 102470.4    0       10
 3449840 591012  27844  341568 1941.12 1987.45 2040.16   32.73 102470.4    0        9
 3449819 591093  27993  341567 2024.34 2036.00 2044.24    6.48 102470.0    0        5
 3415693 585268  27633  341568 2010.57 2017.84 2025.10    7.27 102470.4    0        2

ifconfig
eth0      Link encap:Ethernet  HWaddr 00:21:9b:8f:a1:40
          inet addr:10.2.36.110  Bcast:10.2.36.255  Mask:255.255.255.0
          inet6 addr: fe80::221:9bff:fe8f:a140/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:174716487 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1379 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:60451720777 (56.3 GiB)  TX bytes:225841 (220.5 KiB)
          Interrupt:36 Memory:d6000000-d6012800


If I then reduce the queue size to 20k

(echo 20000 >/proc/sys/net/core/wmem_default)

the loss does either not occur or I see corresponding loss in the switch
stats:

clameter@rd-strategy3:~$ bin/mcast
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111

TotalMsg   Lost SeqErr Msg/Sec   Min/us  Avg/us  Max/us StdDev  Kbytes
Idle  Samples
 3126997      0      0  ...
From: Nivedita Singhvi
Date: Monday, August 17, 2009 - 1:40 pm

Is there any chance this is getting dropped at qdisc?
Can you check with tc?

thanks,
Nivedita
--

From: Christoph Lameter
Date: Monday, August 17, 2009 - 1:46 pm

qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1
1 1 1 1 1
 Sent 3042712708 bytes 8896862 pkt (dropped 0, overlimits 0 requeues 56303)
 rate 0bit 0pps backlog 0b 0p requeues 56303




--

From: Sridhar Samudrala
Date: Monday, August 17, 2009 - 2:50 pm

What about ethtool -S ? Does it report any errors?

Thanks
Sridhar

--

From: Christoph Lameter
Date: Monday, August 17, 2009 - 3:13 pm

Neither. This is is a broadcom bnx2 NIC.

--

From: Sridhar Samudrala
Date: Monday, August 17, 2009 - 3:43 pm

Are you sure the packets are dropped at the sender?
Another place where packet drops/errors are counted is
    /proc/net/softnet_stat
It tracks some counters that could result in drops. I thought these are
all receive statistics. But looks like cpu_collision is a tx stat. The
name of the structure is netif_rx_stat and it includes cpu_collison 
counter.

Thanks
Sridhar

--

From: Christoph Lameter
Date: Monday, August 17, 2009 - 3:52 pm

Yes I am sending 400k messages from the app and the receiver only gets
341k @300 byte (which is the line rate). There is no way that the 400k get
over the line. Also if I reduce SO_SNDBUF then both receiver and
sender get down to 341k.

I added the output of ethtool -S at the end.

The mcast tool can be had from http://gentwo.org/ll or from my directory

How do I decode that information?


clameter@rd-gateway3:~$ cat /proc/net/softnet_stat
00000a2f 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000

rd-gateway3:/home/clameter# ethtool -S eth0
NIC statistics:
     rx_bytes: 380526
     rx_error_bytes: 0
     tx_bytes: 6299438984
     tx_error_bytes: 0
     rx_ucast_packets: 2373
     rx_mcast_packets: 168
     rx_bcast_packets: 5
     ...
From: Christoph Lameter
Date: Monday, August 17, 2009 - 3:57 pm

Argh I guess you want the receiver. The last msg was the sender.

clameter@rd-strategy3:~$ cat /proc/net/softnet_stat
0115d345 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000


--

From: Sridhar Samudrala
Date: Monday, August 17, 2009 - 5:12 pm

total dropped time_squeeze 0 0 0 0 0 cpu_collision

The first 3 are rx stats and the last one is a tx stat.
Anyway, only the first field(total packets received) seems to be non-zero 
in your softnet_stat output on both sender and receiver.
So it is possible that there is some other place in the stack where the packets 
are gettting dropped but not counted.

-Sridhar


--

From: Christoph Lameter
Date: Monday, August 17, 2009 - 5:25 pm

Is the driver responsible? I noticed no rx_drop++ in there.

--

From: Christoph Lameter
Date: Monday, August 24, 2009 - 10:40 am

Such a deed occurs in ip_push_pending_frames():

        /* Netfilter gets whole the not fragmented skb. */
        err = ip_local_out(skb);
        if (err) {
                if (err > 0)
                        err = inet->recverr ? net_xmit_errno(err) : 0;
			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                if (err)
                        goto error;
        }

out:
        ip_cork_release(inet);
        return err;

error:
        IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
        goto out;


So if ip_local_out returns NET_XMIT_DROP then its simply going to be
replaced by 0. Then we check err again and there is no error!!!!

The statistics are only generated if IP_RECVERR is set.

Could we move the increment of IPSTATS_MIB_OUTDISCARDS up so that it
is incremented regardless of the setting of IP_RECVERR?

F.e?


Subject: Report TX drops

Incrementing of TX drop counters currently does not work if errors from the
network stack are suppressed (IP_RECVERR off). Increment the statistics
independently of the setting of IP_RECVERR.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 net/ipv4/ip_output.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Index: linux-2.6/net/ipv4/ip_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_output.c	2009-08-24 17:04:27.000000000 +0000
+++ linux-2.6/net/ipv4/ip_output.c	2009-08-24 17:32:05.000000000 +0000
@@ -1300,20 +1300,21 @@ int ip_push_pending_frames(struct sock *

 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
-	if (err) {
-		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
-		if (err)
-			goto error;
+	if (err > 0) {
+		/* The packet was dropped by the network subsystem */
+		IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+
+		/*
+		 * Errors are not passed on if the socket
+		 * does not process errors (see IP_RECVERR).
+		 * net_xmit_errno ...
From: Eric Dumazet
Date: Monday, August 24, 2009 - 3:02 pm

NET_XMIT_CN strikes again :)

Well, if ip_local_out() returns a negative error (say -EPERM for example),
 your patch disables OUTDISCARDS increments.

Maybe a simpler patch like this one ?

[PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()

ip_push_pending_frames() can fail to send a frame because of a congestioned
device. In this case, we increment SNMP OUTDISCARDS only if user set
IP_RECVERR, which is not RFC conformant.

Only case where we should not update OUTDISCARDS is when
ip_local_output() return value is NET_XMIT_CN (meaning
skb was xmitted but future frames might be dropped)

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..27a5b79 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1301,19 +1301,15 @@ int ip_push_pending_frames(struct sock *sk)
 	/* Netfilter gets whole the not fragmented skb. */
 	err = ip_local_out(skb);
 	if (err) {
+		if (err != NET_XMIT_CN)
+			IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
 		if (err > 0)
 			err = inet->recverr ? net_xmit_errno(err) : 0;
-		if (err)
-			goto error;
 	}
 
 out:
 	ip_cork_release(inet);
 	return err;
-
-error:
-	IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
-	goto out;
 }
 
 /*
--

From: Sridhar Samudrala
Date: Monday, August 24, 2009 - 3:36 pm

Christoph,

So are you hitting this case with your workload and does this account for all the
packet losses you are seeing? 

If we are dropping the packet and returing NET_XMIT_DROP, should
we also increment qdisc drop stats (sch->qstats.drops)?

In dev_queue_xmit(), we have

        if (q->enqueue) {
                spinlock_t *root_lock = qdisc_lock(q);

                spin_lock(root_lock);

                if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                        kfree_skb(skb);
                        rc = NET_XMIT_DROP;
                } else {
                        rc = qdisc_enqueue_root(skb, q);
                        qdisc_run(q);
                }
                spin_unlock(root_lock);

                goto out;
        }

Here, if QDISC_STATE_DEACTIVATED is true, the skb is dropped and NET_XMIT_DROP
is returned, but not accounted in qdisc drop stats.
However it is incremented when NET_XMIT_DROP is returned via qdisc_drop().

If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?

Thanks


--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 6:48 am

I think so but I am no expert. I was surprised to not even see counter
increments at that level. But I was content to fix up the higher level

Yes.

--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 6:48 am

Please hold on, I would like to fully understand what's happening,
and test the patch :)


--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 7:00 am

Ok. It would be good if the drops would also be somehow noted by the UDP
subsystem (one should see something with netstat -su) and may be even the
socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops?
--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 8:32 am

This /proc/net/udp column is for rx_drops currently and was recently added...
--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 8:35 am

So lets rename it to rx_drops and then add tx_drops?

--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 8:58 am

It wont be very nice, because it'll add yet another 32bits counter in each socket
structure, for a unlikely use. While rx_drops can happen if application is slow.

Also, tx_drops might be done later and not noticed.

Please read this old (and usefull) thread, with Alexey words...

http://oss.sgi.com/archives/netdev/2002-10/msg00612.html

http://oss.sgi.com/archives/netdev/2002-10/msg00617.html


So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :)
--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 9:11 am

tx_drops happen if the application sends too fast.

TX drop tracking is important due to the braindamaged throttling logic
during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the
application will be throttled and no packet loss happens. If SO_SNDBUF is
set high then the TX ring will overflow and packets are dropped.

We need some way to diagnose TX drops per socket as long as we have
that mind boggling issue. TX drops means that one should reduce the size
of the sendbuffer in order to get better throttling which reduces packet

I read this just yesterday. IP_RECVERR means that the application wants to
see details on each loss. We just want some counters that give us accurate
statistics to gauge where packet loss is occurring. Applications are
usually not interested in tracking the fate of each packet.

--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 9:27 am

Yep,  but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error
in sending and congestion, which was your initial point :)

--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 9:36 am

The initial point was that the SNMP counters are not updated if IP_RECVERR
is not set which is clearly a bug and your and my patch addresses that.

Then Sridhar noted that there are other tx drop counters. qdisc counters
are also not updated. Wish we would maintain tx drops counters there as
well so that we can track down which NIC drops it.

Then came the wishlist of UDP counters for tx drops and socket based
tx_drop accounting for tuning and tracking down which app is sending
too fast .... ;-)

The apps could be third party apps. Just need to be able to troubleshoot
packet loss.





--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 9:48 am

Technically speaking, the send() syscall is in error. Frame is not sent, so
there is no drop at all. Like trying to send() from a bad user buffer, or write()

Question is : should we just allow send() to return an error (-ENOBUF) regardless
of IP_RECVERR being set or not ? I dont think it would be so bad after all.
Most apps probably dont care, or already handle the error.

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
 	err = ip_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = inet->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 87f8419..a7e5f93 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1526,7 +1526,7 @@ int ip6_push_pending_frames(struct sock *sk)
 	err = ip6_local_out(skb);
 	if (err) {
 		if (err > 0)
-			err = np->recverr ? net_xmit_errno(err) : 0;
+			err = net_xmit_errno(err);
 		if (err)
 			goto error;
 	}
--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 10:01 am

Frame is submitted to the IP layer which discards it. That is the

Some applications will then start to fail because so far you can send with
impunity without getting errors. AFAICT IP_RECVERR was added to preserve
that behavior. Your patch is changing basic send() semantics.
--

From: Eric Dumazet
Date: Tuesday, August 25, 2009 - 10:08 am

Sorry ???, I guess your machines have plenty available LOWMEM then, and kmalloc() never fail then...

man P send

      ENOBUFS
              Insufficient resources were available in the system to perform the operation.

basic send() semantics are respected.

--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 10:44 am

Nope. Currently sendto() just drops the packet and returns success if the
TX ring is full. That can be done ad infinitum and at very high traffic
rates. We had one person here believing he could send 800k 300 byte

basic send() semantics are changed by your patch. The 800k pps would no
longer work without sendto() returning errors.

--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 10:53 am

Manpage for send says:

ENOBUFS
              The output queue for a network interface was full.  This generally indicates that  the  interface  has  stopped
              sending, but may be caused by transient congestion.  (Normally, this does not occur in Linux.  Packets are just
              silently dropped when a device queue overflows.)


So ENOBUFS seems to be designed to have the role that you envision. We
just need to remove the statement in (). Its still a change in behavior
though.




--

From: Sridhar Samudrala
Date: Tuesday, August 25, 2009 - 11:38 am

This patch would allow tracking drops at UDP level too via UDP_MIB_SNDBUFERRORS
that is incremented in udp_sendmsg(). Right now this happens only if IP_RECVERR
is set on the socket.

Ideally, it would be good to track the drops at qdisc, IP and UDP if a
packet is passed all the way to dev_queue_xmit() and then dropped.

Thanks

--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 6:46 am

Yes looks good. Can we get this in soon?

--

From: Eric Dumazet
Date: Monday, August 24, 2009 - 4:14 pm

# gcc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../gcc-4.4.1/configure --enable-languages=c,c++ --prefix=/usr
Thread model: posix
gcc version 4.4.1 (GCC)
# pwd
/opt/src/lldiag-0.14
# make
gcc -Wall -omcast mcast.c -lrt -lm
mcast.c: In function ‘set_ip’:
mcast.c:121: warning: implicit declaration of function ‘htons’
mcast.c: In function ‘build_pattern_array’:
mcast.c:168: warning: implicit declaration of function ‘htonl’
/tmp/cc4sYCDr.o: In function `lock':
mcast.c:(.text+0xcad): undefined reference to `__sync_fetch_and_add_4'
collect2: ld returned 1 exit status
make: *** [mcast] Error 1


I have no idea where is defined sync_fetch_and_add

Thanks
--

From: Eric Dumazet
Date: Monday, August 24, 2009 - 11:46 pm

Doh...

Dont bother, I had to add "-march=i686" to CFLAGS to get "xadd" support

--

From: Christoph Lameter
Date: Tuesday, August 25, 2009 - 6:45 am

Its a GCC builtin for x86 sigh. Ok will put out 0.15 that disables their
use by default.

--