If I use a large send queue
(echo 2000000 >/proc/sys/net/core/wmem_default)
then lots of packet loss results if I try to send more packets than
the bandwidth of the wire (300 bytes the maximum rate is 341k pps). With a
size of 2M for the output buffer the TX ring is overrun.
But this loss is nowhere seen in any counter increments:
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111
TotalMsg Lost SeqErr Msg/Sec Min/us Avg/us Max/us StdDev Kbytes Idle Samples
3118667 532979 25341 308774 1734.96 1784.72 1904.04 44.11 0.0 0 10
3415674 585146 27642 341569 1788.90 1858.40 1905.43 34.32 102470.4 0 10
3449840 591012 27844 341568 1941.12 1987.45 2040.16 32.73 102470.4 0 9
3449819 591093 27993 341567 2024.34 2036.00 2044.24 6.48 102470.0 0 5
3415693 585268 27633 341568 2010.57 2017.84 2025.10 7.27 102470.4 0 2
ifconfig
eth0 Link encap:Ethernet HWaddr 00:21:9b:8f:a1:40
inet addr:10.2.36.110 Bcast:10.2.36.255 Mask:255.255.255.0
inet6 addr: fe80::221:9bff:fe8f:a140/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:174716487 errors:0 dropped:0 overruns:0 frame:0
TX packets:1379 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:60451720777 (56.3 GiB) TX bytes:225841 (220.5 KiB)
Interrupt:36 Memory:d6000000-d6012800
If I then reduce the queue size to 20k
(echo 20000 >/proc/sys/net/core/wmem_default)
the loss does either not occur or I see corresponding loss in the switch
stats:
clameter@rd-strategy3:~$ bin/mcast
Receiver: Listening to control channel 239.0.192.1
Receiver: Subscribing to 1 MC addresses 239.0.192-254.2-254 offset 0
origin 10.2.36.111
TotalMsg Lost SeqErr Msg/Sec Min/us Avg/us Max/us StdDev Kbytes
Idle Samples
3126997 0 0 ...Is there any chance this is getting dropped at qdisc? Can you check with tc? thanks, Nivedita --
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 3042712708 bytes 8896862 pkt (dropped 0, overlimits 0 requeues 56303) rate 0bit 0pps backlog 0b 0p requeues 56303 --
What about ethtool -S ? Does it report any errors? Thanks Sridhar --
Neither. This is is a broadcom bnx2 NIC. --
Are you sure the packets are dropped at the sender?
Another place where packet drops/errors are counted is
/proc/net/softnet_stat
It tracks some counters that could result in drops. I thought these are
all receive statistics. But looks like cpu_collision is a tx stat. The
name of the structure is netif_rx_stat and it includes cpu_collison
counter.
Thanks
Sridhar
--
Yes I am sending 400k messages from the app and the receiver only gets 341k @300 byte (which is the line rate). There is no way that the 400k get over the line. Also if I reduce SO_SNDBUF then both receiver and sender get down to 341k. I added the output of ethtool -S at the end. The mcast tool can be had from http://gentwo.org/ll or from my directory How do I decode that information? clameter@rd-gateway3:~$ cat /proc/net/softnet_stat 00000a2f 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 rd-gateway3:/home/clameter# ethtool -S eth0 NIC statistics: rx_bytes: 380526 rx_error_bytes: 0 tx_bytes: 6299438984 tx_error_bytes: 0 rx_ucast_packets: 2373 rx_mcast_packets: 168 rx_bcast_packets: 5 ...
Argh I guess you want the receiver. The last msg was the sender. clameter@rd-strategy3:~$ cat /proc/net/softnet_stat 0115d345 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000018 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 --
total dropped time_squeeze 0 0 0 0 0 cpu_collision The first 3 are rx stats and the last one is a tx stat. Anyway, only the first field(total packets received) seems to be non-zero in your softnet_stat output on both sender and receiver. So it is possible that there is some other place in the stack where the packets are gettting dropped but not counted. -Sridhar --
Is the driver responsible? I noticed no rx_drop++ in there. --
Such a deed occurs in ip_push_pending_frames():
/* Netfilter gets whole the not fragmented skb. */
err = ip_local_out(skb);
if (err) {
if (err > 0)
err = inet->recverr ? net_xmit_errno(err) : 0;
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if (err)
goto error;
}
out:
ip_cork_release(inet);
return err;
error:
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
goto out;
So if ip_local_out returns NET_XMIT_DROP then its simply going to be
replaced by 0. Then we check err again and there is no error!!!!
The statistics are only generated if IP_RECVERR is set.
Could we move the increment of IPSTATS_MIB_OUTDISCARDS up so that it
is incremented regardless of the setting of IP_RECVERR?
F.e?
Subject: Report TX drops
Incrementing of TX drop counters currently does not work if errors from the
network stack are suppressed (IP_RECVERR off). Increment the statistics
independently of the setting of IP_RECVERR.
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
---
net/ipv4/ip_output.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)
Index: linux-2.6/net/ipv4/ip_output.c
===================================================================
--- linux-2.6.orig/net/ipv4/ip_output.c 2009-08-24 17:04:27.000000000 +0000
+++ linux-2.6/net/ipv4/ip_output.c 2009-08-24 17:32:05.000000000 +0000
@@ -1300,20 +1300,21 @@ int ip_push_pending_frames(struct sock *
/* Netfilter gets whole the not fragmented skb. */
err = ip_local_out(skb);
- if (err) {
- if (err > 0)
- err = inet->recverr ? net_xmit_errno(err) : 0;
- if (err)
- goto error;
+ if (err > 0) {
+ /* The packet was dropped by the network subsystem */
+ IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
+
+ /*
+ * Errors are not passed on if the socket
+ * does not process errors (see IP_RECVERR).
+ * net_xmit_errno ...NET_XMIT_CN strikes again :)
Well, if ip_local_out() returns a negative error (say -EPERM for example),
your patch disables OUTDISCARDS increments.
Maybe a simpler patch like this one ?
[PATCH] net: correctly updates OUTDISCARDS in ip_push_pending_frames()
ip_push_pending_frames() can fail to send a frame because of a congestioned
device. In this case, we increment SNMP OUTDISCARDS only if user set
IP_RECVERR, which is not RFC conformant.
Only case where we should not update OUTDISCARDS is when
ip_local_output() return value is NET_XMIT_CN (meaning
skb was xmitted but future frames might be dropped)
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..27a5b79 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1301,19 +1301,15 @@ int ip_push_pending_frames(struct sock *sk)
/* Netfilter gets whole the not fragmented skb. */
err = ip_local_out(skb);
if (err) {
+ if (err != NET_XMIT_CN)
+ IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
if (err > 0)
err = inet->recverr ? net_xmit_errno(err) : 0;
- if (err)
- goto error;
}
out:
ip_cork_release(inet);
return err;
-
-error:
- IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
- goto out;
}
/*
--
Christoph,
So are you hitting this case with your workload and does this account for all the
packet losses you are seeing?
If we are dropping the packet and returing NET_XMIT_DROP, should
we also increment qdisc drop stats (sch->qstats.drops)?
In dev_queue_xmit(), we have
if (q->enqueue) {
spinlock_t *root_lock = qdisc_lock(q);
spin_lock(root_lock);
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
kfree_skb(skb);
rc = NET_XMIT_DROP;
} else {
rc = qdisc_enqueue_root(skb, q);
qdisc_run(q);
}
spin_unlock(root_lock);
goto out;
}
Here, if QDISC_STATE_DEACTIVATED is true, the skb is dropped and NET_XMIT_DROP
is returned, but not accounted in qdisc drop stats.
However it is incremented when NET_XMIT_DROP is returned via qdisc_drop().
If we count these drops as qdisc drops, should we also count them as IP OUTDISCARDS?
Thanks
--
I think so but I am no expert. I was surprised to not even see counter increments at that level. But I was content to fix up the higher level Yes. --
Please hold on, I would like to fully understand what's happening, and test the patch :) --
Ok. It would be good if the drops would also be somehow noted by the UDP subsystem (one should see something with netstat -su) and may be even the socket. I see a drops column in /proc/net/udp. rx_drops, tx_drops? --
This /proc/net/udp column is for rx_drops currently and was recently added... --
So lets rename it to rx_drops and then add tx_drops? --
It wont be very nice, because it'll add yet another 32bits counter in each socket structure, for a unlikely use. While rx_drops can happen if application is slow. Also, tx_drops might be done later and not noticed. Please read this old (and usefull) thread, with Alexey words... http://oss.sgi.com/archives/netdev/2002-10/msg00612.html http://oss.sgi.com/archives/netdev/2002-10/msg00617.html So I bet your best choice is to set IP_RECVERR, as mentioned in 2002 by Jamal and Alexey :) --
tx_drops happen if the application sends too fast. TX drop tracking is important due to the braindamaged throttling logic during send. If SO_SNDBUF is less than what happens to fit in the TX ring then the application will be throttled and no packet loss happens. If SO_SNDBUF is set high then the TX ring will overflow and packets are dropped. We need some way to diagnose TX drops per socket as long as we have that mind boggling issue. TX drops means that one should reduce the size of the sendbuffer in order to get better throttling which reduces packet I read this just yesterday. IP_RECVERR means that the application wants to see details on each loss. We just want some counters that give us accurate statistics to gauge where packet loss is occurring. Applications are usually not interested in tracking the fate of each packet. --
Yep, but IP_RECVERR also has the side effect of letting kernel returns -ENOBUFS error in sending and congestion, which was your initial point :) --
The initial point was that the SNMP counters are not updated if IP_RECVERR is not set which is clearly a bug and your and my patch addresses that. Then Sridhar noted that there are other tx drop counters. qdisc counters are also not updated. Wish we would maintain tx drops counters there as well so that we can track down which NIC drops it. Then came the wishlist of UDP counters for tx drops and socket based tx_drop accounting for tuning and tracking down which app is sending too fast .... ;-) The apps could be third party apps. Just need to be able to troubleshoot packet loss. --
Technically speaking, the send() syscall is in error. Frame is not sent, so
there is no drop at all. Like trying to send() from a bad user buffer, or write()
Question is : should we just allow send() to return an error (-ENOBUF) regardless
of IP_RECVERR being set or not ? I dont think it would be so bad after all.
Most apps probably dont care, or already handle the error.
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7d08210..afae0cb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1302,7 +1302,7 @@ int ip_push_pending_frames(struct sock *sk)
err = ip_local_out(skb);
if (err) {
if (err > 0)
- err = inet->recverr ? net_xmit_errno(err) : 0;
+ err = net_xmit_errno(err);
if (err)
goto error;
}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 87f8419..a7e5f93 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1526,7 +1526,7 @@ int ip6_push_pending_frames(struct sock *sk)
err = ip6_local_out(skb);
if (err) {
if (err > 0)
- err = np->recverr ? net_xmit_errno(err) : 0;
+ err = net_xmit_errno(err);
if (err)
goto error;
}
--
Frame is submitted to the IP layer which discards it. That is the Some applications will then start to fail because so far you can send with impunity without getting errors. AFAICT IP_RECVERR was added to preserve that behavior. Your patch is changing basic send() semantics. --
Sorry ???, I guess your machines have plenty available LOWMEM then, and kmalloc() never fail then...
man P send
ENOBUFS
Insufficient resources were available in the system to perform the operation.
basic send() semantics are respected.
--
Nope. Currently sendto() just drops the packet and returns success if the TX ring is full. That can be done ad infinitum and at very high traffic rates. We had one person here believing he could send 800k 300 byte basic send() semantics are changed by your patch. The 800k pps would no longer work without sendto() returning errors. --
Manpage for send says:
ENOBUFS
The output queue for a network interface was full. This generally indicates that the interface has stopped
sending, but may be caused by transient congestion. (Normally, this does not occur in Linux. Packets are just
silently dropped when a device queue overflows.)
So ENOBUFS seems to be designed to have the role that you envision. We
just need to remove the statement in (). Its still a change in behavior
though.
--
This patch would allow tracking drops at UDP level too via UDP_MIB_SNDBUFERRORS that is incremented in udp_sendmsg(). Right now this happens only if IP_RECVERR is set on the socket. Ideally, it would be good to track the drops at qdisc, IP and UDP if a packet is passed all the way to dev_queue_xmit() and then dropped. Thanks --
Yes looks good. Can we get this in soon? --
# gcc -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: ../gcc-4.4.1/configure --enable-languages=c,c++ --prefix=/usr Thread model: posix gcc version 4.4.1 (GCC) # pwd /opt/src/lldiag-0.14 # make gcc -Wall -omcast mcast.c -lrt -lm mcast.c: In function ‘set_ip’: mcast.c:121: warning: implicit declaration of function ‘htons’ mcast.c: In function ‘build_pattern_array’: mcast.c:168: warning: implicit declaration of function ‘htonl’ /tmp/cc4sYCDr.o: In function `lock': mcast.c:(.text+0xcad): undefined reference to `__sync_fetch_and_add_4' collect2: ld returned 1 exit status make: *** [mcast] Error 1 I have no idea where is defined sync_fetch_and_add Thanks --
Doh... Dont bother, I had to add "-march=i686" to CFLAGS to get "xadd" support --
Its a GCC builtin for x86 sigh. Ok will put out 0.15 that disables their use by default. --
