Eli, Using the new connectx FW (2.5), I see performance drop to almost zero with ipoib datagram mode. The code that runs on these systems is ofed 1.3 and not mainline kernel, details below. Running netperf With connected mode (64k MTU) I get about 950MB/s where with datagram mode (2k MTU) I get only 20-40MB/s. I used to see about 650MB/s and above with FW 2.3 and datagram mode. Not that it could explain the drop, but the NIC reports to the OS stateless offload support - /sys/class/net/ib1/features is 0x11423 I have opened the ipoib and mlx4 debug prints, and I don't see anything special other then the dmesg get quite filled with ib1: TX ring full, stopping kernel net queue any idea what can explain this? ibv_ud_pingpong gives about 2Gb/s which is about five times what I see with ipoib. Or. git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel commit 564e9e9383272f4311fd87ff4e5447cfcebad73a # uname -a Linux gen2-1 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux # cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.1 (Tikanga) _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Can you tell if changing the FW to 2.3 gives more reasonable results? I don't believe such a drop in performance would have passed the QA tests but I'll check that. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Yes, I will be able to revert to FW 2.3 later this week, but please check also with your QA Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Sure. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
And how much with 2.5.0? Tziporet _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi Or, Trying to reproduce it i encountered the opposite result: datagram mode produced perfectly normal results (450-600 MBytes/s), connected mode produced extremely low results (20-30 MBytes/s). run parameters: dell intel servers, sles10, fw 2.5.0, MLNX_OFED_LINUX-1.3.1 will look into it with eli. oren _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Oren, any news on this? Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Sending to the correct Oren (Sela) Tziporet _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
hi, i do not manage to see a drop in datagram mode with the following params: netperf-2.4.1, 64k message, RHAS5.0, fw 2.5.0, OFED-1.3, hermon-eagle-DDR, Intel servers. connected produce ~1050 Mbytes/sec and datagram produce ~800 Mbytes/sec. these results comply with the ones i have in my database of both releases: 1.3+2.3.0 and 1.3.1+2.5.0. can you recheck all params and help us reproduce: test version, mtu, device etc. BUT, i do see a bw drop in CONNECTED mode to 30-40 Mbytes/sec with the following params: netperf-2.4.1, 64k message, RHAS5.1, fw 2.5.0, OFED-1.3.1, hermon-eagle-DDR, AMD servers. Eli is debugging it now. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Or,
I think this problem was introduced with the patch
ipoib_0390_restore_cm_mtu.patch in ofed. This patch attempts to
set the MTU when changing to CM mode to the max defined for CM mode
(e.g. 65520). However the change was done by setting dev->mtu which is
not how it should be done since it will not call any function
registered to be notified on MTU change. The following patch solved the
problem for me and you're welcome to try if it works for you too.
Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-06-30 11:37:59.000000000 +0300
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-06-30 18:53:31.000000000 +0300
@@ -1433,15 +1433,15 @@ static ssize_t set_mode(struct class_dev
if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu)
ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n",
priv->mcast_mtu);
- dev->mtu = ipoib_cm_max_mtu(dev);
+ dev_set_mtu(dev, ipoib_cm_max_mtu(dev));
ipoib_flush_paths(dev);
return count;
}
if (!strcmp(buf, "datagram\n")) {
clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
- dev->mtu = min(priv->mcast_mtu, dev->mtu);
+ dev_set_mtu(dev, min(priv->mcast_mtu, dev->mtu));
ipoib_flush_paths(dev);
if (priv->ca->flags & IB_DEVICE_IP_CSUM)
You can "fix" this problem by setting the MTU of CM mode manually from
the shell after switching mode in which case you will not need to use
this patch.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I think to manage narrowing this a little further, the issue seems
most notable under long messages sent over datagram mode, could be
an issue in the LSO engine ?! see in this table,
====================================================
mode mtu size bw note
====================================================
datagram 2044 64000 30 <---------- problem
datagram 2044 2000 430
datagram 2044 2000 300 TCP_NODELAY
-----------------------------------------------------
connected 2044 64000 450
connected 2044 2000 450
connected 2044 2000 300 TCP_NODELAY
---------------------------------------------------
connected 64000 64000 930
connected 64000 2000 930
connected 64000 2000 470 TCP_NODELAY
====================================================
notes:
- in all cases, I have set the mtu manually
- verbs tests of bidirectional bandwidth shows that the HCA UD and RC engines work very well, and
that the cables/switch etc are operating fine as the SDR limit is easily reached.
# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 rc_bi_bw
rc_bi_bw:
bw = 1.91 GB/sec
# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 ud_bi_bw
ud_bi_bw:
send_bw = 1.95 GB/sec
recv_bw = 1.95 GB/sec
datagram mode mtu 2044
# netperf -H 10.10.0.90 -fM -l 600 -D 1, -- -m 64000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo
Interim result: 33.00 MBytes/s over 10.31 seconds
Interim result: 22.27 MBytes/s over 1.48 seconds
Interim result: 34.22 MBytes/s over 2.51 seconds
Interim result: 34.05 MBytes/s over 1.01 seconds
Interim result: 22.88 MBytes/s over 1.49 seconds
Interim result: 30.03 MBytes/s over 1.00 seconds
Interim result: 28.26 MBytes/s over 1.01 seconds
Interim result: 28.21 MBytes/s over 1.00 seconds
Interim result: 14.26 MBytes/s over 1.98 seconds
# netperf -H 10.10.0.90 -fM -l 600 -D 1, -- -m 2000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 ...Or, Are these results after you have applied Eli's fix ? Sagi -----Original Message----- From: general-bounces@lists.openfabrics.org [mailto:general-bounces@lists.openfabrics.org] On Behalf Of Or Gerlitz Sent: Tuesday, July 01, 2008 9:42 AM To: Oren Meron Cc: Eli Cohen; general@lists.openfabrics.org Subject: [ofa-general] Re: performance drop for datagram mode with the I think to manage narrowing this a little further, the issue seems most notable under long messages sent over datagram mode, could be an issue in the LSO engine ?! see in this table, ==================================================== mode mtu size bw note ==================================================== datagram 2044 64000 30 <---------- problem datagram 2044 2000 430 datagram 2044 2000 300 TCP_NODELAY ----------------------------------------------------- connected 2044 64000 450 connected 2044 2000 450 connected 2044 2000 300 TCP_NODELAY --------------------------------------------------- connected 64000 64000 930 connected 64000 2000 930 connected 64000 2000 470 TCP_NODELAY ==================================================== notes: - in all cases, I have set the mtu manually - verbs tests of bidirectional bandwidth shows that the HCA UD and RC engines work very well, and that the cables/switch etc are operating fine as the SDR limit is easily reached. # qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 rc_bi_bw rc_bi_bw: bw = 1.91 GB/sec # qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 ud_bi_bw ud_bi_bw: send_bw = 1.95 GB/sec recv_bw = 1.95 GB/sec datagram mode mtu 2044 # netperf -H 10.10.0.90 -fM -l 600 -D 1, -- -m 64000 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo Interim result: 33.00 MBytes/s over 10.31 seconds Interim result: 22.27 MBytes/s over 1.48 seconds Interim result: 34.22 MBytes/s over 2.51 seconds Interim result: 34.05 MBytes/s over 1.01 ...
As I wrote, in all cases I have applied --- manual mtu setting -- for which Eli's fix is not relevant, you can note that with 2k mtu I get quite good results for small packets and very bad for large packets. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
See Eli's response in the other mail, lets keep to 1 thread on this subject. sagi -----Original Message----- From: Or Gerlitz [mailto:ogerlitz@voltaire.com] Sent: Tuesday, July 01, 2008 10:09 AM To: Sagi Rotem Cc: Oren Meron; Eli Cohen; general@lists.openfabrics.org Subject: Re: [ofa-general] Re: performance drop for datagram mode with the newconnectx FW As I wrote, in all cases I have applied --- manual mtu setting -- for which Eli's fix is not relevant, you can note that with 2k mtu I get quite good results for small packets and very bad for large packets. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Or, just to be sure the kernel does not filter out MTU changes that have the same value, you can either apply the patches or make sure that when you change the value, you go through another value. For example: ifconfig ib0 mtu 2000 ifconfig ib0 mtu 2044 _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I was just able to verify that the above suggestion is required, at least on my systems (SLES 10). This is of course valid for changing to CM mode too. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Eli, just to summarize, the above is fair-enough for the problem you on the same settings, the datagram mode bw changes from 30 MB/s to 430MB/s depending on the message size used by netperf, so its a different issue. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
OK, it turns out that this specific parameters set to netperf / TCP_STREAM was the source for the performance drop I saw, setting also the socket send buffer changes everything (all tests use datagram mode, mtu 2044, msg size 64000) ==================================================== send_buf recv_buf BW (MB/s) ==================================================== NA NA 30 32000 NA 640 64000 NA 750 128000 NA 790 128000 128000 840 On this system / current settings, I get quite the same numbers for both firmware version (2.3 and 2.5) Or. # netperf -H 10.10.0.90 -fM -l 20 -D 1, -- -m 64000 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo Interim result: 44.11 MBytes/s over 10.68 seconds Interim result: 34.25 MBytes/s over 2.49 seconds Interim result: 67.30 MBytes/s over 1.01 seconds Interim result: 67.28 MBytes/s over 1.00 seconds Interim result: 66.72 MBytes/s over 1.01 seconds Interim result: 33.31 MBytes/s over 2.00 seconds Interim result: 22.63 MBytes/s over 1.47 seconds Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. MBytes/sec 87380 16384 64000 20.66 42.95 # netperf -H 10.10.0.90 -fM -l 20 -D 1, -- -m 64000 -s 32000 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo Interim result: 643.92 MBytes/s over 1.00 seconds Interim result: 646.37 MBytes/s over 1.00 seconds Interim result: 644.34 MBytes/s over 1.00 seconds Interim result: 644.39 MBytes/s over 1.00 seconds Interim result: 643.47 MBytes/s over 1.00 seconds Interim result: 643.22 MBytes/s over 1.00 seconds Interim result: 643.55 MBytes/s over 1.00 seconds Interim result: 642.41 MBytes/s over 1.00 seconds Interim result: 640.98 MBytes/s over 1.00 seconds Interim result: 641.88 MBytes/s over 1.00 seconds Interim result: 642.67 MBytes/s over 1.00 seconds Interim ...
some details on the setup / nodes I used the fw-25408-2_5_000-MHGH28-XTC_A1.bin (size 487388) firmware file which was downloaded from the mellanox website and this command $ mstflint -d 07:00.0 -i /tmp/fw-25408-2_5_000-MHGH28-XTC_A1.bin b to burn it (eg on a node with the hca at position 07:00.0) The connectx HCAs we use here are eagle-DDR, I can double check it if this matters node A (internal IP 172.25.5.157) ================================= - RH 5.1 kernel 2.6.18-53.el5 SMP x86_84 - 8GB RAM - 2 quad core 2.6 GHZ Intel CPUs $ lspci | grep IB 07:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0) $ mstflint -d 07:00.0 q Image type: ConnectX FW Version: 2.5.0 Device ID: 25418 Chip Revision: A0 Description: Node Port1 Port2 Sys image GUIDs: 0002c90200258630 0002c90200258631 0002c90200258632 0002c90200258633 Board ID: (MT_04A0110002) VSD: PSID: MT_04A0110002 node B (internal IP 172.25.5.77) ================================= - RH 5 kernel 2.6.18-8.el5 SMP x86_84 - 8GB RAM - 2 dual core 2.6 GHZ Intel CPUs $ lspci | grep Inf 03:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0) $ mstflint -d 03:00.0 q Image type: ConnectX FW Version: 2.5.0 Device ID: 25418 Chip Revision: A0 Description: Node Port1 Port2 Sys image GUIDs: 0002c9020025821c 0002c9020025821d 0002c9020025821e 0002c9020025821f Board ID: (MT_04A0110002) VSD: PSID: MT_04A0110002 _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
