Re: [ofa-general] performance drop for datagram mode with the new connectx FW

Previous thread: [ofa-general] is it you? Clarice here by Clarice on Monday, June 23, 2008 - 4:21 am. (1 message)

Next thread: [ofa-general] [PATCH] IPOIB: add LRO support. by Vladimir Sokolovsky on Monday, June 23, 2008 - 5:54 am. (2 messages)
From: Or Gerlitz
Date: Monday, June 23, 2008 - 4:36 am

Eli,

Using the new connectx FW (2.5), I see performance drop to almost
zero with ipoib datagram mode. The code that runs on these systems
is ofed 1.3 and not mainline kernel, details below.

Running netperf With connected mode (64k MTU) I get about 950MB/s
where with datagram mode (2k MTU) I get only 20-40MB/s. I used to
see about 650MB/s and above with FW 2.3 and datagram mode. Not that
it could explain the drop, but the NIC reports to the OS stateless
offload support - /sys/class/net/ib1/features is 0x11423

I have opened the ipoib and mlx4 debug prints, and I don't see anything
special other then the dmesg get quite filled with

	ib1: TX ring full, stopping kernel net queue

any idea what can explain this? ibv_ud_pingpong gives about 2Gb/s which
is about five times what I see with ipoib.


Or.

git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel
commit 564e9e9383272f4311fd87ff4e5447cfcebad73a

# uname -a
Linux gen2-1 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Eli Cohen
Date: Monday, June 23, 2008 - 4:42 am

Can you tell if changing the FW to 2.3 gives more reasonable results?
I don't believe such a drop in performance would have passed the QA
tests but I'll check that.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Or Gerlitz
Date: Monday, June 23, 2008 - 4:52 am

Yes, I will be able to revert to FW 2.3 later this week, but please 
check also with your QA

Or.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Eli Cohen
Date: Monday, June 23, 2008 - 4:55 am

Sure.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Suresh Shelvapille
Date: Tuesday, June 24, 2008 - 9:01 am

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Tziporet Koren
Date: Wednesday, June 25, 2008 - 2:44 am

And how much with 2.5.0?

Tziporet
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Oren Meron
Date: Monday, June 23, 2008 - 7:56 am

Hi Or,
Trying to reproduce it i encountered the opposite result:
datagram mode produced perfectly normal results (450-600 MBytes/s),
connected mode produced extremely low results (20-30 MBytes/s).
run parameters:
dell intel servers, sles10, fw 2.5.0, MLNX_OFED_LINUX-1.3.1 

will look into it with eli.

oren
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Or Gerlitz
Date: Sunday, June 29, 2008 - 7:33 am

Oren,

any news on this?

Or.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Tziporet Koren
Date: Sunday, June 29, 2008 - 8:47 am

Sending to the correct Oren (Sela)

Tziporet

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Oren Meron
Date: Monday, June 30, 2008 - 12:52 am

hi,
i do not manage to see a drop in datagram mode with the following params:
netperf-2.4.1, 64k message, RHAS5.0, fw 2.5.0, OFED-1.3, hermon-eagle-DDR, Intel servers.
connected produce ~1050 Mbytes/sec and datagram produce ~800 Mbytes/sec.
these results comply with the ones i have in my database of both releases: 1.3+2.3.0 and 1.3.1+2.5.0.
can you recheck all params and help us reproduce: test version, mtu, device etc.

BUT,
i do see a bw drop in CONNECTED mode to 30-40 Mbytes/sec with the following params:
netperf-2.4.1, 64k message, RHAS5.1, fw 2.5.0, OFED-1.3.1, hermon-eagle-DDR, AMD servers.
Eli is debugging it now.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Eli Cohen
Date: Monday, June 30, 2008 - 9:18 am

Or,

I think this problem was introduced with the patch
ipoib_0390_restore_cm_mtu.patch in ofed. This patch attempts to
set the MTU when changing to CM mode to the max defined for CM mode
(e.g. 65520). However the change was done by setting dev->mtu which is
not how it should be done since it will not call any function
registered to be notified on MTU change. The following patch solved the
problem for me and you're welcome to try if it works for you too.


Index: ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c
===================================================================
--- ofa_1_3_dev_kernel.orig/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2008-06-30 11:37:59.000000000 +0300
+++ ofa_1_3_dev_kernel/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2008-06-30 18:53:31.000000000 +0300
@@ -1433,15 +1433,15 @@ static ssize_t set_mode(struct class_dev
 		if (ipoib_cm_max_mtu(dev) > priv->mcast_mtu)
 			ipoib_warn(priv, "mtu > %d will cause multicast packet drops.\n",
 				   priv->mcast_mtu);
-		dev->mtu = ipoib_cm_max_mtu(dev);
 
+		dev_set_mtu(dev, ipoib_cm_max_mtu(dev));
 		ipoib_flush_paths(dev);
 		return count;
 	}
 
 	if (!strcmp(buf, "datagram\n")) {
 		clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
-		dev->mtu = min(priv->mcast_mtu, dev->mtu);
+		dev_set_mtu(dev, min(priv->mcast_mtu, dev->mtu));
 		ipoib_flush_paths(dev);
 
 		if (priv->ca->flags & IB_DEVICE_IP_CSUM)

You can "fix" this problem by setting the MTU of CM mode manually from
the shell after switching mode in which case you will not need to use
this patch.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Or Gerlitz
Date: Monday, June 30, 2008 - 11:41 pm

I think to manage narrowing this a little further, the issue seems
most notable under long messages sent over datagram mode, could be
an issue in the LSO engine ?! see in this table,

====================================================
mode		mtu	size	bw	note
====================================================
datagram	2044	64000	30 <---------- problem
datagram	2044	2000	430
datagram	2044	2000	300	TCP_NODELAY
-----------------------------------------------------
connected	2044	64000	450
connected	2044	2000	450
connected	2044	2000	300	TCP_NODELAY
---------------------------------------------------
connected	64000	64000	930
connected	64000	2000	930
connected	64000	2000	470	TCP_NODELAY
====================================================

notes:

- in all cases, I have set the mtu manually
- verbs tests of bidirectional bandwidth shows that the HCA UD and RC engines work very well, and
  that the cables/switch etc are operating fine as the SDR limit is easily reached.


# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 rc_bi_bw

rc_bi_bw:
    bw  =  1.91 GB/sec

# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 ud_bi_bw
ud_bi_bw:
    send_bw  =  1.95 GB/sec
    recv_bw  =  1.95 GB/sec



datagram mode mtu 2044

# netperf -H 10.10.0.90 -fM -l 600 -D 1,  -- -m 64000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo
Interim result:   33.00 MBytes/s over 10.31 seconds
Interim result:   22.27 MBytes/s over 1.48 seconds
Interim result:   34.22 MBytes/s over 2.51 seconds
Interim result:   34.05 MBytes/s over 1.01 seconds
Interim result:   22.88 MBytes/s over 1.49 seconds
Interim result:   30.03 MBytes/s over 1.00 seconds
Interim result:   28.26 MBytes/s over 1.01 seconds
Interim result:   28.21 MBytes/s over 1.00 seconds
Interim result:   14.26 MBytes/s over 1.98 seconds

# netperf -H 10.10.0.90 -fM -l 600 -D 1,  -- -m 2000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 ...
From: Sagi Rotem
Date: Tuesday, July 1, 2008 - 12:06 am

Or,
Are these results after you have applied Eli's fix ? 
Sagi


-----Original Message-----
From: general-bounces@lists.openfabrics.org
[mailto:general-bounces@lists.openfabrics.org] On Behalf Of Or Gerlitz
Sent: Tuesday, July 01, 2008 9:42 AM
To: Oren Meron
Cc: Eli Cohen; general@lists.openfabrics.org
Subject: [ofa-general] Re: performance drop for datagram mode with the

I think to manage narrowing this a little further, the issue seems most
notable under long messages sent over datagram mode, could be an issue
in the LSO engine ?! see in this table,

====================================================
mode		mtu	size	bw	note
====================================================
datagram	2044	64000	30 <---------- problem
datagram	2044	2000	430
datagram	2044	2000	300	TCP_NODELAY
-----------------------------------------------------
connected	2044	64000	450
connected	2044	2000	450
connected	2044	2000	300	TCP_NODELAY
---------------------------------------------------
connected	64000	64000	930
connected	64000	2000	930
connected	64000	2000	470	TCP_NODELAY
====================================================

notes:

- in all cases, I have set the mtu manually
- verbs tests of bidirectional bandwidth shows that the HCA UD and RC
engines work very well, and
  that the cables/switch etc are operating fine as the SDR limit is
easily reached.


# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 rc_bi_bw

rc_bi_bw:
    bw  =  1.91 GB/sec

# qperf -li mlx4_0:2 -ri mlx4_0:1 172.25.5.77 -m 2000 -t 10 ud_bi_bw
ud_bi_bw:
    send_bw  =  1.95 GB/sec
    recv_bw  =  1.95 GB/sec



datagram mode mtu 2044

# netperf -H 10.10.0.90 -fM -l 600 -D 1,  -- -m 64000 TCP STREAM TEST
from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0
AF_INET : demo
Interim result:   33.00 MBytes/s over 10.31 seconds
Interim result:   22.27 MBytes/s over 1.48 seconds
Interim result:   34.22 MBytes/s over 2.51 seconds
Interim result:   34.05 MBytes/s over 1.01 ...
From: Or Gerlitz
Date: Tuesday, July 1, 2008 - 12:08 am

As I wrote, in all cases I have applied --- manual mtu setting --  for 
which Eli's fix is not relevant, you can note that with 2k mtu I get 
quite good results for small packets and very bad for large packets.

Or.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Sagi Rotem
Date: Tuesday, July 1, 2008 - 12:17 am

See Eli's response in the other mail, lets keep to 1 thread on this
subject.
sagi

-----Original Message-----
From: Or Gerlitz [mailto:ogerlitz@voltaire.com] 
Sent: Tuesday, July 01, 2008 10:09 AM
To: Sagi Rotem
Cc: Oren Meron; Eli Cohen; general@lists.openfabrics.org
Subject: Re: [ofa-general] Re: performance drop for datagram mode with
the newconnectx FW

As I wrote, in all cases I have applied --- manual mtu setting --  for
which Eli's fix is not relevant, you can note that with 2k mtu I get
quite good results for small packets and very bad for large packets.

Or.

_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Eli Cohen
Date: Tuesday, July 1, 2008 - 12:19 am

Or, just to be sure the kernel does not filter out MTU changes that
have the same value, you can either apply the patches or make sure
that when you change the value, you go through another value. For
example:

ifconfig ib0 mtu 2000
ifconfig ib0 mtu 2044


_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Eli Cohen
Date: Tuesday, July 1, 2008 - 12:45 am

I was just able to verify that the above suggestion is required, at
least on my systems (SLES 10). This is of course valid for changing to
CM mode too.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Or Gerlitz
Date: Tuesday, July 1, 2008 - 1:04 am

Eli, just to summarize, the above is fair-enough for the problem you 
on the same settings, the datagram mode bw changes from 30 MB/s to 
430MB/s depending on the message size used by netperf, so its a 
different issue.

Or.



_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
From: Or Gerlitz
Date: Monday, July 7, 2008 - 1:20 am

OK, it turns out that this specific parameters set to netperf / TCP_STREAM was
the source for the performance drop I saw, setting also the socket send buffer
changes everything (all tests use datagram mode, mtu 2044, msg size 64000)

====================================================
	send_buf	recv_buf	BW (MB/s)
====================================================
	NA		NA		30
	32000		NA		640
	64000		NA		750
	128000		NA		790
	128000		128000		840

On this system / current settings, I get quite the same numbers
for both firmware version (2.3 and 2.5)

Or.


# netperf -H 10.10.0.90 -fM -l 20 -D 1, -- -m 64000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo
Interim result:   44.11 MBytes/s over 10.68 seconds
Interim result:   34.25 MBytes/s over 2.49 seconds
Interim result:   67.30 MBytes/s over 1.01 seconds
Interim result:   67.28 MBytes/s over 1.00 seconds
Interim result:   66.72 MBytes/s over 1.01 seconds
Interim result:   33.31 MBytes/s over 2.00 seconds
Interim result:   22.63 MBytes/s over 1.47 seconds
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    MBytes/sec

 87380  16384  64000    20.66      42.95
# netperf -H 10.10.0.90 -fM -l 20 -D 1, -- -m 64000 -s 32000
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.10.0.90 (10.10.0.90) port 0 AF_INET : demo
Interim result:  643.92 MBytes/s over 1.00 seconds
Interim result:  646.37 MBytes/s over 1.00 seconds
Interim result:  644.34 MBytes/s over 1.00 seconds
Interim result:  644.39 MBytes/s over 1.00 seconds
Interim result:  643.47 MBytes/s over 1.00 seconds
Interim result:  643.22 MBytes/s over 1.00 seconds
Interim result:  643.55 MBytes/s over 1.00 seconds
Interim result:  642.41 MBytes/s over 1.00 seconds
Interim result:  640.98 MBytes/s over 1.00 seconds
Interim result:  641.88 MBytes/s over 1.00 seconds
Interim result:  642.67 MBytes/s over 1.00 seconds
Interim ...
From: Or Gerlitz
Date: Monday, June 30, 2008 - 11:52 pm

some details on the setup / nodes

I used the fw-25408-2_5_000-MHGH28-XTC_A1.bin (size 487388) firmware file
which was downloaded from the mellanox website and this command

	$ mstflint -d 07:00.0 -i /tmp/fw-25408-2_5_000-MHGH28-XTC_A1.bin b

to burn it (eg on a node with the hca at position 07:00.0)


The connectx HCAs we use here are eagle-DDR, I can double check it if this matters

node A (internal IP 172.25.5.157)
=================================
- RH 5.1 kernel 2.6.18-53.el5 SMP x86_84
- 8GB RAM
- 2 quad core 2.6 GHZ Intel CPUs

$ lspci | grep IB

07:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0)

$ mstflint -d 07:00.0 q

Image type:      ConnectX
FW Version:      2.5.0
Device ID:       25418
Chip Revision:   A0
Description:     Node             Port1            Port2            Sys
image
GUIDs:           0002c90200258630 0002c90200258631 0002c90200258632 0002c90200258633
Board ID:         (MT_04A0110002)
VSD:
PSID:            MT_04A0110002

node B (internal IP 172.25.5.77)
=================================
- RH 5  kernel 2.6.18-8.el5 SMP x86_84
- 8GB RAM
- 2 dual core 2.6 GHZ Intel CPUs

$ lspci | grep Inf
03:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)

$ mstflint -d 03:00.0 q
Image type:      ConnectX
FW Version:      2.5.0
Device ID:       25418
Chip Revision:   A0
Description:     Node             Port1            Port2            Sys
image
GUIDs:           0002c9020025821c 0002c9020025821d 0002c9020025821e 0002c9020025821f
Board ID:         (MT_04A0110002)
VSD:
PSID:            MT_04A0110002






_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Previous thread: [ofa-general] is it you? Clarice here by Clarice on Monday, June 23, 2008 - 4:21 am. (1 message)

Next thread: [ofa-general] [PATCH] IPOIB: add LRO support. by Vladimir Sokolovsky on Monday, June 23, 2008 - 5:54 am. (2 messages)