Hi Roland,
Looking on ipoib_start_xmit, it seems that both the check that comes to handle a gratitious
ARP (ie a difference between the remote GID as kept in the ipoib_neigh to the one present in
the network stack neighbour) and the check that comes to handle a situation where we attempt to
xmit an ipoib_neigh created by another ipoib device (ie following a bonding failover) -
does not come into play for the connected mode neighbours.
Isn't it a bug, or I miss something?
Or.
+static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct ipoib_dev_priv *priv = netdev_priv(dev);
+ struct ipoib_neigh *neigh;
+ unsigned long flags;
+
...
+ if (likely(skb->dst && skb->dst->neighbour)) {
...
+ neigh = *to_ipoib_neigh(skb->dst->neighbour);
+
+ if (ipoib_cm_get(neigh)) {
+ if (ipoib_cm_up(neigh)) {
+ ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
+ goto out;
+ }
+ } else if (neigh->ah) {
+ if (unlikely((memcmp(&neigh->dgid.raw,
+ skb->dst->neighbour->ha + 4,
+ sizeof(union ib_gid))) ||
+ (neigh->dev != dev))) {
any reason not to apply these two checks on connected mode neighbours?
+ spin_lock(&priv->lock);
...
+ ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ goto out;
+ }
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> Looking on ipoib_start_xmit, it seems that both the check that > comes to handle a gratitious ARP (ie a difference between the > remote GID as kept in the ipoib_neigh to the one present in the > network stack neighbour) and the check that comes to handle a > situation where we attempt to xmit an ipoib_neigh created by > another ipoib device (ie following a bonding failover) - does not > come into play for the connected mode neighbours. > > Isn't it a bug, or I miss something? Good question. The device test came straight from Moni's patch -- how much have you guys tested bonding of IPoIB CM? The GID comparison seems a little trickier to handle -- it seems on a neighbour GID change we need to tear down any connection we might have in the CM case... _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
The test for neigh->dev != dev comes to handle a possible race where a fail over occurs under a high xmit rate, so the deletion of the ipoib_neigh portion of the neighbour causes by the bonding fail-over did not happen yet, but as of the fail-over the bonding is now xmitting through a device which is not the one that created the ipoib_neigh. We have never managed to reproduce a hit on this check... anyway, I will double check on how much testing was done with the bonding and not really: when there is a hit on the GID comparison ipoib_neigh_free() is called which for a connected mode neighbour will invoke ipoib_cm_destroy_tx() which will disconnect etc. Or _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
move a little up the code that checks for a situation where the remote GID stored in the ipoib_neigh is
different than the one present in the neighbour (handle Gratuitous ARP) or that a bonding fail over has
happened but the neighbour still has a pointer to an ipoib_neigh created not by the current slave. This
will cause the driver to apply the check also for connected mode neighbours.
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
I have tested this patch on 2.6.24-rc1 (and its now in progress for 2.6.24-rc8)
things are basically working fine, but I do want to play more with bonding fail-overs
to make sure nothing was broken wrt to Gratuitous ARP etc, will let you know.
-----
Index: linux-2.6.24-rc8/drivers/infiniband/ulp/ipoib/ipoib_main.c
===================================================================
--- linux-2.6.24-rc8.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-01-17 16:37:10.000000000 +0200
+++ linux-2.6.24-rc8/drivers/infiniband/ulp/ipoib/ipoib_main.c 2008-01-17 16:46:51.000000000 +0200
@@ -686,13 +686,8 @@ static int ipoib_start_xmit(struct sk_bu
}
neigh = *to_ipoib_neigh(skb->dst->neighbour);
-
- if (ipoib_cm_get(neigh)) {
- if (ipoib_cm_up(neigh)) {
- ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
- goto out;
- }
- } else if (neigh->ah) {
+
+ if (neigh->ah)
if (unlikely((memcmp(&neigh->dgid.raw,
skb->dst->neighbour->ha + 4,
sizeof(union ib_gid))) ||
@@ -713,6 +708,12 @@ static int ipoib_start_xmit(struct sk_bu
goto out;
}
+ if (ipoib_cm_get(neigh)) {
+ if (ipoib_cm_up(neigh)) {
+ ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
+ goto out;
+ }
+ } else if (neigh->ah) {
ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(skb->dst->neighbour->ha));
goto out;
}
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit ...I have did some more testing but not enough to say if without this patch fail-over under connected mode is always slow. Being away for the rest of this week, I will continue working on it next week. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
OK, Roland, I'd am now confident that this patch is needed, see below the reasonings, please apply to 2.6.25, later I will send it also to -stable, here goes: Basically ipoib-cm is not totaly broken wrt to bonding AND connect mode --without-- this patch being applied, but OTOH it does not function at it should. My setup has a client node xmitting udp unicast to a server node where the server node is bonded (ib0 and ib1 are enslaved by bond0). I tried three types of fail-overs where each one of them causes the bonding at the server node to send gratuitous ARP where without this patch no act is taken by ipoib at the client side A) using "primary slave up" (*) B) taking an interface down C) taking a port down In the "primary slave up" fail-over case, since the non-active slave interface is up and running, the traffic keeps going through it, so forever at the client side there's a neighbour pointing to GID X where the traffic goes to (the QP associated with) GID Y. In the interface down fail-over case, the going down code closes the RX QP, since the connected mode (cm) is implemented over RC (...) this causes a send completion with IB_WC_RETRY_EXC_ERR error to be generated by the HCA, ipoib_cm_handle_tx_wc calls ipoib_neigh_free and when the next xmit is called from the stack, ipoib creates a new ipoib_neigh, this time against the correct GID In the port going down case, again the RC implementation causes the retry exceeded error to take place and from here its the same as in the previous case. Other then all the above, gratitious ARP is used in other HA schemes such as floating IP address between I/O targets, since the connected mode ignores it, this scheme will not work without the patch. Or (*) the bonding HA mode enables you to select a primary slave which once up would be moved to be the active slave. So to cause this failover, I take the primary (eg ib0) down, and then fail-over happens to the second slave (eg ib1), now I take the primary up and a second fail-over ...
Hi Roland, Do you need from me any more clarification to merge this into 2.6.25 ? Or _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
thanks, applied _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
The current IPoIB-UD implementation is limited IPoIB payload size to 2048 through hard coding IPOIB_PACKET_SIZE. The implementation is designed for kernel PAGE_SIZE equals or greater than 4K. If the kernel PAGE_SIZE is equals to 2K, memory buffer allocation will be failure when lack of large buffer of memory. However most of the Distros does support PAGE_SIZE >= 4K. So this implementation has no problem for 2048 payload. This implementation is simple but it prevents HCA device who does support 4096 payload from performing, like IBM eHCA2. This patch allows IPoIB-UD MTU up to 4092 (4K - IPOIB_ENCAP_LEN) when HCA can support 4K MTU. In this patch, APIs for S/G buffer allocation in IPoIB-CM mode has been made generic so IPoIB-UD and IPoIB-CM can share the S/G code. When PAGE_SIZE is equal or greater than IPOIB_UD_BUF_SIZE + bytes padding to align IP header, Only one buffer is needed for 4K MTU buffer allocation, otherwise, two buffers allocation is needed in S/G. The node IPoIB link MTU size is the minimum value of admin configurable MTU through ifconfig and IPoIB default broadcast group MTU size. When Subnet Manager enables default broadcast group during start up, this subnet IPoIB link MTU will be the value of default broadcast group MTU size. For any node IB MTU smaller than this value, the node can't join this IPoIB subnet. For any node IB MTU is greater than this value, the node will join this IPoIB subnet and this value will be set as its IPOIB link MTU. If Subnet Manager disables default broadcast group during start up, the first bring up node in this subnet will create the default IPoIB broadcast group based on the negotiation with the Subnet Manager, the default is currently set as 2K according to IPoIB RFC. The patch will be splitted into two patches: 1. Make IPoIB-CM RX S/G APIs generic 2. Enable IPoIB-UD RX S/G I am trying to split these two patches more independent so it's easy to test. ipoib_cm_alloc_rx_skb() will be renamed in second patch. Please review these ...
Please review below patch while I am testing so I can integrate your
comments in my test immediately.
Thanks
Shirley
Signed-off-by:Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 25 ++++--
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 139
++++++------------------------
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 85 +++++++++++++++++++
3 files changed, 131 insertions(+), 118 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index fe250c6..138f1a3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -138,7 +138,7 @@ struct ipoib_mcast {
struct ipoib_rx_buf {
struct sk_buff *skb;
- u64 mapping;
+ u64 mapping[IPOIB_CM_RX_SG];
};
struct ipoib_tx_buf {
@@ -189,7 +189,7 @@ enum ipoib_cm_state {
struct ipoib_cm_rx {
struct ib_cm_id *id;
struct ib_qp *qp;
- struct ipoib_cm_rx_buf *rx_ring;
+ struct ipoib_rx_buf *rx_ring;
struct list_head list;
struct net_device *dev;
unsigned long jiffies;
@@ -212,11 +212,6 @@ struct ipoib_cm_tx {
struct ib_wc ibwc[IPOIB_NUM_WC];
};
-struct ipoib_cm_rx_buf {
- struct sk_buff *skb;
- u64 mapping[IPOIB_CM_RX_SG];
-};
-
struct ipoib_cm_dev_priv {
struct ib_srq *srq;
struct ipoib_cm_rx_buf *srq_ring;
@@ -458,6 +453,22 @@ int ipoib_vlan_delete(struct net_device *pdev,
unsigned short pkey);
void ipoib_pkey_poll(struct work_struct *work);
int ipoib_pkey_dev_delay_open(struct net_device *dev);
void ipoib_drain_cq(struct net_device *dev);
+void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
+ unsigned int length, struct sk_buff *toskb);
+struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
+ int id, int frags, int head_size,
+ int pad, u64 *mapping);
+void inline ipoib_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags,
+ int head_size, u64 *mapping)
+{
+ int ...This patch makes IPoIB-CM RX S/G APIs more generic for IPoIB-UD RX S/G
to be resued later.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 26 +++++-
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 135
++++++-------------------------
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 85 +++++++++++++++++++
3 files changed, 132 insertions(+), 114 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index fe250c6..d1d3ca2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -141,6 +141,11 @@ struct ipoib_rx_buf {
u64 mapping;
};
+struct ipoib_cm_rx_buf {
+ struct sk_buff *skb;
+ u64 mapping[IPOIB_CM_RX_SG];
+};
+
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -212,11 +217,6 @@ struct ipoib_cm_tx {
struct ib_wc ibwc[IPOIB_NUM_WC];
};
-struct ipoib_cm_rx_buf {
- struct sk_buff *skb;
- u64 mapping[IPOIB_CM_RX_SG];
-};
-
struct ipoib_cm_dev_priv {
struct ib_srq *srq;
struct ipoib_cm_rx_buf *srq_ring;
@@ -458,6 +458,22 @@ int ipoib_vlan_delete(struct net_device *pdev,
unsigned short pkey);
void ipoib_pkey_poll(struct work_struct *work);
int ipoib_pkey_dev_delay_open(struct net_device *dev);
void ipoib_drain_cq(struct net_device *dev);
+void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
+ unsigned int length, struct sk_buff *toskb);
+struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
+ int id, int frags, int head_size,
+ int pad, u64 *mapping);
+static void inline ipoib_dma_unmap_rx(struct ipoib_dev_priv *priv, int
frags,
+ int head_size, u64 *mapping)
+{
+ int i;
+ ib_dma_unmap_single(priv->ca, mapping[0], head_size, DMA_FROM_DEVICE);
+ for (i = 0; i < frags; i++)
+ ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE,
+ DMA_FROM_DEVICE);
+
+}
+
#ifdef CONFIG_INFINIBAND_IPOIB_CM
diff --git ...This patch makes two of IPoIB-CM RX S/G APIs generic, so it can be
reusable. This patch is the same as V1 previously submitted.
Signed-of-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 26 +++++-
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 135
++++++-------------------------
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 85 +++++++++++++++++++
3 files changed, 132 insertions(+), 114 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index fe250c6..d1d3ca2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -141,6 +141,11 @@ struct ipoib_rx_buf {
u64 mapping;
};
+struct ipoib_cm_rx_buf {
+ struct sk_buff *skb;
+ u64 mapping[IPOIB_CM_RX_SG];
+};
+
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -212,11 +217,6 @@ struct ipoib_cm_tx {
struct ib_wc ibwc[IPOIB_NUM_WC];
};
-struct ipoib_cm_rx_buf {
- struct sk_buff *skb;
- u64 mapping[IPOIB_CM_RX_SG];
-};
-
struct ipoib_cm_dev_priv {
struct ib_srq *srq;
struct ipoib_cm_rx_buf *srq_ring;
@@ -458,6 +458,22 @@ int ipoib_vlan_delete(struct net_device *pdev,
unsigned short pkey);
void ipoib_pkey_poll(struct work_struct *work);
int ipoib_pkey_dev_delay_open(struct net_device *dev);
void ipoib_drain_cq(struct net_device *dev);
+void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
+ unsigned int length, struct sk_buff *toskb);
+struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
+ int id, int frags, int head_size,
+ int pad, u64 *mapping);
+static void inline ipoib_dma_unmap_rx(struct ipoib_dev_priv *priv, int
frags,
+ int head_size, u64 *mapping)
+{
+ int i;
+ ib_dma_unmap_single(priv->ca, mapping[0], head_size, DMA_FROM_DEVICE);
+ for (i = 0; i < frags; i++)
+ ib_dma_unmap_single(priv->ca, mapping[i + 1], PAGE_SIZE,
+ DMA_FROM_DEVICE);
+
+}
+
#ifdef ...This patch has created a couple of APIs for UD RX S/G to be used later.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 9 ++++
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 65
+++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+), 0 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index fe250c6..415bf9a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -61,6 +61,10 @@ enum {
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096,
+ IPOIB_UD_HEAD_SIZE = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
+ IPOIB_UD_RX_SG = 2, /* for 4K MTU */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -136,6 +140,11 @@ struct ipoib_mcast {
struct net_device *dev;
};
+struct ipoib_sg_rx_buf {
+ struct sk_buff *skb;
+ u64 mapping[IPOIB_UD_RX_SG];
+};
+
struct ipoib_rx_buf {
struct sk_buff *skb;
u64 mapping;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 52bc2bd..9ca3d34 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -87,6 +87,71 @@ void ipoib_free_ah(struct kref *kref)
spin_unlock_irqrestore(&priv->lock, flags);
}
+/* Adjust length of skb with fragments to match received data */
+static void ipoib_ud_skb_put_frags(struct sk_buff *skb, unsigned int
length,
+ struct sk_buff *toskb)
+{
+ unsigned int size;
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[0];
+
+ /* put header into skb */
+ size = min(length, (unsigned)IPOIB_UD_HEAD_SIZE);
+ skb->tail += size;
+ skb->len += size;
+ length -= size;
+
+ if (length == 0) {
+ /* don't need this page */
+ skb_fill_page_desc(toskb, 0, frag->page, 0, PAGE_SIZE);
+ --skb_shinfo(skb)->nr_frags;
+ } else ...Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 138f1a3..65b1159 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -56,11 +56,11 @@
/* constants */
enum {
- IPOIB_PACKET_SIZE = 2048,
- IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
-
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096, /* max ib device payload is 4096 */
+ IPOIB_UD_MAX_RX_SG = ALIGN(IPOIB_MAX_IB_MTU + IB_GRH_BYTES + 4,
PAGE_SIZE) / PAGE_SIZE, /* padding to align IP header */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -314,6 +314,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_MAX_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -354,6 +357,11 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES + 4) /*
padding to align IP header */
+#define IPOIB_UD_HEAD_SIZE(ib_mtu) (IPOIB_UD_BUF_SIZE(ib_mtu)) %
PAGE_SIZE
+#define IPOIB_UD_RX_SG(ib_mtu) ALIGN(IPOIB_UD_BUF_SIZE(ib_mtu),
PAGE_SIZE) / PAGE_SIZE
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..646aeb2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -194,7 +194,7 @@ static int ipoib_change_mtu(struct net_device *dev,
int new_mtu)
return 0;
}
- if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+ if (new_mtu > ...Found a problem in patch generation file ipoib_verbs.c, I will fix it
tomorrow, it should be:
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -150,7 +150,7 @@ int ipoib_transport_dev_init(struct net_device *dev,
struct ib_device *ca)
.max_send_wr = ipoib_sendq_size,
.max_recv_wr = ipoib_recvq_size,
.max_send_sge = 1,
- .max_recv_sge = 1
+ .max_recv_sge =
IPOIB_UD_RX_SG(priv->max_ib_mtu)
},
.sq_sig_type = IB_SIGNAL_ALL_WR,
.qp_type = IB_QPT_UD
@@ -208,6 +208,16 @@ int ipoib_transport_dev_init(struct net_device
*dev, struct ib_device *ca)
priv->tx_wr.num_sge = 1;
priv->tx_wr.send_flags = IB_SEND_SIGNALED;
+ priv->rx_sge[0].length = IPOIB_UD_HEAD_SIZE(priv->max_ib_mtu);
+ for (i = 0; i < IPOIB_UD_RX_SG(priv->max_ib_mtu) - 1; ++i) {
+ priv->rx_sge[i].lkey = priv->mr->lkey;
+ priv->rx_sge[i + 1].length = PAGE_SIZE;
+ }
+ priv->rx_sge[i + 1].lkey = priv->mr->lkey;
+ priv->rx_wr.num_sge = IPOIB_UD_RX_SG(priv->max_ib_mtu);
+ priv->rx_wr.next = NULL;
+ priv->rx_wr.sg_list = priv->rx_sge;
+
return 0;
out_free_cq:
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
This patch sets up all IPoIB-UD RX S/G related parameters.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 13 +++++++++++++
drivers/infiniband/ulp/ipoib/ipoib_main.c | 19
++++++++++++++-----
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +--
drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 14 ++++++++++++--
4 files changed, 40 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index d1d3ca2..004a80b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -61,6 +61,10 @@ enum {
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096,
+ IPOIB_UD_MAX_RX_SG = ALIGN(IPOIB_MAX_IB_MTU + IB_GRH_BYTES + 4,
+ PAGE_SIZE) / PAGE_SIZE, /* padding to align IP header */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -319,6 +323,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_MAX_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -359,6 +366,12 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+/* padding to align IP header */
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES + 4)
+#define IPOIB_UD_HEAD_SIZE(ib_mtu) (IPOIB_UD_BUF_SIZE(ib_mtu)) %
PAGE_SIZE
+#define IPOIB_UD_RX_SG(ib_mtu) ALIGN(IPOIB_UD_BUF_SIZE(ib_mtu),
PAGE_SIZE) / PAGE_SIZE
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..242591f 100644
--- ...--1__=08BBF970DFB285898f9e8a93df938690918c08BBF970DFB28589
Content-type: text/plain; charset=US-ASCII
My unix mail is down. Here is the new update one. I need to resend this
one when my unix mail back.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 13 +++++++++++++
drivers/infiniband/ulp/ipoib/ipoib_main.c | 19 ++++++++++++++-----
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +--
drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 14 ++++++++++++--
4 files changed, 40 insertions(+), 9 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index d1d3ca2..004a80b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -61,6 +61,10 @@ enum {
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096,
+ IPOIB_UD_MAX_RX_SG = ALIGN(IPOIB_MAX_IB_MTU + IB_GRH_BYTES + 4,
+ PAGE_SIZE) / PAGE_SIZE, /* padding to align IP header */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -319,6 +323,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_MAX_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -359,6 +366,12 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+/* padding to align IP header */
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES + 4)
+#define IPOIB_UD_HEAD_SIZE(ib_mtu) (IPOIB_UD_BUF_SIZE(ib_mtu)) % PAGE_SIZE
+#define IPOIB_UD_RX_SG(ib_mtu) ALIGN(IPOIB_UD_BUF_SIZE(ib_mtu), PAGE_SIZE) / PAGE_SIZE
+
/*
* We stash a pointer to our private neighbour ...This patch is the same as previous submitted version (V1).
This patch makes IPoIB-UD RX S/G to be ready.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index d1d3ca2..004a80b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -61,6 +61,10 @@ enum {
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096,
+ IPOIB_UD_MAX_RX_SG = ALIGN(IPOIB_MAX_IB_MTU + IB_GRH_BYTES + 4,
+ PAGE_SIZE) / PAGE_SIZE, /* padding to align IP header */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -319,6 +323,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_MAX_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -359,6 +366,12 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+/* padding to align IP header */
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES + 4)
+#define IPOIB_UD_HEAD_SIZE(ib_mtu) (IPOIB_UD_BUF_SIZE(ib_mtu)) %
PAGE_SIZE
+#define IPOIB_UD_RX_SG(ib_mtu) ALIGN(IPOIB_UD_BUF_SIZE(ib_mtu),
PAGE_SIZE) / PAGE_SIZE
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..242591f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -194,7 +194,7 @@ static int ipoib_change_mtu(struct net_device *dev,
int new_mtu)
return 0;
}
- if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+ if (new_mtu > IPOIB_UD_MTU(priv->max_ib_mtu))
return -EINVAL;
...Signed-off-by: Shirley Ma <xma@us.ibm.com>
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 138f1a3..65b1159 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -56,11 +56,11 @@
/* constants */
enum {
- IPOIB_PACKET_SIZE = 2048,
- IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
-
IPOIB_ENCAP_LEN = 4,
+ IPOIB_MAX_IB_MTU = 4096, /* max ib device payload is 4096 */
+ IPOIB_UD_MAX_RX_SG = ALIGN(IPOIB_MAX_IB_MTU + IB_GRH_BYTES + 4,
PAGE_SIZE) / PAGE_SIZE, /* padding to align IP header */
+
IPOIB_CM_MTU = 0x10000 - 0x10, /* padding to align header to 16 */
IPOIB_CM_BUF_SIZE = IPOIB_CM_MTU + IPOIB_ENCAP_LEN,
IPOIB_CM_HEAD_SIZE = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
@@ -314,6 +314,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_MAX_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -354,6 +357,11 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES + 4) /*
padding to align IP header */
+#define IPOIB_UD_HEAD_SIZE(ib_mtu) (IPOIB_UD_BUF_SIZE(ib_mtu)) %
PAGE_SIZE
+#define IPOIB_UD_RX_SG(ib_mtu) ALIGN(IPOIB_UD_BUF_SIZE(ib_mtu),
PAGE_SIZE) / PAGE_SIZE
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..646aeb2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -194,7 +194,7 @@ static int ipoib_change_mtu(struct net_device *dev,
int new_mtu)
return 0;
}
- if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+ if (new_mtu > IPOIB_UD_MTU(priv->max_ib_mtu))
return -EINVAL;
priv->admin_mtu ...Define and set several UD RX S/G parameters to be used later.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 16 ++++++++++++++++
drivers/infiniband/ulp/ipoib/ipoib_main.c | 19
++++++++++++++-----
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +--
drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 13 ++++++++++++-
4 files changed, 43 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 415bf9a..6b5e108 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -328,6 +328,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -368,6 +371,19 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES)
+static inline int ipoib_ud_need_sg(int ib_mtu)
+{
+ return (IPOIB_UD_BUF_SIZE(ib_mtu) > PAGE_SIZE) ? 1 : 0;
+}
+static inline void ipoib_sg_dma_unmap_rx(struct ipoib_dev_priv *priv,
+ u64 mapping[IPOIB_UD_RX_SG])
+{
+ ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_UD_HEAD_SIZE,
DMA_FROM_DEVICE);
+ ib_dma_unmap_single(priv->ca, mapping[1], PAGE_SIZE, DMA_FROM_DEVICE);
+}
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..242591f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -194,7 +194,7 @@ static int ipoib_change_mtu(struct net_device *dev,
int new_mtu)
return 0;
}
- if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+ if (new_mtu > ...Patchset has been tested for Intel platform 2K MTU. Here is the update
patch:
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 16 ++++++++++++++++
drivers/infiniband/ulp/ipoib/ipoib_main.c | 19
++++++++++++++-----
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +--
drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 16 +++++++++++++++-
4 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 415bf9a..6b5e108 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -328,6 +328,9 @@ struct ipoib_dev_priv {
struct dentry *mcg_dentry;
struct dentry *path_dentry;
#endif
+ int max_ib_mtu;
+ struct ib_sge rx_sge[IPOIB_UD_RX_SG];
+ struct ib_recv_wr rx_wr;
};
struct ipoib_ah {
@@ -368,6 +371,19 @@ struct ipoib_neigh {
struct list_head list;
};
+#define IPOIB_UD_MTU(ib_mtu) (ib_mtu - IPOIB_ENCAP_LEN)
+#define IPOIB_UD_BUF_SIZE(ib_mtu) (ib_mtu + IB_GRH_BYTES)
+static inline int ipoib_ud_need_sg(int ib_mtu)
+{
+ return (IPOIB_UD_BUF_SIZE(ib_mtu) > PAGE_SIZE) ? 1 : 0;
+}
+static inline void ipoib_sg_dma_unmap_rx(struct ipoib_dev_priv *priv,
+ u64 mapping[IPOIB_UD_RX_SG])
+{
+ ib_dma_unmap_single(priv->ca, mapping[0], IPOIB_UD_HEAD_SIZE,
DMA_FROM_DEVICE);
+ ib_dma_unmap_single(priv->ca, mapping[1], PAGE_SIZE, DMA_FROM_DEVICE);
+}
+
/*
* We stash a pointer to our private neighbour information after our
* hardware address in neigh->ha. The ALIGN() expression here makes
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a082466..242591f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -194,7 +194,7 @@ static int ipoib_change_mtu(struct net_device *dev,
int new_mtu)
return 0;
}
- if (new_mtu > IPOIB_PACKET_SIZE - ...Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 65b1159..969955e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -463,9 +463,9 @@ int ipoib_pkey_dev_delay_open(struct net_device
*dev);
void ipoib_drain_cq(struct net_device *dev);
void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
unsigned int length, struct sk_buff *toskb);
-struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
- int id, int frags, int head_size,
- int pad, u64 *mapping);
+struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev,
+ int id, int frags, int head_size,
+ int pad, u64 *mapping);
void inline ipoib_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags,
int head_size, u64 *mapping)
{
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index c7d42ea..a9af796 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -283,11 +283,10 @@ static int ipoib_cm_nonsrq_init_rx(struct
net_device *dev, struct ib_cm_id *cm_i
spin_unlock_irq(&priv->lock);
for (i = 0; i < ipoib_recvq_size; ++i) {
- rx->rx_ring[i].skb = ipoib_cm_alloc_rx_skb(dev, i,
- IPOIB_CM_RX_SG - 1,
- IPOIB_CM_HEAD_SIZE,
- 12,
- rx->rx_ring[i].mapping);
+ rx->rx_ring[i].skb = ipoib_alloc_rx_skb(dev, i,
+ IPOIB_CM_RX_SG - 1,
+ IPOIB_CM_HEAD_SIZE, 12,
+ rx->rx_ring[i].mapping);
if (!rx->rx_ring[i].skb) {
ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
ret = -ENOMEM;
@@ -491,8 +490,8 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev,
struct ib_wc *wc)
frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
(unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
- newskb = ipoib_cm_alloc_rx_skb(dev, wr_id, frags, IPOIB_CM_HEAD_SIZE,
- ...This patch enables IPoIB-UD RX to allocate S/G buffer up to payload size
4096. The link IPoIB MTU size is up to 4K - 4.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 14 +----
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 25 ++++----
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 95
+++++++++++-------------------
3 files changed, 50 insertions(+), 84 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 004a80b..57d33d5 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -56,9 +56,6 @@
/* constants */
enum {
- IPOIB_PACKET_SIZE = 2048,
- IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
-
IPOIB_ENCAP_LEN = 4,
IPOIB_MAX_IB_MTU = 4096,
@@ -142,11 +139,6 @@ struct ipoib_mcast {
struct ipoib_rx_buf {
struct sk_buff *skb;
- u64 mapping;
-};
-
-struct ipoib_cm_rx_buf {
- struct sk_buff *skb;
u64 mapping[IPOIB_CM_RX_SG];
};
@@ -198,7 +190,7 @@ enum ipoib_cm_state {
struct ipoib_cm_rx {
struct ib_cm_id *id;
struct ib_qp *qp;
- struct ipoib_cm_rx_buf *rx_ring;
+ struct ipoib_rx_buf *rx_ring;
struct list_head list;
struct net_device *dev;
unsigned long jiffies;
@@ -223,7 +215,7 @@ struct ipoib_cm_tx {
struct ipoib_cm_dev_priv {
struct ib_srq *srq;
- struct ipoib_cm_rx_buf *srq_ring;
+ struct ipoib_rx_buf *srq_ring;
struct ib_cm_id *id;
struct list_head passive_ids; /* state: LIVE */
struct list_head rx_error_list; /* state: ERROR */
@@ -473,7 +465,7 @@ int ipoib_pkey_dev_delay_open(struct net_device
*dev);
void ipoib_drain_cq(struct net_device *dev);
void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
unsigned int length, struct sk_buff *toskb);
-struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
+struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev,
int id, int frags, int ...This patch keeps existing 2K MTU IPoIB-UD implemenation to be used by
both 2K MTU and no S/G 4K MTU. 4K MTU RX S/G is needed when necessary.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 28 ++++-----
drivers/infiniband/ulp/ipoib/ipoib_cm.c | 10 ++--
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 108
++++++++++++++++++++++---------
3 files changed, 95 insertions(+), 51 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 004a80b..6c33d7d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -56,9 +56,6 @@
/* constants */
enum {
- IPOIB_PACKET_SIZE = 2048,
- IPOIB_BUF_SIZE = IPOIB_PACKET_SIZE + IB_GRH_BYTES,
-
IPOIB_ENCAP_LEN = 4,
IPOIB_MAX_IB_MTU = 4096,
@@ -140,12 +137,7 @@ struct ipoib_mcast {
struct net_device *dev;
};
-struct ipoib_rx_buf {
- struct sk_buff *skb;
- u64 mapping;
-};
-
-struct ipoib_cm_rx_buf {
+struct ipoib_sg_rx_buf {
struct sk_buff *skb;
u64 mapping[IPOIB_CM_RX_SG];
};
@@ -198,7 +190,7 @@ enum ipoib_cm_state {
struct ipoib_cm_rx {
struct ib_cm_id *id;
struct ib_qp *qp;
- struct ipoib_cm_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
struct list_head list;
struct net_device *dev;
unsigned long jiffies;
@@ -223,7 +215,7 @@ struct ipoib_cm_tx {
struct ipoib_cm_dev_priv {
struct ib_srq *srq;
- struct ipoib_cm_rx_buf *srq_ring;
+ struct ipoib_sg_rx_buf *srq_ring;
struct ib_cm_id *id;
struct list_head passive_ids; /* state: LIVE */
struct list_head rx_error_list; /* state: ERROR */
@@ -294,7 +286,7 @@ struct ipoib_dev_priv {
unsigned int admin_mtu;
unsigned int mcast_mtu;
- struct ipoib_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
spinlock_t tx_lock;
struct ipoib_tx_buf *tx_ring;
@@ -367,10 +359,14 @@ struct ipoib_neigh {
};
#define ...This patch enables IPoIB-UD 4K MTU support. If PAGE_SIZE > 4K MTU + GRH
head + IPoIB head, then two buffers are allocated, otherwise use one
buffer.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 7 +--
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 90
+++++++++++++++++++------------
2 files changed, 56 insertions(+), 41 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 6b5e108..faee740 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -145,11 +145,6 @@ struct ipoib_sg_rx_buf {
u64 mapping[IPOIB_UD_RX_SG];
};
-struct ipoib_rx_buf {
- struct sk_buff *skb;
- u64 mapping;
-};
-
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -299,7 +294,7 @@ struct ipoib_dev_priv {
unsigned int admin_mtu;
unsigned int mcast_mtu;
- struct ipoib_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
spinlock_t tx_lock;
struct ipoib_tx_buf *tx_ring;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 9ca3d34..93025d3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -155,29 +155,22 @@ partial_error:
static int ipoib_ib_post_receive(struct net_device *dev, int id)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ib_sge list;
- struct ib_recv_wr param;
struct ib_recv_wr *bad_wr;
int ret;
- list.addr = priv->rx_ring[id].mapping;
- list.length = IPOIB_BUF_SIZE;
- list.lkey = priv->mr->lkey;
-
- param.next = NULL;
- param.wr_id = id | IPOIB_OP_RECV;
- param.sg_list = &list;
- param.num_sge = 1;
-
- ret = ib_post_recv(priv->qp, &param, &bad_wr);
+ priv->rx_wr.wr_id = id | IPOIB_OP_RECV;
+ ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
if (unlikely(ret)) {
+ if (ipoib_ud_need_sg(priv->max_ib_mtu))
+ ipoib_sg_dma_unmap_rx(priv, ...This patchset has been tested for 2K MTU on Intel platform with mthca.
Here is the updated one:
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 7 +--
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 91
+++++++++++++++++++------------
2 files changed, 58 insertions(+), 40 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 6b5e108..faee740 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -145,11 +145,6 @@ struct ipoib_sg_rx_buf {
u64 mapping[IPOIB_UD_RX_SG];
};
-struct ipoib_rx_buf {
- struct sk_buff *skb;
- u64 mapping;
-};
-
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -299,7 +294,7 @@ struct ipoib_dev_priv {
unsigned int admin_mtu;
unsigned int mcast_mtu;
- struct ipoib_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
spinlock_t tx_lock;
struct ipoib_tx_buf *tx_ring;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 9ca3d34..81a517b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -155,29 +155,25 @@ partial_error:
static int ipoib_ib_post_receive(struct net_device *dev, int id)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ib_sge list;
- struct ib_recv_wr param;
struct ib_recv_wr *bad_wr;
int ret;
- list.addr = priv->rx_ring[id].mapping;
- list.length = IPOIB_BUF_SIZE;
- list.lkey = priv->mr->lkey;
+ priv->rx_wr.wr_id = id | IPOIB_OP_RECV;
+ priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
+ priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
- param.next = NULL;
- param.wr_id = id | IPOIB_OP_RECV;
- param.sg_list = &list;
- param.num_sge = 1;
-
- ret = ib_post_recv(priv->qp, &param, &bad_wr);
+ ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
if (unlikely(ret)) {
+ if ...I have fixed a bug found in 4K MTU test. Here is the new patch. I am
running stress tonight.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 7 +--
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 93
+++++++++++++++++++-----------
2 files changed, 60 insertions(+), 40 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 6b5e108..faee740 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -145,11 +145,6 @@ struct ipoib_sg_rx_buf {
u64 mapping[IPOIB_UD_RX_SG];
};
-struct ipoib_rx_buf {
- struct sk_buff *skb;
- u64 mapping;
-};
-
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -299,7 +294,7 @@ struct ipoib_dev_priv {
unsigned int admin_mtu;
unsigned int mcast_mtu;
- struct ipoib_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
spinlock_t tx_lock;
struct ipoib_tx_buf *tx_ring;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 9ca3d34..dfb5cc2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -155,29 +155,25 @@ partial_error:
static int ipoib_ib_post_receive(struct net_device *dev, int id)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ib_sge list;
- struct ib_recv_wr param;
struct ib_recv_wr *bad_wr;
int ret;
- list.addr = priv->rx_ring[id].mapping;
- list.length = IPOIB_BUF_SIZE;
- list.lkey = priv->mr->lkey;
+ priv->rx_wr.wr_id = id | IPOIB_OP_RECV;
+ priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
+ priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
- param.next = NULL;
- param.wr_id = id | IPOIB_OP_RECV;
- param.sg_list = &list;
- param.num_sge = 1;
-
- ret = ib_post_recv(priv->qp, &param, &bad_wr);
+ ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
if (unlikely(ret)) {
+ if ...Hi Shirley, you patches cannot be applied cleanly. It seems like your email client wraps around long lines. Can please check if this is the case? _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Thanks Eli. Too bad :(. I have struggled with my email for a while. Let me send you an attachment file for the whole patch built against OFED-1.3-RC3 kernel here first. I will work on my email client tomorrow. I am too tired today. Let me know right away if there is any problem. Shirley
Go to sleep :) I'll get along with the wrapped lines. I am reviewing now your patches against Roland's tree. After that I'll look at the attachements. -----Original Message----- From: Shirley Ma [mailto:mashirle@us.ibm.com] Sent: א 03 פברואר 2008 02:22 To: Eli Cohen Cc: Roland Dreier; general@lists.openfabrics.org Subject: Re: [UPDATE] [V3] [PATCH 3/3] ib/ipoib: IPoIB-UD RX S/G supportfor 4K MTU Thanks Eli. Too bad :(. I have struggled with my email for a while. Let me send you an attachment file for the whole patch built against OFED-1.3-RC3 kernel here first. I will work on my email client tomorrow. I am too tired today. Let me know right away if there is any problem. Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Thank you so much, Eli! I have done 4 different implementations in the last few days to make it possible to be included in OFED-1.3 as well as Distros. I am totally exhausted. If any issues, let me know. It's quiet possible for me to make mistakes when working like this. I will run stress test overnight on both intel (mthca 2K mtu) and ppc (ehca 4K mtu). Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi Shirley, I have reviewed the patches against Roland's tree and have the following comments: 1. I see that there are a few if statements added on the fast pass and I am concerned they might hurt performance of slow UDP messages. Unfortunately I have not been able to test with an SM defining the broadcast group to 4K MTU (currently opensm uses 2K). 2. The usage of ipoib_ud_skb_put_frags() seems to be redundant and will only hurt performance since you would never reuse anything from the old SKB. This is because the headlen is 40 bytes for GRH and the rest of the data is in the first (and only) fragment. 3. I think it would be better to allocate room for real data in the head of the SKB since the tcp/ip stack seems to have less overhead if the headers are on the linear data. 4. I would consider using a pre-allocated buffer for the GRH of all received data (not as part of the SKB). _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--0__=08BBF977DFCC9DD48f9e8a93df938690918c08BBF977DFCC9DD4 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable What kind of parameters you prefer here for me to test this patch? I ca= n Comments 2,3,4 can be combined of one if we use a pre-allocated buffer = for GRH+IPoIB-head for all IP payload data, right? This is a performance enhancement if any. I think this could be done after this patch being checked in. And I will fix it before RC4 out. Do you agree? Thanks Shirley= --0__=08BBF977DFCC9DD48f9e8a93df938690918c08BBF977DFCC9DD4 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-transfer-encoding: quoted-printable <html><body> <p><tt>general-bounces@lists.openfabrics.org wrote on 02/03/2008 09:07:= 29 AM:<br> <br> &gt; Hi Shirley,<br> &gt; <br> &gt; I have reviewed the patches against Roland's tree and have the fol= lowing<br> &gt; comments:<br> </tt><br> <tt>Appreciate you quick review.</tt><br> <tt><br> &gt; 1. I see that there are a few if statements added on the fast pass= and I<br> &gt; am concerned they might hurt performance of slow UDP messages.<br>= &gt; Unfortunately I have not been able to test with an SM defining the= <br> &gt; broadcast group to 4K MTU (currently opensm uses 2K).<br> What kind of parameters you prefer here for me to test this patch? I ca= n test it right away when you send me your recommendations.</tt><br> <tt>&nbsp;<br> &gt; 2. The usage of ipoib_ud_skb_put_frags() seems to be redundant and= will<br> &gt; only hurt performance since you would never reuse anything from th= e old<br> &gt; SKB. This is because the headlen is 40 bytes for GRH and the rest = of the<br> &gt; data is in the first (and only) fragment.<br> The header is 44 bytes, the IP payload data is in the first fragment.</= tt><br> <br> <tt>&gt; 3. I think it would be better to allocate room for real data i= n the head<br> &gt; of the SKB since the tcp/ip stack seems to have less ...
--0__=08BBF977DFF3D9668f9e8a93df938690918c08BBF977DFF3D966 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable Does your recommendation is the same as Roland's before? I hope it's no= t, otherwise, it doesn't work. Since the first buffer is GRH + IPoIB HEAD= =3D 44 bytes not 40 bytes. If we put all skb data in the first frag, then t= he IP header is not aligned to 16 bytes. I am copying Roland's comments regarding this approach: --------- However, I now realize that my earlier idea of allocating a scratch buffer for the GRH and just allocating a 4096 byte skb doesn't work, because the skb_shinfo ends up being allocated along with the buffer, so trying to allocate a 4096-byte skb will bloat the data past a single page, which is what we're trying to avoid. So how about the following? When using a UD MTU of 4096 with a page size of 4096, allocate an skb of size 44 for the GRH and ethertype, and then allocate a single page for the fragment list. This means that the IP packet will start nicely 16-byte aligned for free, and all the bookkeeping is very simple. ------- thanks Shirley= --0__=08BBF977DFF3D9668f9e8a93df938690918c08BBF977DFF3D966 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-transfer-encoding: quoted-printable <html><body> <p>Does your recommendation is the same as Roland's before? I hope it's= not, otherwise, it doesn't work. Since the first buffer is GRH + IPoI= B HEAD =3D 44 bytes not 40 bytes. If we put all skb data in the first f= rag, then the IP header is not aligned to 16 bytes. I am copying Roland= 's comments regarding this approach:<br> ---------<br> <tt><font size=3D"4">However, I now realize that my earlier idea of all= ocating a scratch<br> because the skb_shinfo ends up being allocated along with the buffer,<b= r> so trying to allocate a 4096-byte skb will bloat the data past a<br> single page, which is what we're trying to avoid.</font></tt><font ...
I actually say lets allocate for example, 128 bytes in the linear data and then a 4K page. The first 128 bytes will be used for GRH, for the encapsulation header, and for the IP and TCP/UDP headers. The following 4K fragment will have large enough space to contain the rest of the packet. Another thing to consider is use a 3 entries receive scatter list: 1. The first will point to 40 bytes generic buffer (allocated once per netdevice). All receive buffer will point to this buffer. As Roland suggested before, this will save us the skb_pull on the GRH. 2. A 128 bytes buffer which comes from the linear part of the SKB - we can align this buffer to ensure IP is aligned at 16 byte boundary. 3. A 4K page to in the first fragment. We can then check when the packet is received whether the overall packet length is small enough such that it did not touch the page. If it did not we can use this page for the newly posted buffer. ** the above 128 bytes value can be a macro and we can determine what is the correct value. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Eli, Are you saying we also do this for 2K MTU? Otherwise the if condition check can't not be avoid. And I don't know how much performance gain from this approach. Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi Shirley, I think it we can do it for 2K MTU is well and avoid all the if . But first let's get this to ofed 1.3 and then work on the changes. Unfortunately you'll have to build again your patches on top of the current ofed tree. Can you do it today? _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Thanks Eli. I will pull git tree and do it today. I will limit the need-S/G check in one in fast path. thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Thanks. -----Original Message----- From: Hal Rosenstock [mailto:hrosenstock@xsigo.com] Sent: ב 04 פברואר 2008 19:39 To: Eli Cohen Cc: Shirley Ma; Roland Dreier; general@lists.openfabrics.org; sashak@voltaire.com Subject: Re: [ofa-general] RE: [UPDATE] [V3] [PATCH 3/3] ib/ipoib: IPoIB-UDRX S/G supportfor 4K MTU Eli, The default is 2K (mtu=4). You can get opensm to make it 4K if you want as follows: /etc/ofa/opensm-partitions.conf: Default=0x7fff,ipoib,mtu=5:ALL=full; _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--1__=08BBF977DFCF7B0A8f9e8a93df938690918c08BBF977DFCF7B0A Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable Hello Tziporet, I have done 4 different approaches for IPoIB-UD 4K mtu implmenetation. = I have tested and validated three of them and I didn't see any performanc= e difference among these implementations for both 2K mtu and 4K mtu. Howe= ver I picked up V3 patch since this V3 patch is based on Eli and Roland's review comment: Keep existing 2K mtu implementation, don't merge IPoIB-= UD RX S/G and IPoIB-CM RX S/G. Using 2 buffers for 4K MTU, one buffer is HEAD=3DGRH+IPoIB-head=3D44 bytes, one buffer is 4K for data when PAGE_S= IZE is not bigger enough for 4K MTU+HEAD. I have tested and validated this patch on both mthca driver intel based= platform and ehca driver ppc platform. Stress test has passed whole nig= ht without any problem on on intel based platform for 2K MTU validation against 2.6.24 kernel for OFED-1.3-RC3 tree + Pradeep's noSRQ patch. The attachment is the patch built against OFED-1.3-RC3. One line is nee= ded for backporting to other kernel: ++dev->stats vs. ++priv->stats. Please= review it for OFED-1.3 inclusion. If there is any issues, please let me= know. (See attached file: ipoib-4kmtu-rc3-2.6.24.patch) Thanks Shirley= --1__=08BBF977DFCF7B0A8f9e8a93df938690918c08BBF977DFCF7B0A Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-transfer-encoding: quoted-printable <html><body> <p>Hello Tziporet,<br> <br> I have done 4 different approaches for IPoIB-UD 4K mtu implmenetation. = I have tested and validated three of them and I didn't see any performa= nce difference among these implementations for both 2K mtu and 4K mtu. = However I picked up V3 patch since this V3 patch is based on Eli and Ro= land's review comment: Keep existing 2K mtu implementation, don't merge= IPoIB-UD RX S/G and IPoIB-CM RX S/G. Using 2 buffers for 4K MTU, one b= uffer is ...
This is updated patch, this patchset has been tested for both 2K MTU and
4K MTU. Here fixed a typo in 4K MTU.
Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
drivers/infiniband/ulp/ipoib/ipoib.h | 7 +--
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 91
+++++++++++++++++++------------
2 files changed, 58 insertions(+), 40 deletions(-)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 6b5e108..faee740 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -145,11 +145,6 @@ struct ipoib_sg_rx_buf {
u64 mapping[IPOIB_UD_RX_SG];
};
-struct ipoib_rx_buf {
- struct sk_buff *skb;
- u64 mapping;
-};
-
struct ipoib_tx_buf {
struct sk_buff *skb;
u64 mapping;
@@ -299,7 +294,7 @@ struct ipoib_dev_priv {
unsigned int admin_mtu;
unsigned int mcast_mtu;
- struct ipoib_rx_buf *rx_ring;
+ struct ipoib_sg_rx_buf *rx_ring;
spinlock_t tx_lock;
struct ipoib_tx_buf *tx_ring;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 9ca3d34..81a517b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -155,29 +155,25 @@ partial_error:
static int ipoib_ib_post_receive(struct net_device *dev, int id)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ib_sge list;
- struct ib_recv_wr param;
struct ib_recv_wr *bad_wr;
int ret;
- list.addr = priv->rx_ring[id].mapping;
- list.length = IPOIB_BUF_SIZE;
- list.lkey = priv->mr->lkey;
+ priv->rx_wr.wr_id = id | IPOIB_OP_RECV;
+ priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
+ priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
- param.next = NULL;
- param.wr_id = id | IPOIB_OP_RECV;
- param.sg_list = &list;
- param.num_sge = 1;
-
- ret = ib_post_recv(priv->qp, &param, &bad_wr);
+ ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
if (unlikely(ret)) {
+ if ...Just to make sure, this patch is a candidate for upstream inclusion (which you want also to be present in ofed 1.3) and hence is based against Roland's tree, correct? Or _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Yes. I forgot to mention these patches are created against Roland's 2.6.25 tree. Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I see, but I want to make sure these patches are the one you want to merge into the kernel or its more of a work in progress which you want to be included in this experimental testbed called ofed If its candidate for upstream inclusion, I find it hard to review since there is no per patch change-log. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Or, I will create patch for OFED-1.3-RC3 separately. I wouldn't call it's experimental code since these APIs have been tested along with IPoIB-CM Thanks for the advice, I thought one change log was enough. If not, I will resubmit these patches along with one ling change-log. Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
I meant to say that ofed is an experimental testbed, this is becoming more and more clear to more and more people. I did not address your If you think that for each of the patch one line change log is enough for a reviewer, let it be, but if I were you, I would validate again this assumption. The things is that when you send an RFC, many times most of the documentation is in the virtual 0/N patch, but remember that this documentation does not go into the git change-log, so in your case since you want this to be merged, you have to work harder and document both in the 0/N and also in the 1/N, 2/N ... N/N postings. Or _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
That's a good suggestion. It seems ipoib_cm.c has been changed in the past few hours. I am having trouble to apply them. I am cleaning my local tree and redo all patches with change-log. thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi Shirley, Just to make sure, can you confirm that this patch set is not dependent on the below patch which is part of ofed but was never submitted to the upstream ipoib driver for inclusion? Also, can you share with what SM have you checked this, did you had to patch or run it with non-default param, more, what was the configuration, specifically what switch was used and any instrumentation you have made to the switch FW, thanks. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
No, this patchset is not dependent on any OFED patches. It's a pure patch set for 2.6.25 kernel. I have another version of this patchset which is built against OFED-1.3-RC2. I will update it to OFED-1.3-RC3. I hope I can get a quick ack for this patchset from maintainers to agree with this approach. There are around 1.5-2 times better performance I can see to use 4K MTU for IPoIB-UD. I will resumit this patchset tomorrow. You should wait for the new patchset since I have found some One of the reason this patchset was not be able to submit earlier was because of the SW support. I couldn't do a full test without SW supports 4K MTU. The SW firmware needs to be update to allow IPoIB broadcast group to be able to create 4096 MTU size. There are two requirements to the switch from SW perspective: 1. SW ports are able to configure to 4096 MTU size. 2. SW default IPoIB broadcast group is able to configure to 4096 MTU size. The default IPoIB broadcast group MTU can't exceed SW ports MTU size. The way to enable IPoIB 4K MTU is: 1. set SW ports to 4K MTU 2. set SM default IPoIB broadcast group MTU size as 4K. You could disable or enable IPoIB broadcast group when starting SM. If you don't enable IPoIB default broadcast group when starting SM, the first node in the subnet will come up and create a broadcast group with 2K MTU for this subnet. It makes sense since the node doesn't know the whole subnet link MTU size. So it's better to create a default 2K MTU. If you enable IPoIB default broadcast group when starting SM, if the MTU size is 2K, then all nodes in the cluster can join the subnet and the IPoIB subnet link MTU size will be set to 2K. If the broadcast group MTU size is 4K, then only nodes with 4K MTU can join this IPoIB subnet. I am not sure that's what you are looking for. Let me know if anything is unclear. thanks Shirley _______________________________________________ general mailing ...
Hi shirley,
my comments are:
1. The first patch (1/3) is malformed. I suggest you try to apply it
before sending.
2. Make sure they compile before submitting - patch 1/3 for example
changes ipoib_rx_buf
struct ipoib_rx_buf {
struct sk_buff *skb;
u64 mapping;
+ u64 mapping[IPOIB_CM_RX_SG];
};
but does not change code in the UD flow to align with these changes.
3. Please put an explanation in the changelog.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Thanks Eli, I found it yesterday night too. I made a mistake, my local tree was not clean somehow. I have been working too much in the past few days for OFED-1.3 validation. When you are tired, it's easy to make mistake. Sorry about that. I sent out an email for checking today's patch already. Somehow, I couldn't receive your email on time. Looks like, some of my emails got warning saying that the email needs to be approved since it matches spam contents. Do you have any idea why it's blocked? Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Maybe the list administrator can help with this issue. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hi. Every once in a while, perfectly normal mail trips the spam filter. If this is really a problem, I can look into it. However, training the filter is kind of a black art so I'm not sure how successful I'll be. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Roland, Finally I cleaned up my local git tree and built compile this patchset against 2.6.25 kernel. The patchset has splitted into three patches, the patch could be built in sequence separately so it's easy to test. 1/3. Make IPoIB-CM RX S/G APIs generic 2/3. Set IPoIB-UD RX S/G parameters 3/3. Enable IPoIB-UD RX S/G Please review these patches as soon as possible so we can meet OFED-1.3-RC4 schedule. Appreciate your help on time. The current IPoIB-UD implementation is limited IPoIB payload size to 2048 through hard coding IPOIB_PACKET_SIZE. The implementation is designed for kernel PAGE_SIZE equals or greater than 4K. If the kernel PAGE_SIZE is equals to 2K, memory buffer allocation will be failed when lack of large buffer of memory. However most of the Distros does support PAGE_SIZE >= 4K. So this implementation has no problem for 2048 payload.This implementation is simple but it prevents HCA device who does support 4096 payload from performing, like IBM eHCA2. This patch allows IPoIB-UD MTU up to 4092 (4K - IPOIB_ENCAP_LEN) when HCA can support 4K MTU. In this patch, APIs for S/G buffer allocation in IPoIB-CM mode has been made generic so IPoIB-UD and IPoIB-CM can share the S/G code. When PAGE_SIZE is equal or greater than IPOIB_UD_BUF_SIZE + bytes padding to align IP header, Only one buffer is needed for 4K MTU buffer allocation, otherwise, two buffers allocation is needed in S/G. The node IPoIB link MTU size is the minimum value of admin configurable MTU through ifconfig and IPoIB default broadcast group MTU size. When Subnet Manager enables default broadcast group during start up, this subnet IPoIB link MTU will be the value of default broadcast group MTU size. For any node IB MTU smaller than this value, the node can't join this IPoIB subnet. For any node IB MTU is greater than this value, the node will join this IPoIB subnet and this value will be set as its IPOIB link MTU. If Subnet Manager disables default broadcast group during start up, ...
> The current IPoIB-UD implementation is limited IPoIB payload size to > 2048 through hard coding IPOIB_PACKET_SIZE. The implementation is > designed for kernel PAGE_SIZE equals or greater than 4K. If the kernel > PAGE_SIZE is equals to 2K, memory buffer allocation will be failure when > lack of large buffer of memory. However most of the Distros does support > PAGE_SIZE >= 4K. So this implementation has no problem for 2048 payload. > This implementation is simple but it prevents HCA device who does > support 4096 payload from performing, like IBM eHCA2. Not sure I understand this. Is there any possible configuration of any architecture where Linux runs where PAGE_SIZE < 4096? > This patch allows IPoIB-UD MTU up to 4092 (4K - IPOIB_ENCAP_LEN) when > HCA can support 4K MTU. In this patch, APIs for S/G buffer allocation in > IPoIB-CM mode has been made generic so IPoIB-UD and IPoIB-CM can share > the S/G code. This approach seems overly complex to me, since it ends up going through all the CM buffer fragment bookkeeping for the simple UD path. However, I now realize that my earlier idea of allocating a scratch buffer for the GRH and just allocating a 4096 byte skb doesn't work, because the skb_shinfo ends up being allocated along with the buffer, so trying to allocate a 4096-byte skb will bloat the data past a single page, which is what we're trying to avoid. So how about the following? When using a UD MTU of 4096 with a page size of 4096, allocate an skb of size 44 for the GRH and ethertype, and then allocate a single page for the fragment list. This means that the IP packet will start nicely 16-byte aligned for free, and all the bookkeeping is very simple. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--0__=08BBF970DF914ACA8f9e8a93df938690918c08BBF970DF914ACA Content-type: text/plain; charset=US-ASCII Hello, Roland, Technially it's a problem, pratically it's not since there is no No, it's not complex, only one buffer is allocated if the page_size is It has 44 bytes head with another 4K page size without if condition check of mtu size and page size. Please look at the patches for detail. thanks Shirley --0__=08BBF970DF914ACA8f9e8a93df938690918c08BBF970DF914ACA Content-type: text/html; charset=US-ASCII Content-Disposition: inline <html><body> <p><tt>Hello, Roland,</tt><br> <br> <tt>Thanks for your quick review.</tt><br> <br> <tt>&gt; Not sure I understand this. &nbsp;Is there any possible configuration of<br> &gt; any architecture where Linux runs where PAGE_SIZE &lt; 4096?<br> </tt><br> <tt>Technially it's a problem, pratically it's not since there is no architecture i can think of has PAGE_SIZE &lt; 4096.</tt><br> <tt><br> &gt; &nbsp;&gt; This patch allows IPoIB-UD MTU up to 4092 (4K - IPOIB_ENCAP_LEN) when<br> &gt; &nbsp;&gt; HCA can support 4K MTU. In this patch, APIs for S/G buffer allocation in<br> &gt; &nbsp;&gt; IPoIB-CM mode has been made generic so IPoIB-UD and IPoIB-CM can share<br> &gt; &nbsp;&gt; the S/G code.<br> &gt; <br> &gt; This approach seems overly complex to me, since it ends up going<br> &gt; through all the CM buffer fragment bookkeeping for the simple UD path.<br> </tt><br> <tt>No, it's not complex, only one buffer is allocated if the page_size is bigger enough and if it's 2K MTU. <br> &gt; So how about the following? &nbsp;When using a UD MTU of 4096 with a page<br> &gt; size of 4096, allocate an skb of size 44 for the GRH and ethertype,<br> &gt; and then allocate a single page for the fragment list. &nbsp;This means<br> &gt; that the IP packet will start nicely 16-byte aligned for free, and all<br> &gt; the bookkeeping is very simple.<br> </tt><br> <tt>It has 44 bytes head with another 4K page size without if condition check ...
--0__=08BBF970DFCB57778f9e8a93df938690918c08BBF970DFCB5777 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable My patch has been passed the stress test for both PPC and Intel architechture against OFED-1.3-RC2 bit for a couple days. And I didn't = see performance imapct for 2K mtu. But I rethink about your suggestion here= yesterday night. I can modify my patch to meet your thoughts here by keeping current implementation of 2K mtu and using if condition check f= or the new code. I will submit a new version of patchset today for review.= Since I only have two days for my patch to be integred into OFED-1.3-RC= 3 for Distros to pick up. I would like to see your ack here for this appr= oach as soon as possible. I will compare two different implementation's performance. Thanks for your inputs. Appreciate your prompt response. Thanks Shirley= --0__=08BBF970DFCB57778f9e8a93df938690918c08BBF970DFCB5777 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-transfer-encoding: quoted-printable <html><body> <p><tt>Hello Roland,</tt><br> <br> <tt>&gt; So how about the following? &nbsp;When using a UD MTU of 4096 = with a page<br> &gt; size of 4096, allocate an skb of size 44 for the GRH and ethertype= ,<br> &gt; and then allocate a single page for the fragment list. &nbsp;This = means<br> &gt; that the IP packet will start nicely 16-byte aligned for free, and= all<br> &gt; the bookkeeping is very simple.<br> </tt><br> <tt>My patch has been passed the stress test for both PPC and Intel arc= hitechture against OFED-1.3-RC2 bit for a couple days. And I didn't see= performance imapct for 2K mtu. But I rethink about your suggestion her= e yesterday night. I can modify my patch to meet your thoughts here by = keeping current implementation of 2K mtu and using if condition check f= or the new code. I will submit a new version of patchset today for revi= ew. Since I only have two days for my patch to ...
> My patch has been passed the stress test for both PPC and Intel > architechture against OFED-1.3-RC2 bit for a couple days. And I didn't see > performance imapct for 2K mtu. But I rethink about your suggestion here > yesterday night. I can modify my patch to meet your thoughts here by > keeping current implementation of 2K mtu and using if condition check for > the new code. I will submit a new version of patchset today for review. > Since I only have two days for my patch to be integred into OFED-1.3-RC3 > for Distros to pick up. I would like to see your ack here for this approach > as soon as possible. I will compare two different implementation's > performance. Sorry, I've kind of lost the plot here with so many versions of the patches flying around. In any case this is not something I am going to pick up for 2.6.25. I don't have any control over OFED or distros, although I would probably hold off on adding a feature at this late stage of the release process; but the OFED maintainers don't seem to be as conservative as I am. - R. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
--0__=08BBF975DF8E96A58f9e8a93df938690918c08BBF975DF8E96A5 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: quoted-printable Thanks Roland. We can review this patch for upper stream later. Eli has= reviewed it. This patch is going to be OFED-1.3. I am testing current OFED-1.3 Git tree + this patch now. It seems everything works well for = a few hours. I will let the test running overnight to see any issues tomorrow. Thanks Shirley= --0__=08BBF975DF8E96A58f9e8a93df938690918c08BBF975DF8E96A5 Content-type: text/html; charset=US-ASCII Content-Disposition: inline Content-transfer-encoding: quoted-printable <html><body> <p>Thanks Roland. We can review this patch for upper stream later. Eli = has reviewed it. This patch is going to be OFED-1.3. I am testing curre= nt OFED-1.3 Git tree + this patch now. It seems everything works well f= or a few hours. I will let the test running overnight to see any issues= tomorrow.<br> <br> Thanks<br> Shirley</body></html>= --0__=08BBF975DF8E96A58f9e8a93df938690918c08BBF975DF8E96A5--
Same goes for me on both points: I was totally lost between all the posts you have made, and it prevents me from reviewing the patches, also, integration to ofed of patches which were --not reviewed-- (nor accepted) for upstream inclusion totally unclear to me, to remove doubt this policy is present in ofed from day one, so its not specific to your patches. Or. _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Roland, Please review below approach as early as you can. Thanks This patch is based on Eli and Roland's input. The idea is to keep IPoIB-UD 2K MTU current implementation and allows IPoIB-UD link MTU up to 4092 (4K - IPOB_ENCAP_LEN) when HCAs support 4K MTU. For IPoIB-UD 4K MTU, if the PAGE_SIZE is greater than IB MTU + GRH HEAD + 4, then no S/G is needed, use IPoIB-UD 2K MTU implementation, if PAGE_SIZE is smaller, then two buffers need to be used. One of the API IPoIB-CM RX S/G code has been made more generic, so it can be reused. This patchset includes three patches: 1. Make one IPoIB-CM RX S/G API generic. 2. Set up IPoIB-UD RX S/G ready. 3. Enable IPoIB-UD RX S/G when needed. Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Roland, This patchset is based on your previous review comments. Using current IPoIB-UD 2K MTU implementation when 4K MTU + GRH head + 4 is less than PAGE_SIZE, if it's greater, then allocate two buffers: One is for GRH + IPoIB head, one is for data. Please compare this approach with V2 patchset and provide the feedback as soon as you can, so I can concentrated on the test and backport the one we agree with to OFED-1.3 RC3. Thanks Shirley _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
