[PATCH 1/4] [NET_SCHED] explict hold dev tx lock

Previous thread: [RESEND] Implement batching skb API and support in IPoIB by Krishna Kumar2 on Friday, September 14, 2007 - 1:52 am. (1 message)

Next thread: [PATCH] Cleanup calling netdev notifiers by Pavel Emelyanov on Friday, September 14, 2007 - 3:34 am. (2 messages)
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:00 am

This set of patches implements the batching xmit capability, and adds support
for batching in IPoIB and E1000 (E1000 driver changes is ported, thanks to
changes taken from Jamal's code from an old kernel).

List of changes from previous revision:
----------------------------------------
1. [Dave] Enable batching as default (change in register_netdev).
2. [Randy] Update documentation (however ethtool cmd to get/set batching is
	not implemented, hence I am guessing the usage).
3. [KK] When changing tx_batch_skb, qdisc xmits need to be blocked since
	qdisc_restart() drops queue_lock before calling driver xmit, and
	driver could find blist change under it.
4. [KK] sched: requeue could wrongly requeue skb already put in the batching
	list (in case a single skb was sent to the device but not sent as the
	device was full, resulting in the skb getting added to blist). This
	also results in slight optimization of batching behavior where for
	getting skbs #2 onwards don't require to check for gso_skb as that
	is the first skb that is processed.
4. [KK] Change documentation to explain this behavior.
5. [KK] sched: Fix panic when GSO is enabled in driver.
6. [KK] IPoIB: Small optimization in ipoib_ib_handle_tx_wc
7. [KK] netdevice: Needed to change NETIF_F_GSO_SHIFT/NETIF_F_GSO_MASK as
	BATCH_SKBS is now defined as 65536 (earlier it was using 8192 which
	was taken up by NETIF_F_NETNS_LOCAL).


Will submit in the next 1-2 days:
---------------------------------
1. [Auke] Enable batching in e1000e.


Extras that I can do later:
---------------------------
1. [Patrick] Use skb_blist statically in netdevice. This could also be used
	to integrate GSO and batching.
2. [Evgeniy] Useful to splice lists dev_add_skb_to_blist (and this can be
	done for regular xmit's of GSO skbs too for #1 above).

Patches are described as:
		 Mail 0/10:  This mail
		 Mail 1/10:  HOWTO documentation
		 Mail 2/10:  Introduce skb_blist, NETIF_F_BATCH_SKBS, use
		 	     single API for batching/no-batching, ...
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:01 am

Add Documentation describing batching skb xmit capability.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 batching_skb_xmit.txt |  107 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 107 insertions(+)

diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt
--- org/Documentation/networking/batching_skb_xmit.txt	1970-01-01 05:30:00.000000000 +0530
+++ new/Documentation/networking/batching_skb_xmit.txt	2007-09-14 10:25:36.000000000 +0530
@@ -0,0 +1,107 @@
+		 HOWTO for batching skb xmit support
+		 -----------------------------------
+
+Section 1: What is batching skb xmit
+Section 2: How batching xmit works vs the regular xmit
+Section 3: How drivers can support batching
+Section 4: Nitty gritty details for drivers
+Section 5: How users can work with batching
+
+
+Introduction: Kernel support for batching skb
+----------------------------------------------
+
+A new capability to support xmit of multiple skbs is provided in the netdevice
+layer. Drivers which enable this capability should be able to process multiple
+skbs in a single call to their xmit handler.
+
+
+Section 1: What is batching skb xmit
+-------------------------------------
+
+	This capability is optionally enabled by a driver by setting the
+	NETIF_F_BATCH_SKBS bit in dev->features. The prerequisite for a
+	driver to use this capability is that it should have a reasonably-
+	sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching xmit works vs the regular xmit
+-------------------------------------------------------
+
+	The network stack gets called from upper layer protocols with a single
+	skb to transmit. This skb is first enqueued and an attempt is made to
+	transmit it immediately (via qdisc_run). However, events like tx lock
+	contention, tx queue stopped, etc., can result in the skb not getting
+	sent out and it remains in the queue. When the next xmit is called or
+	when the ...
From: Randy Dunlap
Date: Friday, September 14, 2007 - 11:37 am

and then
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>

Thanks,
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 9:10 pm

Hi Randy,


Since this was a new section that I added to the documentation, this error
creeped up. Thanks for catching it, and review comments/ack-off :)

thanks,

- KK

-

From: Jeff Garzik
Date: Sunday, September 16, 2007 - 9:13 pm

Please remove me from the CC list.

I get this via netdev, and not having said a single thing in the thread, 
I don't feel the need to be CC'd on every email.

The CC list is pretty massive as it is, anyway.

	Jeff



-

From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:01 am

Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for
batching/no-batching, etc.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    8 ++++++--
 net/core/dev.c            |   29 ++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 5 deletions(-)

diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -439,10 +439,11 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BATCH_SKBS	65536	/* Driver supports multiple skbs/xmit */
 
 	/* Segmentation offload features */
-#define NETIF_F_GSO_SHIFT	16
-#define NETIF_F_GSO_MASK	0xffff0000
+#define NETIF_F_GSO_SHIFT	17
+#define NETIF_F_GSO_MASK	0xfffe0000
 #define NETIF_F_TSO		(SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
 #define NETIF_F_UFO		(SKB_GSO_UDP << NETIF_F_GSO_SHIFT)
 #define NETIF_F_GSO_ROBUST	(SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
@@ -548,6 +549,9 @@ struct net_device
 	/* Partially transmitted GSO packet. */
 	struct sk_buff		*gso_skb;
 
+	/* List of batch skbs (optional, used if driver supports skb batching */
+	struct sk_buff_head	*skb_blist;
+
 	/* ingress path synchronizer */
 	spinlock_t		ingress_lock;
 	struct Qdisc		*qdisc_ingress;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c	2007-09-14 10:25:36.000000000 +0530
@@ -953,6 +953,16 @@ void netdev_state_change(struct net_devi
 	}
 }
 
+static void free_batching(struct net_device *dev)
+{
+	if (dev->skb_blist) {
+		if (!skb_queue_empty(dev->skb_blist))
+			skb_queue_purge(dev->skb_blist);
+		kfree(dev->skb_blist);
+		dev->skb_blist = NULL;
+	}
+}
+
 /**
  *	dev_load 	- load a network ...
From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 5:46 am

A nitpick is that you should use sizeof(struct ...) and I think it

-- 
	Evgeniy Polyakov
-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 8:51 pm

Hi Evgeniy,


I thought it is better to use *var name in case the name of the structure
changes. Also, the flag is not cleared since I could try to enable batching
later, and it could succeed at that time. When skb_blist is allocated, then
batching is enabled otherwise it is disabled (while features flag just
indicates that driver supports batching).

Thanks,

- KK

-

From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:01 am

Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/netdevice.h |    2 
 include/net/pkt_sched.h   |   17 +++++--
 net/core/dev.c            |   45 ++++++++++++++++++
 net/sched/sch_generic.c   |  109 ++++++++++++++++++++++++++++++++++++----------
 4 files changed, 145 insertions(+), 28 deletions(-)

diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/net/pkt_sched.h	2007-09-14 10:25:36.000000000 +0530
@@ -80,13 +80,24 @@ extern struct qdisc_rate_table *qdisc_ge
 		struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+static inline void qdisc_block(struct net_device *dev)
+{
+	while (test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
+		yield();
+}
+
+static inline void qdisc_unblock(struct net_device *dev)
+{
+	clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
+}
+
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
 {
 	if (!netif_queue_stopped(dev) &&
 	    !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
-		__qdisc_run(dev);
+		__qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -1013,6 +1013,8 @@ extern int		dev_set_mac_address(struct n
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
+extern int		dev_add_skb_to_blist(struct ...
From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 5:15 am

Hi Krishna.


Could it be list_move()-like function for skb lists?
I'm pretty sure if you change first and the last skbs and ke of the
queue in one shot, result will be the same.


-- 
	Evgeniy Polyakov
-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 8:49 pm

Hi Evgeniy,


I have to do a bit more like update count, etc, but I agree it is do-able.
I had mentioned in my PATCH 0/10 that I will later try this suggestion

It depends, eg when the tx lock is not got, I get batching of upto 8-10
skbs (assuming that tx lock was not got quite a few times). But when the
queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len

This is the gso code which has segmented 'skb' to skb'1-n', and those
skb'1-n' are sent out and freed by driver, which means the dummy 'skb'
(without any data) remains to be freed.

Thanks,

- KK

-

From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:02 am

Add ethtool support to enable/disable batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 include/linux/ethtool.h   |    2 ++
 include/linux/netdevice.h |    2 ++
 net/core/dev.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/ethtool.c        |   27 +++++++++++++++++++++++++++
 4 files changed, 75 insertions(+)

diff -ruNp org/include/linux/ethtool.h new/include/linux/ethtool.h
--- org/include/linux/ethtool.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/ethtool.h	2007-09-14 10:25:36.000000000 +0530
@@ -440,6 +440,8 @@ struct ethtool_ops {
 #define ETHTOOL_SFLAGS		0x00000026 /* Set flags bitmap(ethtool_value) */
 #define ETHTOOL_GPFLAGS		0x00000027 /* Get driver-private flags bitmap */
 #define ETHTOOL_SPFLAGS		0x00000028 /* Set driver-private flags bitmap */
+#define ETHTOOL_GBATCH		0x00000029 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBATCH		0x00000030 /* Set Batching (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h	2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h	2007-09-14 10:26:21.000000000 +0530
@@ -1331,6 +1331,8 @@ extern void		dev_set_promiscuity(struct 
 extern void		dev_set_allmulti(struct net_device *dev, int inc);
 extern void		netdev_state_change(struct net_device *dev);
 extern void		netdev_features_change(struct net_device *dev);
+extern int		dev_change_tx_batch_skb(struct net_device *dev,
+						unsigned long new_batch_skb);
 /* Load a device via the kmod */
 extern void		dev_load(struct net *net, const char *name);
 extern void		dev_mcast_init(void);
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c	2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c	2007-09-14 10:25:36.000000000 +0530
@@ -963,6 +963,50 @@ void free_batching(struct net_dev
 	}
 }
 
+int dev_change_tx_batch_skb(struct ...
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:02 am

IPoIB header file changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib.h |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h	2007-09-14 10:25:36.000000000 +0530
@@ -271,8 +271,8 @@ struct ipoib_dev_priv {
 	struct ipoib_tx_buf *tx_ring;
 	unsigned             tx_head;
 	unsigned             tx_tail;
-	struct ib_sge        tx_sge;
-	struct ib_send_wr    tx_wr;
+	struct ib_sge        *tx_sge;
+	struct ib_send_wr    *tx_wr;
 
 	struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -367,8 +367,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+		      struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+		      u32 qpn, int wr_num);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-		struct ipoib_ah *address, u32 qpn);
+		struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);
-

From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:03 am

IPoIB CM & Multicast changes based on header file changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_cm.c        |   13 +++++++++----
 ipoib_multicast.c |    4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2007-09-14 10:25:36.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
 			    unsigned int wr_id,
 			    u64 addr, int len)
 {
+	int ret;
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge.addr             = addr;
-	priv->tx_sge.length           = len;
+	priv->tx_sge[0].addr          = addr;
+	priv->tx_sge[0].length        = len;
+
+	priv->tx_wr[0].wr_id 	      = wr_id;
 
-	priv->tx_wr.wr_id 	      = wr_id;
+	priv->tx_wr[0].next = NULL;
+	ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+	priv->tx_wr[0].next = &priv->tx_wr[1];
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2007-09-14 10:25:36.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
 	if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		    sizeof (union ib_gid))) {
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
 			}
 		}
 
-		ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+		ipoib_send(dev, skb, ...
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:03 am

IPoIB verb changes to use batching.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_verbs.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c	2007-09-14 10:25:36.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
 			.max_send_sge = 1,
 			.max_recv_sge = 1
 		},
-		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.sq_sig_type = IB_SIGNAL_REQ_WR,	/* 11.2.4.1 */
 		.qp_type     = IB_QPT_UD
 	};
-
-	int ret, size;
+	struct ib_send_wr *next_wr = NULL;
+	int i, ret, size;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	priv->tx_sge.lkey 	= priv->mr->lkey;
-
-	priv->tx_wr.opcode 	= IB_WR_SEND;
-	priv->tx_wr.sg_list 	= &priv->tx_sge;
-	priv->tx_wr.num_sge 	= 1;
-	priv->tx_wr.send_flags 	= IB_SEND_SIGNALED;
+	for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+		priv->tx_sge[i].lkey		= priv->mr->lkey;
+		priv->tx_wr[i].opcode		= IB_WR_SEND;
+		priv->tx_wr[i].sg_list		= &priv->tx_sge[i];
+		priv->tx_wr[i].num_sge		= 1;
+		priv->tx_wr[i].send_flags	= 0;
+
+		/* Link the list properly for provider to use */
+		priv->tx_wr[i].next		= next_wr;
+		next_wr				= &priv->tx_wr[i];
+	}
 
 	return 0;
 
-

From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:03 am

IPoIB internal post and work completion handler changes.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_ib.c |  212 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 168 insertions(+), 44 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c	2007-09-14 10:25:36.000000000 +0530
@@ -242,6 +242,8 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int num_completions, to_process;
+	unsigned int tx_ring_index;
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
 	unsigned long flags;
@@ -255,18 +257,51 @@ static void ipoib_ib_handle_tx_wc(struct
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	/* Get first WC to process (no one can update tx_tail at this time) */
+	tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
 
-	ib_dma_unmap_single(priv->ca, tx_req->mapping,
-			    tx_req->skb->len, DMA_TO_DEVICE);
+	/* Find number of WC's to process */
+	num_completions = wr_id - tx_ring_index + 1;
+	if (unlikely(num_completions <= 0))
+		num_completions += ipoib_sendq_size;
+	to_process = num_completions;
 
-	++priv->stats.tx_packets;
-	priv->stats.tx_bytes += tx_req->skb->len;
+	/*
+	 * Handle WC's from earlier (possibly multiple) post_sends in this
+	 * iteration as we move from tx_tail to wr_id, since if the last WR
+	 * (which is the one which requested completion notification) failed
+	 * to be sent for any of those earlier request(s), no completion
+	 * notification is generated for successful WR's of those earlier
+	 * request(s). Use a infinite loop to handle the regular case of
+	 * one skb processing faster.
+	 */
+	tx_req = &priv->tx_ring[tx_ring_index];
+	while (1) {
+		if (likely(tx_req->skb)) ...
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:04 am

IPoIB: implement the new batching API.

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 ipoib_main.c |  248 +++++++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 168 insertions(+), 80 deletions(-)

diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c	2007-09-14 10:25:36.000000000 +0530
@@ -563,7 +563,8 @@ static void neigh_add_path(struct sk_buf
 				goto err_drop;
 			}
 		} else
-			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+			ipoib_send(dev, skb, path->ah,
+				   IPOIB_QPN(skb->dst->neighbour->ha), 1);
 	} else {
 		neigh->ah  = NULL;
 
@@ -643,7 +644,7 @@ static void unicast_arp_send(struct sk_b
 		ipoib_dbg(priv, "Send unicast ARP to %04x\n",
 			  be16_to_cpu(path->pathrec.dlid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+		ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
 	} else if ((path->query || !path_rec_start(dev, path)) &&
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		/* put pseudoheader back on for next time */
@@ -657,105 +658,163 @@ static void unicast_arp_send(struct sk_b
 	spin_unlock(&priv->lock);
 }
 
+#define	XMIT_PROCESSED_SKBS()						\
+	do {								\
+		if (wr_num) {						\
+			ipoib_send(dev, NULL, old_neigh->ah, old_qpn,	\
+				   wr_num);				\
+			wr_num = 0;					\
+		}							\
+	} while (0)
+
 static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ipoib_neigh *neigh;
+	struct sk_buff_head *blist;
+	int max_skbs, wr_num = 0;
+	u32 qpn, old_qpn = 0;
+	struct ipoib_neigh *neigh, *old_neigh = NULL;
 	unsigned long flags;
 
 	if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
 		return NETDEV_TX_LOCKED;
 
-	/*
-	 * Check if our queue is stopped.  ...
From: Krishna Kumar
Date: Friday, September 14, 2007 - 2:04 am

E1000: Implement batching capability (ported thanks to changes taken from
	Jamal).

Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
 e1000_main.c |  104 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 75 insertions(+), 29 deletions(-)

diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
--- org/drivers/net/e1000/e1000_main.c	2007-09-14 10:30:57.000000000 +0530
+++ new/drivers/net/e1000/e1000_main.c	2007-09-14 10:31:02.000000000 +0530
@@ -990,7 +990,7 @@ e1000_probe(struct pci_dev *pdev,
 	if (pci_using_dac)
 		netdev->features |= NETIF_F_HIGHDMA;
 
-	netdev->features |= NETIF_F_LLTX;
+	netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
 
 	adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
 
@@ -3092,6 +3092,17 @@ e1000_tx_map(struct e1000_adapter *adapt
 	return count;
 }
 
+static void e1000_kick_DMA(struct e1000_adapter *adapter,
+			   struct e1000_tx_ring *tx_ring, int i)
+{
+	wmb();
+
+	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+	/* we need this if more than one processor can write to our tail
+	 * at a time, it syncronizes IO on IA64/Altix systems */
+	mmiowb();
+}
+
 static void
 e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
                int tx_flags, int count)
@@ -3138,13 +3149,7 @@ e1000_tx_queue(struct e1000_adapter *ada
 	 * know there are new descriptors to fetch.  (Only
 	 * applicable for weak-ordered memory model archs,
 	 * such as IA-64). */
-	wmb();
-
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
-	/* we need this if more than one processor can write to our tail
-	 * at a time, it syncronizes IO on IA64/Altix systems */
-	mmiowb();
 }
 
 /**
@@ -3251,22 +3256,23 @@ static int e1000_maybe_stop_tx(struct ne
 }
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+
+#define NETDEV_TX_DROPPED	-5
+
 static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device ...
From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 5:47 am

This changes could actually go as own patch, although not sure it is

Why is it put at the end?


-- 
	Evgeniy Polyakov
-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 8:56 pm

Hi Evgeniy,


Since this flag is new and useful only for batching, I feel it is OK to

There is a bug that I had explained in rev4 (see XXX below) resulting
in sending out skbs out of order. The fix is that if the driver gets
called with a skb but there are older skbs already in the batch list
(which failed to get sent out), send those skbs first before this one.

Thanks,

- KK

[XXX] Dave had suggested to use batching only in the net_tx_action case.
When I implemented that in earlier revisions, there were lots of TCP
retransmissions (about 18,000 to every 1 in regular code). I found the
reason
for part of that problem as: skbs get queue'd up in dev->qdisc (when tx
lock
was not got or queue blocked); when net_tx_action is called later, it
passes
the batch list as argument to qdisc_run and this results in skbs being
moved
to the batch list; then batching xmit also fails due to tx lock failure;
the
next many regular xmit of a single skb will go through the fast path (pass
NULL batch list to qdisc_run) and send those skbs out to the device while
previous skbs are cooling their heels in the batch list.

The first fix was to not pass NULL/batch-list to qdisc_run() but to always
check whether skbs are present in batch list when trying to xmit. This
reduced
retransmissions by a third (from 18,000 to around 12,000), but led to
another
problem while testing - iperf transmit almost zero data for higher # of
parallel flows like 64 or more (and when I run iperf for a 2 min run, it
takes about 5-6 mins, and reports that it ran 0 secs and the amount of data
transfered is a few MB's). I don't know why this happens with this being
the
only change (any ideas is very appreciated).

The second fix that resolved this was to revert back to Dave's suggestion
to
use batching only in net_tx_action case, and modify the driver to see if
skbs
are present in batch list and to send them out first before sending the
current skb. I still see huge retransmission for IPoIB (but not for ...
From: Kok, Auke
Date: Tuesday, November 13, 2007 - 2:28 pm

this doesn't apply anymore and it would help if you could re-spin this for e1000e.
I don't know what the status for merging of the batched xmit patches is right now
but it would help if you could rewrite them against e1000e, which I assume is what
most people want to test with. There are also significant changes upstream right
now in jgarzik/netdev-2.6 #upstream...

I'm still very interested in these patches BTW.

-

From: Krishna Kumar2
Date: Wednesday, November 14, 2007 - 1:30 am

Hi Auke,


I will make a latest version and test it out for some numbers and try to
send it this week.

Thanks,

- KK

-

From: Evgeniy Polyakov
Date: Friday, September 14, 2007 - 5:49 am

Hi Krishna.


And what about latency for this patchset?

-- 
	Evgeniy Polyakov
-

From: David Miller
Date: Sunday, September 16, 2007 - 4:17 pm

From: Krishna Kumar <krkumar2@in.ibm.com>

The only major complaint I have about this patch series is that
the IPoIB part should just be one big changeset.  Otherwise the
tree is not bisectable, for example the initial ipoib header file
change breaks the build.

The tree must compile and work properly after every single patch.

On a lower priority, I question the indirection of skb_blist by making
it a pointer.  For what?  Saving 12 bytes on 64-bit?  That kmalloc()'d
thing is a nearly guarenteed cache and/or TLB miss.  Just inline the
thing, we generally don't do crap like this anywhere else.
-

From: jamal
Date: Sunday, September 16, 2007 - 5:29 pm

Dave, you do realize that i have been investing my time working on
batching as well, right? 

cheers,
jamal


-

From: jamal
Date: Sunday, September 23, 2007 - 10:53 am

I had plenty of time this weekend so i have been doing a _lot_ of
testing.  My next emails will send a set of patches:
 
Patch 1: Introduces explicit tx locking
Patch 2: Introduces batching interface
Patch 3: Core uses batching interface
Patch 4: get rid of dev->gso_skb

Testing
-------
Each of these patches has been performance tested and the results
are in the logs on a per-patch basis. 
My system under test hardware is a 2xdual core opteron with a couple of 
tg3s. 
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.

All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I have not run testing on patch #4 because i had to let the machine
go, but will have some access to it tommorow early morning where i can
run some tests.

Comments
--------
Iam trying to kill ->hard_batch_xmit() but it would be tricky to do
without it for LLTX drivers. Anything i try will require a few extra
checks. OTOH, I could kill LLTX for the drivers i am using that
are LLTX and then drop that interface or I could say "no support
for LLTX". I am in a dilema.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 4 i will appreaciate it.

More patches to follow  - i didnt want to overload people by dumping 
too many patches. Most of these patches below are ready to go; some ...
From: jamal
Date: Sunday, September 23, 2007 - 10:56 am

I have submitted this before; but here it is again.
Against net-2.6.24 from yesterday for this and all following patches. 


cheers,
jamal

From: jamal
Date: Sunday, September 23, 2007 - 10:58 am

This patch introduces the netdevice interface for batching.

cheers,
jamal

From: jamal
Date: Sunday, September 23, 2007 - 11:00 am

This patch adds the usage of batching within the core.

cheers,
jamal

From: jamal
Date: Sunday, September 23, 2007 - 11:02 am

This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal

From: jamal
Date: Sunday, September 30, 2007 - 11:53 am

This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal


From: jamal
Date: Sunday, October 7, 2007 - 11:39 am

This patch removes dev->gso_skb as it is no longer necessary with
batching code.

cheers,
jamal
From: jamal
Date: Sunday, September 30, 2007 - 11:52 am

This patch adds the usage of batching within the core.

cheers,
jamal



From: Bill Fink
Date: Sunday, September 30, 2007 - 9:11 pm

Have you done performance comparisons for the case of using 9000-byte
jumbo frames?

						-Bill
-

From: jamal
Date: Monday, October 1, 2007 - 6:30 am

I havent, but will try if any of the gige cards i have support it.

As a side note: I have not seen any useful gains or losses as the packet
size approaches even 1500B MTU. For example, post about 256B neither the
batching nor the non-batching give much difference in either throughput
or cpu use. Below 256B, theres a noticeable gain for batching.
Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
so the occupancy of both the qdisc queue(s) and ethernet ring is
constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
and we are hitting in the range of wire speed. We are at 90% idle at
1024B. This is the case with or without batching.  So my suspicion is
that with that trend a 9000B packet will just follow the same pattern.


cheers,
jamal

-

From: Bill Fink
Date: Monday, October 1, 2007 - 9:25 pm

One reason I ask, is that on an earlier set of alternative batching
xmit patches by Krishna Kumar, his performance testing showed a 30 %
performance hit for TCP for a single process and a size of 4 KB, and
a performance hit of 5 % for a single process and a size of 16 KB
(a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
time to inquire further about it, but it would be a major potential
concern for me in my 10-GigE network testing with 9000-byte jumbo
frames.  Of course the single process and 4 KB or larger size was
the only case that showed a significant performance hit in Krishna
Kumar's latest reported test results, so it might be acceptable to
just have a switch to disable the batching feature for that specific
usage scenario.  So it would be useful to know if your xmit batching
changes would have similar issues.

Also for your xmit batching changes, I think it would be good to see
performance comparisons for TCP and IP forwarding in addition to your
UDP pktgen tests, including various packet sizes up to and including
9000-byte jumbo frames.

						-Bill
-

From: jamal
Date: Tuesday, October 2, 2007 - 6:20 am

There were many times while testing that i noticed inconsistencies and
in each case when i analysed[1], i found it to be due to some variable
other than batching which needed some resolving, always via some
parametrization or other. I suspect what KK posted is in the same class.
To give you an example, with UDP, batching was giving worse results at
around 256B compared to 64B or 512B; investigating i found that the
receiver just wasnt able to keep up and the udp layer dropped a lot of
packets so both iperf and netperf reported bad numbers. Fixing the
receiver ended up with consistency coming back. On why 256B was the one
that overwhelmed the receiver more than 64B(which sent more pps)? On
some limited investigation, it seemed to me to be the effect of the
choice of the tg3 driver's default tx mitigation parameters as well tx
ring size; which is something i plan to revisit (but neutralizing it
helps me focus on just batching). In the end i dropped both netperf and
iperf for similar reasons and wrote my own app. What i am trying to
achieve is demonstrate if batching is a GoodThing. In experimentation
like this, it is extremely valuable to reduce the variables. Batching
may expose other orthogonal issues - those need to be resolved or fixed
as they are found. I hope that sounds sensible.

Back to the >=9K packet size you raise above:
I dont have a 10Gige card so iam theorizing. Given that theres an
observed benefit to batching for a saturated link with "smaller" packets
(in my results "small" is anything below 256B which maps to about
380Kpps anything above that seems to approach wire speed and the link is
the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
already achieving wire rate, should continue to do so. And sizes below
that will see improvements if they were not already hitting wire rate.
So i would say that with 10G NICS, there will be more observed
improvements with batching with apps that do bulk transfers (assuming
those apps are not seeing wire ...
From: Bill Fink
Date: Tuesday, October 2, 2007 - 10:29 pm

It does sound sensible.  My own decidedly non-expert speculation
was that the big 30 % performance hit right at 4 KB may be related
to memory allocation issues or having to split the skb across
multiple 4 KB pages.  And perhaps it only affected the single
process case because with multiple processes lock contention may
be a bigger issue and the xmit batching changes would presumably
help with that.  I am admittedly a novice when it comes to the
detailed internals of TCP/skb processing, although I have been
slowly slogging my way through parts of the TCP kernel code to
try and get a better understanding, so I don't know if these
thoughts have any merit.

BTW does anyone know of a good book they would recommend that has
substantial coverage of the Linux kernel TCP code, that's fairly
up-to-date and gives both an overall view of the code and packet
flow as well as details on individual functions and algorithms,
and hopefully covers basic issues like locking and synchronization,
concurrency of different parts of the stack, and memory allocation.
I have several books already on Linux kernel and networking internals,
but they seem to only cover the IP (and perhaps UDP) portions of the
network stack, and none have more than a cursory reference to TCP.  
The most useful documentation on the Linux TCP stack that I have
found thus far is some of Dave Miller's excellent web pages and
a few other web references, but overall it seems fairly skimpy

It would be good to see some empirical evidence that there aren't
any unforeseen gotchas for larger packet sizes, that at least the
same level of performance can be obtained with no greater CPU


As you have written previously, and I heartily agree with, this is a
very good practice for developing performance enhancement patches.

						-Thanks

						-Bill
-

From: jamal
Date: Wednesday, October 3, 2007 - 6:42 am

plausible. But i also worry it could be 10 other things; example, could
it be the driver used? I noted in my udp test the oddity that turned out
to be tx coal parameter related.

You do bring up issues that need to be looked into and i will run those
tests.
Note, the effectiveness of batching becomes evident as the number of
flows grows. Actually, scratch that: It becomes evident if you can keep
the tx path busyed out to which multiple users running contribute. If i
can have a user per CPU with lots of traffic to send, i can create that
condition. It's a little boring in the scenario where the bottleneck is

Reading books or magazines may end up busying you out with some small
gains of knowledge at the end. They tend to be outdated fast. My advice
is if you start with a focus on one thing, watch the patches that fly
around on that area and learn that way. Read the code to further
understand things then ask questions when its not clear. Other folks may
have different views. The other way to do it is pick yourself some task

Reasonable - I will try with 9K after i move over to the new tree from
Dave and make sure nothing else broke in the previous tests.

To give you a perspective, the results i posted were each run 10
iterations per packet size per kernel. Each run is 60 seconds long. I
think i am past that stage for resolving or fixing anything for UDP or
pktgen, but i need to keep checking for any new regressions when Dave
updates his tree. Now multiply that by 5 packet sizes (I am going to add
2 more) and multiply that by 3-4 kernels. Then add the time it takes to
sift through the data and collect it then analyze it and go back to the
drawing table when something doesnt look right.  Essentially, it needs a
weekend ;->

cheers,
jamal

-

From: Patrick McHardy
Date: Monday, October 1, 2007 - 3:42 am

->requeue queues at the head, so this looks like it would reverse
the order of the skbs.


-

From: jamal
Date: Monday, October 1, 2007 - 6:21 am

Excellent catch!  thanks; i will fix.

As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.

Thanks again Patrick.

cheers,
jamal

-

From: Krishna Kumar2
Date: Sunday, October 7, 2007 - 10:03 pm

On the contrary, batching LLTX drivers (if that is not ruled out) will very
often requeue resulting in heavy reordering. Fix looks good though.

- KK

-

From: jamal
Date: Monday, October 8, 2007 - 6:17 am

Two things: 
one, LLTX is deprecated (I think i saw a patch which says no more new
drivers should do LLTX) and i plan if nobody else does to kill 
LLTX in e1000 RSN. So for that reason i removed all code that existed to
support LLTX.
two, there should _never_ be any requeueing even if LLTX in the previous
patches when i supported them; if there is, it is a bug. This is because
we dont send more than what the driver asked for via xmit_win. So if it
asked for more than it can handle, that is a bug. If its available space
changes while we are sending to it, that too is a bug.

cheers,
jamal

-

From: Krishna Kumar2
Date: Monday, October 8, 2007 - 8:09 pm

Driver might ask for 10 and we send 10, but LLTX driver might fail to get
lock and return TX_LOCKED. I haven't seen your code in greater detail, but
don't you requeue in that case too?

- KK

-

From: jamal
Date: Tuesday, October 9, 2007 - 6:10 am

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

-

From: jamal
Date: Sunday, October 7, 2007 - 11:38 am

This patch adds the usage of batching within the core.

cheers,
jamal



From: jamal
Date: Sunday, September 30, 2007 - 11:51 am

This patch introduces the netdevice interface for batching.

cheers,
jamal


From: jamal
Date: Sunday, September 30, 2007 - 11:54 am

Fixed subject - should be 1/3 not 1/4


-

From: jamal
Date: Sunday, October 7, 2007 - 11:36 am

This patch introduces the netdevice interface for batching.

cheers,
jamal


From: Krishna Kumar2
Date: Monday, October 8, 2007 - 2:59 am

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK


-

From: jamal
Date: Monday, October 8, 2007 - 6:49 am

Please provide an analysis when you get the results. IOW, explain why

I havent tested with iperf in a while.
Can you post the netstat on both sides when the driver stops?
---
This variable should be set during xmit path shutdown(netif_stop),
wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first
one the value is set to 1 and in the other two it is set to whatever
the driver deems to be available space on the ring.

Your driver needs to also support wake thresholding.
I will post the driver howto later today.

cheers,
jamal

-

From: Waskiewicz Jr, Peter P
Date: Monday, September 24, 2007 - 12:12 pm

Hi Jamal,
	I've been (slowly) working on resurrecting the original design
of my multiqueue patches to address this exact issue of the queue_lock
being a hot item.  I added a queue_lock to each queue in the subqueue
struct, and in the enqueue and dequeue, just lock that queue instead of
the global device queue_lock.  The only two issues to overcome are the
QDISC_RUNNING state flag, since that also serializes entry into the
qdisc_restart() function, and the qdisc statistics maintenance, which
needs to be serialized.  Do you think this work along with your patch
will benefit from one another?  I apologize for not having working
patches right now, but I am working on them slowly as I have some blips
of spare time.

Thanks,
-PJ Waskiewicz
-

From: jamal
Date: Monday, September 24, 2007 - 3:51 pm

The one thing that seems obvious is to use dev->hard_prep_xmit() in the
patches i posted to select the xmit ring. You should be able to do
figure out the txmit ring without holding any lock. 

I lost track of how/where things went since the last discussion; so i
need to wrap my mind around it to make sensisble suggestions - I know
the core patches are in the kernel but havent paid attention to details
and if you look at my second patch youd see a comment in
dev_batch_xmit() which says i need to scrutinize multiqueue more. 

cheers,
jamal

-

From: Waskiewicz Jr, Peter P
Date: Monday, September 24, 2007 - 3:57 pm

I've looked at that as a candidate to use.  The lock for enqueue would
be needed when actually placing the skb into the appropriate software

No worries.  I'll try to get things together on my end and provide some
patches to add a per-queue lock.  In the meantime, I'll take a much
closer look at the batching code, since I've stopped looking at the
patches in-depth about a month ago.  :-(

Thanks,
-PJ Waskiewicz
-

From: jamal
Date: Monday, September 24, 2007 - 4:38 pm

The enqueue is easy to comprehend. The single device queue lock should
suffice. The dequeue is interesting:
Maybe you can point me to some doc or describe to me the dequeue aspect;
are you planning to have an array of txlocks per, one per ring?
How is the policy to define the qdisc queues locked/mapped to tx rings? 

cheers,
jamal

-

From: David Miller
Date: Sunday, October 7, 2007 - 9:51 pm

From: jamal <hadi@cyberus.ca>

For these high performance 10Gbit cards it's a load balancing
function, really, as all of the transmit queues go out to the same
physical port so you could:

1) Load balance on CPU number.
2) Load balance on "flow"
3) Load balance on destination MAC

etc. etc. etc.

It's something that really sits logically between the qdisc and the
card, not something that is a qdisc thing.

In some ways it's similar to bonding, but using anything similar to
bonding's infrastructure (stacking devices) is way overkill for this.

And then we have the virtualization network devices where the queue
selection has to be made precisely, in order for the packet to
reach the proper destination, rather than a performance improvement.
It is also a situation where the TX queue selection is something
to be made between qdisc activity and hitting the device.

I think we will initially have to live with taking the centralized
qdisc lock for the device, get in and out of that as fast as possible,
then only take the TX queue lock of the queue selected.

After we get things that far we can try to find some clever lockless
algorithm for handling the qdisc to get rid of that hot spot.

These queue selection schemes want a common piece of generic code.  A
set of load balancing algorithms, a "select TX queue by MAC with a
default fallback on no match" for virtualization, and interfaces for
both drivers and userspace to change the queue selection scheme.

-

From: jamal
Date: Monday, October 8, 2007 - 6:34 am

The brain-block i am having is the parallelization aspect of it.
Whatever scheme it is - it needs to ensure the scheduler works as
expected. For example, if it was a strict prio scheduler i would expect
that whatever goes out is always high priority first and never ever
allow a low prio packet out at any time theres something high prio
needing to go out. If i have the two priorities running on two cpus,
then i cant guarantee that effect.
IOW, i see the scheduler/qdisc level as not being split across parallel
cpus. Do i make any sense?

The rest of my understanding hinges on the above, so let me stop here.

cheers,
jamal

-

From: Jeff Garzik
Date: Monday, October 8, 2007 - 7:22 am

Any chance the NIC hardware could provide that guarantee?

8139cp, for example, has two TX DMA rings, with hardcoded 
characteristics:  one is a high prio q, and one a low prio q.  The logic 
is pretty simple:   empty the high prio q first (potentially starving 
low prio q, in worst case).


In terms of overall parallelization, both for TX as well as RX, my gut 
feeling is that we want to move towards an MSI-X, multi-core friendly 
model where packets are LIKELY to be sent and received by the same set 
of [cpus | cores | packages | nodes] that the [userland] processes 
dealing with the data.

There are already some primitive NUMA bits in skbuff allocation, but 
with modern MSI-X and RX/TX flow hashing we could do a whole lot more, 
along the lines of better CPU scheduling decisions, directing flows to 
clusters of cpus, and generally doing a better job of maximizing cache 
efficiency in a modern multi-thread environment.

IMO the current model where each NIC's TX completion and RX processes 
are both locked to the same CPU is outmoded in a multi-core world with 
modern NICs.  :)

But I readily admit general ignorance about the kernel process 
scheduling stuff, so my only idea about a starting point was to see how 
far to go with the concept of "skb affinity" -- a mask in sk_buff that 
is a hint about which cpu(s) on which the NIC should attempt to send and 
receive packets.  When going through bonding or netfilter, it is trivial 
to 'or' together affinity masks.  All the various layers of net stack 
should attempt to honor the skb affinity, where feasible (requires 
interaction with CFS scheduler?).

Or maybe skb affinity is a dumb idea.  I wanted to get people thinking 
on the bigger picture.  Parallelization starts at the user process.

	Jeff


-

From: jamal
Date: Monday, October 8, 2007 - 8:18 am

If you can get the scheduling/dequeuing to run on one CPU (as we do
today) it should work; alternatively you can totaly bypass the qdisc
subystem and go direct to the hardware for devices that are capable
and that would work but would require huge changes. 
My fear is there's a mini-scheduler pieces running on multi cpus which

sounds like strict prio scheduling to me which says "if low prio starves

Does putting things in the same core help? But overall i agree with your

I think i see the receive with a lot of clarity, i am still foggy on the

Infact even with status quo theres a case that can be made to not bind
to interupts.
In my recent experience with batching, due to the nature of my test app,
if i let the interupts float across multiple cpus i benefit.
My app runs/binds a thread per CPU and so benefits from having more
juice to send more packets per unit of time - something i wouldnt get if
i was always running on one cpu.
But when i do this i found that just because i have bound a thread to
cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup
happens on cpu1, scheduler will put the thread on cpu1 if it is to be

There would be cache benefits if you can free the packet on the same cpu
it was allocated; so the idea of skb affinity is useful in the minimal
in that sense if you can pull it. Assuming hardware is capable, even if
you just tagged it on xmit to say which cpu it was sent out on, and made
sure thats where it is freed, that would be a good start. 

Note: The majority of the packet processing overhead is _still_ the
memory subsystem latency; in my tests with batched pktgen improving the 
xmit subsystem meant the overhead on allocing and freeing the packets
went to something > 80%.
So something along the lines of parallelizing based on a split of alloc
free of sksb IMO on more cpus than where xmit/receive run would see


cheers,
jamal

-

From: David Miller
Date: Monday, October 8, 2007 - 2:11 pm

From: Jeff Garzik <jeff@garzik.org>

The problem is that the packet schedulers want global guarantees
on packet ordering, not flow centric ones.

That is the issue Jamal is concerned about.

The more I think about it, the more inevitable it seems that we really
might need multiple qdiscs, one for each TX queue, to pull this full
parallelization off.

But the semantics of that don't smell so nice either.  If the user
attaches a new qdisc to "ethN", does it go to all the TX queues, or
what?

All of the traffic shaping technology deals with the device as a unary
object.  It doesn't fit to multi-queue at all.
-

From: jamal
Date: Monday, October 8, 2007 - 3:30 pm

If you let only one CPU at a time access the "xmit path" you solve all
the reordering. If you want to be more fine grained you make the
serialization point as low as possible in the stack - perhaps in the
driver.
But I think even what we have today with only one cpu entering the
dequeue/scheduler region, _for starters_, is not bad actually ;->  What
i am finding (and i can tell you i have been trying hard;->) is that a
sufficiently fast cpu doesnt sit in the dequeue area for "too long" (and
batching reduces the time spent further). Very quickly there are no more
packets for it to dequeue from the qdisc or the driver is stoped and it
has to get out of there. If you dont have any interupt tied to a
specific cpu then you can have many cpus enter and leave that region all
the time. 

cheers,
jamal

-

From: David Miller
Date: Monday, October 8, 2007 - 3:33 pm

From: jamal <hadi@cyberus.ca>

With the lock shuttling back and forth between those cpus, which is
what we're trying to avoid.

Multiply whatever effect you think you might be able to measure due to
that on your 2 or 4 way system, and multiple it up to 64 cpus or so
for machines I am using.  This is where machines are going, and is
going to become the norm.
-

From: Waskiewicz Jr, Peter P
Date: Monday, October 8, 2007 - 3:35 pm

That along with speeds going to 10 GbE with multiple Tx/Rx queues (with
40 and 100 GbE under discussion now), where multiple CPU's hitting the
driver are needed to push line rate without cratering the entire
machine.

-PJ Waskiewicz
-

From: jamal
Date: Monday, October 8, 2007 - 4:42 pm

Yes, i keep forgetting that ;-> I need to train my brain to remember
that.

cheers,
jamal



-

From: Jeff Garzik
Date: Monday, October 8, 2007 - 6:53 pm

Oh, absolutely.

I think, fundamentally, any amount of cross-flow resource management 
done in software is an obstacle to concurrency.

That's not a value judgement, just a statement of fact.

"traffic cops" are intentional bottlenecks we add to the process, to 
enable features like priority flows, filtering, or even simple socket 
fairness guarantees.  Each of those bottlenecks serves a valid purpose, 
but at the end of the day, it's still a bottleneck.

So, improving concurrency may require turning off useful features that 

Well the easy solutions to networking concurrency are

* use virtualization to carve up the machine into chunks

* use multiple net devices

Since new NIC hardware is actively trying to be friendly to 
multi-channel/virt scenarios, either of these is reasonably 
straightforward given the current state of the Linux net stack.  Using 
multiple net devices is especially attractive because it works very well 
with the existing packet scheduling.

Both unfortunately impose a burden on the developer and admin, to force 
their apps to distribute flows across multiple [VMs | net devs].


The third alternative is to use a single net device, with SMP-friendly 
packet scheduling.  Here you run into the problems you described "device 
as a unary object" etc. with the current infrastructure.

With multiple TX rings, consider that we are pushing the packet 
scheduling from software to hardware...  which implies
* hardware-specific packet scheduling
* some TC/shaping features not available, because hardware doesn't 
support it

	Jeff




-

From: David Miller
Date: Monday, October 8, 2007 - 2:05 pm

From: jamal <hadi@cyberus.ca>

Picture it like N tubes you stick packets into, and the tubes are
processed using DRR.

So packets within a tube won't be reordered, but reordering amongst
tubes is definitely possible.
-

From: Waskiewicz Jr, Peter P
Date: Monday, September 24, 2007 - 4:47 pm

We should make sure we're symmetric with the locking on enqueue to
dequeue.  If we use the single device queue lock on enqueue, then
dequeue will also need to check that lock in addition to the individual
queue lock.  The details of this are more trivial than the actual

The dequeue locking would be pushed into the qdisc itself.  This is how
I had it originally, and it did make the code more complex, but it was
successful at breaking the heavily-contended queue_lock apart.  I have a
subqueue structure right now in netdev, which only has queue_state (for
netif_{start|stop}_subqueue).  This state is checked in sch_prio right
now in the dequeue for both prio and rr.  My approach is to add a
queue_lock in that struct, so each queue allocated by the driver would
have a lock per queue.  Then in dequeue, that lock would be taken when
the skb is about to be dequeued.

The skb->queue_mapping field also maps directly to the queue index
itself, so it can be unlocked easily outside of the context of the
dequeue function.  The policy would be to use a spin_trylock() in
dequeue, so that dequeue can still do work if enqueue or another dequeue
is busy.  And the allocation of qdisc queues to device queues is assumed
to be one-to-one (that's how the qdisc behaves now).

I really just need to put my nose to the grindstone and get the patches
together and to the list...stay tuned.

Thanks,
-PJ Waskiewicz
-

From: jamal
Date: Tuesday, September 25, 2007 - 6:08 am

more locks implies degraded performance. If only one processor can enter
that region, presumably after acquiring the outer lock , why this

So there could be a parallel cpu dequeueing at the same time?
Wouldnt this have implications depending on what the scheduling
algorithm used? If for example i was doing priority queueing i would
want to make sure the highest priority is being dequeued first AND by
all means goes out first to the driver; i dont want a parallell cpu

Ok, that brings back the discussion we had; my thinking was something
like dev->hard_prep_xmit() would select the ring and i think you
staticly already map the ring to a qdisc queue. So i dont think 
dev->hard_prep_xmit() is useful to you.
In any case, there is nothing the batching patches do that interfere or
prevent you from going the path you intend to. instead of dequeueing one
packet, you dequeue several and instead of sending to the driver one
packet, you send several. And using the xmit_win, you should never ever
have to requeue.

cheers,
jamal

-

From: Stephen Hemminger
Date: Monday, September 24, 2007 - 5:14 pm

On Mon, 24 Sep 2007 16:47:06 -0700


Since we are redoing this, is there any way to make the whole TX path
more lockless?  The existing model seems to be more of a monitor than
a real locking model.
-- 
Stephen Hemminger <shemminger@linux-foundation.org>
-

From: jamal
Date: Tuesday, September 25, 2007 - 6:15 am

What do you mean it is "more of a monitor"?

On the challenge of making it lockless:
About every NAPI driver combines the tx prunning with rx polling. If you
are dealing with tx resources on receive thread as well as tx thread,
_you need_ locking. The only other way we can do avoid it is to separate
the rx path interupts from ones on tx related resources; the last NAPI
driver that did that was tulip; i think the e1000 for a short period in
its life did the same as well. But that has been frowned on and people
have evolved away from it.

cheers,
jamal

-

From: Stephen Hemminger
Date: Tuesday, September 25, 2007 - 8:24 am

On Tue, 25 Sep 2007 09:15:38 -0400


The transmit code path is locked as a code region, rather than just object locking
on the transmit queue or other fine grained object. This leads to moderately long

If we went to finer grain locking it would also mean changes to all network
devices using the new locking model. My assumption is that we would use
something like the features flag to do the transition for backward compatibility.

Take this as a purely "what if" or "it would be nice if" kind of suggestion
not a requirement or some grand plan.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>
-

From: jamal
Date: Tuesday, September 25, 2007 - 3:14 pm

It will be pretty tricky to optimize that path given the dependencies
between the queues, classifiers, and actions in enqueues; schedulers in
dequeues as well as their config/queries from user space which could
happen concurently on all "N" CPUs. 
The txlock optimization i added in patch1 intends to let go of the queue
lock when we enter the dequeue region sooner to reduce the contention.

A further optimization i made was to reduce the time it takes to hold
the tx lock at the driver by moving gunk that doesnt need lock-holding

Ok, hopefully someone would demonstrate how to achieve it; seems a hard
thing to achieve.

cheers,
jamal

-

From: jamal
Date: Tuesday, September 25, 2007 - 3:43 pm

Sorry, that should have read dev->hard_prep_xmit()

cheers,
jamal

-

From: Waskiewicz Jr, Peter P
Date: Monday, September 24, 2007 - 5:31 pm

That seems quite reasonable.  I will certainly see what I can do.

Thanks Stephen,

-PJ Waskiewicz
-

From: jamal
Date: Sunday, September 30, 2007 - 11:50 am

Latest net-2.6.24 breaks the patches i posted last week; so this is an
update to resolve that. If you are receiving these emails and are
finding them overloading, please give me a shout and i will remove your
name.

Please provide feedback on the code and/or architecture.
Last time i posted them i received none. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev->gso_skb

I have decided i will kill ->hard_batch_xmit() and not support any
more LLTX drivers. This is the last of patches that will have
->hard_batch_xmit() as i am supporting an e1000 that is LLTX.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev->gso_skb that you introduced).

More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

Each of these patches has been performance tested (last with DaveM's
tree from last weekend) and the results are in the logs on a per-patch 
basis.  My system under test hardware is a 2xdual core opteron with a 
couple of tg3s. I have not re-run the tests with this morning's tree
but i suspect not much difference.
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M ...
From: jamal
Date: Sunday, September 30, 2007 - 12:19 pm

And heres a patch that provides a sample of the usage for batching with
tg3. 
Requires patch "[TG3]Some cleanups" i posted earlier. 

cheers,
jamal

From: jamal
Date: Sunday, October 7, 2007 - 11:34 am

Please provide feedback on the code and/or architecture.
Last time i posted them i received little. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev->gso_skb

What has changed since i posted last:
1) Fix a bug eyeballed by Patrick McHardy on requeue reordering.
2) Killed ->hard_batch_xmit() 
3) I am going one step back and making this set of patches even simpler
so i can make it easier to review.I am therefore killing dev->hard_prep_xmit()
and focussing just on batching. I plan to re-introduce dev->hard_prep_xmit()
but from now on i will make that a separate effort. (it seems to be creating
confusion in relation to the general work).

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev->gso_skb that you introduced).

UPCOMING PATCHES
---------------
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

PERFORMANCE TESTING
--------------------
I started testing since yesterday, but these tests take a long time
so i will post results probably at the end of the day sometime and
may stop running more tests and just comparing batch vs non-batch results.
I have optimized the kernel-config so i expect my overall performance
numbers to look better ...
From: Evgeniy Polyakov
Date: Monday, October 8, 2007 - 5:51 am

Hi Jamal.


it looks like you and Krishna use the same requeueing methods - get one
from qdisk, queue it into blist, get next from qdisk, queue it,
eventually start transmit, where you dequeue it one-by-one and send (or
prepare and commit). This is not the 100% optimal approach, but if you
proved it does not hurt usual network processing, it is ok.
Number of comments dusted to very small - that's a sign, but I'm a bit
lost - did you and Krishna create the competing approaches, or they can
co-exist together, in the former case I doubt you can push, until all
problematic places are resolved, in the latter case, this is probably
ready.

-- 
	Evgeniy Polyakov
-

From: jamal
Date: Monday, October 8, 2007 - 7:05 am

There are probably other bottlenecks that hide the need to optimize

Thanks. I would like to make one more cleanup and get rid of the
temporary pkt list in qdisc restart; now that i have defered the skb
pre-format interface it is unnecessary.  I have a day off today, so i
will make changes, re-run tests and post again.

I dont see something from Krishna's approach that i can take and reuse.
This maybe because my old approaches have evolved from the same path.
There is a long list but as a sample: i used to do a lot more work while
holding the queue lock which i have now moved post queue lock; i dont
have any speacial interfaces/tricks just for batching, i provide hints
to the core of how much the driver can take etc etc. I have offered
Krishna co-authorship if he makes the IPOIB driver to work on my
patches, that offer still stands if he chooses to take it. 

cheers,
jamal

-

From: Krishna Kumar2
Date: Tuesday, October 9, 2007 - 1:14 am

My feeling is that since the approaches are very different, it would
be a good idea to test the two for performance. Do you mind me doing
that? Ofcourse others and/or you are more than welcome to do the same.

I had sent a note to you yesterday about this, please let me know
either way.

******************* Previous mail ******************

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

-

From: jamal
Date: Tuesday, October 9, 2007 - 6:25 am

My concern is the approaches are different only for short periods of
time. For example, I do requeueing, have xmit_win, have ->end_xmit,
do batching from core etc; if you see value in any of these concepts,
they will appear in your patches and this goes on a loop. Perhaps what

Which i dont mind as long as it has an analysis that goes with it.
If all you post is "heres what netperf showed", it is not useful at all.
There are also a lot of affecting variables. For example, is the
receiver a bottleneck? To make it worse, I could demonstrate to you that
if i slowed down the driver and allowed more packets to queue up on the
qdisc, batching will do well. In the past my feeling is you glossed over

I responded to you - but it may have been lost in the noise; heres a
copy:
http://marc.info/?l=linux-netdev&m=119185137124008&w=2

cheers,
jamal

-

From: Jeff Garzik
Date: Sunday, September 23, 2007 - 11:19 am

You should post at least a couple driver patches to see how its used on 
Real Hardware(tm)...   :)

The batching idea has always seemed like a no-brainer to me, so I'm very 
interested to see how this turns out.

	Jeff


-

From: jamal
Date: Sunday, September 23, 2007 - 12:11 pm

This is the tg3 patch i used for the testing - against whats in Daves
net-2.6.24 tree. Patch may be a bit hard to read.
For an example of an LLTX version look at the e1000 in the older git
tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

If the intel folks will accept the patch i'd really like to kill 
the e1000 LLTX interface.
The tg3 in that tree used the old style batch_xmit() interface.

cheers,
jamal
From: jamal
Date: Monday, September 24, 2007 - 3:54 pm

I have updated the driver howto to match the patches i posted yesterday.
attached. 

cheers,
jamal
From: Randy Dunlap
Date: Tuesday, September 25, 2007 - 1:16 pm

Thanks for sending this.

This is an early draft, right?

I'll fix some typos etc. in it (patch attached) and add some whitespace.
Please see RD: in the patch for more questions/comments.

IMO it needs some changes to eliminate words like "we", "you",
and "your" (words that personify code).  Those words are OK
when talking about yourself.


---
~Randy
Phaedrus says that Quality is about caring.
From: jamal
Date: Tuesday, September 25, 2007 - 3:28 pm

Its a third revision - but you could call it early. When it is done, i


The narrative intent is supposed to be i (or someone doing the
description) sitting there with a pen and paper and maybe a laptop and
walking through the details with someone who needs to understand those
details. If you think it is important to make it formal, then by all
means be my guest.

Again, thanks for taking the time.

cheers,
jamal

-

From: Kok, Auke
Date: Sunday, September 23, 2007 - 12:36 pm

please be reminded that we're going to strip down e1000 and most of the features
should go into e1000e, which has much less hardware workarounds. I'm still
reluctant to putting in new stuff in e1000 - I really want to chop it down first ;)

AUke
-

From: jamal
Date: Sunday, September 23, 2007 - 2:20 pm

sure - the question then is, will you take those changes if i use
e1000e? theres a few cleanups that have nothing to do with batching;
take a look at the modified e1000 on the git tree.

cheers,
jamal

-

From: Kok, Auke
Date: Monday, September 24, 2007 - 12:00 am

that's bad to begin with :) - please send those separately so I can fasttrack them
into e1000e and e1000 where applicable.

But yes, I'm very inclined to merge more features into e1000e than e1000. I intend
to put multiqueue support into e1000e, as *all* of the hardware that it will
support has multiple queues. Putting in any other performance feature like tx
batching would absolutely be interesting.

Auke
-

From: jamal
Date: Monday, September 24, 2007 - 3:38 pm

Ive been CCing you ;-> Most of the changes are readability and

I looked at the e1000e and it is very close to e1000 so i should be able
to move the changes easily. Most importantly, can i kill LLTX?
For tx batching, we have to wait to see how Dave wants to move forward;
i will have the patches but it is not something you need to push until
we see where that is going.

cheers,
jamal

-

From: Kok, Auke
Date: Monday, September 24, 2007 - 3:52 pm

hmm, I though I already removed that, but now I see some remnants from that.

By all means, please send a separate patch for that!

Auke
-

From: Jeff Garzik
Date: Monday, September 24, 2007 - 5:15 pm

If I understood DaveM correctly, it is sounding like we want to 
deprecate all of use LLTX on "real" hardware?  If so, several such 
projects might be considered, as well as possibly simplifying TX 
batching work perhaps.

Also, WRT e1000 specifically, I was hoping to minimize changes, and 
focus people on e1000e.

e1000e replaces (deprecates) large portions of e1000, namely the support 
for the PCI Express modern chips.  When e1000e has proven itself in the 
field, we can potentially look at several e1000 simplifications, during 
the large scale code removal that becomes possible.

	Jeff


-

From: David Miller
Date: Sunday, September 16, 2007 - 6:02 pm

From: jamal <hadi@cyberus.ca>

I do.

And I'm reviewing and applying several hundred patches a day.

What's the point? :-)
-

From: jamal
Date: Sunday, September 16, 2007 - 7:14 pm

Reading the commentary made me think you were about to swallow that with
one more change by the time i wake up;->
I still think this work - despite my vested interest - needs more
scrutiny from a performance perspective.
I tend to send a url to my work, but it may be time to start posting
patches.

cheers,
jamal

-

From: David Miller
Date: Sunday, September 16, 2007 - 7:25 pm

From: jamal <hadi@cyberus.ca>

Absolutely.

There are tertiary issues I'm personally interested in, for example
how well this stuff works when we enable software GSO on a non-TSO
capable card.

In such a case the GSO segment should be split right before we hit the
driver and then all the sub-segments of the original GSO frame batched
in one shot down to the device driver.

In this way you'll get a large chunk of the benefit of TSO without
explicit hardware support for the feature.

There are several cards (some even 10GB) that will benefit immensely
from this.
-

From: jamal
Date: Sunday, September 16, 2007 - 8:01 pm

I think GSO is still useful on top of this.
In my patches anything with gso gets put into the batch list and shot
down the driver. Ive never considered checking whether the nic is TSO
capable, that may be something worth checking into. The netiron allows
you to shove upto 128 skbs utilizing one tx descriptor, which makes for

indeed - ive always wondered if batching this way would make the NICs
behave differently from the way TSO does.

On a side note: My observation is that with large packets on a very busy
system; bulk transfer type app, one approaches wire speed; with or
without batching, the apps are mostly idling (Ive seen upto 90% idle
time polling at the socket level for write to complete with a really
busy system). This is the case with or without batching. cpu seems a
little better with batching. As the aggregation of the apps gets more
aggressive (achievable by reducing their packet sizes), one can achieve
improved throughput and reduced cpu utilization. This all with UDP; i am
still studying tcp. 
 
cheers,
jamal

-

From: David Miller
Date: Sunday, September 16, 2007 - 8:13 pm

From: jamal <hadi@cyberus.ca>

We're talking past each other, but I'm happy to hear that for
sure your code does the right thing :-)

Right now only TSO capable hardware sets the TSO capable bit,
except perhaps for the XEN netfront driver.

What Herbert and I want to do is basically turn on TSO for
devices that can't do it in hardware, and rely upon the GSO
framework to do the segmenting in software right before we
hit the device.

This only makes sense for devices which can 1) scatter-gather
and 2) checksum on transmit.  Otherwise we make too many
copies and/or passes over the data.

And we can only get the full benefit if we can pass all the
sub-segments down to the driver in one ->hard_start_xmit()

UDP apps spraying data tend to naturally batch well and load balance
amongst themselves because each socket fills up to it's socket send
buffer limit, then sleeps, and we then get a stream from the next UDP
socket up to it's limit, and so on and so forth.

UDP is too easy a test case in fact :-)
-

From: jamal
Date: Monday, September 17, 2007 - 5:51 am

If you have knowledge there are enough descriptors in the driver to
cover all skbs you are passing, do you need to have #1? 
Note i dont touch fragments, i am assuming the driver is smart enough to

I didnt understand this last bit - you are still going to go over the
list regardless of whether you call ->hard_start_xmit() once or
multiple times over the same list, no? In the later case i am assuming

I learnt a lot about the behavior out of doing udp (and before that with
pktgen); theres a lot of driver habits that may need to be tuned before
batching becomes really effective - which is easier to see with udp than
with tcp.

cheers,
jamal

-

From: David Miller
Date: Monday, September 17, 2007 - 9:37 am

From: jamal <hadi@cyberus.ca>

Yes, because you can have multiple descriptors per SKB
because we have the head part in skb->data and the rest
in the page vector.

Thus the device must be able to handle multiple descriptors

If the device can't checksum, we have to pass over the data to
compute the checksum and stick it into the headers.

If the device can't scatter-gather, we have to allocate and
copy into a linear buffer.

Otherwise it's just bumping page reference counts and adjusting
offsets, no data touching at all.
-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 9:46 pm

[Removing Jeff as requested from thread :) ]

Hi Dave,


I have tried this on ehca which does not support TSO. I added GSO flag at
the ipoib layer (and that resulted in a panic/fix that is mentioned in
this patchset). I will re-run tests for this and submit results.

Thanks,

- KK

-

From: Krishna Kumar2
Date: Sunday, September 16, 2007 - 9:08 pm

Hi Dave,



The intention was to avoid having two flags (one that driver supports
batching and second to indicate that batching is on/off). So I could test
skb_blist as an indication of whether batching is on/off. But your point
on cache miss is absolutely correct, and I will change this part to be
inline.

thanks,

- KK

-

Previous thread: [RESEND] Implement batching skb API and support in IPoIB by Krishna Kumar2 on Friday, September 14, 2007 - 1:52 am. (1 message)

Next thread: [PATCH] Cleanup calling netdev notifiers by Pavel Emelyanov on Friday, September 14, 2007 - 3:34 am. (2 messages)