This set of patches implements the batching xmit capability, and adds support for batching in IPoIB and E1000 (E1000 driver changes is ported, thanks to changes taken from Jamal's code from an old kernel). List of changes from previous revision: ---------------------------------------- 1. [Dave] Enable batching as default (change in register_netdev). 2. [Randy] Update documentation (however ethtool cmd to get/set batching is not implemented, hence I am guessing the usage). 3. [KK] When changing tx_batch_skb, qdisc xmits need to be blocked since qdisc_restart() drops queue_lock before calling driver xmit, and driver could find blist change under it. 4. [KK] sched: requeue could wrongly requeue skb already put in the batching list (in case a single skb was sent to the device but not sent as the device was full, resulting in the skb getting added to blist). This also results in slight optimization of batching behavior where for getting skbs #2 onwards don't require to check for gso_skb as that is the first skb that is processed. 4. [KK] Change documentation to explain this behavior. 5. [KK] sched: Fix panic when GSO is enabled in driver. 6. [KK] IPoIB: Small optimization in ipoib_ib_handle_tx_wc 7. [KK] netdevice: Needed to change NETIF_F_GSO_SHIFT/NETIF_F_GSO_MASK as BATCH_SKBS is now defined as 65536 (earlier it was using 8192 which was taken up by NETIF_F_NETNS_LOCAL). Will submit in the next 1-2 days: --------------------------------- 1. [Auke] Enable batching in e1000e. Extras that I can do later: --------------------------- 1. [Patrick] Use skb_blist statically in netdevice. This could also be used to integrate GSO and batching. 2. [Evgeniy] Useful to splice lists dev_add_skb_to_blist (and this can be done for regular xmit's of GSO skbs too for #1 above). Patches are described as: Mail 0/10: This mail Mail 1/10: HOWTO documentation Mail 2/10: Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for batching/no-batching, ...
Add Documentation describing batching skb xmit capability. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> --- batching_skb_xmit.txt | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 107 insertions(+) diff -ruNp org/Documentation/networking/batching_skb_xmit.txt new/Documentation/networking/batching_skb_xmit.txt --- org/Documentation/networking/batching_skb_xmit.txt 1970-01-01 05:30:00.000000000 +0530 +++ new/Documentation/networking/batching_skb_xmit.txt 2007-09-14 10:25:36.000000000 +0530 @@ -0,0 +1,107 @@ + HOWTO for batching skb xmit support + ----------------------------------- + +Section 1: What is batching skb xmit +Section 2: How batching xmit works vs the regular xmit +Section 3: How drivers can support batching +Section 4: Nitty gritty details for drivers +Section 5: How users can work with batching + + +Introduction: Kernel support for batching skb +---------------------------------------------- + +A new capability to support xmit of multiple skbs is provided in the netdevice +layer. Drivers which enable this capability should be able to process multiple +skbs in a single call to their xmit handler. + + +Section 1: What is batching skb xmit +------------------------------------- + + This capability is optionally enabled by a driver by setting the + NETIF_F_BATCH_SKBS bit in dev->features. The prerequisite for a + driver to use this capability is that it should have a reasonably- + sized hardware queue that can process multiple skbs. + + +Section 2: How batching xmit works vs the regular xmit +------------------------------------------------------- + + The network stack gets called from upper layer protocols with a single + skb to transmit. This skb is first enqueued and an attempt is made to + transmit it immediately (via qdisc_run). However, events like tx lock + contention, tx queue stopped, etc., can result in the skb not getting + sent out and it remains in the queue. When the next xmit is called or + when the ...
and then Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Thanks, --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
Hi Randy, Since this was a new section that I added to the documentation, this error creeped up. Thanks for catching it, and review comments/ack-off :) thanks, - KK -
Please remove me from the CC list. I get this via netdev, and not having said a single thing in the thread, I don't feel the need to be CC'd on every email. The CC list is pretty massive as it is, anyway. Jeff -
Introduce skb_blist, NETIF_F_BATCH_SKBS, use single API for
batching/no-batching, etc.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/netdevice.h | 8 ++++++--
net/core/dev.c | 29 ++++++++++++++++++++++++++---
2 files changed, 32 insertions(+), 5 deletions(-)
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530
@@ -439,10 +439,11 @@ struct net_device
#define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */
#define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */
#define NETIF_F_LRO 32768 /* large receive offload */
+#define NETIF_F_BATCH_SKBS 65536 /* Driver supports multiple skbs/xmit */
/* Segmentation offload features */
-#define NETIF_F_GSO_SHIFT 16
-#define NETIF_F_GSO_MASK 0xffff0000
+#define NETIF_F_GSO_SHIFT 17
+#define NETIF_F_GSO_MASK 0xfffe0000
#define NETIF_F_TSO (SKB_GSO_TCPV4 << NETIF_F_GSO_SHIFT)
#define NETIF_F_UFO (SKB_GSO_UDP << NETIF_F_GSO_SHIFT)
#define NETIF_F_GSO_ROBUST (SKB_GSO_DODGY << NETIF_F_GSO_SHIFT)
@@ -548,6 +549,9 @@ struct net_device
/* Partially transmitted GSO packet. */
struct sk_buff *gso_skb;
+ /* List of batch skbs (optional, used if driver supports skb batching */
+ struct sk_buff_head *skb_blist;
+
/* ingress path synchronizer */
spinlock_t ingress_lock;
struct Qdisc *qdisc_ingress;
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c 2007-09-14 10:25:36.000000000 +0530
@@ -953,6 +953,16 @@ void netdev_state_change(struct net_devi
}
}
+static void free_batching(struct net_device *dev)
+{
+ if (dev->skb_blist) {
+ if (!skb_queue_empty(dev->skb_blist))
+ skb_queue_purge(dev->skb_blist);
+ kfree(dev->skb_blist);
+ dev->skb_blist = NULL;
+ }
+}
+
/**
* dev_load - load a network ...A nitpick is that you should use sizeof(struct ...) and I think it -- Evgeniy Polyakov -
Hi Evgeniy, I thought it is better to use *var name in case the name of the structure changes. Also, the flag is not cleared since I could try to enable batching later, and it could succeed at that time. When skb_blist is allocated, then batching is enabled otherwise it is disabled (while features flag just indicates that driver supports batching). Thanks, - KK -
Modify qdisc_run() to support batching. Modify callers of qdisc_run to
use batching, modify qdisc_restart to implement batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/netdevice.h | 2
include/net/pkt_sched.h | 17 +++++--
net/core/dev.c | 45 ++++++++++++++++++
net/sched/sch_generic.c | 109 ++++++++++++++++++++++++++++++++++++----------
4 files changed, 145 insertions(+), 28 deletions(-)
diff -ruNp org/include/net/pkt_sched.h new/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h 2007-09-13 09:11:09.000000000 +0530
+++ new/include/net/pkt_sched.h 2007-09-14 10:25:36.000000000 +0530
@@ -80,13 +80,24 @@ extern struct qdisc_rate_table *qdisc_ge
struct rtattr *tab);
extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
-extern void __qdisc_run(struct net_device *dev);
+static inline void qdisc_block(struct net_device *dev)
+{
+ while (test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
+ yield();
+}
+
+static inline void qdisc_unblock(struct net_device *dev)
+{
+ clear_bit(__LINK_STATE_QDISC_RUNNING, &dev->state);
+}
+
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist)
{
if (!netif_queue_stopped(dev) &&
!test_and_set_bit(__LINK_STATE_QDISC_RUNNING, &dev->state))
- __qdisc_run(dev);
+ __qdisc_run(dev, blist);
}
extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530
@@ -1013,6 +1013,8 @@ extern int dev_set_mac_address(struct n
struct sockaddr *);
extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev);
+extern int dev_add_skb_to_blist(struct ...Hi Krishna. Could it be list_move()-like function for skb lists? I'm pretty sure if you change first and the last skbs and ke of the queue in one shot, result will be the same. -- Evgeniy Polyakov -
Hi Evgeniy, I have to do a bit more like update count, etc, but I agree it is do-able. I had mentioned in my PATCH 0/10 that I will later try this suggestion It depends, eg when the tx lock is not got, I get batching of upto 8-10 skbs (assuming that tx lock was not got quite a few times). But when the queue gets blocked, I have seen batching upto 4K skbs (if tx_queue_len This is the gso code which has segmented 'skb' to skb'1-n', and those skb'1-n' are sent out and freed by driver, which means the dummy 'skb' (without any data) remains to be freed. Thanks, - KK -
Add ethtool support to enable/disable batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
include/linux/ethtool.h | 2 ++
include/linux/netdevice.h | 2 ++
net/core/dev.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
net/core/ethtool.c | 27 +++++++++++++++++++++++++++
4 files changed, 75 insertions(+)
diff -ruNp org/include/linux/ethtool.h new/include/linux/ethtool.h
--- org/include/linux/ethtool.h 2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/ethtool.h 2007-09-14 10:25:36.000000000 +0530
@@ -440,6 +440,8 @@ struct ethtool_ops {
#define ETHTOOL_SFLAGS 0x00000026 /* Set flags bitmap(ethtool_value) */
#define ETHTOOL_GPFLAGS 0x00000027 /* Get driver-private flags bitmap */
#define ETHTOOL_SPFLAGS 0x00000028 /* Set driver-private flags bitmap */
+#define ETHTOOL_GBATCH 0x00000029 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBATCH 0x00000030 /* Set Batching (ethtool_value) */
/* compatibility with older code */
#define SPARC_ETH_GSET ETHTOOL_GSET
diff -ruNp org/include/linux/netdevice.h new/include/linux/netdevice.h
--- org/include/linux/netdevice.h 2007-09-13 09:11:09.000000000 +0530
+++ new/include/linux/netdevice.h 2007-09-14 10:26:21.000000000 +0530
@@ -1331,6 +1331,8 @@ extern void dev_set_promiscuity(struct
extern void dev_set_allmulti(struct net_device *dev, int inc);
extern void netdev_state_change(struct net_device *dev);
extern void netdev_features_change(struct net_device *dev);
+extern int dev_change_tx_batch_skb(struct net_device *dev,
+ unsigned long new_batch_skb);
/* Load a device via the kmod */
extern void dev_load(struct net *net, const char *name);
extern void dev_mcast_init(void);
diff -ruNp org/net/core/dev.c new/net/core/dev.c
--- org/net/core/dev.c 2007-09-14 10:24:27.000000000 +0530
+++ new/net/core/dev.c 2007-09-14 10:25:36.000000000 +0530
@@ -963,6 +963,50 @@ void free_batching(struct net_dev
}
}
+int dev_change_tx_batch_skb(struct ...IPoIB header file changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib.h | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h new/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib.h 2007-09-14 10:25:36.000000000 +0530
@@ -271,8 +271,8 @@ struct ipoib_dev_priv {
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
unsigned tx_tail;
- struct ib_sge tx_sge;
- struct ib_send_wr tx_wr;
+ struct ib_sge *tx_sge;
+ struct ib_send_wr *tx_wr;
struct ib_wc ibwc[IPOIB_NUM_WC];
@@ -367,8 +367,11 @@ static inline void ipoib_put_ah(struct i
int ipoib_open(struct net_device *dev);
int ipoib_add_pkey_attr(struct net_device *dev);
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, struct ipoib_ah *address,
+ u32 qpn, int wr_num);
void ipoib_send(struct net_device *dev, struct sk_buff *skb,
- struct ipoib_ah *address, u32 qpn);
+ struct ipoib_ah *address, u32 qpn, int num_skbs);
void ipoib_reap_ah(struct work_struct *work);
void ipoib_flush_paths(struct net_device *dev);
-
IPoIB CM & Multicast changes based on header file changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_cm.c | 13 +++++++++----
ipoib_multicast.c | 4 ++--
2 files changed, 11 insertions(+), 6 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c new/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-09-14 10:25:36.000000000 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
unsigned int wr_id,
u64 addr, int len)
{
+ int ret;
struct ib_send_wr *bad_wr;
- priv->tx_sge.addr = addr;
- priv->tx_sge.length = len;
+ priv->tx_sge[0].addr = addr;
+ priv->tx_sge[0].length = len;
+
+ priv->tx_wr[0].wr_id = wr_id;
- priv->tx_wr.wr_id = wr_id;
+ priv->tx_wr[0].next = NULL;
+ ret = ib_post_send(tx->qp, priv->tx_wr, &bad_wr);
+ priv->tx_wr[0].next = &priv->tx_wr[1];
- return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+ return ret;
}
void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-09-14 10:25:36.000000000 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
if (!memcmp(mcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
sizeof (union ib_gid))) {
priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
- priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+ priv->tx_wr[0].wr.ud.remote_qkey = priv->qkey;
}
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -736,7 +736,7 @@ out:
}
}
- ipoib_send(dev, skb, mcast->ah, IB_MULTICAST_QPN);
+ ipoib_send(dev, skb, ...IPoIB verb changes to use batching.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_verbs.c | 23 ++++++++++++++---------
1 files changed, 14 insertions(+), 9 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-09-14 10:25:36.000000000 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
.max_send_sge = 1,
.max_recv_sge = 1
},
- .sq_sig_type = IB_SIGNAL_ALL_WR,
+ .sq_sig_type = IB_SIGNAL_REQ_WR, /* 11.2.4.1 */
.qp_type = IB_QPT_UD
};
-
- int ret, size;
+ struct ib_send_wr *next_wr = NULL;
+ int i, ret, size;
priv->pd = ib_alloc_pd(priv->ca);
if (IS_ERR(priv->pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
priv->dev->dev_addr[2] = (priv->qp->qp_num >> 8) & 0xff;
priv->dev->dev_addr[3] = (priv->qp->qp_num ) & 0xff;
- priv->tx_sge.lkey = priv->mr->lkey;
-
- priv->tx_wr.opcode = IB_WR_SEND;
- priv->tx_wr.sg_list = &priv->tx_sge;
- priv->tx_wr.num_sge = 1;
- priv->tx_wr.send_flags = IB_SEND_SIGNALED;
+ for (i = ipoib_sendq_size - 1; i >= 0; i--) {
+ priv->tx_sge[i].lkey = priv->mr->lkey;
+ priv->tx_wr[i].opcode = IB_WR_SEND;
+ priv->tx_wr[i].sg_list = &priv->tx_sge[i];
+ priv->tx_wr[i].num_sge = 1;
+ priv->tx_wr[i].send_flags = 0;
+
+ /* Link the list properly for provider to use */
+ priv->tx_wr[i].next = next_wr;
+ next_wr = &priv->tx_wr[i];
+ }
return 0;
-
IPoIB internal post and work completion handler changes.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_ib.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 168 insertions(+), 44 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c new/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-09-14 10:25:36.000000000 +0530
@@ -242,6 +242,8 @@ repost:
static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
+ int num_completions, to_process;
+ unsigned int tx_ring_index;
unsigned int wr_id = wc->wr_id;
struct ipoib_tx_buf *tx_req;
unsigned long flags;
@@ -255,18 +257,51 @@ static void ipoib_ib_handle_tx_wc(struct
return;
}
- tx_req = &priv->tx_ring[wr_id];
+ /* Get first WC to process (no one can update tx_tail at this time) */
+ tx_ring_index = priv->tx_tail & (ipoib_sendq_size - 1);
- ib_dma_unmap_single(priv->ca, tx_req->mapping,
- tx_req->skb->len, DMA_TO_DEVICE);
+ /* Find number of WC's to process */
+ num_completions = wr_id - tx_ring_index + 1;
+ if (unlikely(num_completions <= 0))
+ num_completions += ipoib_sendq_size;
+ to_process = num_completions;
- ++priv->stats.tx_packets;
- priv->stats.tx_bytes += tx_req->skb->len;
+ /*
+ * Handle WC's from earlier (possibly multiple) post_sends in this
+ * iteration as we move from tx_tail to wr_id, since if the last WR
+ * (which is the one which requested completion notification) failed
+ * to be sent for any of those earlier request(s), no completion
+ * notification is generated for successful WR's of those earlier
+ * request(s). Use a infinite loop to handle the regular case of
+ * one skb processing faster.
+ */
+ tx_req = &priv->tx_ring[tx_ring_index];
+ while (1) {
+ if (likely(tx_req->skb)) ...IPoIB: implement the new batching API.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
ipoib_main.c | 248 +++++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 168 insertions(+), 80 deletions(-)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c new/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-13 09:10:58.000000000 +0530
+++ new/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-09-14 10:25:36.000000000 +0530
@@ -563,7 +563,8 @@ static void neigh_add_path(struct sk_buf
goto err_drop;
}
} else
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+ ipoib_send(dev, skb, path->ah,
+ IPOIB_QPN(skb->dst->neighbour->ha), 1);
} else {
neigh->ah = NULL;
@@ -643,7 +644,7 @@ static void unicast_arp_send(struct sk_b
ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));
- ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr));
+ ipoib_send(dev, skb, path->ah, IPOIB_QPN(phdr->hwaddr), 1);
} else if ((path->query || !path_rec_start(dev, path)) &&
skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
/* put pseudoheader back on for next time */
@@ -657,105 +658,163 @@ static void unicast_arp_send(struct sk_b
spin_unlock(&priv->lock);
}
+#define XMIT_PROCESSED_SKBS() \
+ do { \
+ if (wr_num) { \
+ ipoib_send(dev, NULL, old_neigh->ah, old_qpn, \
+ wr_num); \
+ wr_num = 0; \
+ } \
+ } while (0)
+
static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
- struct ipoib_neigh *neigh;
+ struct sk_buff_head *blist;
+ int max_skbs, wr_num = 0;
+ u32 qpn, old_qpn = 0;
+ struct ipoib_neigh *neigh, *old_neigh = NULL;
unsigned long flags;
if (unlikely(!spin_trylock_irqsave(&priv->tx_lock, flags)))
return NETDEV_TX_LOCKED;
- /*
- * Check if our queue is stopped. ...E1000: Implement batching capability (ported thanks to changes taken from
Jamal).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
e1000_main.c | 104 ++++++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 75 insertions(+), 29 deletions(-)
diff -ruNp org/drivers/net/e1000/e1000_main.c new/drivers/net/e1000/e1000_main.c
--- org/drivers/net/e1000/e1000_main.c 2007-09-14 10:30:57.000000000 +0530
+++ new/drivers/net/e1000/e1000_main.c 2007-09-14 10:31:02.000000000 +0530
@@ -990,7 +990,7 @@ e1000_probe(struct pci_dev *pdev,
if (pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;
- netdev->features |= NETIF_F_LLTX;
+ netdev->features |= NETIF_F_LLTX | NETIF_F_BATCH_SKBS;
adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
@@ -3092,6 +3092,17 @@ e1000_tx_map(struct e1000_adapter *adapt
return count;
}
+static void e1000_kick_DMA(struct e1000_adapter *adapter,
+ struct e1000_tx_ring *tx_ring, int i)
+{
+ wmb();
+
+ writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+ /* we need this if more than one processor can write to our tail
+ * at a time, it syncronizes IO on IA64/Altix systems */
+ mmiowb();
+}
+
static void
e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
int tx_flags, int count)
@@ -3138,13 +3149,7 @@ e1000_tx_queue(struct e1000_adapter *ada
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
* such as IA-64). */
- wmb();
-
tx_ring->next_to_use = i;
- writel(i, adapter->hw.hw_addr + tx_ring->tdt);
- /* we need this if more than one processor can write to our tail
- * at a time, it syncronizes IO on IA64/Altix systems */
- mmiowb();
}
/**
@@ -3251,22 +3256,23 @@ static int e1000_maybe_stop_tx(struct ne
}
#define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+
+#define NETDEV_TX_DROPPED -5
+
static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device ...This changes could actually go as own patch, although not sure it is Why is it put at the end? -- Evgeniy Polyakov -
Hi Evgeniy, Since this flag is new and useful only for batching, I feel it is OK to There is a bug that I had explained in rev4 (see XXX below) resulting in sending out skbs out of order. The fix is that if the driver gets called with a skb but there are older skbs already in the batch list (which failed to get sent out), send those skbs first before this one. Thanks, - KK [XXX] Dave had suggested to use batching only in the net_tx_action case. When I implemented that in earlier revisions, there were lots of TCP retransmissions (about 18,000 to every 1 in regular code). I found the reason for part of that problem as: skbs get queue'd up in dev->qdisc (when tx lock was not got or queue blocked); when net_tx_action is called later, it passes the batch list as argument to qdisc_run and this results in skbs being moved to the batch list; then batching xmit also fails due to tx lock failure; the next many regular xmit of a single skb will go through the fast path (pass NULL batch list to qdisc_run) and send those skbs out to the device while previous skbs are cooling their heels in the batch list. The first fix was to not pass NULL/batch-list to qdisc_run() but to always check whether skbs are present in batch list when trying to xmit. This reduced retransmissions by a third (from 18,000 to around 12,000), but led to another problem while testing - iperf transmit almost zero data for higher # of parallel flows like 64 or more (and when I run iperf for a 2 min run, it takes about 5-6 mins, and reports that it ran 0 secs and the amount of data transfered is a few MB's). I don't know why this happens with this being the only change (any ideas is very appreciated). The second fix that resolved this was to revert back to Dave's suggestion to use batching only in net_tx_action case, and modify the driver to see if skbs are present in batch list and to send them out first before sending the current skb. I still see huge retransmission for IPoIB (but not for ...
this doesn't apply anymore and it would help if you could re-spin this for e1000e. I don't know what the status for merging of the batched xmit patches is right now but it would help if you could rewrite them against e1000e, which I assume is what most people want to test with. There are also significant changes upstream right now in jgarzik/netdev-2.6 #upstream... I'm still very interested in these patches BTW. -
Hi Auke, I will make a latest version and test it out for some numbers and try to send it this week. Thanks, - KK -
Hi Krishna. And what about latency for this patchset? -- Evgeniy Polyakov -
From: Krishna Kumar <krkumar2@in.ibm.com> The only major complaint I have about this patch series is that the IPoIB part should just be one big changeset. Otherwise the tree is not bisectable, for example the initial ipoib header file change breaks the build. The tree must compile and work properly after every single patch. On a lower priority, I question the indirection of skb_blist by making it a pointer. For what? Saving 12 bytes on 64-bit? That kmalloc()'d thing is a nearly guarenteed cache and/or TLB miss. Just inline the thing, we generally don't do crap like this anywhere else. -
Dave, you do realize that i have been investing my time working on batching as well, right? cheers, jamal -
I had plenty of time this weekend so i have been doing a _lot_ of
testing. My next emails will send a set of patches:
Patch 1: Introduces explicit tx locking
Patch 2: Introduces batching interface
Patch 3: Core uses batching interface
Patch 4: get rid of dev->gso_skb
Testing
-------
Each of these patches has been performance tested and the results
are in the logs on a per-patch basis.
My system under test hardware is a 2xdual core opteron with a couple of
tg3s.
My test tool generates udp traffic of different sizes for upto 60
seconds per run or a total of 30M packets. I have 4 threads each
running on a specific CPU which keep all the CPUs as busy as they can
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.
I have not run testing on patch #4 because i had to let the machine
go, but will have some access to it tommorow early morning where i can
run some tests.
Comments
--------
Iam trying to kill ->hard_batch_xmit() but it would be tricky to do
without it for LLTX drivers. Anything i try will require a few extra
checks. OTOH, I could kill LLTX for the drivers i am using that
are LLTX and then drop that interface or I could say "no support
for LLTX". I am in a dilema.
Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood.
Herbert, if you can look at at least patch 4 i will appreaciate it.
More patches to follow - i didnt want to overload people by dumping
too many patches. Most of these patches below are ready to go; some ...I have submitted this before; but here it is again. Against net-2.6.24 from yesterday for this and all following patches. cheers, jamal
This patch introduces the netdevice interface for batching. cheers, jamal
This patch adds the usage of batching within the core. cheers, jamal
This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal
This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal
This patch removes dev->gso_skb as it is no longer necessary with batching code. cheers, jamal
This patch adds the usage of batching within the core. cheers, jamal
Have you done performance comparisons for the case of using 9000-byte jumbo frames? -Bill -
I havent, but will try if any of the gige cards i have support it. As a side note: I have not seen any useful gains or losses as the packet size approaches even 1500B MTU. For example, post about 256B neither the batching nor the non-batching give much difference in either throughput or cpu use. Below 256B, theres a noticeable gain for batching. Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and so the occupancy of both the qdisc queue(s) and ethernet ring is constantly high. For example at 512B, the app is 80% idle on all 4 CPUs and we are hitting in the range of wire speed. We are at 90% idle at 1024B. This is the case with or without batching. So my suspicion is that with that trend a 9000B packet will just follow the same pattern. cheers, jamal -
One reason I ask, is that on an earlier set of alternative batching xmit patches by Krishna Kumar, his performance testing showed a 30 % performance hit for TCP for a single process and a size of 4 KB, and a performance hit of 5 % for a single process and a size of 16 KB (a size of 8 KB wasn't tested). Unfortunately I was too busy at the time to inquire further about it, but it would be a major potential concern for me in my 10-GigE network testing with 9000-byte jumbo frames. Of course the single process and 4 KB or larger size was the only case that showed a significant performance hit in Krishna Kumar's latest reported test results, so it might be acceptable to just have a switch to disable the batching feature for that specific usage scenario. So it would be useful to know if your xmit batching changes would have similar issues. Also for your xmit batching changes, I think it would be good to see performance comparisons for TCP and IP forwarding in addition to your UDP pktgen tests, including various packet sizes up to and including 9000-byte jumbo frames. -Bill -
There were many times while testing that i noticed inconsistencies and in each case when i analysed[1], i found it to be due to some variable other than batching which needed some resolving, always via some parametrization or other. I suspect what KK posted is in the same class. To give you an example, with UDP, batching was giving worse results at around 256B compared to 64B or 512B; investigating i found that the receiver just wasnt able to keep up and the udp layer dropped a lot of packets so both iperf and netperf reported bad numbers. Fixing the receiver ended up with consistency coming back. On why 256B was the one that overwhelmed the receiver more than 64B(which sent more pps)? On some limited investigation, it seemed to me to be the effect of the choice of the tg3 driver's default tx mitigation parameters as well tx ring size; which is something i plan to revisit (but neutralizing it helps me focus on just batching). In the end i dropped both netperf and iperf for similar reasons and wrote my own app. What i am trying to achieve is demonstrate if batching is a GoodThing. In experimentation like this, it is extremely valuable to reduce the variables. Batching may expose other orthogonal issues - those need to be resolved or fixed as they are found. I hope that sounds sensible. Back to the >=9K packet size you raise above: I dont have a 10Gige card so iam theorizing. Given that theres an observed benefit to batching for a saturated link with "smaller" packets (in my results "small" is anything below 256B which maps to about 380Kpps anything above that seems to approach wire speed and the link is the bottleneck); then i theorize that 10Gige with 9K jumbo frames if already achieving wire rate, should continue to do so. And sizes below that will see improvements if they were not already hitting wire rate. So i would say that with 10G NICS, there will be more observed improvements with batching with apps that do bulk transfers (assuming those apps are not seeing wire ...
It does sound sensible. My own decidedly non-expert speculation was that the big 30 % performance hit right at 4 KB may be related to memory allocation issues or having to split the skb across multiple 4 KB pages. And perhaps it only affected the single process case because with multiple processes lock contention may be a bigger issue and the xmit batching changes would presumably help with that. I am admittedly a novice when it comes to the detailed internals of TCP/skb processing, although I have been slowly slogging my way through parts of the TCP kernel code to try and get a better understanding, so I don't know if these thoughts have any merit. BTW does anyone know of a good book they would recommend that has substantial coverage of the Linux kernel TCP code, that's fairly up-to-date and gives both an overall view of the code and packet flow as well as details on individual functions and algorithms, and hopefully covers basic issues like locking and synchronization, concurrency of different parts of the stack, and memory allocation. I have several books already on Linux kernel and networking internals, but they seem to only cover the IP (and perhaps UDP) portions of the network stack, and none have more than a cursory reference to TCP. The most useful documentation on the Linux TCP stack that I have found thus far is some of Dave Miller's excellent web pages and a few other web references, but overall it seems fairly skimpy It would be good to see some empirical evidence that there aren't any unforeseen gotchas for larger packet sizes, that at least the same level of performance can be obtained with no greater CPU As you have written previously, and I heartily agree with, this is a very good practice for developing performance enhancement patches. -Thanks -Bill -
plausible. But i also worry it could be 10 other things; example, could it be the driver used? I noted in my udp test the oddity that turned out to be tx coal parameter related. You do bring up issues that need to be looked into and i will run those tests. Note, the effectiveness of batching becomes evident as the number of flows grows. Actually, scratch that: It becomes evident if you can keep the tx path busyed out to which multiple users running contribute. If i can have a user per CPU with lots of traffic to send, i can create that condition. It's a little boring in the scenario where the bottleneck is Reading books or magazines may end up busying you out with some small gains of knowledge at the end. They tend to be outdated fast. My advice is if you start with a focus on one thing, watch the patches that fly around on that area and learn that way. Read the code to further understand things then ask questions when its not clear. Other folks may have different views. The other way to do it is pick yourself some task Reasonable - I will try with 9K after i move over to the new tree from Dave and make sure nothing else broke in the previous tests. To give you a perspective, the results i posted were each run 10 iterations per packet size per kernel. Each run is 60 seconds long. I think i am past that stage for resolving or fixing anything for UDP or pktgen, but i need to keep checking for any new regressions when Dave updates his tree. Now multiply that by 5 packet sizes (I am going to add 2 more) and multiply that by 3-4 kernels. Then add the time it takes to sift through the data and collect it then analyze it and go back to the drawing table when something doesnt look right. Essentially, it needs a weekend ;-> cheers, jamal -
->requeue queues at the head, so this looks like it would reverse the order of the skbs. -
Excellent catch! thanks; i will fix. As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. Thanks again Patrick. cheers, jamal -
On the contrary, batching LLTX drivers (if that is not ruled out) will very often requeue resulting in heavy reordering. Fix looks good though. - KK -
Two things: one, LLTX is deprecated (I think i saw a patch which says no more new drivers should do LLTX) and i plan if nobody else does to kill LLTX in e1000 RSN. So for that reason i removed all code that existed to support LLTX. two, there should _never_ be any requeueing even if LLTX in the previous patches when i supported them; if there is, it is a bug. This is because we dont send more than what the driver asked for via xmit_win. So if it asked for more than it can handle, that is a bug. If its available space changes while we are sending to it, that too is a bug. cheers, jamal -
Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? - KK -
For others drivers that are non-batching and LLTX, it is possible - at the moment in my patch i whine that the driver is buggy. I will fix this up so it checks for NETIF_F_BTX. Thanks for pointing the above use case. cheers, jamal -
This patch adds the usage of batching within the core. cheers, jamal
This patch introduces the netdevice interface for batching. cheers, jamal
This patch introduces the netdevice interface for batching. cheers, jamal
Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK -
Please provide an analysis when you get the results. IOW, explain why I havent tested with iperf in a while. Can you post the netstat on both sides when the driver stops? --- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. Your driver needs to also support wake thresholding. I will post the driver howto later today. cheers, jamal -
Hi Jamal, I've been (slowly) working on resurrecting the original design of my multiqueue patches to address this exact issue of the queue_lock being a hot item. I added a queue_lock to each queue in the subqueue struct, and in the enqueue and dequeue, just lock that queue instead of the global device queue_lock. The only two issues to overcome are the QDISC_RUNNING state flag, since that also serializes entry into the qdisc_restart() function, and the qdisc statistics maintenance, which needs to be serialized. Do you think this work along with your patch will benefit from one another? I apologize for not having working patches right now, but I am working on them slowly as I have some blips of spare time. Thanks, -PJ Waskiewicz -
The one thing that seems obvious is to use dev->hard_prep_xmit() in the patches i posted to select the xmit ring. You should be able to do figure out the txmit ring without holding any lock. I lost track of how/where things went since the last discussion; so i need to wrap my mind around it to make sensisble suggestions - I know the core patches are in the kernel but havent paid attention to details and if you look at my second patch youd see a comment in dev_batch_xmit() which says i need to scrutinize multiqueue more. cheers, jamal -
I've looked at that as a candidate to use. The lock for enqueue would be needed when actually placing the skb into the appropriate software No worries. I'll try to get things together on my end and provide some patches to add a per-queue lock. In the meantime, I'll take a much closer look at the batching code, since I've stopped looking at the patches in-depth about a month ago. :-( Thanks, -PJ Waskiewicz -
The enqueue is easy to comprehend. The single device queue lock should suffice. The dequeue is interesting: Maybe you can point me to some doc or describe to me the dequeue aspect; are you planning to have an array of txlocks per, one per ring? How is the policy to define the qdisc queues locked/mapped to tx rings? cheers, jamal -
From: jamal <hadi@cyberus.ca> For these high performance 10Gbit cards it's a load balancing function, really, as all of the transmit queues go out to the same physical port so you could: 1) Load balance on CPU number. 2) Load balance on "flow" 3) Load balance on destination MAC etc. etc. etc. It's something that really sits logically between the qdisc and the card, not something that is a qdisc thing. In some ways it's similar to bonding, but using anything similar to bonding's infrastructure (stacking devices) is way overkill for this. And then we have the virtualization network devices where the queue selection has to be made precisely, in order for the packet to reach the proper destination, rather than a performance improvement. It is also a situation where the TX queue selection is something to be made between qdisc activity and hitting the device. I think we will initially have to live with taking the centralized qdisc lock for the device, get in and out of that as fast as possible, then only take the TX queue lock of the queue selected. After we get things that far we can try to find some clever lockless algorithm for handling the qdisc to get rid of that hot spot. These queue selection schemes want a common piece of generic code. A set of load balancing algorithms, a "select TX queue by MAC with a default fallback on no match" for virtualization, and interfaces for both drivers and userspace to change the queue selection scheme. -
The brain-block i am having is the parallelization aspect of it. Whatever scheme it is - it needs to ensure the scheduler works as expected. For example, if it was a strict prio scheduler i would expect that whatever goes out is always high priority first and never ever allow a low prio packet out at any time theres something high prio needing to go out. If i have the two priorities running on two cpus, then i cant guarantee that effect. IOW, i see the scheduler/qdisc level as not being split across parallel cpus. Do i make any sense? The rest of my understanding hinges on the above, so let me stop here. cheers, jamal -
Any chance the NIC hardware could provide that guarantee? 8139cp, for example, has two TX DMA rings, with hardcoded characteristics: one is a high prio q, and one a low prio q. The logic is pretty simple: empty the high prio q first (potentially starving low prio q, in worst case). In terms of overall parallelization, both for TX as well as RX, my gut feeling is that we want to move towards an MSI-X, multi-core friendly model where packets are LIKELY to be sent and received by the same set of [cpus | cores | packages | nodes] that the [userland] processes dealing with the data. There are already some primitive NUMA bits in skbuff allocation, but with modern MSI-X and RX/TX flow hashing we could do a whole lot more, along the lines of better CPU scheduling decisions, directing flows to clusters of cpus, and generally doing a better job of maximizing cache efficiency in a modern multi-thread environment. IMO the current model where each NIC's TX completion and RX processes are both locked to the same CPU is outmoded in a multi-core world with modern NICs. :) But I readily admit general ignorance about the kernel process scheduling stuff, so my only idea about a starting point was to see how far to go with the concept of "skb affinity" -- a mask in sk_buff that is a hint about which cpu(s) on which the NIC should attempt to send and receive packets. When going through bonding or netfilter, it is trivial to 'or' together affinity masks. All the various layers of net stack should attempt to honor the skb affinity, where feasible (requires interaction with CFS scheduler?). Or maybe skb affinity is a dumb idea. I wanted to get people thinking on the bigger picture. Parallelization starts at the user process. Jeff -
If you can get the scheduling/dequeuing to run on one CPU (as we do today) it should work; alternatively you can totaly bypass the qdisc subystem and go direct to the hardware for devices that are capable and that would work but would require huge changes. My fear is there's a mini-scheduler pieces running on multi cpus which sounds like strict prio scheduling to me which says "if low prio starves Does putting things in the same core help? But overall i agree with your I think i see the receive with a lot of clarity, i am still foggy on the Infact even with status quo theres a case that can be made to not bind to interupts. In my recent experience with batching, due to the nature of my test app, if i let the interupts float across multiple cpus i benefit. My app runs/binds a thread per CPU and so benefits from having more juice to send more packets per unit of time - something i wouldnt get if i was always running on one cpu. But when i do this i found that just because i have bound a thread to cpu3 doesnt mean that thread will always run on cpu3. If netif_wakeup happens on cpu1, scheduler will put the thread on cpu1 if it is to be There would be cache benefits if you can free the packet on the same cpu it was allocated; so the idea of skb affinity is useful in the minimal in that sense if you can pull it. Assuming hardware is capable, even if you just tagged it on xmit to say which cpu it was sent out on, and made sure thats where it is freed, that would be a good start. Note: The majority of the packet processing overhead is _still_ the memory subsystem latency; in my tests with batched pktgen improving the xmit subsystem meant the overhead on allocing and freeing the packets went to something > 80%. So something along the lines of parallelizing based on a split of alloc free of sksb IMO on more cpus than where xmit/receive run would see cheers, jamal -
From: Jeff Garzik <jeff@garzik.org> The problem is that the packet schedulers want global guarantees on packet ordering, not flow centric ones. That is the issue Jamal is concerned about. The more I think about it, the more inevitable it seems that we really might need multiple qdiscs, one for each TX queue, to pull this full parallelization off. But the semantics of that don't smell so nice either. If the user attaches a new qdisc to "ethN", does it go to all the TX queues, or what? All of the traffic shaping technology deals with the device as a unary object. It doesn't fit to multi-queue at all. -
If you let only one CPU at a time access the "xmit path" you solve all the reordering. If you want to be more fine grained you make the serialization point as low as possible in the stack - perhaps in the driver. But I think even what we have today with only one cpu entering the dequeue/scheduler region, _for starters_, is not bad actually ;-> What i am finding (and i can tell you i have been trying hard;->) is that a sufficiently fast cpu doesnt sit in the dequeue area for "too long" (and batching reduces the time spent further). Very quickly there are no more packets for it to dequeue from the qdisc or the driver is stoped and it has to get out of there. If you dont have any interupt tied to a specific cpu then you can have many cpus enter and leave that region all the time. cheers, jamal -
From: jamal <hadi@cyberus.ca> With the lock shuttling back and forth between those cpus, which is what we're trying to avoid. Multiply whatever effect you think you might be able to measure due to that on your 2 or 4 way system, and multiple it up to 64 cpus or so for machines I am using. This is where machines are going, and is going to become the norm. -
That along with speeds going to 10 GbE with multiple Tx/Rx queues (with 40 and 100 GbE under discussion now), where multiple CPU's hitting the driver are needed to push line rate without cratering the entire machine. -PJ Waskiewicz -
Oh, absolutely. I think, fundamentally, any amount of cross-flow resource management done in software is an obstacle to concurrency. That's not a value judgement, just a statement of fact. "traffic cops" are intentional bottlenecks we add to the process, to enable features like priority flows, filtering, or even simple socket fairness guarantees. Each of those bottlenecks serves a valid purpose, but at the end of the day, it's still a bottleneck. So, improving concurrency may require turning off useful features that Well the easy solutions to networking concurrency are * use virtualization to carve up the machine into chunks * use multiple net devices Since new NIC hardware is actively trying to be friendly to multi-channel/virt scenarios, either of these is reasonably straightforward given the current state of the Linux net stack. Using multiple net devices is especially attractive because it works very well with the existing packet scheduling. Both unfortunately impose a burden on the developer and admin, to force their apps to distribute flows across multiple [VMs | net devs]. The third alternative is to use a single net device, with SMP-friendly packet scheduling. Here you run into the problems you described "device as a unary object" etc. with the current infrastructure. With multiple TX rings, consider that we are pushing the packet scheduling from software to hardware... which implies * hardware-specific packet scheduling * some TC/shaping features not available, because hardware doesn't support it Jeff -
From: jamal <hadi@cyberus.ca> Picture it like N tubes you stick packets into, and the tubes are processed using DRR. So packets within a tube won't be reordered, but reordering amongst tubes is definitely possible. -
We should make sure we're symmetric with the locking on enqueue to
dequeue. If we use the single device queue lock on enqueue, then
dequeue will also need to check that lock in addition to the individual
queue lock. The details of this are more trivial than the actual
The dequeue locking would be pushed into the qdisc itself. This is how
I had it originally, and it did make the code more complex, but it was
successful at breaking the heavily-contended queue_lock apart. I have a
subqueue structure right now in netdev, which only has queue_state (for
netif_{start|stop}_subqueue). This state is checked in sch_prio right
now in the dequeue for both prio and rr. My approach is to add a
queue_lock in that struct, so each queue allocated by the driver would
have a lock per queue. Then in dequeue, that lock would be taken when
the skb is about to be dequeued.
The skb->queue_mapping field also maps directly to the queue index
itself, so it can be unlocked easily outside of the context of the
dequeue function. The policy would be to use a spin_trylock() in
dequeue, so that dequeue can still do work if enqueue or another dequeue
is busy. And the allocation of qdisc queues to device queues is assumed
to be one-to-one (that's how the qdisc behaves now).
I really just need to put my nose to the grindstone and get the patches
together and to the list...stay tuned.
Thanks,
-PJ Waskiewicz
-
more locks implies degraded performance. If only one processor can enter that region, presumably after acquiring the outer lock , why this So there could be a parallel cpu dequeueing at the same time? Wouldnt this have implications depending on what the scheduling algorithm used? If for example i was doing priority queueing i would want to make sure the highest priority is being dequeued first AND by all means goes out first to the driver; i dont want a parallell cpu Ok, that brings back the discussion we had; my thinking was something like dev->hard_prep_xmit() would select the ring and i think you staticly already map the ring to a qdisc queue. So i dont think dev->hard_prep_xmit() is useful to you. In any case, there is nothing the batching patches do that interfere or prevent you from going the path you intend to. instead of dequeueing one packet, you dequeue several and instead of sending to the driver one packet, you send several. And using the xmit_win, you should never ever have to requeue. cheers, jamal -
On Mon, 24 Sep 2007 16:47:06 -0700 Since we are redoing this, is there any way to make the whole TX path more lockless? The existing model seems to be more of a monitor than a real locking model. -- Stephen Hemminger <shemminger@linux-foundation.org> -
What do you mean it is "more of a monitor"? On the challenge of making it lockless: About every NAPI driver combines the tx prunning with rx polling. If you are dealing with tx resources on receive thread as well as tx thread, _you need_ locking. The only other way we can do avoid it is to separate the rx path interupts from ones on tx related resources; the last NAPI driver that did that was tulip; i think the e1000 for a short period in its life did the same as well. But that has been frowned on and people have evolved away from it. cheers, jamal -
On Tue, 25 Sep 2007 09:15:38 -0400 The transmit code path is locked as a code region, rather than just object locking on the transmit queue or other fine grained object. This leads to moderately long If we went to finer grain locking it would also mean changes to all network devices using the new locking model. My assumption is that we would use something like the features flag to do the transition for backward compatibility. Take this as a purely "what if" or "it would be nice if" kind of suggestion not a requirement or some grand plan. -- Stephen Hemminger <shemminger@linux-foundation.org> -
It will be pretty tricky to optimize that path given the dependencies between the queues, classifiers, and actions in enqueues; schedulers in dequeues as well as their config/queries from user space which could happen concurently on all "N" CPUs. The txlock optimization i added in patch1 intends to let go of the queue lock when we enter the dequeue region sooner to reduce the contention. A further optimization i made was to reduce the time it takes to hold the tx lock at the driver by moving gunk that doesnt need lock-holding Ok, hopefully someone would demonstrate how to achieve it; seems a hard thing to achieve. cheers, jamal -
Sorry, that should have read dev->hard_prep_xmit() cheers, jamal -
That seems quite reasonable. I will certainly see what I can do. Thanks Stephen, -PJ Waskiewicz -
Latest net-2.6.24 breaks the patches i posted last week; so this is an update to resolve that. If you are receiving these emails and are finding them overloading, please give me a shout and i will remove your name. Please provide feedback on the code and/or architecture. Last time i posted them i received none. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb I have decided i will kill ->hard_batch_xmit() and not support any more LLTX drivers. This is the last of patches that will have ->hard_batch_xmit() as i am supporting an e1000 that is LLTX. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. Each of these patches has been performance tested (last with DaveM's tree from last weekend) and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. I have not re-run the tests with this morning's tree but i suspect not much difference. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M ...
And heres a patch that provides a sample of the usage for batching with tg3. Requires patch "[TG3]Some cleanups" i posted earlier. cheers, jamal
Please provide feedback on the code and/or architecture. Last time i posted them i received little. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev->gso_skb What has changed since i posted last: 1) Fix a bug eyeballed by Patrick McHardy on requeue reordering. 2) Killed ->hard_batch_xmit() 3) I am going one step back and making this set of patches even simpler so i can make it easier to review.I am therefore killing dev->hard_prep_xmit() and focussing just on batching. I plan to re-introduce dev->hard_prep_xmit() but from now on i will make that a separate effort. (it seems to be creating confusion in relation to the general work). Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev->gso_skb that you introduced). UPCOMING PATCHES --------------- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. PERFORMANCE TESTING -------------------- I started testing since yesterday, but these tests take a long time so i will post results probably at the end of the day sometime and may stop running more tests and just comparing batch vs non-batch results. I have optimized the kernel-config so i expect my overall performance numbers to look better ...
Hi Jamal. it looks like you and Krishna use the same requeueing methods - get one from qdisk, queue it into blist, get next from qdisk, queue it, eventually start transmit, where you dequeue it one-by-one and send (or prepare and commit). This is not the 100% optimal approach, but if you proved it does not hurt usual network processing, it is ok. Number of comments dusted to very small - that's a sign, but I'm a bit lost - did you and Krishna create the competing approaches, or they can co-exist together, in the former case I doubt you can push, until all problematic places are resolved, in the latter case, this is probably ready. -- Evgeniy Polyakov -
There are probably other bottlenecks that hide the need to optimize Thanks. I would like to make one more cleanup and get rid of the temporary pkt list in qdisc restart; now that i have defered the skb pre-format interface it is unnecessary. I have a day off today, so i will make changes, re-run tests and post again. I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. cheers, jamal -
My feeling is that since the approaches are very different, it would be a good idea to test the two for performance. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. ******************* Previous mail ****************** Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK -
My concern is the approaches are different only for short periods of time. For example, I do requeueing, have xmit_win, have ->end_xmit, do batching from core etc; if you see value in any of these concepts, they will appear in your patches and this goes on a loop. Perhaps what Which i dont mind as long as it has an analysis that goes with it. If all you post is "heres what netperf showed", it is not useful at all. There are also a lot of affecting variables. For example, is the receiver a bottleneck? To make it worse, I could demonstrate to you that if i slowed down the driver and allowed more packets to queue up on the qdisc, batching will do well. In the past my feeling is you glossed over I responded to you - but it may have been lost in the noise; heres a copy: http://marc.info/?l=linux-netdev&m=119185137124008&w=2 cheers, jamal -
You should post at least a couple driver patches to see how its used on Real Hardware(tm)... :) The batching idea has always seemed like a no-brainer to me, so I'm very interested to see how this turns out. Jeff -
This is the tg3 patch i used for the testing - against whats in Daves net-2.6.24 tree. Patch may be a bit hard to read. For an example of an LLTX version look at the e1000 in the older git tree at: git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git If the intel folks will accept the patch i'd really like to kill the e1000 LLTX interface. The tg3 in that tree used the old style batch_xmit() interface. cheers, jamal
Thanks for sending this. This is an early draft, right? I'll fix some typos etc. in it (patch attached) and add some whitespace. Please see RD: in the patch for more questions/comments. IMO it needs some changes to eliminate words like "we", "you", and "your" (words that personify code). Those words are OK when talking about yourself. --- ~Randy Phaedrus says that Quality is about caring.
Its a third revision - but you could call it early. When it is done, i The narrative intent is supposed to be i (or someone doing the description) sitting there with a pen and paper and maybe a laptop and walking through the details with someone who needs to understand those details. If you think it is important to make it formal, then by all means be my guest. Again, thanks for taking the time. cheers, jamal -
please be reminded that we're going to strip down e1000 and most of the features should go into e1000e, which has much less hardware workarounds. I'm still reluctant to putting in new stuff in e1000 - I really want to chop it down first ;) AUke -
sure - the question then is, will you take those changes if i use e1000e? theres a few cleanups that have nothing to do with batching; take a look at the modified e1000 on the git tree. cheers, jamal -
that's bad to begin with :) - please send those separately so I can fasttrack them into e1000e and e1000 where applicable. But yes, I'm very inclined to merge more features into e1000e than e1000. I intend to put multiqueue support into e1000e, as *all* of the hardware that it will support has multiple queues. Putting in any other performance feature like tx batching would absolutely be interesting. Auke -
Ive been CCing you ;-> Most of the changes are readability and I looked at the e1000e and it is very close to e1000 so i should be able to move the changes easily. Most importantly, can i kill LLTX? For tx batching, we have to wait to see how Dave wants to move forward; i will have the patches but it is not something you need to push until we see where that is going. cheers, jamal -
hmm, I though I already removed that, but now I see some remnants from that. By all means, please send a separate patch for that! Auke -
If I understood DaveM correctly, it is sounding like we want to deprecate all of use LLTX on "real" hardware? If so, several such projects might be considered, as well as possibly simplifying TX batching work perhaps. Also, WRT e1000 specifically, I was hoping to minimize changes, and focus people on e1000e. e1000e replaces (deprecates) large portions of e1000, namely the support for the PCI Express modern chips. When e1000e has proven itself in the field, we can potentially look at several e1000 simplifications, during the large scale code removal that becomes possible. Jeff -
From: jamal <hadi@cyberus.ca> I do. And I'm reviewing and applying several hundred patches a day. What's the point? :-) -
Reading the commentary made me think you were about to swallow that with one more change by the time i wake up;-> I still think this work - despite my vested interest - needs more scrutiny from a performance perspective. I tend to send a url to my work, but it may be time to start posting patches. cheers, jamal -
From: jamal <hadi@cyberus.ca> Absolutely. There are tertiary issues I'm personally interested in, for example how well this stuff works when we enable software GSO on a non-TSO capable card. In such a case the GSO segment should be split right before we hit the driver and then all the sub-segments of the original GSO frame batched in one shot down to the device driver. In this way you'll get a large chunk of the benefit of TSO without explicit hardware support for the feature. There are several cards (some even 10GB) that will benefit immensely from this. -
I think GSO is still useful on top of this. In my patches anything with gso gets put into the batch list and shot down the driver. Ive never considered checking whether the nic is TSO capable, that may be something worth checking into. The netiron allows you to shove upto 128 skbs utilizing one tx descriptor, which makes for indeed - ive always wondered if batching this way would make the NICs behave differently from the way TSO does. On a side note: My observation is that with large packets on a very busy system; bulk transfer type app, one approaches wire speed; with or without batching, the apps are mostly idling (Ive seen upto 90% idle time polling at the socket level for write to complete with a really busy system). This is the case with or without batching. cpu seems a little better with batching. As the aggregation of the apps gets more aggressive (achievable by reducing their packet sizes), one can achieve improved throughput and reduced cpu utilization. This all with UDP; i am still studying tcp. cheers, jamal -
From: jamal <hadi@cyberus.ca> We're talking past each other, but I'm happy to hear that for sure your code does the right thing :-) Right now only TSO capable hardware sets the TSO capable bit, except perhaps for the XEN netfront driver. What Herbert and I want to do is basically turn on TSO for devices that can't do it in hardware, and rely upon the GSO framework to do the segmenting in software right before we hit the device. This only makes sense for devices which can 1) scatter-gather and 2) checksum on transmit. Otherwise we make too many copies and/or passes over the data. And we can only get the full benefit if we can pass all the sub-segments down to the driver in one ->hard_start_xmit() UDP apps spraying data tend to naturally batch well and load balance amongst themselves because each socket fills up to it's socket send buffer limit, then sleeps, and we then get a stream from the next UDP socket up to it's limit, and so on and so forth. UDP is too easy a test case in fact :-) -
If you have knowledge there are enough descriptors in the driver to cover all skbs you are passing, do you need to have #1? Note i dont touch fragments, i am assuming the driver is smart enough to I didnt understand this last bit - you are still going to go over the list regardless of whether you call ->hard_start_xmit() once or multiple times over the same list, no? In the later case i am assuming I learnt a lot about the behavior out of doing udp (and before that with pktgen); theres a lot of driver habits that may need to be tuned before batching becomes really effective - which is easier to see with udp than with tcp. cheers, jamal -
From: jamal <hadi@cyberus.ca> Yes, because you can have multiple descriptors per SKB because we have the head part in skb->data and the rest in the page vector. Thus the device must be able to handle multiple descriptors If the device can't checksum, we have to pass over the data to compute the checksum and stick it into the headers. If the device can't scatter-gather, we have to allocate and copy into a linear buffer. Otherwise it's just bumping page reference counts and adjusting offsets, no data touching at all. -
[Removing Jeff as requested from thread :) ] Hi Dave, I have tried this on ehca which does not support TSO. I added GSO flag at the ipoib layer (and that resulted in a panic/fix that is mentioned in this patchset). I will re-run tests for this and submit results. Thanks, - KK -
Hi Dave, The intention was to avoid having two flags (one that driver supports batching and second to indicate that batching is on/off). So I could test skb_blist as an indication of whether batching is on/off. But your point on cache miss is absolutely correct, and I will change this part to be inline. thanks, - KK -
