Hi,
This patch will remove napi in tx path and do Tx compleiton processing in
interrupt context. This makes Tx completion processing simpler without loss of
performance.
Thanks
Ajit
Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
drivers/net/benet/be.h | 3 +-
drivers/net/benet/be_main.c | 114 ++++++++++++++++++++++---------------------
2 files changed, 60 insertions(+), 57 deletions(-)
diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index b4bb06f..6d149a4 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -159,8 +159,6 @@ struct be_eq_obj {
u16 min_eqd; /* in usecs */
u16 max_eqd; /* in usecs */
u16 cur_eqd; /* in usecs */
-
- struct napi_struct napi;
};
struct be_tx_obj {
@@ -179,6 +177,7 @@ struct be_rx_page_info {
};
struct be_rx_obj {
+ struct napi_struct napi;
struct be_queue_info q;
struct be_queue_info cq;
struct be_rx_page_info page_info_tbl[RX_Q_LEN];
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index ae2f6b5..640337c 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -991,7 +991,7 @@ static void be_tx_compl_process(struct be_adapter *adapter, u16 last_index)
atomic_sub(num_wrbs, &txq->used);
- kfree_skb(sent_skb);
+ dev_kfree_skb_any(sent_skb);
}
static void be_rx_q_clean(struct be_adapter *adapter)
@@ -1216,20 +1216,66 @@ static int event_handle(struct be_ctrl_info *ctrl,
/* We can see an interrupt and no event */
be_eq_notify(ctrl, eq_obj->q.id, true, true, num);
- if (num)
- napi_schedule(&eq_obj->napi);
return num;
}
+static int rx_event_handle(struct be_adapter *adapter)
+{
+ int evts;
+
+ evts = event_handle(&adapter->ctrl, &adapter->rx_eq);
+ if (evts)
+ napi_schedule(&adapter->rx_obj.napi);
+
+ return evts;
+}
+
+static int tx_event_handle(struct be_adapter *adapter)
+{
+ struct be_tx_obj *tx_obj = &adapter->tx_obj;
+ struct be_queue_info *tx_cq = ...From: Ajit Khaparde <ajitk@serverengines.com> This is different from how every other NAPI driver does this. You should have a single NAPI context, that handles both TX and RX processing. Except, that for TX processing, no work budget adjustments are made. You simply unconditionally process all pending TX work without accounting it into the POLL call budget. I have no idea why this driver tried to split the RX and TX work like this, it accomplishes nothing but add overhead. Simply add the TX completion code to the RX poll handler and that's all you need to do. Also, make sure to run TX polling work before RX polling work, this makes fresh SKBs available for responses generated by RX packet processing. I bet this is why you really saw performance problems, rather than something to do with running it directly in interrupt context. There should be zero gain from that if you do the TX poll work properly in the RX poll handler. When you free TX packets in hardware interrupt context using dev_kfree_skb_any() that just schedules a software interrupt to do the actual SKB free, which adds just more overhead for TX processing work. You aren't avoiding software IRQ work by doing TX processing in the hardware interrupt handler, in fact you theoretically are doing more. So the only conclusion I can come to is that what is important is doing the TX completion work before the RX packets get processed in the NAPI poll handler, and you accomplish that more efficiently and more properly by simply moving the TX completion work to the top of the RX poll handler code. You also failed to provide any indication of what kind of performance change occurs due to this patch, you just say that it does occur. This gives people very little confidence in your changes, nor does it give them an idea of the magnitude of the performance changes caused by this change. The kind of analysis I did above is the kind of things I expect to find in a commit message that makes delicate changes like ...
Thanks David for this analysis I would like to point a scalability problem we currently have with non multiqueue devices, and multi core host with the schem you described/advocated. (this has nothing to do with the be2net patch, please forgive me for jumping in) When a lot of network trafic is handled by one device, we enter in a ksofirqd/napi mode, where one cpu is almost dedicated in handling both TX completions and RX completions, while other cpus run application code (and some parts of TCP/UDP stack ) Thats really expensive because of many cache line ping pongs occurring. In that case, it would make sense to transfert most part of the TX completion work to the other cpus (cpus that order the xmits actually). skb freeing of course, and sock_wfree() callbacks... So maybe some NIC device drivers could let their ndo_start_xmit() do some cleanup work of previously sent skbs. If correctly done, we could lower number of cache line ping pongs. This would give a breath to the cpu that would only take care of RX completions, and probably give better throughput. Some machines out there want to transmit lot of frames, while receiving few ones... There is also a minor latency problem with current schem : Taking care of TX completion takes some time and delay RX handling, increasing latencies of incoming trafic. --
From: Eric Dumazet <dada1@cosmosbay.com> Yes and that kind of idea can be combined with the SW multiqueue That's another idea. However the ordering necessary to do this correctly on some chips might make the cost of it prohibitive. For example, it might only be safe to check the consumer pointer value DMA's by a device into the status block after an IRQ is received unless some expensive synchronization (f.e. a register read) is One thing that one must understand is that deferring any SKB freeing increases the size of the working set of memory that the CPU has to access. Buffer reuse is absolutely essential to keep the working set of unfree'd data under control. This working set bloating effect is also, unfortunately, a hallmark of RCU. Especially before we had softint based RCU available. --
Comments inline. Thanks Ajit Splitting RX and TX interrupts helps with performance in certain work loads. TX completion processing load is not significant in TCP traffic due to TSO. RX and TX completion processing load on same CPU limits throughput dev_kfree_skb_any() scheduling a software interrupt was something that we We did not see any change in throughput or CPU utilization by doing TX completion processing in interrupt context instead of the NAPI context. The objective of the patch was to simplify the TX completion processing. --
From: Ajit Khaparde <ajitk@serverengines.com> Thank you. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing |
