Re: [net-next-2.6 PATCH][be2net] remove napi in the tx path and do tx completion processing in interrupt context

Previous thread: Re: tcp vegas ssthresh bugfix by David Miller on Tuesday, May 19, 2009 - 2:48 am. (1 message)

Next thread: [PATCH net-next-2.6] cleanup: remove unused parameter from fill method in fib_rules_ops. by Rami Rosen on Tuesday, May 19, 2009 - 5:54 am. (5 messages)

Hi,
This patch will remove napi in tx path and do Tx compleiton processing in 
interrupt context.  This makes Tx completion processing simpler without loss of 
performance.

Thanks
Ajit

Signed-off-by: Ajit Khaparde <ajitk@serverengines.com>
---
 drivers/net/benet/be.h      |    3 +-
 drivers/net/benet/be_main.c |  114 ++++++++++++++++++++++---------------------
 2 files changed, 60 insertions(+), 57 deletions(-)

diff --git a/drivers/net/benet/be.h b/drivers/net/benet/be.h
index b4bb06f..6d149a4 100644
--- a/drivers/net/benet/be.h
+++ b/drivers/net/benet/be.h
@@ -159,8 +159,6 @@ struct be_eq_obj {
 	u16 min_eqd;		/* in usecs */
 	u16 max_eqd;		/* in usecs */
 	u16 cur_eqd;		/* in usecs */
-
-	struct napi_struct napi;
 };
 
 struct be_tx_obj {
@@ -179,6 +177,7 @@ struct be_rx_page_info {
 };
 
 struct be_rx_obj {
+	struct napi_struct napi;
 	struct be_queue_info q;
 	struct be_queue_info cq;
 	struct be_rx_page_info page_info_tbl[RX_Q_LEN];
diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index ae2f6b5..640337c 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -991,7 +991,7 @@ static void be_tx_compl_process(struct be_adapter *adapter, u16 last_index)
 
 	atomic_sub(num_wrbs, &txq->used);
 
-	kfree_skb(sent_skb);
+	dev_kfree_skb_any(sent_skb);
 }
 
 static void be_rx_q_clean(struct be_adapter *adapter)
@@ -1216,20 +1216,66 @@ static int event_handle(struct be_ctrl_info *ctrl,
 
 	/* We can see an interrupt and no event */
 	be_eq_notify(ctrl, eq_obj->q.id, true, true, num);
-	if (num)
-		napi_schedule(&eq_obj->napi);
 
 	return num;
 }
 
+static int rx_event_handle(struct be_adapter *adapter)
+{
+	int evts;
+
+	evts = event_handle(&adapter->ctrl, &adapter->rx_eq);
+	if (evts)
+		napi_schedule(&adapter->rx_obj.napi);
+
+	return evts;
+}
+
+static int tx_event_handle(struct be_adapter *adapter)
+{
+	struct be_tx_obj *tx_obj = &adapter->tx_obj;
+	struct be_queue_info *tx_cq = ...

From: Ajit Khaparde <ajitk@serverengines.com>

This is different from how every other NAPI driver does this.

You should have a single NAPI context, that handles both TX and RX
processing.  Except, that for TX processing, no work budget
adjustments are made.  You simply unconditionally process all pending
TX work without accounting it into the POLL call budget.

I have no idea why this driver tried to split the RX and TX
work like this, it accomplishes nothing but add overhead.
Simply add the TX completion code to the RX poll handler
and that's all you need to do.  Also, make sure to run TX
polling work before RX polling work, this makes fresh SKBs
available for responses generated by RX packet processing.

I bet this is why you really saw performance problems, rather than
something to do with running it directly in interrupt context.  There
should be zero gain from that if you do the TX poll work properly in
the RX poll handler.  When you free TX packets in hardware interrupt
context using dev_kfree_skb_any() that just schedules a software
interrupt to do the actual SKB free, which adds just more overhead for
TX processing work.  You aren't avoiding software IRQ work by doing TX
processing in the hardware interrupt handler, in fact you
theoretically are doing more.

So the only conclusion I can come to is that what is important is
doing the TX completion work before the RX packets get processed in
the NAPI poll handler, and you accomplish that more efficiently and
more properly by simply moving the TX completion work to the top of
the RX poll handler code.

You also failed to provide any indication of what kind of
performance change occurs due to this patch, you just say
that it does occur.  This gives people very little confidence
in your changes, nor does it give them an idea of the magnitude
of the performance changes caused by this change.

The kind of analysis I did above is the kind of things I expect
to find in a commit message that makes delicate changes like ...

Thanks David for this analysis

I would like to point a scalability problem we currently have with non
multiqueue devices, and multi core host with the schem you described/advocated.

(this has nothing to do with the be2net patch, please forgive me for jumping in)

When a lot of network trafic is handled by one device, we enter in a
ksofirqd/napi mode, where one cpu is almost dedicated in handling
both TX completions and RX completions, while other cpus
run application code (and some parts of TCP/UDP stack )

Thats really expensive because of many cache line ping pongs occurring.

In that case, it would make sense to transfert most part of the TX completion work
to the other cpus (cpus that order the xmits actually). skb freeing of course,
and sock_wfree() callbacks...

So maybe some NIC device drivers could let their ndo_start_xmit()
do some cleanup work of previously sent skbs. If correctly done,
we could lower number of cache line ping pongs.

This would give a breath to the cpu that would only take care of RX completions,
and probably give better throughput. Some machines out there want to transmit
lot of frames, while receiving few ones...


There is also a minor latency problem with current schem :
Taking care of TX completion takes some time and delay RX handling, increasing latencies
of incoming trafic.


--


From: Eric Dumazet <dada1@cosmosbay.com>

Yes and that kind of idea can be combined with the SW multiqueue

That's another idea.  However the ordering necessary to do this
correctly on some chips might make the cost of it prohibitive.  For
example, it might only be safe to check the consumer pointer value
DMA's by a device into the status block after an IRQ is received
unless some expensive synchronization (f.e. a register read) is

One thing that one must understand is that deferring any SKB freeing
increases the size of the working set of memory that the CPU has
to access.  Buffer reuse is absolutely essential to keep the working
set of unfree'd data under control.

This working set bloating effect is also, unfortunately, a hallmark of
RCU.  Especially before we had softint based RCU available.
--


Comments inline.
 
Thanks
Ajit

Splitting RX and TX interrupts helps with performance in certain work loads.
TX completion processing load is not significant in TCP traffic due to
TSO.  RX and TX completion processing load on same CPU limits throughput
dev_kfree_skb_any() scheduling a software interrupt was something that we
We did not see any change in throughput or CPU utilization by doing TX
completion processing in interrupt context instead of the NAPI context.
The objective of the patch was to simplify the TX completion processing.

--


From: Ajit Khaparde <ajitk@serverengines.com>

Thank you.
--

Previous thread: Re: tcp vegas ssthresh bugfix by David Miller on Tuesday, May 19, 2009 - 2:48 am. (1 message)

Next thread: [PATCH net-next-2.6] cleanup: remove unused parameter from fill method in fib_rules_ops. by Rami Rosen on Tuesday, May 19, 2009 - 5:54 am. (5 messages)