We have done some optimizations in bnx2x_start_xmit() and bnx2x_tx_int(), which
in my opinion can lead into some theoretical race conditions.
I can be pretty wrong here, but if so, we have to optimize some other drivers,
which use memory barriers/locking schema from that patch (like tg3, bnx2).
Memory barriers here IMHO, prevent to make queue permanently stopped when on one
cpu bnx2x_tx_int() make queue empty, whereas on other cpu bnx2x_start_xmit() see
it full and make stop it, such cause queue will be stopped forever.
I'm not quite sure what for is __netif_tx_lock, but other drivers use it.
diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
index 5adf2a0..ca91aa8 100644
--- a/drivers/net/bnx2x_main.c
+++ b/drivers/net/bnx2x_main.c
@@ -893,7 +893,10 @@ static inline u16 bnx2x_tx_avail(struct bnx2x_fastpath *fp)
u16 prod;
u16 cons;
- barrier(); /* Tell compiler that prod and cons can change */
+ /* prod and cons can change on other cpu, want to see
+ consistend available space and queue (stop/running) state */
+ smp_mb();
+
prod = fp->tx_bd_prod;
cons = fp->tx_bd_cons;
@@ -957,21 +960,23 @@ static int bnx2x_tx_int(struct bnx2x_fastpath *fp)
fp->tx_pkt_cons = sw_cons;
fp->tx_bd_cons = bd_cons;
+ /* Need to make the tx_bd_cons update visible to start_xmit()
+ * before checking for netif_tx_queue_stopped(). Without the
+ * memory barrier, there is a small possibility that start_xmit()
+ * will miss it and cause the queue to be stopped forever. This
+ * can happen when we make queue empty here, when on other cpu
+ * start_xmit() still see it becoming full and stop.
+ */
+ smp_mb();
+
/* TBD need a thresh? */
if (unlikely(netif_tx_queue_stopped(txq))) {
-
- /* Need to make the tx_bd_cons update visible to start_xmit()
- * before checking for netif_tx_queue_stopped(). Without the
- * memory barrier, there is a small possibility that
- * start_xmit() will miss it and cause the queue to be stopped
- * ...From: Stanislaw Gruszka <sgruszka@redhat.com> Instead of having an opinion, please show the exact sequence of events that can lead to this situation. With such facts inhand, you will have no need for an opinion :-) --
On Thu, 25 Feb 2010 02:18:22 -0800 (PST)
Ok, here is the story:
Queue (4000 elements) is almost full: fp->tx_bd_prod = 3980, fp->tx_bd_cons = 0,
Packets ware already transmitted by device, but that was not reported to
the driver because interrupts where disabled. We are transmitting new skb
with 20 fragments.
cpu0: (in bnx2x_poll) cpu1: (transferring data)
bnx2x_tx_int(): bnx2x_start_xmit():
local fp->tx_bd_cons = 3980; send more data to device
return
bnx2x_tx_int(): local fp->tx_bd_prod = 4000
local fp->tx_bd_cons = 4000; local fp->tx_bd_cons still 0
queue not stopped no avail space in queue
return; stop queue
smp_mb() - not paired, cpu1 does not "see" cpu0 caches changes
local fp->tx_bd_cons still 0
no wake
Finally queue is sopped and device will not generate interrupt nor
call bnx2x_tx_int() from bnx2x_poll() since bnx2x_has_tx_work() will
return false.
--
In bnx2x_start_xmit(): right after the "stop queue" there is an smp_mb(), which will bring the cpu0 cache and a fresh fp->tx_bd_cons value to cpu1 and the following if() will return true and the queue will be released from bnx2x_start_xmit() flow. No deadlock here. --
On Thu, 25 Feb 2010 07:49:48 -0800
If I understand correctly what is written in Documentation/memory-barriers.txt
this smp_mb() need to have another "paired" smp_{w}mb() on cpu0 to make value
be updated on cpu1, which is missing.
--
From: Stanislaw Gruszka <sgruszka@redhat.com> The invocation of ->hard_start_xmit() creates an implicit memory barrier because all such invocations take the netdev spinlock. --
On Thu, 25 Feb 2010 08:06:24 -0800 (PST)
But barrier is missing on cpu which call bnx2x_pull(), how spinlock
before bnx2x_start_xmit() helps with that? Below again the whole
picture:
cpu0: (in bnx2x_poll) cpu1: (transferring data)
bnx2x_tx_int(): bnx2x_start_xmit():
local fp->tx_bd_cons = 3980; send more data to device
return
bnx2x_tx_int(): local fp->tx_bd_prod = 4000
local fp->tx_bd_cons = 4000; local fp->tx_bd_cons still 0
queue not stopped no avail space in queue
return; stop queue
smp_mb() - not paired, cpu1 does not "see" cpu0 caches changes
local fp->tx_bd_cons still 0
no wake
--
Correct. There a missing barrier in bnx2x_tx_int(). I'll create a proper patch shortly. --
Stanislaw, there is no need for memory barrier in bnx2x_tx_avail() as long as they are taken outside the function the way we minimize the possibility of calls for that barrier only to the rare cases, when the ring is going to be closed (XOFF). The same is with the smp_mb() in bnx2x_tx_int() - we don't want to call it unless we really needed and this is only in the case we need to XON the queue which is currently XOFFed. The tx_lock() is not needed there as well. There is a slight and very hypothetical possibility for the case when we might release the queue, when it's full and then there is a possibility that bnx2x_start_xmit() will return NETIF_TX_BUSY. And even if it happens (while nobody has reported about such a case) it's not fatal. Dave, I'm running stress tests on the patch, which will eliminate even that hypothetical issue, I've mentioned before. Once we fill confident with this patch we'll send it to u. Thanks, --
I suspect that this isn't what you want. The barrier() didn't tell the compiler that fp->tx_bd_prod and fp->tx_bd_cons could change. What it did was to say that the accesses to those two variables must be performed after all the other accesses issued by that CPU prior to the barrier - at least as far as the compiler is concerned. You don't need to separate the reads of tx_bd_prod and tx_bd_cons above with a memory barrier. They aren't ever altered in the same place. What you want is something more like the following pseudocode. To insert into a circular buffer: bd_prod = fp->tx_bd_prod; bd_cons = fp->tx_bd_cons; if (CIRC_SPACE(bd_cons, bd_prod, NUM_TX_BD) <= 0) goto no_space; /* get a tx_buf and first BD */ tx_start_bd = &fp->tx_desc_ring[bd_prod].start_bd; tx_start_bd->bd_flags.as_bitfield = ETH_TX_BD_FLAGS_START_BD; tx_start_bd->general_data = (UNICAST_ADDRESS << ETH_TX_START_BD_ETH_ADDR_TYPE_SHIFT); tx_start_bd->general_data |= (1 << ETH_TX_START_BD_HDR_NBDS_SHIFT); smp_wmb(); /* commit buffer contents before incrementing index */ fp->tx_bd_prod = TX_BD(bd_prod + 1); To read from a circular buffer: bd_prod = fp->tx_bd_prod; bd_cons = fp->tx_bd_cons; smp_read_barrier_depends(); /* read index before reading contents */ if (CIRC_CNT(bd_cons, bd_prod, NUM_TX_BD) <= 0) goto no_data; tx_start_bd = &fp->tx_desc_ring[bd_cons].start_bd; munge_descriptor(tx_start_bd); smp_mb(); /* finish reading descriptor before incrementing index */ fp->tx_bd_cons = TX_BD(bd_cons + 1); At least, I'm fairly certain that's correct. David --
From: David Howells <dhowells@redhat.com> barrier() has a "memory" asm clobber which says that all memory could have changed. --
Indeed, but only as far as the compiler is concerned. However, the problem is almost certainly not this, but that the item to be read from the buffer may not have been written yet by the producing CPU, even though it's written the head pointer. David --
Yes, I realized that. I posted other patches, removing you from CC, since thought you are not interested. http://patchwork.ozlabs.org/patch/47172/ http://patchwork.ozlabs.org/patch/47173/ http://patchwork.ozlabs.org/patch/47174/ I removed there barrier() and not use smp_mb(), with explanation why. Stanislaw --
Having said that, you might need a memory barrier before reading tx_bd_prod in the consumer if the producer waggles a flag in memory to indicate to the consumer that it should consume, and a memory barrier in the producer before waggling that flag: [producer] ... smp_wmb(); /* commit buffer contents before incrementing index */ fp->tx_bd_prod = TX_BD(bd_prod + 1); smp_wmb(); /* commit increment index before prodding consumer */ prod_consumer(); [consumer] check_prod_flag(); smp_rmb(); /* read producer index after checking prod flag */ bd_prod = fp->tx_bd_prod; bd_cons = fp->tx_bd_cons; smp_read_barrier_depends(); /* read index before reading contents */ David --
