The line:
if ((bd == ugeth->txBd[txQ]) && (netif_queue_stopped(dev) == 0))
break;
in ucc_geth_tx() didn not make sense to me. Rework & cleanup
this logic to something understandable.
---
Reworked the patch according to Antons comments.
drivers/net/ucc_geth.c | 66 +++++++++++++++++++++++++++--------------------
1 files changed, 38 insertions(+), 28 deletions(-)
diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 7fc91aa..465de3a 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -161,6 +161,17 @@ static struct ucc_geth_info ugeth_primary_info = {
static struct ucc_geth_info ugeth_info[8];
+
+static inline u32 __iomem *bd2buf(u8 __iomem *bd)
+{
+ return (u32 __iomem *)(bd+4);
+}
+
+static inline u32 __iomem *bd2status(u8 __iomem *bd)
+{
+ return (u32 __iomem *)bd;
+}
+
#ifdef DEBUG
static void mem_disp(u8 *addr, int size)
{
@@ -3048,6 +3059,7 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, struct net_device *dev)
u8 __iomem *bd; /* BD pointer */
u32 bd_status;
u8 txQ = 0;
+ int tx_ind;
ugeth_vdbg("%s: IN", __func__);
@@ -3057,21 +3069,19 @@ static int ucc_geth_start_xmit(struct sk_buff *skb, struct net_device *dev)
/* Start from the next BD that should be filled */
bd = ugeth->txBd[txQ];
- bd_status = in_be32((u32 __iomem *)bd);
+ bd_status = in_be32(bd2status(bd));
/* Save the skb pointer so we can free it later */
- ugeth->tx_skbuff[txQ][ugeth->skb_curtx[txQ]] = skb;
+ tx_ind = ugeth->skb_curtx[txQ];
+ ugeth->tx_skbuff[txQ][tx_ind] = skb;
/* Update the current skb pointer (wrapping if this was the last) */
- ugeth->skb_curtx[txQ] =
- (ugeth->skb_curtx[txQ] +
- 1) & TX_RING_MOD_MASK(ugeth->ug_info->bdRingLenTx[txQ]);
+ tx_ind = (tx_ind + 1) & TX_RING_MOD_MASK(ugeth->ug_info->bdRingLenTx[txQ]);
+ ugeth->skb_curtx[txQ] = tx_ind;
/* set up the buffer descriptor */
- out_be32(&((struct qe_bd __iomem *)bd)->buf,
- dma_map_single(&ugeth->dev->dev, ...Spaces around "+"... Also, now it's obvious that we're just reading status or buf. So instead of these inlines we could use struct qe_bd as described in asm/qe.h. I.e. struct qe_bd *bd = ...; in_be32(&bd->buf); in_be16(&bd->status); Oh, wait. We can't, ucc_geth assumes status is u32, which includes the length field, i.e. ucc_fast.h defines: The cleanup work surely desires a separate patch, so bd2buf and bd2status are OK, for now. I'll test the patch as soon as I'll get some QE board back on my table (actually I have one QE board handy, but it's a 1GHz one, while I'd like to test the patch on a slow machine, where we'll actually see performance regressions). Line over 80 columns. Ditto. Please make sure your patches pass scripts/checkpatch.pl. Thanks, -- Anton Vorontsov email: cbouatmailru@gmail.com irc://irc.freenode.net/bd2 --
Exactly, I did look at using struct qe_bd *bd, but that will affect lots of stuff. Reading the status and len in one go is Good, I have more coming but it touches the same code so I really want to sort out this one first. Jocke --
On Fri, Mar 27, 2009 at 1:44 AM, Joakim Tjernlund When the driver was reviewed for upstream merge, I was told to remove this kind of thing in favor of direct struct qe_bd access. So do we really have a guideline on this? Or it's just a matter of personal Why does this make more sense? The BD with a NULL buffer means the Now I see why you depend on the buffer to judge if the BD ring is full. If you want to do this, make sure it's well documented in code, or others will be confused. And as Anton already commented, in/out access is slow. I think the original way can be fixed if you think --
This is a bit better than the current type casts. Moving over to qe_bd accesses is a bigger cleanup that will have to be a seperate patch. I am not sure it is an all win as you probably loose the ability The code could use a better comment, but I really think this method is superior and more robust. The problem is that in/out functions is much slower than it has to be for QE IO memory. The original way is broken and I tired to fix it but it always broke as soon one applied some load. Jocke --
On Fri, Mar 27, 2009 at 6:23 PM, Joakim Tjernlund The only difference I see between your method and the original code is that you update the cleanup progress on every BD. The original code updates progress only after all sent BDs are processed. You can try move ugeth->confBd[txQ] = bd; into the loop then it will be the same as your proposed code. - Leo --
Already tried that+lots of other combinatios. Doesn't work --
Not if you define the struct to have one status_len field (or a union of that with separate fields), as gianfar does. -Scott --
gianfar does not seem to use in_/out_ functions for the BDs. Works just fine that too it seems. I always felt that the in_/out_ functions was extra baggage that the Freescale CPUs don't need. There is something in between that is missing. Jocke --
It does now that it has explicit barriers in a few places. Before they were added, it would sometimes fail under load. That was due to a compiler reordering, but CPU reordering was possible as well. -Scott --
Does not the CPU skip reordering if the guarded bit is set? Could we not use the same in ucc_geth as well? Then people would not need to worry about "performance issues". Jocke --
The guarded bit is typically not set for DMA buffers. ucc_geth is a bit different since descriptors are in MURAM which is ioremap()ed -- though switching to a cacheable mapping with barriers should be a performance improvement. -Scott --
I always thought that MURAM was very fast. The whole reason to have BDs in MURAM is that it is faster than normal RAM, at least that is what I thought. Jocke --
Yeah, on second thought it probably wouldn't be worth it. There's also the question of under what circumstances the QE's MURAM accesses will be cache-coherent. As for the CPU not reordering guarded+cache inhibited accesses, that initially seemed to be true for the new arch stuff (book3e/book3s, but not really, see below), but the classic arch documentation only guarantees stores to such regions to be in-order (and the explicitly-specified operation of eieio on I+G accesses wouldn't make much sense if they were already guaranteed to be in-order). Then I looked at the EREF to see what older book E documents had to say on the issue, and it suggests that when the architecture document says "out of order", it really means "speculative" (and reading the arch doc's definition of "out of order" seems to confirm this -- redefining terms is bad, m'kay?). So it seems that the simple answer is no, guarded storage is not guaranteed to be in order, unless the only thing that can cause an out-of-order access is speculative execution. -Scott --
I am a bit confused, what isn't worth it? Currently MURAM isn't used by ucc_geth, but it is easy to change. Swap MEM_PART_SYSTEM to MEM_PART_MURAM, however, just tried that and the driver stopped working. I known this worked earlier because I tried it and I even think I sent a patch to Leo. What choices do we have, I see three: 1) MEM_PART_SYSTEM, as today. 2) MEM_PART_MURAM. I guess this should be uncacheable memory? 3) as gianfar, dma_alloc_coherent(). I presume this is uncacheable memory? My guess would be 2 or 3. Do they have the same synchronization Very informative, thanks. --
On Tue, Mar 31, 2009 at 5:07 PM, Joakim Tjernlund 1 and 3 are the same. All of them use cacheable memory as we have a hardware coherency module to take care of the cache coherency problem. However it might be better to use dma_alloc_coherent() for the code to be more readible. Thanks. - Leo --
Enabling cacheing on MURAM, at least when used for buffer descriptors. The cache line ping-pong would probably outweigh the cost of the Hmm. I looked in the driver and saw numerous muram allocations, but I didn't try to follow the driver enough to ensure that they were for the It would be uncacheable on systems without coherent DMA, but I don't No, unfortunately. PowerPC sync instructions are a bit complicated. For example, you can use eieio to sync between reading the interrupt status register and checking the ring buffer, if they're both mapped I+G, but not if the former is I+G and the latter is cacheable (you need a full sync in that case). -Scott --
I noticed that in gianfar these memory access is not protected by "volatile". Can this be the reason why the compiler did some unwanted optimization? - Leo --
