Re: [PATCH 2/3] dma: override "dma_flags_set_dmaflush" for sn-ia64

Previous thread: [PATCH 1/3] dma: introduce no-op stub "dma_flags_set_dmaflush" by akepner on Friday, August 17, 2007 - 8:27 pm. (1 message)

Next thread: [PATCH 3/3] dma: use dma_flags_set_dmaflush in ib_umem_get by akepner on Friday, August 17, 2007 - 8:27 pm. (1 message)
To: <linux-kernel@...>
Cc: <jes@...>, <rdreier@...>
Date: Friday, August 17, 2007 - 8:27 pm

Define ARCH_DOES_POSTED_DMA, and override the dma_flags_set_dmaflush
function. Also define dma_flags_get_direction, dma_flags_get_dmaflush
to retrieve the direction and "dmaflush attribute" from flags
passed to the sn_dma_map_* functions.

Signed-off-by: Arthur Kepner <akepner@sgi.com>
--
arch/ia64/sn/pci/pci_dma.c | 35 ++++++++++++++++++++++++++---------
include/asm-ia64/sn/io.h | 26 ++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d79ddac..754240b 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* @dev: device to map for
* @cpu_addr: kernel virtual address of the region to map
* @size: size of the region
- * @direction: DMA direction
+ * @flags: DMA direction, and arch-specific attributes
*
* Map the region pointed to by @cpu_addr for DMA and return the
* DMA address.
@@ -167,17 +167,23 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* figure out how to save dmamap handle so can use two step.
*/
dma_addr_t sn_dma_map_single(struct device *dev, void *cpu_addr, size_t size,
- int direction)
+ int flags)
{
dma_addr_t dma_addr;
unsigned long phys_addr;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+ int dmaflush = dma_flags_get_dmaflush(flags);

BUG_ON(dev->bus != &pci_bus_type);

phys_addr = __pa(cpu_addr);
- dma_addr = provider->dma_map(pdev, phys_addr, size, SN_DMA_ADDR_PHYS);
+ if (dmaflush)
+ dma_addr = provider->dma_map_consistent(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
+ else
+ dma_addr = provider->dma_map(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
if (!dma_addr) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
return 0;
@@ -240,18 +246,20 @@ EXPORT_SYMBOL(sn_dma_unmap_sg);
* @dev: device to map fo...

To: <akepner@...>
Cc: linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Monday, August 20, 2007 - 4:24 am

Hi Arthur,

I'm a little concerned about changing the API for the dma_ foo
functions, which are defined cross platform. If you want to change
that, I think it will require updating the documentation explaining
it.

Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or
does it have to be run-time tested? (does it do anything at this
stage). I ask since not all ia64 platforms do posted DMA.

Cheers,
Jes
-

To: Jes Sorensen <jes@...>
Cc: linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 3:35 pm

What do you think of the following? (And is there anyone else
I should be cc-ing for review?)

Document semantics of dma_flags_set_dmaflush()

Signed-off-by: Arthur Kepner <akepner@sgi.com>
--
DMA-API.txt | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index cc7a8c3..e117b72 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -392,6 +392,28 @@ Notes: You must do this:

See also dma_map_single().

+int
+dma_flags_set_dmaflush(int dir)
+
+Amend dir (one of the enum dma_data_direction values), with a platform-
+specific "dmaflush" attribute. Unless the platform supports "posted DMA"
+this is a no-op.
+
+On platforms that support posted DMA, dma_flags_set_dmaflush() causes
+device writes to the associated memory region to flush in-flight DMA.
+This can be important, for example, when (DMA) writes to the memory
+region indicate that DMA of data is complete. If DMA of data and DMA of
+the completion indication race, as they can do when the platform supports
+posted DMA, then the completion indication may arrive in host memory
+ahead of some data.
+
+To prevent this, you might map the memory region used for completion
+indications as follows:
+
+ int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
+ .....
+ count = dma_map_sg(dev, sglist, nents, flags);
+

Part II - Advanced dma_ usage
-----------------------------
--
Arthur

-

To: <akepner@...>
Cc: Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 4:16 pm

So, let me try to understand ... your hardware allows writes from the
device to pass other writes from the device? Doesn't that violate the
PCI spec? I'm thinking about this (page 43 of PCI 2.3):

Posted memory writes moving in the same direction through a bridge
will complete on the destination bus in the same order they complete
on the originating bus. Even if a single burst on the originating bus
is terminated with Disconnect on the destination bus so that it is
broken into multiple transactions, those transactions must not allow
the data phases to complete on the destination bus in any order other

So any device driver used on your hardware has to add a call to this new
function, or it'll see data corruption? Not acceptable, IMO.

If this is a performance optimisation, then by all means add a function
drivers can call to say "it's OK, I know about this brokenness, and I
don't depend on it", but safety first.

--
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
-

To: Matthew Wilcox <matthew@...>
Cc: <akepner@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 3:44 am

Hi Matthew,

Yes I believe this behavior on sn2 is 'bending' the PCI spec a bit. The
problem is that the NUMA fabric is a routed network in itself so there
can be multiple paths between the device and the physical memory it DMAs
to/from.

Cheers,
Jes
-

To: Matthew Wilcox <matthew@...>
Cc: Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 5:37 pm

I should have stated this more carefully, but it's not a PCI reordering
that's being addressed here, it's a reordering that can occur within
the NUMA-interconnect.

--
Arthur

-

To: <akepner@...>
Cc: Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>, <James.Bottomley@...>
Date: Tuesday, August 21, 2007 - 4:05 pm

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Randy Dunlap <randy.dunlap@...>
Cc: <akepner@...>, Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 4:55 pm

Almost every platform supports posted DMA ... its a property of most PCI

This isn't possible on most platforms. PCI write posting can only be
flushed by a read transaction on the device (or sometimes any device on
the bridge). Either this interface is misnamed and misdescribed, or it

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Randy Dunlap <randy.dunlap@...>, Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 8:34 pm

The term "posted DMA" is used to describe this behavior in the Altix
Device Driver Writer's Guide, but it may be confusing things here.
Maybe a better term will suggest itself if I can clarify....

On Altix, DMA from a device isn't guaranteed to arrive in host memory
in the order it was sent from the device. This reordering can happen

Clearly it wasn't described adequately...

A read transaction on the device will flush pending writes to the
device. But I'm worried about DMA from the device to host memory.
On Altix, there are two mechanisms that flush all in-flight DMA
to host memory: 1) an interrupt, and 2) a write to a memory region
which has a "barrier" attribute set. Obviously option 1 isn't
viable for performance reasons. This new interface is about making
"option 2" generally available. (As it is now, the only way to get
memory with the "barrier" attribute is to allocate it with
dma_alloc_coherent().)

--
Arthur

-

To: <akepner@...>
Cc: Randy Dunlap <randy.dunlap@...>, Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Tuesday, August 21, 2007 - 9:14 pm

OK, but posted DMA has a pretty specific meaning in terms of PCI, hence

Which sounds exactly what mmiowb does ... is there a need for a new API;
can't you just use mmiowb()?

James

-

To: James Bottomley <James.Bottomley@...>
Cc: <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 3:39 am

I believe it's the same problem, except this time it's when exposing
structures to userland.

Cheers,
Jes
-

To: Jes Sorensen <jes@...>
Cc: James Bottomley <James.Bottomley@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Thursday, August 23, 2007 - 1:58 am

Actually, it's a different problem, but with a similar cause.

mmiowb() is for the case PIO (or MMIO) write order from two different CPUs
can invert somewhere in the NL fabric.

read_relaxed() is a performance optimization to avoid the flush that's
necessary to avoid inversion in order between a PIO (or MMIO) read and
a DMA write.

What Arthur's trying to do here is avoid inversion in the order of two
DMA writes.

jeremy
-

To: Jes Sorensen <jes@...>
Cc: <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 10:02 am

Hmm, so how does another kernel API exposing mmiowb in a different way
help with this? Surely, if the driver is exporting something to user
space, it's simply up to the driver to call mmiowb when the user says
it's done?

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 12:03 pm

mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's
other mail).

The problem is a DMA write (say to a completion queue) from a device may imply
something about another DMA write from the same device (say the actual data).
If the completion queue write arrives first (which can happen on sn2), the
driver must ensure that the rest of the outstanding DMA is complete prior to
looking at the completion queue status. It can either use a regular PIO read
to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA
address that makes it act as a barrier wrt other DMA, which is what akepner's
patch does (which should be much more efficient that using a PIO read to
guarantee DMA writes have completed).

Jesse
-

To: Jesse Barnes <jbarnes@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 12:44 pm

This is a violation of the PCI spec, isn't it, like Matthew pointed out?
The only time a device->host DMA transaction shouldn't follow strict
ordering is when the device sets the relaxed hint in its PCI registers.

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 12:51 pm

Yeah, it is. Whether its allowed in PCIe depends on how you read the spec
(but either way it would need to be explicitly enabled).

For better or for worse, Altix hardware always behaves this way (well mostly
for the better, since most device protocols don't care as they involve PIO,
and out of order completion is *much* faster on Altix than strict ordering).

Arthur's patch is pretty straightfoward though, so unless someone can think of
a better way of hiding this architectural detail in lower level code it's
probably a good thing to add (especially given that future revs of PCIe will
probably allow this behavior, and hopefully less ambiguously than the current
spec).

Jesse
-

To: Jesse Barnes <jbarnes@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 1:04 pm

The spec isn't ambiguous ... it says if the device and bridge agree on
relaxed ordering, then it *may* be observed in the transaction. If
there's a disagreement or neither wishes to support relaxed ordering
then the transaction *must* be strict.

I really don't think a work around for a PCI spec violation belongs in
the generic DMA code, do you? The correct fix for this should be to set
the device hints to strict ordering, which presumably altix respects?
In which case, it sounds like what needs exposing are access to the PCI
device hints. I believe both PCI-X and PCIe have these hints as
optional specifiers in the bridges, so it should be in a current Rev of
the PCI spec. Or are you proposing adding an additional PCI API that
allows transaction flushes to be inserted into the stream for devices
and bridges that have already negotiated relaxed ordering? ... in which
case we need to take this to the PCI list.

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 1:17 pm

Well, the Altix hw by itself won't honor device hints (I'm not even sure if
there are devices that honor order hints like you outline above). However,
Altix could track per-device ordering as long as arch code was called from

I think it would have to be the latter, since otherwise it would be hard to
setup just completion queue requests to be ordered.

Jesse

-

To: Jesse Barnes <jbarnes@...>
Cc: Jes Sorensen <jes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 2:13 pm

Well ... I don't have one either. However, Grant Grundler did a
presentation to OLS about relaxed ordering, and I went over it pretty
thoroughly with him a while ago. The bottom line is that the default is

OK ... I think this is definitely a PCI specific API ... and probably a
generic one for requesting order flushes in devices that have negotiated
relaxed ordering. Do you want to start a new thread on linux-pci and cc
me?

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Jesse Barnes <jbarnes@...>, Jes Sorensen <jes@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 2:44 pm

I'll do this.

--
Arthur

-

To: James Bottomley <James.Bottomley@...>
Cc: Jesse Barnes <jbarnes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 1:03 pm

James,

I don't believe it respects those hints - I agree, it's a pita, but
thats the state of the situation. Even if it did, it would make
performance suck as Jesse also pointed out.

As I pointed out in my email to Willy is that the NUMA fabric is routed,
there's not one path through the system, which is what makes this
happen.

Jes
-

To: Jes Sorensen <jes@...>
Cc: Jesse Barnes <jbarnes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 2:10 pm

Hmm, didn't see the email ... but I'm probably not cc'd on all the
thread. However ... it isn't that you couldn't do it ... it's that you
don't want to do it because it's faster to violate the spec ... like all
those nice ATA devices that lie about having a cache and then let you
power down with uncommitted data still in it ... they work much faster
for HDIO tests ... and who ever switches their box off?

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Jesse Barnes <jbarnes@...>, <akepner@...>, Randy Dunlap <randy.dunlap@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Thursday, August 23, 2007 - 4:45 am

James,

I didn't do it, I don't know who did it, but sure we can try and track
them down and line them up outside .....

Point is that this is how the chips were done and they are out there in
numbers. It makes things a *lot* faster to violate the spec in such a
system, yes it sucks, but thats how things are. It wouldn't be the first
time someone violated the PCI spec in their implementation and I am
pretty sure it won't be the last.

Cheers,
Jes
-

To: James Bottomley <James.Bottomley@...>
Cc: Randy Dunlap <randy.dunlap@...>, Jes Sorensen <jes@...>, linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Wednesday, August 22, 2007 - 11:54 am

No, this is different.

This problem here has do with ordering writes from the device
to host memory. Specifically this problem can be manifested with
Infiniband - when a Completion Queue Entry is written by the IB
device, it indicates that data is available in host memory. But
the write to the Completion Queue can race with DMA of data.

(Completion Queues can be allocated by the kernel or in userspace.
The race described above can only happen when they are allocated
in userspace - kernel allocations of CQs use dma_alloc_coherent()
and so get the "barrier" attribute that's needed to prevent the
race. The proposed new interface would allow CQs, or anything else,
allocated with plain old malloc(), to set the barrier attribute.)

--
Arthur

-

To: Jes Sorensen <jes@...>
Cc: linux-kernel <linux-kernel@...>, <rdreier@...>, linux-ia64 <linux-ia64@...>
Date: Monday, August 20, 2007 - 12:07 pm

I think a #define is exactly what we want here.

The newly #define-ed function (for now, and maybe forever) does
nothing except on IA64_SGI_SN2, which does posted DMA.

--
Arthur

-

Previous thread: [PATCH 1/3] dma: introduce no-op stub "dma_flags_set_dmaflush" by akepner on Friday, August 17, 2007 - 8:27 pm. (1 message)

Next thread: [PATCH 3/3] dma: use dma_flags_set_dmaflush in ib_umem_get by akepner on Friday, August 17, 2007 - 8:27 pm. (1 message)