Define ARCH_DOES_POSTED_DMA, and override the dma_flags_set_dmaflush
function. Also define dma_flags_get_direction, dma_flags_get_dmaflush
to retrieve the direction and "dmaflush attribute" from flags
passed to the sn_dma_map_* functions.Signed-off-by: Arthur Kepner <akepner@sgi.com>
--
arch/ia64/sn/pci/pci_dma.c | 35 ++++++++++++++++++++++++++---------
include/asm-ia64/sn/io.h | 26 ++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 9 deletions(-)diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d79ddac..754240b 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -153,7 +153,7 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* @dev: device to map for
* @cpu_addr: kernel virtual address of the region to map
* @size: size of the region
- * @direction: DMA direction
+ * @flags: DMA direction, and arch-specific attributes
*
* Map the region pointed to by @cpu_addr for DMA and return the
* DMA address.
@@ -167,17 +167,23 @@ EXPORT_SYMBOL(sn_dma_free_coherent);
* figure out how to save dmamap handle so can use two step.
*/
dma_addr_t sn_dma_map_single(struct device *dev, void *cpu_addr, size_t size,
- int direction)
+ int flags)
{
dma_addr_t dma_addr;
unsigned long phys_addr;
struct pci_dev *pdev = to_pci_dev(dev);
struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+ int dmaflush = dma_flags_get_dmaflush(flags);BUG_ON(dev->bus != &pci_bus_type);
phys_addr = __pa(cpu_addr);
- dma_addr = provider->dma_map(pdev, phys_addr, size, SN_DMA_ADDR_PHYS);
+ if (dmaflush)
+ dma_addr = provider->dma_map_consistent(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
+ else
+ dma_addr = provider->dma_map(pdev, phys_addr, size,
+ SN_DMA_ADDR_PHYS);
if (!dma_addr) {
printk(KERN_ERR "%s: out of ATEs\n", __FUNCTION__);
return 0;
@@ -240,18 +246,20 @@ EXPORT_SYMBOL(sn_dma_unmap_sg);
* @dev: device to map fo...
Hi Arthur,
I'm a little concerned about changing the API for the dma_ foo
functions, which are defined cross platform. If you want to change
that, I think it will require updating the documentation explaining
it.Regarding ARCH_DOES_POSTED_DMA, is that sufficient as a #define or
does it have to be run-time tested? (does it do anything at this
stage). I ask since not all ia64 platforms do posted DMA.Cheers,
Jes
-
What do you think of the following? (And is there anyone else
I should be cc-ing for review?)Document semantics of dma_flags_set_dmaflush()
Signed-off-by: Arthur Kepner <akepner@sgi.com>
--
DMA-API.txt | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+)diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index cc7a8c3..e117b72 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -392,6 +392,28 @@ Notes: You must do this:See also dma_map_single().
+int
+dma_flags_set_dmaflush(int dir)
+
+Amend dir (one of the enum dma_data_direction values), with a platform-
+specific "dmaflush" attribute. Unless the platform supports "posted DMA"
+this is a no-op.
+
+On platforms that support posted DMA, dma_flags_set_dmaflush() causes
+device writes to the associated memory region to flush in-flight DMA.
+This can be important, for example, when (DMA) writes to the memory
+region indicate that DMA of data is complete. If DMA of data and DMA of
+the completion indication race, as they can do when the platform supports
+posted DMA, then the completion indication may arrive in host memory
+ahead of some data.
+
+To prevent this, you might map the memory region used for completion
+indications as follows:
+
+ int count, flags = dma_flags_set_dmaflush(DMA_BIDIRECTIONAL);
+ .....
+ count = dma_map_sg(dev, sglist, nents, flags);
+Part II - Advanced dma_ usage
-----------------------------
--
Arthur-
So, let me try to understand ... your hardware allows writes from the
device to pass other writes from the device? Doesn't that violate the
PCI spec? I'm thinking about this (page 43 of PCI 2.3):Posted memory writes moving in the same direction through a bridge
will complete on the destination bus in the same order they complete
on the originating bus. Even if a single burst on the originating bus
is terminated with Disconnect on the destination bus so that it is
broken into multiple transactions, those transactions must not allow
the data phases to complete on the destination bus in any order otherSo any device driver used on your hardware has to add a call to this new
function, or it'll see data corruption? Not acceptable, IMO.If this is a performance optimisation, then by all means add a function
drivers can call to say "it's OK, I know about this brokenness, and I
don't depend on it", but safety first.--
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
-
Hi Matthew,
Yes I believe this behavior on sn2 is 'bending' the PCI spec a bit. The
problem is that the NUMA fabric is a routed network in itself so there
can be multiple paths between the device and the physical memory it DMAs
to/from.Cheers,
Jes
-
I should have stated this more carefully, but it's not a PCI reordering
that's being addressed here, it's a reordering that can occur within
the NUMA-interconnect.--
Arthur-
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-
Almost every platform supports posted DMA ... its a property of most PCI
This isn't possible on most platforms. PCI write posting can only be
flushed by a read transaction on the device (or sometimes any device on
the bridge). Either this interface is misnamed and misdescribed, or itJames
-
The term "posted DMA" is used to describe this behavior in the Altix
Device Driver Writer's Guide, but it may be confusing things here.
Maybe a better term will suggest itself if I can clarify....On Altix, DMA from a device isn't guaranteed to arrive in host memory
in the order it was sent from the device. This reordering can happenClearly it wasn't described adequately...
A read transaction on the device will flush pending writes to the
device. But I'm worried about DMA from the device to host memory.
On Altix, there are two mechanisms that flush all in-flight DMA
to host memory: 1) an interrupt, and 2) a write to a memory region
which has a "barrier" attribute set. Obviously option 1 isn't
viable for performance reasons. This new interface is about making
"option 2" generally available. (As it is now, the only way to get
memory with the "barrier" attribute is to allocate it with
dma_alloc_coherent().)--
Arthur-
OK, but posted DMA has a pretty specific meaning in terms of PCI, hence
Which sounds exactly what mmiowb does ... is there a need for a new API;
can't you just use mmiowb()?James
-
I believe it's the same problem, except this time it's when exposing
structures to userland.Cheers,
Jes
-
Actually, it's a different problem, but with a similar cause.
mmiowb() is for the case PIO (or MMIO) write order from two different CPUs
can invert somewhere in the NL fabric.read_relaxed() is a performance optimization to avoid the flush that's
necessary to avoid inversion in order between a PIO (or MMIO) read and
a DMA write.What Arthur's trying to do here is avoid inversion in the order of two
DMA writes.jeremy
-
Hmm, so how does another kernel API exposing mmiowb in a different way
help with this? Surely, if the driver is exporting something to user
space, it's simply up to the driver to call mmiowb when the user says
it's done?James
-
mmiowb() is for PIO->device. This interface is for DMA->memory (see akepner's
other mail).The problem is a DMA write (say to a completion queue) from a device may imply
something about another DMA write from the same device (say the actual data).
If the completion queue write arrives first (which can happen on sn2), the
driver must ensure that the rest of the outstanding DMA is complete prior to
looking at the completion queue status. It can either use a regular PIO read
to do this (i.e. a non-relaxed one) or set a flag on the completion queue DMA
address that makes it act as a barrier wrt other DMA, which is what akepner's
patch does (which should be much more efficient that using a PIO read to
guarantee DMA writes have completed).Jesse
-
This is a violation of the PCI spec, isn't it, like Matthew pointed out?
The only time a device->host DMA transaction shouldn't follow strict
ordering is when the device sets the relaxed hint in its PCI registers.James
-
Yeah, it is. Whether its allowed in PCIe depends on how you read the spec
(but either way it would need to be explicitly enabled).For better or for worse, Altix hardware always behaves this way (well mostly
for the better, since most device protocols don't care as they involve PIO,
and out of order completion is *much* faster on Altix than strict ordering).Arthur's patch is pretty straightfoward though, so unless someone can think of
a better way of hiding this architectural detail in lower level code it's
probably a good thing to add (especially given that future revs of PCIe will
probably allow this behavior, and hopefully less ambiguously than the current
spec).Jesse
-
The spec isn't ambiguous ... it says if the device and bridge agree on
relaxed ordering, then it *may* be observed in the transaction. If
there's a disagreement or neither wishes to support relaxed ordering
then the transaction *must* be strict.I really don't think a work around for a PCI spec violation belongs in
the generic DMA code, do you? The correct fix for this should be to set
the device hints to strict ordering, which presumably altix respects?
In which case, it sounds like what needs exposing are access to the PCI
device hints. I believe both PCI-X and PCIe have these hints as
optional specifiers in the bridges, so it should be in a current Rev of
the PCI spec. Or are you proposing adding an additional PCI API that
allows transaction flushes to be inserted into the stream for devices
and bridges that have already negotiated relaxed ordering? ... in which
case we need to take this to the PCI list.James
-
Well, the Altix hw by itself won't honor device hints (I'm not even sure if
there are devices that honor order hints like you outline above). However,
Altix could track per-device ordering as long as arch code was called fromI think it would have to be the latter, since otherwise it would be hard to
setup just completion queue requests to be ordered.Jesse
-
Well ... I don't have one either. However, Grant Grundler did a
presentation to OLS about relaxed ordering, and I went over it pretty
thoroughly with him a while ago. The bottom line is that the default isOK ... I think this is definitely a PCI specific API ... and probably a
generic one for requesting order flushes in devices that have negotiated
relaxed ordering. Do you want to start a new thread on linux-pci and cc
me?James
-
I'll do this.
--
Arthur-
James,
I don't believe it respects those hints - I agree, it's a pita, but
thats the state of the situation. Even if it did, it would make
performance suck as Jesse also pointed out.As I pointed out in my email to Willy is that the NUMA fabric is routed,
there's not one path through the system, which is what makes this
happen.Jes
-
Hmm, didn't see the email ... but I'm probably not cc'd on all the
thread. However ... it isn't that you couldn't do it ... it's that you
don't want to do it because it's faster to violate the spec ... like all
those nice ATA devices that lie about having a cache and then let you
power down with uncommitted data still in it ... they work much faster
for HDIO tests ... and who ever switches their box off?James
-
James,
I didn't do it, I don't know who did it, but sure we can try and track
them down and line them up outside .....Point is that this is how the chips were done and they are out there in
numbers. It makes things a *lot* faster to violate the spec in such a
system, yes it sucks, but thats how things are. It wouldn't be the first
time someone violated the PCI spec in their implementation and I am
pretty sure it won't be the last.Cheers,
Jes
-
No, this is different.
This problem here has do with ordering writes from the device
to host memory. Specifically this problem can be manifested with
Infiniband - when a Completion Queue Entry is written by the IB
device, it indicates that data is available in host memory. But
the write to the Completion Queue can race with DMA of data.(Completion Queues can be allocated by the kernel or in userspace.
The race described above can only happen when they are allocated
in userspace - kernel allocations of CQs use dma_alloc_coherent()
and so get the "barrier" attribute that's needed to prevent the
race. The proposed new interface would allow CQs, or anything else,
allocated with plain old malloc(), to set the barrier attribute.)--
Arthur-
I think a #define is exactly what we want here.
The newly #define-ed function (for now, and maybe forever) does
nothing except on IA64_SGI_SN2, which does posted DMA.--
Arthur-
| Thomas Gleixner | Re: Linux 2.6.21-rc1 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| James Bottomley | [Ksummit-2008-discuss] Fixing the Kernel Janitors project |
| James Morris | Re: LSM conversion to static interface |
git: | |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Christoph Hellwig | Re: [PATCH 06/32] IGET: Mark iget() and read_inode() as being obsolete [try #2] |
| Linus Torvalds | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH take 2] pkt_sched: Protect gen estimators under est_lock. |
