Re: [PATCHv8 00/12] Contiguous Memory Allocator

Previous thread: [PATCHv8 03/12] lib: genalloc: Generic allocator improvements by Michal Nazarewicz on Wednesday, December 15, 2010 - 1:34 pm. (1 message)

Next thread: [PATCH 0/5] UAS updates by Matthew Wilcox on Wednesday, December 15, 2010 - 1:41 pm. (6 messages)
From: Michal Nazarewicz
Date: Wednesday, December 15, 2010 - 1:34 pm

Hello everyone,

This is yet another version of CMA this time stripped from a lot of
code and with working migration implementation.

   The Contiguous Memory Allocator (CMA) makes it possible for
   device drivers to allocate big contiguous chunks of memory after
   the system has booted.

For more information see 7th patch in the set.


This version fixes some things Kamezawa suggested plus it separates
code that uses MIGRATE_CMA from the rest of the code.  This I hope
will help to grasp the overall idea of CMA.


The current version is just an allocator that handles allocation of
contiguous memory blocks.  The difference between this patchset and
Kamezawa's alloc_contig_pages() are:

1. alloc_contig_pages() requires MAX_ORDER alignment of allocations
   which may be unsuitable for embeded systems where a few MiBs are
   required.

   Lack of the requirement on the alignment means that several threads
   might try to access the same pageblock/page.  To prevent this from
   happening CMA uses a mutex so that only one cm_alloc()/cm_free()
   function may run at one point.

2. CMA may use its own migratetype (MIGRATE_CMA) which behaves
   similarly to ZONE_MOVABLE but can be put in arbitrary places.

   This is required for us since we need to define two disjoint memory
   ranges inside system RAM.  (ie. in two memory banks (do not confuse
   with nodes)).

3. alloc_contig_pages() scans memory in search for range that could be
   migrated.  CMA on the other hand maintains its own allocator to
   decide where to allocate memory for device drivers and then tries
   to migrate pages from that part if needed.  This is not strictly
   required but I somehow feel it might be faster.


Links to previous versions of the patchset:
v7: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
v6: <http://article.gmane.org/gmane.linux.kernel.mm/55626>
v5: (intentionally left out as CMA v5 was identical to CMA v4)
v4: <http://article.gmane.org/gmane.linux.kernel.mm/52010>
v3: ...
From: Kyungmin Park
Date: Thursday, December 23, 2010 - 2:30 am

Hi Andrew,

any comments? what's the next step to merge it for 2.6.38 kernel. we
want to use this feature at mainline kernel.

Any idea and comments are welcome.

Thank you,
Kyungmin Park

On Thu, Dec 16, 2010 at 5:34 AM, Michal Nazarewicz
--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 3:06 am

Has anyone addressed my issue with it that this is wide-open for
abuse by allocating large chunks of memory, and then remapping
them in some way with different attributes, thereby violating the
ARM architecture specification?

In other words, do we _actually_ have a use for this which doesn't
involve doing something like allocating 32MB of memory from it,
remapping it so that it's DMA coherent, and then performing DMA
on the resulting buffer?
--

From: Marek Szyprowski
Date: Thursday, December 23, 2010 - 3:58 am

Hello,


Actually this contiguous memory allocator is a better replacement for
alloc_pages() which is used by dma_alloc_coherent(). It is a generic

This is an arm specific problem, also related to dma_alloc_coherent()
allocator. To be 100% conformant with ARM specification we would
probably need to unmap all pages used by the dma_coherent allocator
from the LOW MEM area. This is doable, but completely not related
to the CMA and this patch series.

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center


--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 5:19 am

... which is open to abuse.  What I'm trying to find out is - if it
can't be used for DMA, what is it to be used for?


You've already been told why we can't unmap pages from the kernel
direct mapping.

Okay, so I'm just going to assume that CMA has _no_ _business_ being
used on ARM, and is not something that should interest anyone in the
ARM community.
--

From: Marek Szyprowski
Date: Thursday, December 23, 2010 - 6:09 am

Hello,


We are trying to get something that really works and SOLVES some of the

It requires some amount of work but I see no reason why we shouldn't be
able to unmap that pages to stay 100% conformant with ARM spec.

Please notice that there are also use cases where the memory will not be
accessed by the CPU at all (like DMA transfers between multimedia devices

Go ahead! Remeber to remove dma_coherent because it also breaks the spec. :)
Oh, I forgot. We can also remove all device drivers that might use DMA. :)



Merry Christmas and Happy New Year for everyone! :)

Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 6:44 am

I have considered - and tried - to do that with the dma_alloc_coherent()
spec, but it is NOT POSSIBLE to do so - too many factors stand in the
way of making it work, such as the need bring the system to a complete
halt to modify all the L1 page tables and broadcast the TLB operations
to invalidate the old mappings.  None of that can be done from all the

Rubbish - if you think that, then you have very little understanding of
modern CPUs.  Modern CPUs speculatively access _any_ memory which is
visible to them, and as the ARM architecture progresses, the speculative
prefetching will become more aggressive.  So if you have memory mapped
in the kernel direct map, then you _have_ to assume that the CPU will

The only solution I've come up for dma_alloc_coherent() is to reserve
the entire coherent DMA region at boot time, taking it out of the
kernel's view of available memory and thereby preventing it from ever
being mapped or the kernel using that memory for any other purpose.
That's about the best we can realistically do for ARM to conform to the
spec.

Every time I've brought this issue up with you, you've brushed it aside.
So if you feel that the right thing to do is to ignore such issues, you
won't be surprised if I keep opposing your efforts to get this into
mainline.

If you're serious about making this work, then provide some proper code
which shows how to use this for DMA on ARM systems without violating
the architecture specification.  Until you do, I see no hope that CMA
will ever be suitable for use on ARM.
--

From: Tomasz Fujak
Date: Thursday, December 23, 2010 - 6:35 am

Dear Mr. King,

AFAIK the CMA is the fourth attempt since 2008 taken to solve the
multimedia memory allocation issue on some embedded devices. Most
notably on ARM, that happens to be present in the SoCs we care about
along the IOMMU-incapable multimedia IPs.

I understand that you have your guidelines taken from the ARM
specification, but this approach is not helping us. The mainline kernel
is server- and desktop-centric for various reasons I am not going to
dwell into. We're trying hard to solve the physical memory fragmentation
issue for some time now, only to hear "this is not acceptable, go
somewhere else". So we did - the CMA is targeted towards mm, NOT the
ARM. While I do not exactly know how you see your role in ARM kernel
development, we have shown a few times that this issue is important for
us, and we'd like to solve it. So if you could give a glimpse of what is
acceptable, given the existing circumstances, we could possibly help
developing that solution. Namely:
1. ARM-compatible SoC
2. Multimedia IP blocks requiring large amounts of contiguous memory
2. No IOMMU or SG in said blocks
4. Unused memory reserved for said multimedia drivers should  be used by
the kernel
5. Multimedia allocation scenarios must always be working (under some
constraints of course), within sane time limit
6. The solution shall have minimal delta to upstream linux (none?)

While the obvious CMA uses are the ones you'd mostly like to avoid, we
haven't tried to post anything like that along.
This way no obvious spec abuse is made, and we minimize the delta to the
upstream - it's even better than current state, when you have dma
coherent memory doing exactly what you claim is forbidden (unpredictable
results could possibly happen).

As the feedback from the first CMA patches confirm, the issue we're
trying to solve here is real. Yet no real solution exists to my
knowledge. I understand the ARM holding my try to just wait till all the
relevant chips do have an IOMMU, but here and now there ...
From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 6:48 am

I'm sorry you feel like that, but I'm living in reality.  If we didn't
have these architecture restrictions then we wouldn't have this problem
in the first place.

What I'm trying to do here is to ensure that we remain _legal_ to the
architecture specification - which for this issue means that we avoid
corrupting people's data.

Maybe you like having a system which randomly corrupts people's data?
I most certainly don't.  But that's the way CMA is heading at the moment
on ARM.

It is not up to me to solve these problems - that's for the proposer of
the new API to do so.  So, please, don't try to lump this problem on
my shoulders.  It's not my problem to sort out.
--

From: Tomasz Fujak
Date: Thursday, December 23, 2010 - 7:04 am

Has this been experienced? I had some ARM-compatible boards on my desk
(xscale, v6 and v7) and none of them crashed due to this behavior. And
Just great. Nothing short of spectacular - this way the IA32 is going to
take the embedded market piece by piece once the big two advance their
foundry processes.
Despite having the translator, so much burden in the legacy ISA and the
fact that most of the embedded engineers from the high end are
accustomed to the ARM.

In other words, should we take your response as yet another NAK?
Or would you try harder and at least point us to some direction that
would not doom the effort from the very beginning.
I understand that the role of an oracle is so much easier, but the time
is running and devising subsequent solutions is not the use of
engineers' time.

Best regards
---

--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 7:16 am

Yes.  We have seen CPUs which lockup or crash as a result of mismatched

See my other comment in an earlier email.  See the patch which prevents
ioremap() being used on system memory.  There is active movement at the
present time to sorting these violations out and find solutions for
them.


Xscale doesn't suffer from the problem.  V6 doesn't aggressively speculate.

Look, I've been pointing out this problem ever since the very _first_
CMA patches were posted to the list, yet the CMA proponents have decided
to brush those problems aside each and every time I've raised them.

So, you should be asking _why_ the CMA proponents are choosing to ignore
this issue completely, rather than working to resolve it.


What the fsck do you think I've been doing?  This is NOT THE FIRST time
I've raised this issue.  I gave up raising it after the first couple
of attempts because I wasn't being listened to.

You say about _me_ not being very helpful.  How about the CMA proponents
start taking the issue I've raised seriously, and try to work out how
to solve it?  And how about blaming them for the months of wasted time
on this issue _because_ _they_ have chosen to ignore it?
--

From: Felipe Contreras
Date: Thursday, December 23, 2010 - 7:42 am

On Thu, Dec 23, 2010 at 4:16 PM, Russell King - ARM Linux

I've also raised the issue for ARM. However, I don't see what is the
big problem.

A generic solution (that I think I already proposed) would be to
reserve a chunk of memory for the CMA that can be removed from the
normally mapped kernel memory through memblock at boot time. The size
of this memory region would be configurable through kconfig. Then, the
CMA would have a "dma" flag or something, and take chunks out of it
until there's no more, and then return errors. That would work for
ARM.

Cheers.

-- 
Felipe Contreras
--

From: Michal Nazarewicz
Date: Thursday, December 23, 2010 - 8:02 am

Having exactly that usage in mind, in v8 I've added notion of private
CMA contexts which can be used for DMA coherent RAM as well as memory

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
--

From: David Brown
Date: Thursday, December 23, 2010 - 11:04 am

That sounds an awful lot like the Android kernel's pmem implementation.

Solving this problem is important for us as well, but, I'm not sure I
see a better solution that something like Felipe suggests.

The disadvantage, of course, being that the memory isn't available for
the system when the user isn't doing the multi-media.

David

-- 
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
--

From: Michal Nazarewicz
Date: Thursday, December 23, 2010 - 6:41 am

Huge pages.

Also, don't treat it as coherent memory and just flush/clear/invalidate
cache before and after each DMA transaction.  I never understood what's
wrong with that approach.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 6:51 am

If you've ever used an ARM system with a VIVT cache, you'll know what's
wrong with this approach.

ARM systems with VIVT caches have extremely poor task switching
performance because they flush the entire data cache at every task switch
- to the extent that it makes system performance drop dramatically when
they become loaded.

Doing that for every DMA operation will kill the advantage we've gained
from having VIPT caches and ASIDs stone dead.
--

From: Tomasz Fujak
Date: Thursday, December 23, 2010 - 7:08 am

This statement effectively means: don't map dma-able memory to the CPU

--

From: Russell King - ARM Linux
Date: Thursday, December 23, 2010 - 7:20 am

I'll give you another solution to the problem - lobby ARM Ltd to have
this restriction lifted from the architecture specification, which
will probably result in the speculative prefetching also having to be
removed.

That would be my preferred solution if I had the power to do so, but
I have to live with what ARM Ltd (and their partners such as yourselves)
decide should end up in the architecture specification.
--

From: Tomasz Fujak
Date: Thursday, December 23, 2010 - 8:35 am

Isn't disabling Speculative Accesses forwarding to the AXI bus the
solution to our woes?
At least on the A8, which happens to be paired with non-IOMMU capable
IPs on our SoCs.
On A9 the bit is gone (or has it moved?), but we have IOMMU here so the
CMA isn't needed.

http://infocenter.arm.com/
Cortex-A8 Technical Reference Manual    Revision: r3p2
3.2.26. c1, Auxiliary Control Register

--

From: Michał Nazarewicz
Date: Tuesday, January 4, 2011 - 9:59 am

IIRC both apply.
--

From: Johan MOSSBERG
Date: Tuesday, January 4, 2011 - 9:23 am

I seem to have missed the previous discussion about this issue.
Where in the specification (preferably ARMv7) can I find
information about this? Is the problem that it is simply
forbidden to map an address multiple times with different cache
setting and if this is done the hardware might start failing? Or
is the problem that having an address mapped cached means that
speculative pre-fetch can read it into the cache at any time,
possibly causing problems if an un-cached mapping exists? In my
opinion option number two can be handled and I've made an attempt
at doing that in hwmem (posted on linux-mm a while ago), look in
cache_handler.c. Hwmem currently does not use cma but the next
version probably will.

/Johan Mossberg
--

From: Russell King - ARM Linux
Date: Tuesday, January 4, 2011 - 10:19 am

Here's the extracts from the architecture reference manual:

* If the same memory locations are marked as having different
  cacheability attributes, for example by the use of aliases in a
  virtual to physical address mapping, behavior is UNPREDICTABLE.

A3.5.7 Memory access restrictions

Behavior is UNPREDICTABLE if the same memory location:
* is marked as Shareable Normal and Non-shareable Normal
* is marked as having different memory types (Normal, Device, or
  Strongly-ordered)
* is marked as having different cacheability attributes
* is marked as being Shareable Device and Non-shareable Device memory.

Such memory marking contradictions can occur, for example, by the use of
aliases in a virtual to physical address mapping.

Glossary:
UNPREDICTABLE
Means the behavior cannot be relied upon. UNPREDICTABLE behavior must not
represent security holes.  UNPREDICTABLE behavior must not halt or hang
the processor, or any parts of the system. UNPREDICTABLE behavior must not

Given the extract from the architecture reference manual, do you want
to run a system where you can't predict what the behaviour will be if
you have two mappings present, one which is cacheable and one which is
non-cacheable, and you're relying on the non-cacheable mapping to never
return data from the cache?

What if during your testing, it appears to work correctly, but out in
the field, someone's loaded a different application to your setup
resulting in different memory access patterns, causing cache lines to
appear in the non-cacheable mapping, and then the CPU hits them on
subsequent accesses corrupting data...

You can't say that will never happen if you're relying on this
unpredictable behaviour.
--

From: Santosh Shilimkar
Date: Tuesday, January 4, 2011 - 10:31 am

Just to add to Russell's point, we did land up in un-traceable
CPU deadlocks while running the kernel which was violating some of
the rules set by ARM ARM.
The usecase use to work ~98% of the time.

Regards,
Santosh
--

Previous thread: [PATCHv8 03/12] lib: genalloc: Generic allocator improvements by Michal Nazarewicz on Wednesday, December 15, 2010 - 1:34 pm. (1 message)

Next thread: [PATCH 0/5] UAS updates by Matthew Wilcox on Wednesday, December 15, 2010 - 1:41 pm. (6 messages)