Hello everyone, This is yet another version of CMA this time stripped from a lot of code and with working migration implementation. The Contiguous Memory Allocator (CMA) makes it possible for device drivers to allocate big contiguous chunks of memory after the system has booted. For more information see 7th patch in the set. This version fixes some things Kamezawa suggested plus it separates code that uses MIGRATE_CMA from the rest of the code. This I hope will help to grasp the overall idea of CMA. The current version is just an allocator that handles allocation of contiguous memory blocks. The difference between this patchset and Kamezawa's alloc_contig_pages() are: 1. alloc_contig_pages() requires MAX_ORDER alignment of allocations which may be unsuitable for embeded systems where a few MiBs are required. Lack of the requirement on the alignment means that several threads might try to access the same pageblock/page. To prevent this from happening CMA uses a mutex so that only one cm_alloc()/cm_free() function may run at one point. 2. CMA may use its own migratetype (MIGRATE_CMA) which behaves similarly to ZONE_MOVABLE but can be put in arbitrary places. This is required for us since we need to define two disjoint memory ranges inside system RAM. (ie. in two memory banks (do not confuse with nodes)). 3. alloc_contig_pages() scans memory in search for range that could be migrated. CMA on the other hand maintains its own allocator to decide where to allocate memory for device drivers and then tries to migrate pages from that part if needed. This is not strictly required but I somehow feel it might be faster. Links to previous versions of the patchset: v7: <http://article.gmane.org/gmane.linux.kernel.mm/55626> v6: <http://article.gmane.org/gmane.linux.kernel.mm/55626> v5: (intentionally left out as CMA v5 was identical to CMA v4) v4: <http://article.gmane.org/gmane.linux.kernel.mm/52010> v3: ...
Hi Andrew, any comments? what's the next step to merge it for 2.6.38 kernel. we want to use this feature at mainline kernel. Any idea and comments are welcome. Thank you, Kyungmin Park On Thu, Dec 16, 2010 at 5:34 AM, Michal Nazarewicz --
Has anyone addressed my issue with it that this is wide-open for abuse by allocating large chunks of memory, and then remapping them in some way with different attributes, thereby violating the ARM architecture specification? In other words, do we _actually_ have a use for this which doesn't involve doing something like allocating 32MB of memory from it, remapping it so that it's DMA coherent, and then performing DMA on the resulting buffer? --
Hello, Actually this contiguous memory allocator is a better replacement for alloc_pages() which is used by dma_alloc_coherent(). It is a generic This is an arm specific problem, also related to dma_alloc_coherent() allocator. To be 100% conformant with ARM specification we would probably need to unmap all pages used by the dma_coherent allocator from the LOW MEM area. This is doable, but completely not related to the CMA and this patch series. Best regards -- Marek Szyprowski Samsung Poland R&D Center --
... which is open to abuse. What I'm trying to find out is - if it can't be used for DMA, what is it to be used for? You've already been told why we can't unmap pages from the kernel direct mapping. Okay, so I'm just going to assume that CMA has _no_ _business_ being used on ARM, and is not something that should interest anyone in the ARM community. --
Hello, We are trying to get something that really works and SOLVES some of the It requires some amount of work but I see no reason why we shouldn't be able to unmap that pages to stay 100% conformant with ARM spec. Please notice that there are also use cases where the memory will not be accessed by the CPU at all (like DMA transfers between multimedia devices Go ahead! Remeber to remove dma_coherent because it also breaks the spec. :) Oh, I forgot. We can also remove all device drivers that might use DMA. :) Merry Christmas and Happy New Year for everyone! :) Best regards -- Marek Szyprowski Samsung Poland R&D Center --
I have considered - and tried - to do that with the dma_alloc_coherent() spec, but it is NOT POSSIBLE to do so - too many factors stand in the way of making it work, such as the need bring the system to a complete halt to modify all the L1 page tables and broadcast the TLB operations to invalidate the old mappings. None of that can be done from all the Rubbish - if you think that, then you have very little understanding of modern CPUs. Modern CPUs speculatively access _any_ memory which is visible to them, and as the ARM architecture progresses, the speculative prefetching will become more aggressive. So if you have memory mapped in the kernel direct map, then you _have_ to assume that the CPU will The only solution I've come up for dma_alloc_coherent() is to reserve the entire coherent DMA region at boot time, taking it out of the kernel's view of available memory and thereby preventing it from ever being mapped or the kernel using that memory for any other purpose. That's about the best we can realistically do for ARM to conform to the spec. Every time I've brought this issue up with you, you've brushed it aside. So if you feel that the right thing to do is to ignore such issues, you won't be surprised if I keep opposing your efforts to get this into mainline. If you're serious about making this work, then provide some proper code which shows how to use this for DMA on ARM systems without violating the architecture specification. Until you do, I see no hope that CMA will ever be suitable for use on ARM. --
Dear Mr. King, AFAIK the CMA is the fourth attempt since 2008 taken to solve the multimedia memory allocation issue on some embedded devices. Most notably on ARM, that happens to be present in the SoCs we care about along the IOMMU-incapable multimedia IPs. I understand that you have your guidelines taken from the ARM specification, but this approach is not helping us. The mainline kernel is server- and desktop-centric for various reasons I am not going to dwell into. We're trying hard to solve the physical memory fragmentation issue for some time now, only to hear "this is not acceptable, go somewhere else". So we did - the CMA is targeted towards mm, NOT the ARM. While I do not exactly know how you see your role in ARM kernel development, we have shown a few times that this issue is important for us, and we'd like to solve it. So if you could give a glimpse of what is acceptable, given the existing circumstances, we could possibly help developing that solution. Namely: 1. ARM-compatible SoC 2. Multimedia IP blocks requiring large amounts of contiguous memory 2. No IOMMU or SG in said blocks 4. Unused memory reserved for said multimedia drivers should be used by the kernel 5. Multimedia allocation scenarios must always be working (under some constraints of course), within sane time limit 6. The solution shall have minimal delta to upstream linux (none?) While the obvious CMA uses are the ones you'd mostly like to avoid, we haven't tried to post anything like that along. This way no obvious spec abuse is made, and we minimize the delta to the upstream - it's even better than current state, when you have dma coherent memory doing exactly what you claim is forbidden (unpredictable results could possibly happen). As the feedback from the first CMA patches confirm, the issue we're trying to solve here is real. Yet no real solution exists to my knowledge. I understand the ARM holding my try to just wait till all the relevant chips do have an IOMMU, but here and now there ...
I'm sorry you feel like that, but I'm living in reality. If we didn't have these architecture restrictions then we wouldn't have this problem in the first place. What I'm trying to do here is to ensure that we remain _legal_ to the architecture specification - which for this issue means that we avoid corrupting people's data. Maybe you like having a system which randomly corrupts people's data? I most certainly don't. But that's the way CMA is heading at the moment on ARM. It is not up to me to solve these problems - that's for the proposer of the new API to do so. So, please, don't try to lump this problem on my shoulders. It's not my problem to sort out. --
Has this been experienced? I had some ARM-compatible boards on my desk (xscale, v6 and v7) and none of them crashed due to this behavior. And Just great. Nothing short of spectacular - this way the IA32 is going to take the embedded market piece by piece once the big two advance their foundry processes. Despite having the translator, so much burden in the legacy ISA and the fact that most of the embedded engineers from the high end are accustomed to the ARM. In other words, should we take your response as yet another NAK? Or would you try harder and at least point us to some direction that would not doom the effort from the very beginning. I understand that the role of an oracle is so much easier, but the time is running and devising subsequent solutions is not the use of engineers' time. Best regards --- --
Yes. We have seen CPUs which lockup or crash as a result of mismatched See my other comment in an earlier email. See the patch which prevents ioremap() being used on system memory. There is active movement at the present time to sorting these violations out and find solutions for them. Xscale doesn't suffer from the problem. V6 doesn't aggressively speculate. Look, I've been pointing out this problem ever since the very _first_ CMA patches were posted to the list, yet the CMA proponents have decided to brush those problems aside each and every time I've raised them. So, you should be asking _why_ the CMA proponents are choosing to ignore this issue completely, rather than working to resolve it. What the fsck do you think I've been doing? This is NOT THE FIRST time I've raised this issue. I gave up raising it after the first couple of attempts because I wasn't being listened to. You say about _me_ not being very helpful. How about the CMA proponents start taking the issue I've raised seriously, and try to work out how to solve it? And how about blaming them for the months of wasted time on this issue _because_ _they_ have chosen to ignore it? --
On Thu, Dec 23, 2010 at 4:16 PM, Russell King - ARM Linux I've also raised the issue for ARM. However, I don't see what is the big problem. A generic solution (that I think I already proposed) would be to reserve a chunk of memory for the CMA that can be removed from the normally mapped kernel memory through memblock at boot time. The size of this memory region would be configurable through kconfig. Then, the CMA would have a "dma" flag or something, and take chunks out of it until there's no more, and then return errors. That would work for ARM. Cheers. -- Felipe Contreras --
Having exactly that usage in mind, in v8 I've added notion of private CMA contexts which can be used for DMA coherent RAM as well as memory -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- --
That sounds an awful lot like the Android kernel's pmem implementation. Solving this problem is important for us as well, but, I'm not sure I see a better solution that something like Felipe suggests. The disadvantage, of course, being that the memory isn't available for the system when the user isn't doing the multi-media. David -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum. --
Huge pages. Also, don't treat it as coherent memory and just flush/clear/invalidate cache before and after each DMA transaction. I never understood what's wrong with that approach. -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- --
If you've ever used an ARM system with a VIVT cache, you'll know what's wrong with this approach. ARM systems with VIVT caches have extremely poor task switching performance because they flush the entire data cache at every task switch - to the extent that it makes system performance drop dramatically when they become loaded. Doing that for every DMA operation will kill the advantage we've gained from having VIPT caches and ASIDs stone dead. --
This statement effectively means: don't map dma-able memory to the CPU --
I'll give you another solution to the problem - lobby ARM Ltd to have this restriction lifted from the architecture specification, which will probably result in the speculative prefetching also having to be removed. That would be my preferred solution if I had the power to do so, but I have to live with what ARM Ltd (and their partners such as yourselves) decide should end up in the architecture specification. --
Isn't disabling Speculative Accesses forwarding to the AXI bus the solution to our woes? At least on the A8, which happens to be paired with non-IOMMU capable IPs on our SoCs. On A9 the bit is gone (or has it moved?), but we have IOMMU here so the CMA isn't needed. http://infocenter.arm.com/ Cortex-A8 Technical Reference Manual Revision: r3p2 3.2.26. c1, Auxiliary Control Register --
I seem to have missed the previous discussion about this issue. Where in the specification (preferably ARMv7) can I find information about this? Is the problem that it is simply forbidden to map an address multiple times with different cache setting and if this is done the hardware might start failing? Or is the problem that having an address mapped cached means that speculative pre-fetch can read it into the cache at any time, possibly causing problems if an un-cached mapping exists? In my opinion option number two can be handled and I've made an attempt at doing that in hwmem (posted on linux-mm a while ago), look in cache_handler.c. Hwmem currently does not use cma but the next version probably will. /Johan Mossberg --
Here's the extracts from the architecture reference manual: * If the same memory locations are marked as having different cacheability attributes, for example by the use of aliases in a virtual to physical address mapping, behavior is UNPREDICTABLE. A3.5.7 Memory access restrictions Behavior is UNPREDICTABLE if the same memory location: * is marked as Shareable Normal and Non-shareable Normal * is marked as having different memory types (Normal, Device, or Strongly-ordered) * is marked as having different cacheability attributes * is marked as being Shareable Device and Non-shareable Device memory. Such memory marking contradictions can occur, for example, by the use of aliases in a virtual to physical address mapping. Glossary: UNPREDICTABLE Means the behavior cannot be relied upon. UNPREDICTABLE behavior must not represent security holes. UNPREDICTABLE behavior must not halt or hang the processor, or any parts of the system. UNPREDICTABLE behavior must not Given the extract from the architecture reference manual, do you want to run a system where you can't predict what the behaviour will be if you have two mappings present, one which is cacheable and one which is non-cacheable, and you're relying on the non-cacheable mapping to never return data from the cache? What if during your testing, it appears to work correctly, but out in the field, someone's loaded a different application to your setup resulting in different memory access patterns, causing cache lines to appear in the non-cacheable mapping, and then the CPU hits them on subsequent accesses corrupting data... You can't say that will never happen if you're relying on this unpredictable behaviour. --
Just to add to Russell's point, we did land up in un-traceable CPU deadlocks while running the kernel which was violating some of the rules set by ARM ARM. The usecase use to work ~98% of the time. Regards, Santosh --
