Hello, While debugging a Linux driver on my ARM platform (AT91), I switched on the 2.6.22-rc5 kernel. While reconfiguring it I selected CONFIG_SLUB as a SLAB allocator. The sd/mmc driver I tried to run is vanilla driver which never, until now, produces Oops (at91_mci.c). The oops seems to occur after a page unmapping using dma_unmap_page() followed by a flush_dcache_page() (in at91mci_post_dma_read()). Here is the oops (driver verbose output included to show pages used for dma sg list) : Starting kernel ... Linux version 2.6.22-rc5 (user@location) (gcc version 3.4.2) #8 Thu Jun 21 11:05:18 CEST 2007 CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00053177 Machine: Atmel AT91SAM9263-EK [..] Memory: 64MB = 64MB total Memory: 62012KB available (2492K code, 246K data, 116K init) SLUB: Genslabs=23, HWalign=32, Order=0-1, MinObjects=4, CPUs=1, Nodes=1 [..] Probe MCI devices pre dma read Using transfer index 0 sg = c3d0be68 dma address = 23C54968, length = 8 Nothing left to transfer (index = 1) pre dma read done setting ier to 00000040 MCI irq: status = 0000C0ED, C07F0040, 00000040 Receive has ended post dma read finishing index 0 Unmapping page 23C54968 buffer = c3c54968, length = 8 post dma read done [^ this transfert is ok] [..] pre dma read Using transfer index 0 sg = c3d0be68 dma address = 203AFBC0, length = 64 Nothing left to transfer (index = 1) pre dma read done setting ier to 00000040 MCI irq: status = 0000C0ED, C07F0040, 00000040 Receive has ended post dma read finishing index 0 Unmapping page 203AFBC0 buffer = c03afbc0, length = 64 Unable to handle kernel NULL pointer dereference at virtual address 0000000d pgd = c0004000 [0000000d] *pgd=00000000 Internal error: Oops: 1 [#1] Modules linked in: CPU: 0 PC is at get_index+0x1c/0x50 LR is at prio_tree_next+0x60/0x194 pc : [<c01098ac>] lr : [<c0109fc4>] Not tainted sp : c3d0bca8 ip : ffffffdd fp : c3d0bcb4 r10: 00000040 r9 : c3d0be78 r8 : c3d0bcc4 r7 : c3d0bcc0 r6 : ...
Hello Nicolas! Good news! I think I've found the problem, this seems to work (tested with SLUB and SLAB). If you're in the cleanup stage, I think the whole kmap and kunmap can be in the 'if (cpu_is_at91rm9200())', we have no reason to kmap data we don't touch :-D Regards Marc --- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig 2007-06-21 16:27:31.000000000 +0200 +++ drivers/mmc/host/at91_mci.c-2.6.22-rc5 2007-06-21 16:42:48.000000000 +0200 @@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str else memcpy(dmabuf, sgbuffer, amount); - kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ); + kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ); if (size == 0) break; @@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct buffer[index] = swab32(buffer[index]); } - kunmap_atomic(buffer, KM_BIO_SRC_IRQ); + kunmap_atomic(buffer - sg->buffer, KM_BIO_SRC_IRQ); flush_dcache_page(sg->page); } -
please use this patch, sorry for the later Regards Marc --- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig 2007-06-21 16:27:31.000000000 +0200 +++ drivers/mmc/host/at91_mci.c-2.6.22-rc5 2007-06-21 16:42:48.000000000 +0200 @@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str else memcpy(dmabuf, sgbuffer, amount); - kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ); + kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ); if (size == 0) break; @@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct buffer[index] = swab32(buffer[index]); } - kunmap_atomic(buffer, KM_BIO_SRC_IRQ); + kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ); flush_dcache_page(sg->page); } -
My eyes are too tired or this patch is the same as the previous one :-\ Cheers, -- Nicolas Ferre -
from: Marc Pignat <marc.pignat@hevs.ch> kunmap must be called on the pointer returned by kmap. Signed-off-by: Marc Pignat <marc.pignat@hevs.ch> --- N.B: This is the same patch as yesterday, with proper Signed-off-by and more comments. The buffer variable is used this way: buffer = kmap_atomic(sg->page, KM_BIO_SRC_IRQ) + sg->offset; So we can't do kunmap_atomic(buffer, KM_BIO_SRC_IRQ); we must do kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ); Strangely this misuse showed no problem using CONFIG_SLAB, but oops using CONFIG_SLUB, (lkml ''Oops in a driver while using SLUB as a SLAB allocator"). Regards Marc --- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig 2007-06-21 16:27:31.000000000 +0200 +++ drivers/mmc/host/at91_mci.c-2.6.22-rc5 2007-06-21 16:42:48.000000000 +0200 @@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str else memcpy(dmabuf, sgbuffer, amount); - kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ); + kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ); if (size == 0) break; @@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct buffer[index] = swab32(buffer[index]); } - kunmap_atomic(buffer, KM_BIO_SRC_IRQ); + kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ); flush_dcache_page(sg->page); } -
Aren't you just guessing there? Those kunmap_atomics in at91_mci.c may look wrong to you, but they're not incorrect (so long as sg->offset falls within the page, as it must do here to make sense of the page). Especially not on ARM, where kunmap_atomic actually has no interest in the argument passed. And the oops was in the flush_dcache_page. If you actually reproduced Nicolas' problem on ARM, and verified that your patch then fixes it, please let us know: that will be remarkably interesting. I believe I posted the correct fix last night (or at least a safe fix for now: Christoph Lameter may prefer to undo it and change ARM's dma_mapping at leisure later), checking PageSlab in page_mapping; and Linus has already put that one into his git tree. But it would be good to hear from Nicolas whether that indeed fixes his oops: I couldn't actually try my patch on ARM either. -
Patch tested without success. Indeed, I always see the Oops with Here: Ok tested on Atmel ARM AT91 (at91sam9263), this fixes this oops. Thanks a lot to all of you for your precise help. Regards, -- Nicolas Ferre -
Not at all, thank you for reporting and testing and reporting back, in time for 2.6.22: that's all more valuable than the easy fix. Hugh -
So, it's really interesting, it worked really for me (oops without patch) and no oops with it. My hardware is a custom board with an at91rm9200. I had a look the the kunmap_atomic function, and I _really_ don't understand how this patch can do something :-/ And last but not least, my patch is completly wrong... void *kmap_atomic(...); unsigned int *buffer = kmap_atomic(...) + sg->offset; // addition in u8* kunmap_atomic(buffer - sg->offset); // <- sub in u32* -> please forget my patch Regards Marc -
Just tested and working on my atmel at91rm9200. (at91sam9263 core is ARM926EJ-S and at91rm9200 core is ARM920T). Thanks! Regards Marc -
kunmap_atomic() should align the given buffer to the start of the page anyway, so your patch wont change behaviour. -- Jens Axboe -
Indeed, my eyes where too tired ;-) Sorry for the trouble. -- Nicolas Ferre -
That looks serious: thanks a lot for reporting it. (I see Marc has sent you a patch or two for the driver end: I didn't see their relevance, maybe they skirt around the problem somehow; but unless I'm mistaken, what you've found goes beyond that particular driver.) Seems a little odd that it's gone throughout 2.6.22-rc unnoticed until now - nobody else trying SLUB on ARM or PA-RISC yet perhaps. As I understand it, you're not doing anything wrong (disclaimer: I'm no expert on dma_mapping things), but SLUB's reuse of struct page fields has collided with what flush_dcache_page is expecting. Here's a patch: I'm not convinced it's necessarily the best one (most uses of page_mapping will never see a slab page, it's a pity to be cluttering up that inline even further); but in case nobody else can provide a better... [PATCH] page_mapping must avoid slub pages Nicolas Ferre reports oops from flush_dcache_page() on ARM when using SLUB: which reuses page->mapping as page->slab. The page_mapping() function, used by ARM and PA-RISC flush_dcache_page() implementations, must not confuse SLUB pages with those which have page->mapping set. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- That #ifdef I've put in there is not essential: it's perhaps more of a comment or an accusation ;) include/linux/mm.h | 4 ++++ 1 file changed, 4 insertions(+) --- 2.6.22-rc5/include/linux/mm.h 2007-05-26 07:16:48.000000000 +0100 +++ linux/include/linux/mm.h 2007-06-21 15:52:47.000000000 +0100 @@ -603,6 +603,10 @@ static inline struct address_space *page if (unlikely(PageSwapCache(page))) mapping = &swapper_space; +#ifdef CONFIG_SLUB + else if (unlikely(PageSlab(page))) + mapping = NULL; +#endif else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON)) mapping = NULL; return mapping; -
Well one may be better off allocating pages using the page allocator instead of the slab allocator. I removed these things from i386 but I did not check ARM. -
They may or may not be: I think that's a matter to discuss with rmk. You keep on forcing the outside world to revolve around your needs within slub.c: that is a good way to keep slub lean, and may be justified; but it's at least questionable to be enforcing such restrictions years after people have grown accustomed to more freedom from their slabs. Hugh -
The coherent case on ARM is broken in more ways, not only because it uses kmalloc, but it also takes no notice of the DMA mask. However, AT91 isn't a coherent ARM architecture, so arch_is_coherent() should be false. Therefore, we should never be allocating pages for DMA from SLAB/SLUB for AT91 platforms. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
This work is a cleanup of the VM code and part of the slab cleanup that was done. Having slab objects on the LRU and other components of the VM that are supposed to work on page sized objects is plainly wrong and can lead to surprising results. -
Maybe this will address the issue on ARM?
ARM: Allocate dma pages via the page allocator and not via the slab allocator
Slab allocations are not guaranteed to be page aligned and slab allocators
may use the page structs for their own purposes. Using the page allocator
yields a properly aligned page and also makes the page flushing logic work right.
Passing a kmalloced "page" to a flushing function will not work reliably.
This will hopefully address the issue with SLUB on ARM. SLUB uses the
page->mapping field which is also checked by the flushing logic. The
flushing logic expects a normal page and not a slab page.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
arch/arm/mm/consistent.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6/arch/arm/mm/consistent.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/consistent.c 2007-06-21 18:18:15.000000000 -0700
+++ linux-2.6/arch/arm/mm/consistent.c 2007-06-21 18:29:16.000000000 -0700
@@ -277,7 +277,7 @@ dma_alloc_coherent(struct device *dev, s
if (arch_is_coherent()) {
void *virt;
- virt = kmalloc(size, gfp);
+ virt = get_free_pages(gfp, get_order(size));
if (!virt)
return NULL;
*handle = virt_to_dma(dev, virt);
@@ -364,7 +364,7 @@ void dma_free_coherent(struct device *de
WARN_ON(irqs_disabled());
if (arch_is_coherent()) {
- kfree(cpu_addr);
+ free_pages((unsigned long)cpu_addr, get_order(size));
return;
}
-
Looks like it would indeed address the immediate issue on ARM - IF they've no particular reason to be using kmalloc there. However... what gives you confidence that flush_dcache_page is never applied to other slab pages? This looks to me like a clean way forward to try in -mm; but that 2.6.22 should go with the safety PageSlab test in page_mapping. -
Flush dcache page is supposed to run on pages not objects of varying length. It is suprising that this has not lead to earlier problems. Objects allocated this way may straddle a page boundary under some conditions and in that case virt_to_page may not lead to a page that covers the complete object that is supposed to be flushed. Hopefully the 2.6.22 cleans up these issues and this one follows in those footsteps. The reason for the introduction of the quicklist f.e. was to have a clear separation between page sized allocation and the variable allocations through slab allocators. -
No, that's the wrong way round. Neither ARM nor PA-RISC expects flush_dcache_page to flush any dcache when given a slab allocation: they just expect it to pass through, not to oops. Hugh -
I think the right thing to do is do both of these things. I already applied Hugh's patch - it seemed like a total nobrainer to do at this stage in the 2.6.22 -rc series. But that doesn't mean that we should not _also_ look at "flush_dcache_page()" users. I do think that even just the name (the ".._page()" part) makes it obvious that it was designed for page-level allocations, not kmalloc(). So I think Christoph's patch makes sense in that context. At the same time, I do think that the whole notion of flushing the D$ is certainly something that makes sense for kmalloc() allocations also, and maybe people do actually do small DMA allocations and Christophs patch would break that. End result: for 2.6.22, I think the patch from Hugh that I already applied is the right thing. But as to 2.6.23 and onward.. I dunno. Linus -
Hugh's patch not address the complete issue. It only works right now because the size of the allocation is page size and fits right into a slab page. If debugging is enabled then the slab size will increase and the "pages" will be misaligned which will lead to other sorts of funky behavior. kmalloc allocations are only guaranteed to be aligned to ARCH_KMALLOC_MINALIGN which is 4 to 8 bytes. If one must have a page aligned entity out of a slab allocator then a custom slab needs to be created. -
Add VM_BUG_ON in case someone uses page_mapping on a slab page Detect slab objects being passed to the page oriented functions of the VM. It is not sufficient to simply return NULL because the functions calling page_mapping may depend on other items of the page_struct also to be setup properly. Moreover the slab object may not be properly aligned. The page orientedfunctions of the VM expect to operate on page aligned, page sized objects. operations on objects straddling page boundaries may only affect the objects partially which may lead to surprising results. It is better to detect eventually remaining uses and eliminate them. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- include/linux/mm.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) Index: linux-2.6/include/linux/mm.h =================================================================== --- linux-2.6.orig/include/linux/mm.h 2007-06-22 10:33:27.000000000 -0700 +++ linux-2.6/include/linux/mm.h 2007-06-22 10:34:09.000000000 -0700 @@ -603,10 +603,7 @@ static inline struct address_space *page if (unlikely(PageSwapCache(page))) mapping = &swapper_space; -#ifdef CONFIG_SLUB - else if (unlikely(PageSlab(page))) - mapping = NULL; -#endif + VM_BUG_ON(PageSlab(page)); else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON)) mapping = NULL; return mapping; -
I'm quite happy with this approach for 2.6.23-rc, along with your ARM dma_map patch which (if I understood aright) rmk approved. Except, haven't you misplaced that VM_BUG_ON between an if and its else? And I'd much prefer you to make it an outright BUG_ON, because many testers have those VM_BUG_ONs configured out. -
Thought about it but doing so would create quite a load of BUG_ONs in the
VM given the frequent use of that particular inline function. And AFAIK
we are only aware of one other potential call site that could cause
trouble. Many arches have run SLUB now for awhile and would certainly have
shown issues if they would do strange things with slab allocs. Even with
SLAB they would have to be very careful in order to make this work. So I
think its rather unlikely that this is going to be triggered. Its
primarily useful for debugging if strange things start to happen.
The VM_BUG_ON could stay there for good to make sure development does not
result in similar issues in the future.
Fixed up patch:
Add VM_BUG_ON in case someone uses page_mapping on a slab page
Detect slab objects being passed to the page oriented functions of the VM.
It is not sufficient to simply return NULL because the functions calling
page_mapping may depend on other items of the page_struct also to be setup
properly. Moreover slab object may not be properly aligned. The page oriented
functions of the VM expect to operate on page aligned, page sized objects.
Operations on object straddling page boundaries may only affect the objects
partially which may lead to surprising results.
It is better to detect eventually remaining uses and eliminate them.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mm.h | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2007-06-22 10:33:27.000000000 -0700
+++ linux-2.6/include/linux/mm.h 2007-06-22 11:44:10.000000000 -0700
@@ -601,12 +601,9 @@ static inline struct address_space *page
{
struct address_space *mapping = page->mapping;
+ VM_BUG_ON(PageSlab(page));
if (unlikely(PageSwapCache(page)))
mapping = &swapper_space;
-#ifdef CONFIG_SLUB
- else if ...Okay; and I was overlooking that (as in this case) we'd probably get an easily debuggable oops instead of the explicit BUG when it is Acked-by: Hugh Dickins <hugh@veritas.com> -
We need to fix any remaining weird slab object uses right now. Your check leaves a lot of holes open. 2.6.22 removes all other such strange slab uses in other arches. It would be inconsistent if we left these things in ARM (and maybe PA-RISC). -
As I understand it, that driver used to work right with SLAB, then oopsed with SLUB, and now works okay again with the page_mapping patch? I'm unclear how it comes about that you removed "all other such strange slab uses in other arches", yet missed this? That suggests there may be further unexpected uses. It worries me that any use which catches you by surprise has to be fixed up in the caller, rather than in slub itself: slab/slub is a service, not a master. But I'm rather repeating myself. Hugh -
Try to enable debugging then it may fail again despite your patch. You just scratched the surface with this and are enabling a dangerous usage There could be other uses that were missed. I looked for slabs created by kmem_cache_create. The trouble is that any kmalloc could also be used for engineer these weird things (as seen here) and there are gazillions of kmallocs. That is why a VM_BUG_ON is useful. However, it requires some effort even with SLAB to create these things and--given that we have tested extensively on lots of arches--I am hopeful that SLUB used to implement the same special casing as SLAB did (which results in the fragile scenarios in which the above is possible). But we made the decision to clean up the slab interface and we dropped the emulation of the SLAB frills from SLUB. Messy and problematic code like this should be removed. That improves the quality of the kernel. The removal is a straightfoward process. And the cases that we are discussing here are in remote corners of the kernel. -
Here is the corresponding PA-RISC piece. Its as straightforward as
the other one since the only way to make this work properly in the past
was if the sizes passed to the dma alloc functions are page size based.
PA-RISC: Use page allocator instead of slab allocator.
Slab pages obtained via kmalloc are not cachline aligned. Nor is it advisable
to perform VM operations designed for page allocator pages on
memory obtained via kmalloc.
So replace the page sized allocations in kernel/pci-dma.c with page allocator
pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
arch/parisc/kernel/pci-dma.c | 4 ++--
include/linux/mm.h | 5 +----
2 files changed, 3 insertions(+), 6 deletions(-)
Index: linux-2.6/arch/parisc/kernel/pci-dma.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/pci-dma.c 2007-06-22 13:02:32.000000000 -0700
+++ linux-2.6/arch/parisc/kernel/pci-dma.c 2007-06-22 13:06:39.000000000 -0700
@@ -572,7 +572,7 @@ static void *pa11_dma_alloc_noncoherent(
void *addr = NULL;
/* rely on kmalloc to be cacheline aligned */
- addr = kmalloc(size, flag);
+ addr = (void *)__get_free_pages(flag, get_order(size));
if(addr)
*dma_handle = (dma_addr_t)virt_to_phys(addr);
@@ -582,7 +582,7 @@ static void *pa11_dma_alloc_noncoherent(
static void pa11_dma_free_noncoherent(struct device *dev, size_t size,
void *vaddr, dma_addr_t iova)
{
- kfree(vaddr);
+ free_pages((unsigned long)vaddr, get_order(size));
return;
}
-
* From: Christoph Lameter * Newsgroups: linux.kernel,linux.ports.arm.kernel Seems like comment must be removed. ____ -
I didn't approve it. Please re-read my reply - there are still some unanswered questions in it which _really_ need answering. The report talks about the AT91 machines. These machines do not have cache coherent DMA. Therefore, the code being patched should be optimised away by the compiler. *Or* we have even bigger problems. Please forward the original problem report. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
Okay, that seems to back up my suspicions - it's definitely AT91-based. Since AT91-based machines do not have a DMA coherent cache, arch_is_coherent() must be defined to '0'. The only way that kmalloc could be reached is if that were defined to something other than '0', and if that's done on a machine with DMA incoherent caches, that will lead to data corruption. I think we need to wait for Nicolas to respond on this issue before running headlong into applying a sticky plaster for something which is actually a deeper issue. However, the arch_is_coherent() path _is_ buggy as it stands, but in more than the way identified thus far. Eg, it doesn't set __GFP_DMA appropriately for various DMA masks, so it might return DMA inaccessible memory. If we're after a simple fix for 2.6.22, the _easiest_ solution would be to delete the entire arch_is_coherent() branches in arch/arm/mm/consistent.c; that will result in a working solution for everyone, albiet at a slightly lower performance for the DMA-coherent CPUs. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
No need for Nicolas to respond, I think I've found what's "wrong": I expect you're right, but that's a separate issue. I had thought you were approving Christoph's ARM patch because both you and he seemed to agree that kmalloc was inappropriate for use in dma_alloc_coherent, whatever additional issues you saw with it. I still don't see why kmalloc is wrong there myself: for a while I bought Christoph's alignment argument, but now I don't see why (more than long) alignment is important to it. But I'm easily The fix for 2.6.22 is my PageSlab test in page_mapping which Linus already put into -git. And I now rather think that needs to stay, not be replaced by the VM_BUG_ON Christoph was proposing for 2.6.23 (which earlier I acked). Christoph responded to my page_mapping patch by looking at arch/arm, and there finding a kmalloc in dma_alloc_coherent which he didn't like; but you're right, it's entirely irrelevant to Nicolas' oops. The slub allocation which gives rise to Nicolas' oops in not in ARM, but (I'm guessing) in drivers/mmc/core/sd.c: one of those status = kmalloc(64, GFP_KERNEL); where status is passed down for the response from mmc_sd_switch. And what is wrong with using kmalloc there? Why should that be changed to allocate a whole page? How many other such cases might there be? And the flush_dcache_page in at91mci_post_dma_read looks correct to me too: it has just filled and perhaps also swabbed a buffer, that buffer might in some cases be mapped into userspace, so it calls flush_dcache_page. In the kmalloc case it's not mapped into userspace: flush_dcache_page should detect that and do nothing, as it does with slab; but slub was reusing page->mapping for something else, so we oopsed. Hugh -
AT91 _is_ defined with the arch_is_coherent() switch set to 0 (in Ok it is bad but, in AT91, all memory is accessible with DMA. True, the oops appears after a mmc switch command (#6 command). [..] Not sure I can add much to what Hugh has said. If you need some more precision anyway, I will try to answer. Regards, -- Nicolas Ferre -
If that is the case then what we really want is a flush_dcache_range not the above. flush_dcache_range does not take a page struct as an argument and it will work on memory that has no struct page backing it. Is flush_dcache_range available in all platforms? I see some drivers using it: drivers/net/fec.c drivers/serial/mpsc.c drivers/char/agp/uninorth-agp.c flush_dcache_page is implemented by sparc64 Uses mapping sh Ok. Only uses PG_mapped arm Uses mapping in the mmu case frv Does a kmap_atomic ?? Otherwise looks ok. ppc Clears PG_arch_1 mips Uses mapping sh64 No page struct use parisc Uses mapping xtensa Uses mapping powerpc Handles page flags PG_arch_1 ia64 Clears PG_arch_1 sparc Calculates address based on page struct addr. blackfin Does an immediate page_address(page) m68k Does an immediate page_address(page) In many situations the page struct passed to flush_dcache_page is simply used to calculate the virtual address. So its mostly harmless. Trouble starts when page attributes like the mapping is used. So the problem platforms are sparc64 arm mips parisc xtensa If we indeed do these weird things then I think the general fix should be to use flush_dcache_range() but that is too late for 2.6.22. The VM_BUG_ON will be useful to detect these scenarios. Maybe we need to replace that with a WARN_ON or something if the usage is frequent? There are a large number of platforms on which flush_dcache_range has no effect or an effect that is negligible. A kmalloc slab object (even 64 byte) may be crossing a page boundary with a ARCH_KMALLOC_MINALIGN of 4 or 8. So I think that flush_dcache_range *must* be used rather than flush_dcache_page. flush_dcache_page(virt_to_page(object)) takes the starting address of the object and flushes the page in which the object started. It may not be the complete object. This usually works fine with 64 byte objects because they neatly fit into a slab page. Again if CONFIG_SLAB_DEBUG f.e. is enabled ...
Mostly harmless indeed. I don't understand why you insist on trying to complicate the situation. flush_dcache_page is only expected to do something on pages mapped into userspace (correct me if I'm wrong there), it's expected to do nothing on kmalloc'ed pages. It's been working that way for years, and will continue to work that way with slub, providing either page_mapping or flush_dcache_page checks PageSlab to avoid oopsing on page->mapping. 2.6.22-rc6 has page_mapping making that check: we could argue about which is the better site for it, there are good arguments both ways (page_mapping is the correct place, flush_dcache_page is the more Why???? All we require of flush_dcache_page is that it not oops on the first page in the range: we don't need to change over to We have the PageSlab check in page_mapping now, I don't see what further change is needed. Hugh -
It is definitely intended to work. Otherwise we would not have code like this: christoph@fly:~/linux-2.6$ find . -name "*.c" | xargs grep "flush_dcache_page"|grep virt ./drivers/scsi/scsi_tgt_if.c: flush_dcache_page(virt_to_page(ev)); Ok. I think your patch is fine as a quick fix for 2.6.22. I am a bit uneasy with that given that its in such a broadly used function while its only use is to enable flush_dcache_page to work. But we need the general As explained about: There are corner cases in which it does not work. You seem to assume that flush_dcache_page can become a no op. That may not be true on platforms that need explicit cache flushing for a DMA engine to access a data structure. The above listed use suggests that the caller expects flushing to occur correctly. -
I didn't claim that flush_dcache_page(virt_to_page(virt)) is not expected to work. I claim that flush_dcache_page is expected to be a noop rather The scsi_tgt_if.c use you show above? That's not dealing with kmalloced pages, is it? True, the pages it is dealing with don't have page->mapping set, so those architectures which use page->mapping to find what to do in their flush_dcache_page, won't do anything there in their flush_dcache_page. Whether that's a bug or not, I wouldn't pretend to know; but it's nothing to do with the present discussion. Please see Documentation/cachetlb.txt: flush_dcache_page is about pagecache pages mapped into userspace. We don't use kmalloc for those, but we do sometimes need to flush_dcache_page in places which commonly deal with pagecache pages, but sometimes handle kmalloc'ed buffers too. Luckily we don't have to deal with buffers in which the first page is kmalloced and the next comes from pagecache. Hugh -
There are no kmalloced pages. There is only kmalloced memory. You allocate pages from the page allocator. Its a layering violation to expect a page struct operation on a slab object to work. It is not okay to use a page cache function on a slab object. The slab object does not fulfill the alignment requirements of a page cache page nor does it have a compatible page struct content as a page cache page. This can only cause trouble. The problem is that the code is allocating some slab memory and then determines the page struct pointer and then hands it back to the DMA This gets crazier and crazier. flush_dcache_page is for pages not for allocated buffers via kmalloc. So this has nothing to do with any need to flush slab objects? Sometimes we do this and then we do that? So someone played loose ball with the slab, was successful and that makes it right now? We need the VM_BUG_ON in include/linux/mm.h to stop this nonsense. The "sometimes we have kmalloced buffers" locations need to be fixed. -
I've said enough, I'd better leave it to others to deter you or not from fiddling around pointlessly here. But Andrew, please cancel my Acked-by to mm's add-vm_bug_on-in-case-someone-uses-page_mapping-on-a-slab-page.patch Thanks, Hugh -
Are there any locations left after the two fixes to pa-risc and arm? If the ARM fix solves the issue then we may not need that special casing anymore. Maybe the sometimes has become never? -
Please reread mails of the last 20 hours. Your "ARM fix" may or may not be a good thing, I don't know. But it had nothing to do with this oops, which (we believe) came from a kmalloc in drivers/mmc/core/sd.c. There are likely to be many other drivers which pass down kmalloc'ed buffers to routines which handle both kmalloc'ed buffers and page buffers. Though very few of those needing to flush_dcache_page. Hugh -
If that is so then those functions may want to check PageSlab before calling flush_dcache_page. -
Note, however, that the use of flush_dcache_page() on allocations derived from dma_alloc_coherent() is undefined and unpredictable. dma_alloc_coherent() may return a remapped virtual address which would be invalid for things like virt_to_page() to operate on. So the fact that dma_alloc_coherent() returning something that was kmalloc'd and then causing flush_dcache_page() to oops is actually a sign that there's far deeper problems. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: -
The impact is only on a subset of ARM machines. PA_RISC? It looks like they run their own flushing function for byte ranges called flush_kernel_dache_range. That does not use the page struct. -
PA-RISC does have a function of that name, and I'm guessing that you came across it in looking at the PA-RISC dma_map_single. But PA-RISC also has a function called flush_dcache_page, which uses page_mapping and expects a struct address_space * from it. If that can ever be get applied to a SLOB page (which is not so clear as in the ARM case, but cannot easily be ruled out completely), we're in trouble without a PageSlab test within page_mapping. Hugh -
A page comes from the page allocator. Access to a slab allocators "page" is an access to subsystem internals. Those internals vary (including the representation of objects in the "page") and depend on the slab allocator selected, the debug options in effect and parameters with which the slab was created etc etc. If flush_dcache_page is applied to a slab object then we need to do a similar change to PA-RISC as for ARM. -
