Re: Oops in a driver while using SLUB as a SLAB allocator

Previous thread: how about mutual compatibility between Linux's GPLv2 and GPLv3? by Alexandre Oliva on Thursday, June 21, 2007 - 2:39 am. (68 messages)

Next thread: [RFC] mm-controller by Peter Zijlstra on Thursday, June 21, 2007 - 2:32 am. (12 messages)
From: Nicolas Ferre
Date: Thursday, June 21, 2007 - 2:30 am

Hello,

While debugging a Linux driver on my ARM platform 
(AT91), I switched on the 2.6.22-rc5 kernel. While 
reconfiguring it I selected CONFIG_SLUB as a SLAB allocator.

The sd/mmc driver I tried to run is vanilla driver 
which never, until now, produces Oops (at91_mci.c).

The oops seems to occur after a page unmapping using 
dma_unmap_page() followed by a flush_dcache_page() (in 
at91mci_post_dma_read()).

Here is the oops (driver verbose output included to show 
pages used for dma sg list) :

Starting kernel ...

Linux version 2.6.22-rc5 (user@location) (gcc version 3.4.2) #8 Thu Jun 21 11:05:18 CEST 2007
CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00053177
Machine: Atmel AT91SAM9263-EK
[..]
Memory: 64MB = 64MB total
Memory: 62012KB available (2492K code, 246K data, 116K init)
SLUB: Genslabs=23, HWalign=32, Order=0-1, MinObjects=4, CPUs=1, Nodes=1
[..]
Probe MCI devices
pre dma read
Using transfer index 0
sg = c3d0be68
dma address = 23C54968, length = 8
Nothing left to transfer (index = 1)
pre dma read done
setting ier to 00000040
MCI irq: status = 0000C0ED, C07F0040, 00000040
Receive has ended
post dma read
finishing index 0
Unmapping page 23C54968
buffer = c3c54968, length = 8
post dma read done
[^ this transfert is ok]
[..]
pre dma read
Using transfer index 0
sg = c3d0be68
dma address = 203AFBC0, length = 64
Nothing left to transfer (index = 1)
pre dma read done
setting ier to 00000040
MCI irq: status = 0000C0ED, C07F0040, 00000040
Receive has ended
post dma read
finishing index 0
Unmapping page 203AFBC0
buffer = c03afbc0, length = 64
Unable to handle kernel NULL pointer dereference at virtual address 0000000d
pgd = c0004000
[0000000d] *pgd=00000000
Internal error: Oops: 1 [#1]
Modules linked in:
CPU: 0
PC is at get_index+0x1c/0x50
LR is at prio_tree_next+0x60/0x194
pc : [<c01098ac>]    lr : [<c0109fc4>]    Not tainted
sp : c3d0bca8  ip : ffffffdd  fp : c3d0bcb4
r10: 00000040  r9 : c3d0be78  r8 : c3d0bcc4
r7 : c3d0bcc0  r6 : ...
From: Marc Pignat
Date: Thursday, June 21, 2007 - 7:54 am

Hello Nicolas!

Good news!

I think I've found the problem, this seems to work (tested with SLUB and SLAB).

If you're in the cleanup stage, I think the whole kmap and kunmap can be in the 'if (cpu_is_at91rm9200())',
we have no reason to kmap data we don't touch :-D

Regards

Marc

--- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig	2007-06-21 16:27:31.000000000 +0200
+++ drivers/mmc/host/at91_mci.c-2.6.22-rc5	2007-06-21 16:42:48.000000000 +0200
@@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str
 		else
 			memcpy(dmabuf, sgbuffer, amount);

-		kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ);

 		if (size == 0)
 			break;
@@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct
 				buffer[index] = swab32(buffer[index]);
 		}

-		kunmap_atomic(buffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(buffer - sg->buffer, KM_BIO_SRC_IRQ);
 		flush_dcache_page(sg->page);
 	}

-

From: Marc Pignat
Date: Thursday, June 21, 2007 - 7:57 am

please use this patch, sorry for the later

Regards

Marc

--- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig	2007-06-21 16:27:31.000000000 +0200
+++ drivers/mmc/host/at91_mci.c-2.6.22-rc5	2007-06-21 16:42:48.000000000 +0200
@@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str
 		else
 			memcpy(dmabuf, sgbuffer, amount);

-		kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ);

 		if (size == 0)
 			break;
@@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct
 				buffer[index] = swab32(buffer[index]);
 		}

-		kunmap_atomic(buffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ);
 		flush_dcache_page(sg->page);
 	}

-

From: Nicolas Ferre
Date: Thursday, June 21, 2007 - 8:54 am

My eyes are too tired or this patch is the same as the previous one :-\

Cheers,
-- 
Nicolas Ferre


-

From: Marc Pignat
Date: Thursday, June 21, 2007 - 11:28 pm

from: Marc Pignat <marc.pignat@hevs.ch>

kunmap must be called on the pointer returned by kmap.

Signed-off-by: Marc Pignat <marc.pignat@hevs.ch>

---

N.B: This is the same patch as yesterday, with proper Signed-off-by and more 
comments.

The buffer variable is used this way:
		buffer = kmap_atomic(sg->page, KM_BIO_SRC_IRQ) + sg->offset;
So we can't do
		kunmap_atomic(buffer, KM_BIO_SRC_IRQ);
we must do
		kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ);
		
Strangely this misuse showed no problem using CONFIG_SLAB, but oops using
CONFIG_SLUB, (lkml ''Oops in a driver while using SLUB as a SLAB allocator").

Regards

Marc

--- drivers/mmc/host/at91_mci.c-2.6.22-rc5.orig	2007-06-21 16:27:31.000000000 +0200
+++ drivers/mmc/host/at91_mci.c-2.6.22-rc5	2007-06-21 16:42:48.000000000 +0200
@@ -164,7 +164,7 @@ static inline void at91mci_sg_to_dma(str
 		else
 			memcpy(dmabuf, sgbuffer, amount);

-		kunmap_atomic(sgbuffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(sgbuffer - sg->offset, KM_BIO_SRC_IRQ);

 		if (size == 0)
 			break;
@@ -293,7 +293,7 @@ static void at91mci_post_dma_read(struct
 				buffer[index] = swab32(buffer[index]);
 		}

-		kunmap_atomic(buffer, KM_BIO_SRC_IRQ);
+		kunmap_atomic(buffer - sg->offset, KM_BIO_SRC_IRQ);
 		flush_dcache_page(sg->page);
 	}

-

From: Hugh Dickins
Date: Friday, June 22, 2007 - 5:00 am

Aren't you just guessing there?  Those kunmap_atomics in at91_mci.c
may look wrong to you, but they're not incorrect (so long as sg->offset
falls within the page, as it must do here to make sense of the page).
Especially not on ARM, where kunmap_atomic actually has no interest
in the argument passed.  And the oops was in the flush_dcache_page.

If you actually reproduced Nicolas' problem on ARM, and verified
that your patch then fixes it, please let us know: that will be
remarkably interesting.

I believe I posted the correct fix last night (or at least a safe fix
for now: Christoph Lameter may prefer to undo it and change ARM's
dma_mapping at leisure later), checking PageSlab in page_mapping;
and Linus has already put that one into his git tree.

But it would be good to hear from Nicolas whether that indeed fixes
his oops: I couldn't actually try my patch on ARM either.

-

From: Nicolas Ferre
Date: Friday, June 22, 2007 - 6:34 am

Patch tested without success. Indeed, I always see the Oops with 

Here:

Ok tested on Atmel ARM AT91 (at91sam9263), this fixes this oops.

Thanks a lot to all of you for your precise help.

Regards,
-- 
Nicolas Ferre


-

From: Hugh Dickins
Date: Friday, June 22, 2007 - 6:46 am

Not at all, thank you for reporting and testing and reporting back,
in time for 2.6.22: that's all more valuable than the easy fix.

Hugh
-

From: Marc Pignat
Date: Friday, June 22, 2007 - 7:21 am

So, it's really interesting, it worked really for me (oops without patch) and
no oops with it. My hardware is a custom board with an at91rm9200.

I had a look the the kunmap_atomic function, and I _really_ don't understand
how this patch can do something :-/

And last but not least, my patch is completly wrong...
void *kmap_atomic(...);
unsigned int *buffer = kmap_atomic(...) + sg->offset; // addition in u8*
kunmap_atomic(buffer - sg->offset); // <- sub in u32*

-> please forget my patch

Regards

Marc


-

From: Marc Pignat
Date: Friday, June 22, 2007 - 7:58 am

Just tested and working on my atmel at91rm9200.
(at91sam9263 core is ARM926EJ-S and at91rm9200 core is ARM920T).

Thanks!

Regards

Marc

-

From: Jens Axboe
Date: Friday, June 22, 2007 - 12:00 pm

kunmap_atomic() should align the given buffer to the start of the page
anyway, so your patch wont change behaviour.

-- 
Jens Axboe

-

From: Nicolas Ferre
Date: Friday, June 22, 2007 - 2:09 am

Indeed, my eyes where too tired ;-) Sorry for the trouble.

-- 
Nicolas Ferre


-

From: Hugh Dickins
Date: Thursday, June 21, 2007 - 3:27 pm

That looks serious: thanks a lot for reporting it.

(I see Marc has sent you a patch or two for the driver end:
I didn't see their relevance, maybe they skirt around the
problem somehow; but unless I'm mistaken, what you've found
goes beyond that particular driver.)

Seems a little odd that it's gone throughout 2.6.22-rc unnoticed
until now - nobody else trying SLUB on ARM or PA-RISC yet perhaps.
As I understand it, you're not doing anything wrong (disclaimer:
I'm no expert on dma_mapping things), but SLUB's reuse of struct
page fields has collided with what flush_dcache_page is expecting.

Here's a patch: I'm not convinced it's necessarily the best one
(most uses of page_mapping will never see a slab page, it's a pity
to be cluttering up that inline even further); but in case nobody
else can provide a better...


[PATCH] page_mapping must avoid slub pages

Nicolas Ferre reports oops from flush_dcache_page() on ARM when using
SLUB: which reuses page->mapping as page->slab.  The page_mapping()
function, used by ARM and PA-RISC flush_dcache_page() implementations,
must not confuse SLUB pages with those which have page->mapping set.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
That #ifdef I've put in there is not essential:
it's perhaps more of a comment or an accusation ;)

 include/linux/mm.h |    4 ++++
 1 file changed, 4 insertions(+)

--- 2.6.22-rc5/include/linux/mm.h	2007-05-26 07:16:48.000000000 +0100
+++ linux/include/linux/mm.h	2007-06-21 15:52:47.000000000 +0100
@@ -603,6 +603,10 @@ static inline struct address_space *page
 
 	if (unlikely(PageSwapCache(page)))
 		mapping = &swapper_space;
+#ifdef CONFIG_SLUB
+	else if (unlikely(PageSlab(page)))
+		mapping = NULL;
+#endif
 	else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON))
 		mapping = NULL;
 	return mapping;

-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 6:01 pm

Well one may be better off allocating pages using the page allocator 
instead of the slab allocator. I removed these things from i386 but I did 
not check ARM.

-

From: Hugh Dickins
Date: Thursday, June 21, 2007 - 9:26 pm

They may or may not be: I think that's a matter to discuss with rmk.

You keep on forcing the outside world to revolve around your needs
within slub.c: that is a good way to keep slub lean, and may be
justified; but it's at least questionable to be enforcing such
restrictions years after people have grown accustomed to more
freedom from their slabs.

Hugh
-

From: Russell King
Date: Friday, June 22, 2007 - 12:00 am

The coherent case on ARM is broken in more ways, not only because
it uses kmalloc, but it also takes no notice of the DMA mask.

However, AT91 isn't a coherent ARM architecture, so arch_is_coherent()
should be false.  Therefore, we should never be allocating pages for
DMA from SLAB/SLUB for AT91 platforms.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 10:13 pm

This work is a cleanup of the VM code and part of the slab cleanup that 
was done. Having slab objects on the LRU and other components of the VM 
that are supposed to work on page sized objects is plainly wrong and can 
lead to surprising results.


-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 6:36 pm

Maybe this will address the issue on ARM?


ARM: Allocate dma pages via the page allocator and not via the slab allocator

Slab allocations are not guaranteed to be page aligned and slab allocators
may use the page structs for their own purposes. Using the page allocator
yields a properly aligned page and also makes the page flushing logic work right.

Passing a kmalloced "page" to a flushing function will not work reliably.

This will hopefully address the issue with SLUB on ARM. SLUB uses the
page->mapping field which is also checked by the flushing logic. The
flushing logic expects a normal page and not a slab page.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/arm/mm/consistent.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/arm/mm/consistent.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/consistent.c	2007-06-21 18:18:15.000000000 -0700
+++ linux-2.6/arch/arm/mm/consistent.c	2007-06-21 18:29:16.000000000 -0700
@@ -277,7 +277,7 @@ dma_alloc_coherent(struct device *dev, s
 	if (arch_is_coherent()) {
 		void *virt;
 
-		virt = kmalloc(size, gfp);
+		virt = get_free_pages(gfp, get_order(size));
 		if (!virt)
 			return NULL;
 		*handle =  virt_to_dma(dev, virt);
@@ -364,7 +364,7 @@ void dma_free_coherent(struct device *de
 	WARN_ON(irqs_disabled());
 
 	if (arch_is_coherent()) {
-		kfree(cpu_addr);
+		free_pages((unsigned long)cpu_addr, get_order(size));
 		return;
 	}
 
-

From: Hugh Dickins
Date: Thursday, June 21, 2007 - 9:40 pm

Looks like it would indeed address the immediate issue on ARM -
IF they've no particular reason to be using kmalloc there.

However... what gives you confidence that flush_dcache_page is
never applied to other slab pages?

This looks to me like a clean way forward to try in -mm; but that
2.6.22 should go with the safety PageSlab test in page_mapping.

-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 10:10 pm

Flush dcache page is supposed to run on pages not objects of varying 
length. It is suprising that this has not lead to earlier problems.
Objects allocated this way may straddle a page boundary under some 
conditions and in that case virt_to_page may not lead to a page that 
covers the complete object that is supposed to be flushed. Hopefully the 

2.6.22 cleans up these issues and this one follows in those footsteps.

The reason for the introduction of the quicklist f.e. was to have a clear 
separation between page sized allocation and the variable allocations 
through slab allocators.
-

From: Hugh Dickins
Date: Thursday, June 21, 2007 - 10:37 pm

No, that's the wrong way round.  Neither ARM nor PA-RISC expects
flush_dcache_page to flush any dcache when given a slab allocation:
they just expect it to pass through, not to oops.

Hugh
-

From: Linus Torvalds
Date: Friday, June 22, 2007 - 9:40 am

I think the right thing to do is do both of these things. I already 
applied Hugh's patch - it seemed like a total nobrainer to do at this 
stage in the 2.6.22 -rc series. But that doesn't mean that we should not 
_also_ look at "flush_dcache_page()" users.

I do think that even just the name (the ".._page()" part) makes it obvious 
that it was designed for page-level allocations, not kmalloc(). So I think 
Christoph's patch makes sense in that context.

At the same time, I do think that the whole notion of flushing the D$ is 
certainly something that makes sense for kmalloc() allocations also, and 
maybe people do actually do small DMA allocations and Christophs patch 
would break that.

End result: for 2.6.22, I think the patch from Hugh that I already applied 
is the right thing. 

But as to 2.6.23 and onward.. I dunno.

		Linus
-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 10:26 am

Hugh's patch not address the complete issue. It only works right now 
because the size of the allocation is page size and fits right into a slab 
page. If debugging is enabled then the slab size will increase and the 
"pages" will be misaligned which will lead to other sorts of funky 
behavior. kmalloc allocations are only guaranteed to be aligned to 
ARCH_KMALLOC_MINALIGN which is 4 to 8 bytes. If one must have a 
page aligned entity out of a slab allocator then a custom slab needs to be 
created.
-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 10:41 am

Add VM_BUG_ON in case someone uses page_mapping on a slab page

Detect slab objects being passed to the page oriented functions of the VM.

It is not sufficient to simply return NULL because the functions calling
page_mapping may depend on other items of the page_struct also to be setup
properly. Moreover the slab object may not be properly aligned. The page 
orientedfunctions of the VM expect to operate on page aligned, page sized 
objects. operations on objects straddling page boundaries may only affect 
the objects partially which may lead to surprising results.

It is better to detect eventually remaining uses and eliminate them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-06-22 10:33:27.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-06-22 10:34:09.000000000 -0700
@@ -603,10 +603,7 @@ static inline struct address_space *page
 
 	if (unlikely(PageSwapCache(page)))
 		mapping = &swapper_space;
-#ifdef CONFIG_SLUB
-	else if (unlikely(PageSlab(page)))
-		mapping = NULL;
-#endif
+	VM_BUG_ON(PageSlab(page));
 	else if (unlikely((unsigned long)mapping & PAGE_MAPPING_ANON))
 		mapping = NULL;
 	return mapping;

-

From: Hugh Dickins
Date: Friday, June 22, 2007 - 11:39 am

I'm quite happy with this approach for 2.6.23-rc, along with your ARM
dma_map patch which (if I understood aright) rmk approved.  Except,
haven't you misplaced that VM_BUG_ON between an if and its else?
And I'd much prefer you to make it an outright BUG_ON, because
many testers have those VM_BUG_ONs configured out.

-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 11:51 am

Thought about it but doing so would create quite a load of BUG_ONs in the 
VM given the frequent use of that particular inline function. And AFAIK
we are only aware of one other potential call site that could cause 
trouble. Many arches have run SLUB now for awhile and would certainly have 
shown issues if they would do strange things with slab allocs. Even with 
SLAB they would have to be very careful in order to make this work. So I 
think its rather unlikely that this is going to be triggered. Its 
primarily useful for debugging if strange things start to happen.
The VM_BUG_ON could stay there for good to make sure development does not
result in similar issues in the future.

Fixed up patch:




Add VM_BUG_ON in case someone uses page_mapping on a slab page

Detect slab objects being passed to the page oriented functions of the VM.

It is not sufficient to simply return NULL because the functions calling
page_mapping may depend on other items of the page_struct also to be setup
properly. Moreover slab object may not be properly aligned. The page oriented
functions of the VM expect to operate on page aligned, page sized objects.
Operations on object straddling page boundaries may only affect the objects
partially which may lead to surprising results.

It is better to detect eventually remaining uses and eliminate them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-06-22 10:33:27.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-06-22 11:44:10.000000000 -0700
@@ -601,12 +601,9 @@ static inline struct address_space *page
 {
 	struct address_space *mapping = page->mapping;
 
+	VM_BUG_ON(PageSlab(page));
 	if (unlikely(PageSwapCache(page)))
 		mapping = &swapper_space;
-#ifdef CONFIG_SLUB
-	else if ...
From: Hugh Dickins
Date: Friday, June 22, 2007 - 12:01 pm

Okay; and I was overlooking that (as in this case) we'd probably get
an easily debuggable oops instead of the explicit BUG when it is

Acked-by: Hugh Dickins <hugh@veritas.com>
-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 12:11 pm

We need to fix any remaining weird slab object uses right now. Your check 
leaves a lot of holes open. 2.6.22 removes all other such strange slab 
uses in other arches. It would be inconsistent if we left these things in 
ARM (and maybe PA-RISC).





-

From: Hugh Dickins
Date: Friday, June 22, 2007 - 1:21 pm

As I understand it, that driver used to work right with SLAB, then
oopsed with SLUB, and now works okay again with the page_mapping patch?

I'm unclear how it comes about that you removed "all other such strange
slab uses in other arches", yet missed this?  That suggests there may
be further unexpected uses.

It worries me that any use which catches you by surprise has to be
fixed up in the caller, rather than in slub itself: slab/slub is a
service, not a master.  But I'm rather repeating myself.

Hugh
-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 3:54 pm

Try to enable debugging then it may fail again despite your patch. You 
just scratched the surface with this and are enabling a dangerous usage 

There could be other uses that were missed. I looked for slabs created by 
kmem_cache_create. The trouble is that any kmalloc could also be used for 
engineer these weird things (as seen here) and there are gazillions of 
kmallocs. That is why a VM_BUG_ON is useful. However, it requires some 
effort even with SLAB to create these things and--given that we have 
tested extensively on lots of arches--I am hopeful that 

SLUB used to implement the same special casing as SLAB did (which results 
in the fragile scenarios in which the above is possible). But we made the 
decision to clean up the slab interface and we dropped the emulation of 
the SLAB frills from SLUB.

Messy and problematic code like this should be removed. That improves the 
quality of the kernel. The removal is a straightfoward process. And the 
cases that we are discussing here are in remote corners of the kernel.
-

From: Christoph Lameter
Date: Friday, June 22, 2007 - 1:15 pm

Here is the corresponding PA-RISC piece. Its as straightforward as 
the other one since the only way to make this work properly in the past 
was if the sizes passed to the dma alloc functions are page size based.



PA-RISC: Use page allocator instead of slab allocator.

Slab pages obtained via kmalloc are not cachline aligned. Nor is it advisable
to perform VM operations designed for page allocator pages on
memory obtained via kmalloc.

So replace the page sized allocations in kernel/pci-dma.c with page allocator
pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/parisc/kernel/pci-dma.c |    4 ++--
 include/linux/mm.h           |    5 +----
 2 files changed, 3 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/parisc/kernel/pci-dma.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/pci-dma.c	2007-06-22 13:02:32.000000000 -0700
+++ linux-2.6/arch/parisc/kernel/pci-dma.c	2007-06-22 13:06:39.000000000 -0700
@@ -572,7 +572,7 @@ static void *pa11_dma_alloc_noncoherent(
 	void *addr = NULL;
 
 	/* rely on kmalloc to be cacheline aligned */
-	addr = kmalloc(size, flag);
+	addr = (void *)__get_free_pages(flag, get_order(size));
 	if(addr)
 		*dma_handle = (dma_addr_t)virt_to_phys(addr);
 
@@ -582,7 +582,7 @@ static void *pa11_dma_alloc_noncoherent(
 static void pa11_dma_free_noncoherent(struct device *dev, size_t size,
 					void *vaddr, dma_addr_t iova)
 {
-	kfree(vaddr);
+	free_pages((unsigned long)vaddr, get_order(size));
 	return;
 }
 

-

From: Oleg Verych
Date: Saturday, June 23, 2007 - 3:40 am

* From: Christoph Lameter
* Newsgroups: linux.kernel,linux.ports.arm.kernel

Seems like comment must be removed.
____
-

From: Russell King
Date: Sunday, June 24, 2007 - 1:38 am

I didn't approve it.  Please re-read my reply - there are still some
unanswered questions in it which _really_ need answering.

The report talks about the AT91 machines.  These machines do not have
cache coherent DMA.  Therefore, the code being patched should be
optimised away by the compiler. *Or* we have even bigger problems.

Please forward the original problem report.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-

From: Hugh Dickins
Date: Sunday, June 24, 2007 - 3:24 am

Done.
Hugh
-

From: Russell King
Date: Sunday, June 24, 2007 - 3:51 am

Okay, that seems to back up my suspicions - it's definitely AT91-based.
Since AT91-based machines do not have a DMA coherent cache,
arch_is_coherent() must be defined to '0'.  The only way that kmalloc
could be reached is if that were defined to something other than '0',
and if that's done on a machine with DMA incoherent caches, that will
lead to data corruption.

I think we need to wait for Nicolas to respond on this issue before
running headlong into applying a sticky plaster for something which is
actually a deeper issue.

However, the arch_is_coherent() path _is_ buggy as it stands, but in
more than the way identified thus far.  Eg, it doesn't set __GFP_DMA
appropriately for various DMA masks, so it might return DMA inaccessible
memory.

If we're after a simple fix for 2.6.22, the _easiest_ solution would be
to delete the entire arch_is_coherent() branches in arch/arm/mm/consistent.c;
that will result in a working solution for everyone, albiet at a slightly
lower performance for the DMA-coherent CPUs.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-

From: Hugh Dickins
Date: Sunday, June 24, 2007 - 5:25 pm

No need for Nicolas to respond, I think I've found what's "wrong":

I expect you're right, but that's a separate issue.  I had thought
you were approving Christoph's ARM patch because both you and he seemed
to agree that kmalloc was inappropriate for use in dma_alloc_coherent,
whatever additional issues you saw with it.

I still don't see why kmalloc is wrong there myself: for a while
I bought Christoph's alignment argument, but now I don't see why
(more than long) alignment is important to it.  But I'm easily

The fix for 2.6.22 is my PageSlab test in page_mapping which Linus
already put into -git.

And I now rather think that needs to stay, not be replaced by the
VM_BUG_ON Christoph was proposing for 2.6.23 (which earlier I acked).

Christoph responded to my page_mapping patch by looking at arch/arm,
and there finding a kmalloc in dma_alloc_coherent which he didn't
like; but you're right, it's entirely irrelevant to Nicolas' oops.

The slub allocation which gives rise to Nicolas' oops in not in
ARM, but (I'm guessing) in drivers/mmc/core/sd.c: one of those
	status = kmalloc(64, GFP_KERNEL);
where status is passed down for the response from mmc_sd_switch.

And what is wrong with using kmalloc there?
Why should that be changed to allocate a whole page?
How many other such cases might there be?

And the flush_dcache_page in at91mci_post_dma_read looks correct
to me too: it has just filled and perhaps also swabbed a buffer,
that buffer might in some cases be mapped into userspace, so it
calls flush_dcache_page.

In the kmalloc case it's not mapped into userspace: flush_dcache_page
should detect that and do nothing, as it does with slab; but slub was
reusing page->mapping for something else, so we oopsed.

Hugh
-

From: Nicolas Ferre
Date: Monday, June 25, 2007 - 6:55 am

AT91 _is_ defined with the arch_is_coherent() switch set to 0 (in 

Ok it is bad but, in AT91, all memory is accessible with DMA.


True, the oops appears after a mmc switch command (#6 command).

[..]

Not sure I can add much to what Hugh has said. If you need some more 
precision anyway, I will try to answer.

Regards,
-- 
Nicolas Ferre


-

From: Christoph Lameter
Date: Monday, June 25, 2007 - 7:07 am

If that is the case then what we really want is a flush_dcache_range
not the above. flush_dcache_range does not take a page struct as an 
argument and it will work on memory that has no struct page backing it.

Is flush_dcache_range available in all platforms? I see some drivers
using it:

drivers/net/fec.c
drivers/serial/mpsc.c
drivers/char/agp/uninorth-agp.c


flush_dcache_page is implemented by

sparc64		Uses mapping
sh		Ok. Only uses PG_mapped
arm		Uses mapping in the mmu case
frv		Does a kmap_atomic ?? Otherwise looks ok.
ppc		Clears PG_arch_1
mips		Uses mapping
sh64		No page struct use
parisc		Uses mapping
xtensa		Uses mapping
powerpc		Handles page flags PG_arch_1
ia64		Clears PG_arch_1
sparc		Calculates address based on page struct addr.
blackfin	Does an immediate page_address(page)
m68k		Does an immediate page_address(page)

In many situations the page struct passed to flush_dcache_page is
simply used to calculate the virtual address. So its mostly harmless.
Trouble starts when page attributes like the mapping is used.

So the problem platforms are

sparc64 arm mips parisc xtensa

If we indeed do these weird things then I think the general fix should
be to use flush_dcache_range() but that is too late for 2.6.22. The 
VM_BUG_ON will be useful to detect these scenarios. Maybe we need
to replace that with a WARN_ON or something if the usage is frequent?
There are a large number of platforms on which flush_dcache_range has
no effect or an effect that is negligible.

A kmalloc slab object (even 64 byte) may be crossing a page boundary 
with a ARCH_KMALLOC_MINALIGN of 4 or 8. So I think that 
flush_dcache_range *must* be used rather than flush_dcache_page. 
flush_dcache_page(virt_to_page(object)) takes the starting address of 
the object and flushes the page in which the object started. It may
not be the complete object. This usually works fine with 64 byte objects
because they neatly fit into a slab page. Again if CONFIG_SLAB_DEBUG
f.e. is enabled ...
From: Hugh Dickins
Date: Monday, June 25, 2007 - 9:42 am

Mostly harmless indeed.  I don't understand why you insist on trying
to complicate the situation.  flush_dcache_page is only expected to
do something on pages mapped into userspace (correct me if I'm wrong
there), it's expected to do nothing on kmalloc'ed pages.  It's
been working that way for years, and will continue to work that way
with slub, providing either page_mapping or flush_dcache_page checks
PageSlab to avoid oopsing on page->mapping.

2.6.22-rc6 has page_mapping making that check: we could argue about
which is the better site for it, there are good arguments both ways
(page_mapping is the correct place, flush_dcache_page is the more

Why????  All we require of flush_dcache_page is that it not oops on
the first page in the range: we don't need to change over to

We have the PageSlab check in page_mapping now,
I don't see what further change is needed.

Hugh
-

From: Christoph Lameter
Date: Monday, June 25, 2007 - 10:00 am

It is definitely intended to work. Otherwise we would not have code 
like this:

christoph@fly:~/linux-2.6$ find . -name "*.c" | xargs grep "flush_dcache_page"|grep virt
./drivers/scsi/scsi_tgt_if.c:   flush_dcache_page(virt_to_page(ev));

Ok. I think your patch is fine as a quick fix for 2.6.22. I am a bit 
uneasy with that given that its in such a broadly used function while its 
only use is to enable flush_dcache_page to work. But we need the general 

As explained about: There are corner cases in which it does not work. You 
seem to assume that flush_dcache_page can become a no op. That may not be 
true on platforms that need explicit cache flushing for a DMA engine to 
access a data structure. The above listed use suggests that the caller 
expects flushing to occur correctly.



-

From: Hugh Dickins
Date: Monday, June 25, 2007 - 10:23 am

I didn't claim that flush_dcache_page(virt_to_page(virt)) is not expected
to work.  I claim that flush_dcache_page is expected to be a noop rather


The scsi_tgt_if.c use you show above?  That's not dealing with
kmalloced pages, is it?  True, the pages it is dealing with don't
have page->mapping set, so those architectures which use page->mapping
to find what to do in their flush_dcache_page, won't do anything there
in their flush_dcache_page.  Whether that's a bug or not, I wouldn't
pretend to know; but it's nothing to do with the present discussion.

Please see Documentation/cachetlb.txt: flush_dcache_page is about
pagecache pages mapped into userspace.  We don't use kmalloc for those,
but we do sometimes need to flush_dcache_page in places which commonly
deal with pagecache pages, but sometimes handle kmalloc'ed buffers too.
Luckily we don't have to deal with buffers in which the first page is
kmalloced and the next comes from pagecache.

Hugh
-

From: Christoph Lameter
Date: Monday, June 25, 2007 - 11:23 am

There are no kmalloced pages. There is only kmalloced memory. You allocate 
pages from the page allocator. Its a layering violation to expect a page
struct operation on a slab object to work.

It is not okay to use a page cache function on a slab object. The slab 
object does not fulfill the alignment requirements of a page cache page
nor does it  have a compatible page struct content as a page cache page. 
This can only cause trouble.

The problem is that the code is allocating some slab memory and then 
determines the page struct pointer and then hands it back to the DMA 


This gets crazier and crazier. flush_dcache_page is for pages not for
allocated buffers via kmalloc. So this has nothing to do with any need
to flush slab objects?

Sometimes we do this and then we do that? So someone played loose
ball with the slab, was successful and that makes it right now?

We need the VM_BUG_ON in include/linux/mm.h to stop this nonsense.

The "sometimes we have kmalloced buffers" locations need to be fixed.
-

From: Hugh Dickins
Date: Monday, June 25, 2007 - 11:43 am

I've said enough, I'd better leave it to others to deter you or not
from fiddling around pointlessly here.

But Andrew, please cancel my Acked-by to mm's
add-vm_bug_on-in-case-someone-uses-page_mapping-on-a-slab-page.patch

Thanks,
Hugh
-

From: Christoph Lameter
Date: Monday, June 25, 2007 - 11:50 am

Are there any locations left after the two fixes to pa-risc and arm?

If the ARM fix solves the issue then we may not need that special casing 
anymore. Maybe the sometimes has become never?



-

From: Hugh Dickins
Date: Monday, June 25, 2007 - 12:04 pm

Please reread mails of the last 20 hours.  Your "ARM fix" may or may not
be a good thing, I don't know.  But it had nothing to do with this oops,
which (we believe) came from a kmalloc in drivers/mmc/core/sd.c.  There
are likely to be many other drivers which pass down kmalloc'ed buffers
to routines which handle both kmalloc'ed buffers and page buffers.
Though very few of those needing to flush_dcache_page.

Hugh
-

From: Christoph Lameter
Date: Tuesday, June 26, 2007 - 11:09 am

If that is so then those functions may want to check PageSlab before 
calling flush_dcache_page.

-

From: Russell King
Date: Friday, June 22, 2007 - 1:18 pm

Note, however, that the use of flush_dcache_page() on allocations
derived from dma_alloc_coherent() is undefined and unpredictable.

dma_alloc_coherent() may return a remapped virtual address which
would be invalid for things like virt_to_page() to operate on.

So the fact that dma_alloc_coherent() returning something that was
kmalloc'd and then causing flush_dcache_page() to oops is actually
a sign that there's far deeper problems.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 6:41 pm

The impact is only on a subset of ARM machines.

PA_RISC? It looks like they run their own flushing function for byte 
ranges called flush_kernel_dache_range. That does not use the page struct.


-

From: Hugh Dickins
Date: Thursday, June 21, 2007 - 9:46 pm

PA-RISC does have a function of that name, and I'm guessing that
you came across it in looking at the PA-RISC dma_map_single.

But PA-RISC also has a function called flush_dcache_page, which uses
page_mapping and expects a struct address_space * from it.  If that
can ever be get applied to a SLOB page (which is not so clear as in
the ARM case, but cannot easily be ruled out completely), we're
in trouble without a PageSlab test within page_mapping.

Hugh
-

From: Christoph Lameter
Date: Thursday, June 21, 2007 - 10:31 pm

A page comes from the page allocator. Access to a slab allocators "page" 
is an access to subsystem internals. Those internals vary 
(including the representation of objects in the "page") and depend on the 
slab allocator selected, the debug options in effect and parameters 
with which the slab was created etc etc.

If flush_dcache_page is applied to a slab object then we need to do a 
similar change to PA-RISC as for ARM.
-

Previous thread: how about mutual compatibility between Linux's GPLv2 and GPLv3? by Alexandre Oliva on Thursday, June 21, 2007 - 2:39 am. (68 messages)

Next thread: [RFC] mm-controller by Peter Zijlstra on Thursday, June 21, 2007 - 2:32 am. (12 messages)