Previous thread: Re: Config NO_BOOTMEM breaks my amd64 box by H. Peter Anvin on Wednesday, March 31, 2010 - 8:16 pm. (2 messages)

Next thread: [ANNOUNCE] Git 1.7.0.4 by Junio C Hamano on Wednesday, March 31, 2010 - 9:50 pm. (1 message)
From: TAO HU
Date: Wednesday, March 31, 2010 - 9:05 pm

Hi, all

We got a panic on our ARM (OMAP) based HW.
Our code is based on 2.6.29 kernel (last commit for mm/page_alloc.c is
cc2559bccc72767cb446f79b071d96c30c26439b)

It appears to crash while going through pcp->list in
buffered_rmqueue() of mm/page_alloc.c after checking vmlinux.
"00100100" implies LIST_POISON1 that suggests a race condition between
list_add() and list_del() in my personal view.
However we not yet figure out locking problem regarding page.lru.

Any known issues about race condition in mm/page_alloc.c?
And other hints are highly appreciated.

 /* Find a page of the appropriate migrate type */
                if (cold) {
                   ... ...
                } else {
                        list_for_each_entry(page, &pcp->list, lru)
                                if (page_private(page) == migratetype)
                                        break;
                }

<1>[120898.805267] Unable to handle kernel paging request at virtual
address 00100100
<1>[120898.805633] pgd = c1560000
<1>[120898.805786] [00100100] *pgd=897b3031, *pte=00000000, *ppte=00000000
<4>[120898.806457] Internal error: Oops: 17 [#1] PREEMPT
... ...
<4>[120898.807861] CPU: 0    Not tainted  (2.6.29-omap1 #1)
<4>[120898.808044] PC is at get_page_from_freelist+0x1d0/0x4b0
<4>[120898.808227] LR is at get_page_from_freelist+0xc8/0x4b0
<4>[120898.808563] pc : [<c00a600c>]    lr : [<c00a5f04>]    psr: 800000d3
<4>[120898.808563] sp : c49fbd18  ip : 00000000  fp : c49fbd74
<4>[120898.809020] r10: 00000000  r9 : 001000e8  r8 : 00000002
<4>[120898.809204] r7 : 001200d2  r6 : 60000053  r5 : c0507c4c  r4 : c49fa000
<4>[120898.809509] r3 : 001000e8  r2 : 00100100  r1 : c0507c6c  r0 : 00000001
<4>[120898.809844] Flags: Nzcv  IRQs off  FIQs off  Mode SVC_32  ISA
ARM  Segment kernel
<4>[120898.810028] Control: 10c5387d  Table: 82160019  DAC: 00000017
<4>[120898.948425] Backtrace:
<4>[120898.948760] [<c00a5e3c>] (get_page_from_freelist+0x0/0x4b0)
from [<c00a6398>] ...
From: TAO HU
Date: Thursday, April 1, 2010 - 8:51 pm

2 patches related to page_alloc.c were applied.
Does anyone see a connection between the 2 patches and the panic?
NOTE: the full patches are attached.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a596bfd..34a29e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2551,6 +2551,20 @@ static inline unsigned long
wait_table_bits(unsigned long size)
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))

 /*
+ * Check if a pageblock contains reserved pages
+ */
+static int pageblock_is_reserved(unsigned long start_pfn)
+{
+	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++)
+		if (PageReserved(pfn_to_page(pfn)))
+			return 1;
+	return 0;
+}
+
+/*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
  * of blocks reserved is based on zone->pages_min. The memory within the
  * reserve will tend to store contiguous free pages. Setting min_free_kbytes
@@ -2579,7 +2593,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 			continue;

 		/* Blocks with reserved pages will never free, skip them. */
-		if (PageReserved(page))
+		if (pageblock_is_reserved(pfn))
 			continue;

 		block_migratetype = get_pageblock_migratetype(page);
-- 
1.5.4.3

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c44ed4..a596bfd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -119,6 +119,7 @@ static char * const zone_names[MAX_NR_ZONES] = {
 };

 int min_free_kbytes = 1024;
+int min_free_order_shift = 1;

 unsigned long __meminitdata nr_kernel_pages;
 unsigned long __meminitdata nr_all_pages;
@@ -1256,7 +1257,7 @@ int zone_watermark_ok(struct zone *z, int order,
unsigned long mark,
 		free_pages -= z->free_area[o].nr_free << o;

 		/* Require fewer higher order pages to be free */
-		min >>= 1;
+		min >>= min_free_order_shift;

 		if (free_pages <= min)
 			return 0;
-- 


From: KOSAKI Motohiro
Date: Thursday, April 1, 2010 - 10:03 pm

I think your attached two patches are perfectly unrelated your problem.

"mm: Add min_free_order_shift tunable." seems makes zero sense. I don't think this patch
need to be merge.

but "mm: Check if any page in a pageblock is reserved before marking it MIGRATE_RESERVE"
treat strange hardware correctly, I think. If Mel ack this, I hope merge it. 



--

From: TAO HU
Date: Thursday, April 1, 2010 - 10:19 pm

Hi, KOSAKI Motohiro

I'm glad to know your're considering patch "mm: Check if any ..."
though it is not my original purpose :)

cc: Arve Hjønnevåg who is the author


On Fri, Apr 2, 2010 at 1:03 PM, KOSAKI Motohiro
--

From: Mel Gorman
Date: Friday, April 2, 2010 - 2:48 am

Agreed. It's unlikely that there is a race as such in the page
allocator. In buffered_rmqueue that you initially talk about, the lists
being manipulated are per-cpu lists. About the only way to corrupt them
is if you had a NMI hander that called the page allocator. I really hope
your platform is not doing anything like that.

A double free of page->lru is a possibility. You could try reproducing

It makes a marginal amount of sense. Basically what it does is allowing
high-order allocations to go much further below their watermarks than is
currently allowed. If the platform in question is doing a lot of high-order
allocations, this patch could be seen to "fix" the problem but you wouldn't
touch mainline with it with a barge pole. It would be more stable to fix
the drivers to not use high order allocations or use a mempool.


This patch is interesting and I am surprised it is required. Is it really the
case that page blocks near the start of a zone are dominated with PageReserved
pages but the first one happen to be free? I guess it's conceivable on ARM
where memmap can be freed at boot time.

There is a theoritical problem with the patch but it is easily resolved.
A PFN walker like this must call pfn_valid_within() before calling
pfn_to_page(). If they do not, it's possible to get complete garbage
for the page and result in a bad dereference. In this particular case,
it would be a kernel oops rather than memory corruption though.

If that was fixed, I'd see no problem with Acking the patch.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: =?ISO-8859-1?Q?Arve_Hj=F8nnev=E5g?=
Date: Friday, April 2, 2010 - 5:59 pm

The high order allocation that caused problems was the first level
page table for each process. Each time a new process started the
kernel would empty the entire page cache to create contiguous free
memory. With the reserved pageblock mostly full (fixed by the second
patch) this contiguous memory would then almost immediately get used
for low order allocations, so the same problem starts again when the
next process starts. I agree this patch does not fix the problem, but
it does improve things when the problem hits. I have not seen a device
in this situation with the second patch applied, but I did not remove

I think this happens by default on arm. The kernel starts at offset
0x8000 to leave room for boot parameters, and in recent kernel

I can fix this if you want the patch in mainline. I was not sure it
was acceptable since will slow down boot on all systems, even where it



-- 
Arve Hjønnevåg
--

From: KOSAKI Motohiro
Date: Sunday, April 4, 2010 - 3:45 pm

I would like to merge the second patch at first. If the same problem still occur, please

bootup code is not fast path. then, small slowdown is ok, I think.
So, I'm looking for your new version patch.



--

From: Mel Gorman
Date: Monday, April 5, 2010 - 3:14 am

Out of curiousity, how big is that allocation? Is it specific to
android? If it is, I guess it can be let slide but if it's common, it
would be worth thinking of an arch-hook that tells the VM that a
particular high-order is very common. For example, one possibility would
be to ask kswapd to always reclaim at a given order even if the


This is a little outside what I expected the reserved pageblock was
intended for. I expected it to be used for high-order short-lived
allocations such as required by some wireless drivers. Pagetables are a


It will not be noticeable. Only a few pageblocks are scanned per zone
and the full zone gets walked for a variety of reasons during boot
anyway. If it ever became absolutly necessary, the lowest suitable
pageblock could be identified when the bootmem allocator is being torn

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Monday, April 5, 2010 - 3:49 am

Hi, Mel and Arve.


It is the specific on ARM. You can refer get_pgd_slow in arch/arm/mm/pgd.c.

Just out of curiosity, too.

Normally, embedded system don't have fork-bomb workload.
But I think android's case is some different.
That's because Dalvik(JVM) keeps many memory which are anon pages for byte codes
by itself as possible as.
So system always doesn't have enough memory.
In addition, most of embedded system don't have swap. It makes system
worse, too.
So current reclaimer can't be work well.

I am not sure my assumption.
Arve, my guessing is right?
If it is so, Dalvik have to solve this problem?
For example, AFAIK, android kernel has low memory killer.
If kernel signals memory pressure, Dalvik have to discard some
anon pages which has byte codes for executable.

It is just my guessing about android. If I misunderstood about android,

Maybe it was because system has lots of anon pages but no swap.

-- 
Kind regards,
Minchan Kim
--

From: =?UTF-8?q?Arve=20Hj=C3=B8nnev=C3=A5g?=
Date: Monday, April 5, 2010 - 8:09 pm

This fixes a problem where the first pageblock got marked MIGRATE_RESERVE even
though it only had a few free pages. This in turn caused no contiguous memory
to be reserved and frequent kswapd wakeups that emptied the caches to get more
contiguous memory.

Signed-off-by: Arve Hjønnevåg <arve@android.com>
---
 mm/page_alloc.c |   16 +++++++++++++++-
 1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb7df1d..46ade16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2860,6 +2860,20 @@ static inline unsigned long wait_table_bits(unsigned long size)
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
 
 /*
+ * Check if a pageblock contains reserved pages
+ */
+static int pageblock_is_reserved(unsigned long start_pfn)
+{
+	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
+	unsigned long pfn;
+
+	for (pfn = start_pfn; pfn < end_pfn; pfn++)
+		if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
+			return 1;
+	return 0;
+}
+
+/*
  * Mark a number of pageblocks as MIGRATE_RESERVE. The number
  * of blocks reserved is based on min_wmark_pages(zone). The memory within
  * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
@@ -2898,7 +2912,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
 			continue;
 
 		/* Blocks with reserved pages will never free, skip them. */
-		if (PageReserved(page))
+		if (pageblock_is_reserved(pfn))
 			continue;
 
 		block_migratetype = get_pageblock_migratetype(page);
-- 
1.6.5.1

--

From: Minchan Kim
Date: Monday, April 5, 2010 - 9:15 pm

It would be better to add following your description of previous mail thread.
It can help others understand it in future.

On Fri, Apr 02, 2010 at 05:59:00PM -0700, Arve Hj?nnev?g wrote:
...
"I think this happens by default on arm. The kernel starts at offset
0x8000 to leave room for boot parameters, and in recent kernel
versions (>~2.6.26-29) this memory is freed."


-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Subject:
Date: Tuesday, April 6, 2010 - 8:11 am

I would have used pageblock_reserve_suitable because what you are really
checking is "is this page block suitable for use by MIGRATE_RESERVE?".
The definition was "is the first page PageReserved" and you are changing it to
"does the page block have any memory holes or PageReserved pages?"

No biggie though. Change it if you like before upstreaming. Either way.

Acked-by: Mel Gorman <mel@csn.ul.ie>



-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 1, 2010 - 10:04 pm

On Fri, 2 Apr 2010 11:51:33 +0800

I don't think there are relationship between patches and your panic.

BTW, there is other case about the backlog rather than race in alloc_pages()
itself. If someone list_del(&page->lru) and the page is already freed,
you'll see the same backlog later.
Then, I doubt use-after-free case rather than complicated races.

Thanks,

--

From: Minchan Kim
Date: Thursday, April 1, 2010 - 10:15 pm

On Fri, Apr 2, 2010 at 2:04 PM, KAMEZAWA Hiroyuki

It does make sense.
Please, grep "page handling" by out-of-mainline code.
If you found out, Please, post it.

-- 
Kind regards,
Minchan Kim
--

From: TAO HU
Date: Friday, April 2, 2010 - 12:00 am

Hi, kamezawa hiroyu

Thanks for the hint!

Hi, Minchan Kim

Sorry. Not exactly sure your idea about <grep "page handling">.
Below is a result of $ grep -n -r "list_del(&page->lru)" * in our src tree

arch/s390/mm/pgtable.c:83:	list_del(&page->lru);
arch/s390/mm/pgtable.c:226:		list_del(&page->lru);
arch/x86/mm/pgtable.c:60:	list_del(&page->lru);
drivers/xen/balloon.c:154:	list_del(&page->lru);
drivers/virtio/virtio_balloon.c:143:		list_del(&page->lru);
fs/cifs/file.c:1780:		list_del(&page->lru);
fs/btrfs/extent_io.c:2584:		list_del(&page->lru);
fs/mpage.c:388:		list_del(&page->lru);
include/linux/mm_inline.h:37:	list_del(&page->lru);
include/linux/mm_inline.h:47:	list_del(&page->lru);
kernel/kexec.c:391:		list_del(&page->lru);
kernel/kexec.c:711:			list_del(&page->lru);
mm/migrate.c:69:		list_del(&page->lru);
mm/migrate.c:695: 		list_del(&page->lru);
mm/hugetlb.c:467:			list_del(&page->lru);
mm/hugetlb.c:509:			list_del(&page->lru);
mm/hugetlb.c:836:		list_del(&page->lru);
mm/hugetlb.c:844:			list_del(&page->lru);
mm/hugetlb.c:900:			list_del(&page->lru);
mm/hugetlb.c:1130:			list_del(&page->lru);
mm/hugetlb.c:1809:		list_del(&page->lru);
mm/vmscan.c:597:		list_del(&page->lru);
mm/vmscan.c:1148:			list_del(&page->lru);
mm/vmscan.c:1246:		list_del(&page->lru);
mm/slub.c:827:	list_del(&page->lru);
mm/slub.c:1249:	list_del(&page->lru);
mm/slub.c:1263:		list_del(&page->lru);
mm/slub.c:2419:			list_del(&page->lru);
mm/slub.c:2809:				list_del(&page->lru);
mm/readahead.c:65:		list_del(&page->lru);
mm/readahead.c:100:		list_del(&page->lru);
mm/page_alloc.c:532:		list_del(&page->lru);
mm/page_alloc.c:679:		list_del(&page->lru);
mm/page_alloc.c:741:		list_del(&page->lru);
mm/page_alloc.c:820:			list_del(&page->lru);
mm/page_alloc.c:1107:		list_del(&page->lru);
mm/page_alloc.c:4784:		list_del(&page->lru);

--

From: Minchan Kim
Date: Friday, April 2, 2010 - 12:22 am

It's not enough.
There are normal caller.
I expected some bogus driver of out-of-mainline uses page directly
without enough review.

Is your kernel working well except this bug?
Do you see same oops call trace(about page-allocator) whenever kernel
panic happens?

I mean if something not page-allocadtor breaks memory, you can see
other symptoms. so we can doubt others(H/W, other subsystem).

-- 
Kind regards,
Minchan Kim
--

From: Minchan Kim
Date: Thursday, April 1, 2010 - 10:13 pm

Seem to not related to the problem.
I don't have seen the problem before.

Could you git-bisect to make sure which patch makes bug?
Is it reproducible?
Can I reproduce it in QEMU-goldfish?

-- 
Kind regards,
Minchan Kim
--

From: TAO HU
Date: Thursday, April 1, 2010 - 11:48 pm

Hi, Minchan Kim

It is hard to reproduce the problem.
We only observed it  twice in the past month.
And it randomly occurred a few more times before.

So I'm afraid neither git-bisect nor QEMU-goldfish would help.

--

From: Daniel Mack
Date: Friday, April 2, 2010 - 12:06 am

I'm sure this is just a memory corruption which is unrelated to code in
the the memory management area. The code there just happens to trigger
it as it is called frequently and is very sensitive to bogus data

Did you see the other thread I started off yesterday?

  http://lkml.indiana.edu/hypermail/linux/kernel/1004.0/00157.html

We could well see the same problem here. Not sure though as any kind of
memory corruption ends up in Ooopses like the ones you see, but it could
be a hint.

Daniel

--

Previous thread: Re: Config NO_BOOTMEM breaks my amd64 box by H. Peter Anvin on Wednesday, March 31, 2010 - 8:16 pm. (2 messages)

Next thread: [ANNOUNCE] Git 1.7.0.4 by Junio C Hamano on Wednesday, March 31, 2010 - 9:50 pm. (1 message)