[BUGFIX][PATCH] memory hotplug: fix notiier chain return value (Was Re: 2.6.28-rc4 mem_cgroup_charge_common panic)

Previous thread: [PATCH] Use single threaded work queue for hid_compat by Andi Kleen on Monday, November 10, 2008 - 5:51 pm. (2 messages)

Next thread: Disabling hpilo (see: sharing interrupt between PCI device) by Altobelli, David on Monday, November 10, 2008 - 5:56 pm. (1 message)
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Monday, November 10, 2008 - 5:43 pm

Hi KAME,

Thank you for the fix for online/offline page_cgroup panic.

While running memory offline/online tests ran into another
mem_cgroup panic.

Thanks,
Badari

Unable to handle kernel paging request for data at address 0x00000020
Faulting instruction address: 0xc0000000001055e4
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c0000000001055e4 LR: c00000000010557c CTR: c0000000000bfb74
REGS: c0000000f6c7f1b0 TRAP: 0300 Tainted: G D (2.6.28-rc4)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44044422 XER: 20000018
DAR: 0000000000000020, DSISR: 0000000042000000
TASK = c0000000f6c56cc0[4610] 'crash' THREAD: c0000000f6c7c000 CPU: 0
GPR00: c0000000e910b560 c0000000f6c7f430 c000000000b36fc0 0000000000000001
GPR04: c000000005355278 0000000000000001 0000000000000000 0000000000000000
GPR08: c000000005355290 0000000000000018 c0000000e910b558 c0000000e910b548
GPR12: 0000000000000000 c000000000b58300 00000400001ca30a 0000000000000000
GPR16: 0000000000000000 0000000000000006 c0000000d43cb5c0 c0000000e66d0b88
GPR20: 0000000000000004 0000000000000000 c0000000e64c6180 0000000000000000
GPR24: 00000000000000d0 0000000000000005 c000000000bac418 0000000000000001
GPR28: c0000000e910b538 c000000005355278 c000000000aacad8 c0000000f6c7f430
NIP [c0000000001055e4] .mem_cgroup_charge_common+0x26c/0x330
LR [c00000000010557c] .mem_cgroup_charge_common+0x204/0x330
Call Trace:
[c0000000f6c7f430] [c00000000010557c] .mem_cgroup_charge_common+0x204/0x330 (unreliable)
[c0000000f6c7f4f0] [c000000000105c70] .mem_cgroup_cache_charge+0x130/0x154
[c0000000f6c7f590] [c0000000000c29bc] .add_to_page_cache_locked+0x64/0x18c
[c0000000f6c7f640] [c0000000000c2b64] .add_to_page_cache_lru+0x80/0xe4
[c0000000f6c7f6e0] [c000000000144348] .mpage_readpages+0xc8/0x170
[c0000000f6c7f810] [c000000000182e68] .reiserfs_readpages+0x50/0x78
[c0000000f6c7f8b0] [c0000000000cee80] .__do_page_cache_readahead+0x174/0x280
[c0000000f6c7f980] [c0000000000cf6e0] .do_page_...

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Thursday, November 13, 2008 - 7:27 am

Badari, I think you used SLUB. If so, page_cgroup's notifier callback was not
called and newly allocated page's page_cgroup wasn't allocated.
This is a fix. (notifier saw STOP_HERE flag added by slub's notifier.)

I'm now testing modified kernel, which does alloc/free page_cgroup by notifier.
(Usually, all page_cgroups are from bootmem and not freed.
so, modified a bit for test)

And I cannot reproduce panic. I think you do "real" memory hotplug other than
online/offline and saw panic caused by this.

Is this slub's behavior intentional ? page_cgroup's notifier has lower priority
than slub, now.

Thanks,
-Kame
==
notifier callback's notifier_from_errno() just works well in error
route. (It adds mask for "stop here")

Hanlder should return NOTIFY_OK in explict way.

Signed-off-by:KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/page_cgroup.c | 5 ++++-
mm/slub.c | 6 ++++--
2 files changed, 8 insertions(+), 3 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/slub.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/slub.c
+++ mmotm-2.6.28-Nov10/mm/slub.c
@@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
case MEM_CANCEL_OFFLINE:
break;
}
-
- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;
return ret;
}

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -216,7 +216,10 @@ static int page_cgroup_callback(struct n
break;
}

- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;

return ret;
}

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Thursday, November 13, 2008 - 1:11 pm

No. I wasn't using SLUB.

# egrep "SLUB|SLAB" .config
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set

I can test the patch and let you know.

Thanks,

--

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Monday, November 10, 2008 - 9:14 pm

On Mon, 10 Nov 2008 13:43:28 -0800

Hm, should I avoid freeing mem_cgroup at memory Offline ?
(memmap is also not free AFAIK.)

Anyway, I'll dig this. thanks.

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Badari Pulavarty <pbadari@...>, Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Monday, November 10, 2008 - 10:09 pm

On Tue, 11 Nov 2008 10:14:40 +0900
it seems not the same kind of bug..

Could you give me disassemble of mem_cgroup_charge_common() ?
(I'm not sure I can read ppc asm but I want to know what is "0x20"
of fault address....)

As first impression, it comes from page migration..
rc4's page migration handler of memcg handles *usual* path but not so good.

new migration code of memcg in mmotm is much better, I think.
Could you try mmotm if you have time ?

Thanks,

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, November 12, 2008 - 6:02 pm

I tried mmtom. Its even worse :(

Ran into following quickly .. Sorry!!

Thanks,
Badari

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c000000000109e50 LR: c000000000109de8 CTR: c0000000000c2414
REGS: c0000000e66d72d0 TRAP: 0300 Not tainted (2.6.28-rc4-mm1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44008424 XER: 20000010
DAR: 0000000000000008, DSISR: 0000000042000000
TASK = c0000000e7dfa050[4835] 'drmgr' THREAD: c0000000e66d4000 CPU: 0
GPR00: 0000000000000020 c0000000e66d7550 c000000000b472d8 c0000000e910d958
GPR04: c00000000010a730 c0000000000b96a8 c0000000e66d7660 0000000000000000
GPR08: c000000005520440 0000000000000000 c0000000e910d960 c0000000e910d948
GPR12: 0000000024000422 c000000000b68300 00000000200957bc 0000000000000000
GPR16: 0000000000000000 c0000000e66d78f8 0000000000000000 c000000000aeea70
GPR20: 000000000000006d 0000000000000000 c000000003709e78 00000000000e6000
GPR24: 0000000000000000 0000000000000000 0000000000000001 c0000000e910d938
GPR28: c000000005520428 0000000000000001 c000000000abc220 c0000000e66d7550
NIP [c000000000109e50] .__mem_cgroup_add_list+0x98/0xec
LR [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec
Call Trace:
[c0000000e66d7550] [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec (unreliable)
[c0000000e66d75f0] [c00000000010a730] .__mem_cgroup_commit_charge+0x108/0x154
[c0000000e66d7690] [c00000000010adf8] .mem_cgroup_end_migration+0xb4/0x130
[c0000000e66d7730] [c000000000108c84] .migrate_pages+0x460/0x62c
[c0000000e66d7880] [c000000000106760] .offline_pages+0x398/0x5ac
[c0000000e66d7990] [c0000000001069b8] .remove_memory+0x44/0x60
[c0000000e66d7a20] [c00000000040758c] .memory_block_change_state+0x198/0x230
[c0000000e66d7ad0] [c000000000407cac] .store_mem_state+0xcc/0x144
[c0000000e66d7b70] [c0000000003fa8b0] .sysdev_store+0x74/0xa4
[c0000000e66d7c10] [c00000000017b084] .sysfs_write_file+0x128/0x1a4
[c0000000e66d7cd0] [c00000000010fb7c] .vfs_write+0xf0/0x1c4
[c0000000e66...

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, November 12, 2008 - 10:17 pm

On Wed, 12 Nov 2008 14:02:56 -0800

the reason doesn't seem to be different from the one you saw in rc4.

We'do add_list() hear, so (maybe) used page_cgroup is zero-cleared, I think.
We usually do migration test on cpuset and confirmed this works with migration.

Hmm...I susupect following. could you try ?

Sorry.
-Kame
==

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -166,7 +166,7 @@ int online_page_cgroup(unsigned long sta
end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);

for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
- if (!pfn_present(pfn))
+ if (!pfn_valid(pfn))
continue;
fail = init_section_page_cgroup(pfn);
}

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Thursday, November 13, 2008 - 2:53 pm

I tried mmtom + startpfn fix + this fix + notifier fix. Didn't help.
I am not using SLUB (using SLAB). Yes. I am testing "real" memory
remove (not just offline/online), since it executes more code of
freeing memmap etc.

Code that is panicing is list_add() in mem_cgroup_add_list().
I will debug it further.

Thanks,
Badari

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c000000000109e50 LR: c000000000109de8 CTR: c0000000000c2414
REGS: c0000000e653f2d0 TRAP: 0300 Not tainted (2.6.28-rc4-mm1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44008484 XER: 20000018
DAR: 0000000000000007, DSISR: 0000000042000000
TASK = c0000000e7db9950[4927] 'drmgr' THREAD: c0000000e653c000 CPU: 0
GPR00: 0000000000000020 c0000000e653f550 c000000000b472d8 c0000000e910d558
GPR04: c00000000010a730 c0000000000b96a8 c0000000e653f660 0000000000000000
GPR08: c000000005432358 ffffffffffffffff c0000000e910d560 c0000000e910d548
GPR12: 0000000024000482 c000000000b68300 00000000200957bc 0000000000000000
GPR16: 0000000000000000 c0000000e653f8f8 0000000000000000 c000000000aeea70
GPR20: 0000000000000056 0000000000000000 c00000000370c300 00000000000e6000
GPR24: 0000000000000000 0000000000000000 0000000000000001 c0000000e910d538
GPR28: c000000005432340 0000000000000001 c000000000abc220 c0000000e653f550
NIP [c000000000109e50] .__mem_cgroup_add_list+0x98/0xec
LR [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec
Call Trace:
[c0000000e653f550] [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec (unreliable)
[c0000000e653f5f0] [c00000000010a730] .__mem_cgroup_commit_charge+0x108/0x154
[c0000000e653f690] [c00000000010adf8] .mem_cgroup_end_migration+0xb4/0x130
[c0000000e653f730] [c000000000108c84] .migrate_pages+0x460/0x62c
[c0000000e653f880] [c000000000106760] .offline_pages+0x398/0x5ac
[c0000000e653f990] [c0000000001069b8] .remove_memory+0x44/0x60
[c0000000e653fa20] [c000000000407590] .memory_block_change_state+0x198/0x230
[c0000000e653fad0] [c000000000407c...

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Friday, November 14, 2008 - 12:10 am

On Thu, 13 Nov 2008 10:53:24 -0800

Considering difference between "real" memory hotplug and logical ones,
I found this. I hope this fixes the bug.
But I myself can't do test this..

Thanks,
-Kame

==
Fixes for memcg/memory hotplug.

While memory hotplug allocate/free memmap, page_cgroup doesn't free
page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
(Because freeing bootmem requires special care.)

Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
by memory hotplug, page_cgroup->page == page is no longer true and
we have to update that.

But current MEM_ONLINE handler doesn't check it and update page_cgroup->page
if it's not necessary to allocate page_cgroup.

And I noticed that MEM_ONLINE can be called against "part of section".
So, freeing page_cgroup at CANCEL_ONLINE will cause trouble.
(freeing used page_cgroup)
Don't rollback at CANCEL.

One more, current memory hotplug notifier is stopped by slub
because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback
never be called. (low priority than slub now.)

I think this slub's behavior is not intentional(BUG). and fixes it.

Another way to be considered about page_cgroup allocation:
- free page_cgroup at OFFLINE even if it's from bootmem
and remove specieal handler. But it requires more changes.

Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com>

---
mm/page_cgroup.c | 39 +++++++++++++++++++++++++++------------
mm/slub.c | 6 ++++--
2 files changed, 31 insertions(+), 14 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -104,18 +104,30 @@ int __meminit init_section_page_cgroup(u
unsigned long table_size;
int nid, index;

- if (section->page_cgroup)
- return 0;
+ if (!section->page_cgroup) {

- nid = page_to_nid(pfn_to_page...

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Monday, November 17, 2008 - 5:30 pm

Kame,

With this patch I am able to run tests without any issues.

Sorry for delayed response, I wanted to make sure test runs fine over
the weekend.

Tested-by: Badari Pulavarty <pbadari@us.ibm.com>

Thanks,

--

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Monday, November 17, 2008 - 9:08 pm

On Mon, 17 Nov 2008 13:30:08 -0800
Wow, Thank you!

--

To: Badari Pulavarty <pbadari@...>
Cc: Andrew Morton <akpm@...>, linux-mm <linux-mm@...>, linux-kernel <linux-kernel@...>
Date: Wednesday, November 12, 2008 - 9:09 pm

On Wed, 12 Nov 2008 14:02:56 -0800
Hmm...mmotm + start_pfn fix ?

disassemble is also helpful.

Thanks,

--

Previous thread: [PATCH] Use single threaded work queue for hid_compat by Andi Kleen on Monday, November 10, 2008 - 5:51 pm. (2 messages)

Next thread: Disabling hpilo (see: sharing interrupt between PCI device) by Altobelli, David on Monday, November 10, 2008 - 5:56 pm. (1 message)