The mm-of-the-moment snapshot 2010-04-15-14-42 has been uploaded to http://userweb.kernel.org/~akpm/mmotm/ and will soon be available at git://zen-kernel.org/kernel/mmotm.git It contains the following patches against ...
From: Randy Dunlap <randy.dunlap@oracle.com>
Fix vmstat.c to build when CONFIG_PROC_FS is disabled
but CONFIG_DEBUG_FS is enabled.
Fixes around 25 errors.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Mel Gorman <mel@csn.ul.ie>
---
mm/vmstat.c | 119 ++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 60 deletions(-)
--- mmotm-2010-0415-1442.orig/mm/vmstat.c
+++ mmotm-2010-0415-1442/mm/vmstat.c
@@ -16,6 +16,7 @@
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
+#include <linux/seq_file.h>
#include <linux/math64.h>
#ifdef CONFIG_VM_EVENT_COUNTERS
@@ -380,18 +381,57 @@ void zone_statistics(struct zone *prefer
}
#endif
-#ifdef CONFIG_PROC_FS
-#include <linux/proc_fs.h>
-#include <linux/seq_file.h>
-
-static char * const migratetype_names[MIGRATE_TYPES] = {
- "Unmovable",
- "Reclaimable",
- "Movable",
- "Reserve",
- "Isolate",
+struct contig_page_info {
+ unsigned long free_pages;
+ unsigned long free_blocks_total;
+ unsigned long free_blocks_suitable;
};
+/* Walk all the zones in a node and print using a callback */
+static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
+ void (*print)(struct seq_file *m, pg_data_t *, struct zone *))
+{
+ struct zone *zone;
+ struct zone *node_zones = pgdat->node_zones;
+ unsigned long flags;
+
+ for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
+ if (!populated_zone(zone))
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ print(m, pgdat, zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+}
+
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int __fragmentation_index(unsigned int order, struct ...memcg-move-charge-of-file-pages.patch: when CONFIG_SHMFS is not enabled: mm/shmem.c:2721: error: implicit declaration of function 'SHMEM_I' mm/shmem.c:2721: warning: initialization makes pointer from integer without a cast mm/shmem.c:2726: error: dereferencing pointer to incomplete type mm/shmem.c:2727: error: implicit declaration of function 'shmem_swp_entry' mm/shmem.c:2727: warning: assignment makes pointer from integer without a cast mm/shmem.c:2734: error: implicit declaration of function 'shmem_swp_unmap' mm/shmem.c:2735: error: dereferencing pointer to incomplete type However, adding (needed) #include <linux/spinlock.h> to that source file does not fix the build error. Should CGROUP_MEM_RES_CTLR depend on SHMFS or anything else? kernel config attached. thanks, --- ~Randy
Thank you very much for your report.
I attach a fix patch.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
build fix for !CONFIG_SHMEM case.
CC mm/shmem.o
mm/shmem.c: In function 'mem_cgroup_get_shmem_target':
mm/shmem.c:2721: error: implicit declaration of function 'SHMEM_I'
mm/shmem.c:2721: warning: initialization makes pointer from integer without a cast
mm/shmem.c:2726: error: dereferencing pointer to incomplete type
mm/shmem.c:2727: error: implicit declaration of function 'shmem_swp_entry'
mm/shmem.c:2727: warning: assignment makes pointer from integer without a cast
mm/shmem.c:2734: error: implicit declaration of function 'shmem_swp_unmap'
mm/shmem.c:2735: error: dereferencing pointer to incomplete type
make[1]: *** [mm/shmem.o] Error 1
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
mm/shmem.c | 99 +++++++++++++++++++++++++++++++++++++----------------------
1 files changed, 62 insertions(+), 37 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index cb87365..6f183ef 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2568,6 +2568,43 @@ out4:
return error;
}
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/**
+ * mem_cgroup_get_shmem_target - find a page or entry assigned to the shmem file
+ * @inode: the inode to be searched
+ * @pgoff: the offset to be searched
+ * @pagep: the pointer for the found page to be stored
+ * @ent: the pointer for the found swap entry to be stored
+ *
+ * If a page is found, refcount of it is incremented. Callers should handle
+ * these refcount.
+ */
+void mem_cgroup_get_shmem_target(struct inode *inode, pgoff_t pgoff,
+ struct page **pagep, swp_entry_t *ent)
+{
+ swp_entry_t entry = { .val = 0 }, *ptr;
+ struct page *page = NULL;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+
+ if ((pgoff << PAGE_CACHE_SHIFT) >= i_size_read(inode))
+ goto out;
+
+ spin_lock(&info->lock);
+ ptr = shmem_swp_entry(info, pgoff, NULL);
+ if (ptr ...Acked-by: Randy Dunlap <randy.dunlap@oracle.com> -- ~Randy --
mmotm 2010-04-15-14-42 When I tried # echo 0 > /proc/sys/vm/compaction I see following. My enviroment was 2.6.34-rc4-mm1+ (2010-04-15-14-42) (x86-64) CPUx8 allocating tons of hugepages and reduce free memory. What I did was: # echo 0 > /proc/sys/vm/compact_memory Hmm, I see this kind of error at migation for the 1st time.. my.config is attached. Hmm... ? (I'm sorry I'll be offline soon.) -Kame == Apr 19 18:55:04 localhost kernel: BUG: unable to handle kernel paging request at ffff8806213ff000 Apr 19 18:55:04 localhost kernel: IP: [<ffffffff812ae3a5>] copy_page_c+0x5/0x10 Apr 19 18:55:04 localhost kernel: PGD 1a43063 PUD 50d5067 PMD 51df067 PTE 80000006213ff160 Apr 19 18:55:04 localhost kernel: Oops: 0002 [#1] SMP DEBUG_PAGEALLOC Apr 19 18:55:04 localhost kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:1d.3/usb5/devnum Apr 19 18:55:04 localhost kernel: CPU 1 Apr 19 18:55:04 localhost kernel: Modules linked in: sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conn track_ipv6 ip6table_filter ip6_tables ipv6 uinput e1000e bnx2 shpchp i5000_edac edac_core i2c_i801 i2c_core ppdev i5k_amb parport_pc parport ioatdma dca iTCO_wdt iTCO_vendor_support pcspkr kvm_intel kvm dm_multipath megaraid_sas [last unloaded: microcode] Apr 19 18:55:04 localhost kernel: Apr 19 18:55:04 localhost kernel: Pid: 2427, comm: bash Tainted: G W 2.6.34-rc4-mm1+ #1 D2 519/PRIMERGY Apr 19 18:55:04 localhost kernel: RIP: 0010:[<ffffffff812ae3a5>] [<ffffffff812ae3a5>] copy_page_c+ 0x5/0x10 Apr 19 18:55:04 localhost kernel: RSP: 0018:ffff88061c025b70 EFLAGS: 00010286 Apr 19 18:55:04 localhost kernel: RAX: ffff880000000000 RBX: ffffea0003801180 RCX: 0000000000000200 Apr 19 18:55:04 localhost kernel: RDX: 6db6db6db6db6db7 RSI: ffff880100050000 RDI: ffff8806213ff000 Apr 19 18:55:04 localhost kernel: RBP: ffff88061c025b98 R08: 0000000000000048 R09: 0000000000000001 Apr 19 ...
That's ok, thanks you for the report. I'm afraid I made little progress as I spent most of the day on other bugs but I do have something for you. First, I reproduced the problem using your .config. However, the problem does not manifest with the .config I normally use which is derived from the distro kernel configuration (Debian Lenny). So, there is something in your .config that triggers the problem. I very strongly suspect this is an interaction between migration, compaction and page allocation debug. Compaction takes pages directly off the buddy list and I bet you a shiny penny they are still unmapped when the copy takes place resulting in your oops. I'll verify the theory tomorrow but it's a plausible explanation. On a different note, where did config options like the following come out of? CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" I don't think they are a factor but I'm curious. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I unexpecedly had the time to dig into this. Does the following patch fix
your problem? It Worked For Me.
==== CUT HERE ====
mm,compaction: Map free pages in the address space after they get split for compaction
split_free_page() is a helper function which takes a free page from the
buddy lists and splits it into order-0 pages. It is used by memory
compaction to build a list of destination pages. If
CONFIG_DEBUG_PAGEALLOC is set, a kernel paging request bug is triggered
because split_free_page() did not call the arch-allocation hooks or map
the page into the kernel address space.
This patch does not update split_free_page() as it is called with
interrupts held. Instead it documents that callers of split_free_page()
are responsible for calling the arch hooks and to map the page and fixes
compaction.
This is a fix to the patch mm-compaction-memory-compaction-core.patch.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/compaction.c | 6 ++++++
mm/page_alloc.c | 3 +++
2 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 8f4c518..6218e03 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -184,6 +184,12 @@ static void isolate_freepages(struct zone *zone,
}
spin_unlock_irqrestore(&zone->lock, flags);
+ /* split_free_page does not map the pages */
+ list_for_each_entry(page, freelist, lru) {
+ arch_alloc_page(page, 0);
+ kernel_map_pages(page, 1, 1);
+ }
+
cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 53442fd..b2af4d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1211,6 +1211,9 @@ void split_page(struct page *page, unsigned int order)
/*
* Similar to split_page except the page is already free. As this is only
* being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() ...On Mon, 19 Apr 2010 20:39:19 +0100 Ok, works for me, too. Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Thank you. --
Dumb question. Why can't we call arch_alloc_page and kernel_map_pages as interrupt disabled? It's deadlock issue or latency issue? I don't found any comment about it. It should have added the comment around that functions. :) And now compaction only uses split_free_page and it is exposed by mm.h. I think it would be better to map pages inside split_free_page to export others.(ie, making generic function). If we can't do, how about making split_free_page static as static function? And only uses it in compaction. -- Kind regards, Minchan Kim --
On Tue, 20 Apr 2010 11:39:46 +0900
I guess it's from the same reason as vfree(), which can't be called under
irq-disabled.
Both of them has to flush TLB of all cpus. At flushing TLB (of other cpus), cpus has
to send IPI via smp_call_function. What I know from old stories is below.
At sendinf IPI, usual sequence is following. (This may be old.)
spin_lock(&ipi_lock);
set up cpu mask for getting notification from other cpu for declearing
"I received IPI and finished my own work".
spin_unlock(&ipi_lock);
Then,
CPU0 CPU1
irq_disable (somewhere) spin_lock
send IPI and wait for notification.
spin_lock()
deadlock. Seeing decription of kernel/smp.c::smp_call_function_many(), it says
this function should not be called under irq-disabled.
(Maybe the same kind of spin-wait deadlock can happen.)
Thanks,
-Kame
--
On Tue, Apr 20, 2010 at 12:07 PM, KAMEZAWA Hiroyuki Thanks for kind explanation. Actually I guessed TLB issue but I can't find any glue point which connect tlb flush to smp_call_function_xxx. :( Now look at the __native_flush_tlb_global. It just read and write cr4 with just mask off X86_CR4_PGE. So i don't know how connect this and smp_schedule_xxxx. Hmm,, maybe APIC? Sorry for dumb question. -- Kind regards, Minchan Kim --
On Tue, 20 Apr 2010 12:58:43 +0900 Hmm...seeing again, arch/x86/mm/pageattr.c::kernel_map_pages() says: 1293 /* 1294 * We should perform an IPI and flush all tlbs, 1295 * but that can deadlock->flush only current cpu: 1296 */ Wow. It just flush only local cpu. Then, no IPI. Hmm...all other archs does the same thing ? If so, kernel_map_pages() can be called under irq_disabled. The author of kernel_map_pages() is aware that this can be called under irq-disabled. Hmm... Thanks, -Kame --
In theory, it isn't known what arch_alloc_page is going to do but more practically kernel_map_pages() is updating mappings and should be flushing all the TLBs. It can't do that with interrupts disabled. I checked X86 and it should be fine but only because it flushes the local CPU and appears to just hope for the best that this doesn't cause I'm not aware of any. arch_alloc_page() is only used by s390 so it's not well known. kernel_map_pages() is only active for a rarely used I considered that and it would not be ideal. It would have to disable and reenable interrupts as each page is taken from the list or alternatively require that the caller not have the zone lock taken. The latter of these options is more reasonable but would still result in more interrupt enabling and disabling. split_free_page() is extremely specialised and requires knowledge of the page allocator internals to call properly. There is little pressure to It pretty much has to be in page_alloc.c because it uses internal functions of the page allocator - e.g. rmv_page_order. I could move it to mm/internal.h because whatever about split_page, I can't imagine why anyone else would need to call split_free_page. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Yes. Then, Let's add comment like split_page. :) /* * Note: this is probably too low level an operation for use in drivers. * Please consult with lkml before using this in your driver. -- Kind regards, Minchan Kim --
I can, but the comment that was there says it's like split_page except the page is already free. This also covers not using it in a driver. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I see. In addition, you already mentioned "As this is only being used for migration". I missed one. I don't have any against one. Will you repost v2 which move split_free_pages out of compaction.c? Anyway, feel free to add my reviewed-by sign. Thanks, Mel. -- Kind regards, Minchan Kim --
I don't understand your suggestion. split_free_pages is already out of Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Ahh. Sorry. It's my fault. I confused. forget it, please. -- Kind regards, Minchan Kim --
On Mon, 19 Apr 2010 20:39:19 +0100 Sorry, I think I hit another? error again. (sorry, no log.) What I did was... Running 2 shells. while true; do make -j 16;make cleanl;done and while true; do echo 0 > /proc/sys/vm/compact_memory;done Using the same config. Apr 21 17:27:47 localhost kernel: ------------[ cut here ]------------ Apr 21 17:27:47 localhost kernel: kernel BUG at include/linux/swapops.h:105! Apr 21 17:27:47 localhost kernel: invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC Apr 21 17:27:47 localhost kernel: last sysfs file: /sys/devices/virtual/net/br0/statistics/collisions Apr 21 17:27:47 localhost kernel: CPU 3 Apr 21 17:27:47 localhost kernel: Modules linked in: fuse sit tunnel4 ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 dm_multipath uinput ioatdma ppdev parport_pc i5000_edac bnx2 iTCO_wdt edac_core iTCO_vendor_support shpchp parport e1000e kvm_intel dca kvm i2c_i801 i2c_core i5k_amb pcspkr megaraid_sas [last unloaded: microcode] Apr 21 17:27:47 localhost kernel: Apr 21 17:27:47 localhost kernel: Pid: 27892, comm: cc1 Tainted: G W 2.6.34-rc4-mm1+ #4 D2519/PRIMERGY Apr 21 17:27:47 localhost kernel: RIP: 0010:[<ffffffff8114e9cf>] [<ffffffff8114e9cf>] migration_entry_wait+0x16f/0x180 Apr 21 17:27:47 localhost kernel: RSP: 0000:ffff88008d9efe08 EFLAGS: 00010246 Apr 21 17:27:47 localhost kernel: RAX: ffffea0000000000 RBX: ffffea0000241100 RCX: 0000000000000001 Apr 21 17:27:47 localhost kernel: RDX: 000000000000a4e0 RSI: ffff880621a4ab00 RDI: 000000000149c03e Apr 21 17:27:47 localhost kernel: RBP: ffff88008d9efe38 R08: 0000000000000000 R09: 0000000000000000 Apr 21 17:27:47 localhost kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff880621a4aae8 Apr 21 17:27:47 localhost kernel: R13: 00000000bf811000 R14: 000000000149c03e R15: 0000000000000000 Apr 21 17:27:47 localhost kernel: FS: ...
On Wed, 21 Apr 2010 17:28:38 +0900
It seems that this is a new error.
static inline struct page *migration_entry_to_page(swp_entry_t entry)
{
struct page *p = pfn_to_page(swp_offset(entry));
/*
* Any use of migration entries may only occur while the
* corresponding page is locked
*/
BUG_ON(!PageLocked(p));
return p;
}
Hits this BUG_ON()....then, the page migration_entry points to is unlocked.
But we always do
lock_page(old_page);
unamp(old_page);
remap(new_page);
unlock_page(old_page);
So....some pte wasn't updated at remap ?
Hmm.
-Kame
--
I'm working on reproducing the problem. I've hit it only once. My stress tests were using dd instead of make like yours did and my compilation-orientated test would not have been hitting compaction as hard. The theory I'm working on is that it's a PageSwapCache page that was unmapped and not remapped (remap_swapcache == 0) in move_to_new_page(). In this case, the page would be migrated, left in place and unlocked. Later when a swap fault occurred, the migration PTE is found and the bug_on triggers i.e. the bug check is no longer valid because it is possible for an unlocked migration pte to be left behind. Trying to reproduce with some instrumentation in place documenting pages left behind but haven't managed to trigger it a second time yet. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi, Mel. Hmm. How about the situation? CPU A CPU B 1. unmap_and_move 2. lock_page 3. PageAnon && !page_mapped && PageSwapCache 3' do_fork 4. remap_swapcache = 0 4' pte lock, page_dup_rmap <- race happens 5. try_to_unmap - make migration entry by 4' 6. move_to_newpage 7. don't call remove_migration due to 4 8. do_swap_page 9. migration_entry_wait 10. goto out 11. fault! In this case, process of CPU B will be killed although it passes PageLocked So I think we have to find another method. I might be wrong since nearly falling asleep. :( -- Kind regards, Minchan Kim --
Yes. I was wrong. I seem to miss detach_vma before unmap_region. Sorry, Ignore this, please. :( -- Kind regards, Minchan Kim --
On Mon, 19 Apr 2010 19:14:42 +0100
Hmm ? arch/x86/Kconfig.
config ARCH_HWEIGHT_CFLAGS
string
default "-fcall-saved-ecx -fcall-saved-edx" if X86_32
default "-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" if X86_64
Seems to be from
patches/x86-add-optimized-popcnt-variants.patch
Thanks,
-Kame
--
