Re: [PATCH 53 of 66] add numa awareness to hugepage allocations

Previous thread: [PATCH 46 of 66] add x86 32bit support by Andrea Arcangeli on Wednesday, November 3, 2010 - 8:28 am. (1 message)

Next thread: [PATCH 49 of 66] mprotect: pass vma down to page table walkers by Andrea Arcangeli on Wednesday, November 3, 2010 - 8:28 am. (1 message)
From: Andrea Arcangeli
Date: Wednesday, November 3, 2010 - 8:28 am

From: Andrea Arcangeli <aarcange@redhat.com>

It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known and
an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in
case of failure. But it retains the more efficient logic of handling allocation
failures in khugepaged in case of CONFIG_NUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -330,14 +330,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 {
 	return alloc_pages_current(gfp_mask, order);
 }
-extern struct page *alloc_page_vma(gfp_t gfp_mask,
+extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_pages_vma(gfp_mask, order, vma, addr)	\
+	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr)	\
+	alloc_pages_vma(gfp_mask, 0, vma, addr)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -620,11 +620,26 @@ static int __do_huge_pmd_anonymous_page(
 	return ret;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+{
+	return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+			       HPAGE_PMD_ORDER, vma, ...
From: Daisuke Nishimura
Date: Sunday, November 28, 2010 - 10:38 pm

I think this should be:

	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}

Thanks,
--

From: Andrea Arcangeli
Date: Monday, November 29, 2010 - 9:11 am

Hmm no, the change you suggest would generate memory corruption with
use after free.
--

From: Daisuke Nishimura
Date: Monday, November 29, 2010 - 5:38 pm

On Mon, 29 Nov 2010 17:11:03 +0100

I'm sorry if I miss something, "new_page" will be reused in !CONFIG_NUMA case
as you say, but, in CONFIG_NUMA case, it is allocated in this function
(collapse_huge_page()) by alloc_hugepage_vma(), and is not freed when memcg's
charge failed.
Actually, we do in collapse_huge_page():
	if (unlikely(!isolated)) {
		...
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}
later. I think we need a similar logic in memcg's failure path too.

Thanks,
Daisuke Nishimura.
--

From: Andrea Arcangeli
Date: Tuesday, November 30, 2010 - 12:01 pm

Apologies, you really found a minor memleak in case of memcg
accounting failure.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1726,7 +1726,7 @@ static void collapse_huge_page(struct mm
 	}
 #endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
-		goto out;
+		goto out_put_page;
 
 	anon_vma_lock(vma->anon_vma);
 
@@ -1755,10 +1755,7 @@ static void collapse_huge_page(struct mm
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock(vma->anon_vma);
 		mem_cgroup_uncharge_page(new_page);
-#ifdef CONFIG_NUMA
-		put_page(new_page);
-#endif
-		goto out;
+		goto out_put_page;
 	}
 
 	/*
@@ -1799,6 +1796,13 @@ static void collapse_huge_page(struct mm
 	khugepaged_pages_collapsed++;
 out:
 	up_write(&mm->mmap_sem);
+	return;
+
+out_put_page:
+#ifdef CONFIG_NUMA
+	put_page(new_page);
+#endif
+	goto out;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm,



I was too optimistic that there wasn't really a bug, I thought it was
some confusion about the hpage usage that differs with numa and not
numa.

On a side note, the CONFIG_NUMA case will later change further to move
the allocation out of the mmap_sem write mode to make the fs
submitting I/O from userland and doing memory allocations in the I/O
paths happier.
--

Previous thread: [PATCH 46 of 66] add x86 32bit support by Andrea Arcangeli on Wednesday, November 3, 2010 - 8:28 am. (1 message)

Next thread: [PATCH 49 of 66] mprotect: pass vma down to page table walkers by Andrea Arcangeli on Wednesday, November 3, 2010 - 8:28 am. (1 message)