This patchkit supports GB pages for hugetlb on x86-64 in addition to 2MB pages. This is the sucessor of an earlier much simpler patchkit that allowed to set the hugepagesz globally at boot to 1GB pages. The advantage of this more complex patchkit is that it allows 2MB page users and 1GB page users to coexist (although not on the same hugetlbfs mount points) It first adds some straight-forward infrastructure to hugetlbfs to support multiple page sizes. Then it uses that infrastructure to implement support for huge pages > MAX_ORDER (which can be allocated at boot with bootmem only). Then the x86-64 port is extended to support 1GB pages on CPUs that support them (AMD Quad Cores) There is no support for i386 because GB pages are only available in long mode. The variable page size support is currently limited to the specific use case of the single additional 1GB page size. Using it for more page sizes (especially those < MAX_ORDER) would require some more work, although the basic infrastructure is all in place and the incremental work will be small. But I didn't bother to implement some corner cases not needed for the GB page case. I usually added comments so they should be easy to find (and fix) later however :) I hacked in also cpuset support. It would be good if Paul double checked that. GB pages are only intended to be used in special situations, like dedicated databases where complicated configuration does not matter. That is why they have some limitations: - Can be only allocated at boot (using hugepagesz=1G hugepages=...) - Can't be freed at runtime - One hugetlbfs mount per page size (using the pagesize=... mount option). This is a little awkward, but greatly simplified the code. - No IPC SHM support currently (would not be very hard to do, but it is unclear what the best API for this is. Suggestions welcome) Some of this would be fixable later. Known issues: - GB pages are not reported in total memory, which gives confusing free(1) ...
On Mon, 2008-03-17 at 02:58 +0100, Andi Kleen wrote: I bet copy_hugetlb_page_range() is causing your complaints. It takes the dest_mm->page_table_lock followed by src_mm->page_table_lock inside a loop and hasn't yet been converted to call spin_lock_nested(). A I am not sure how well LTP is tracking mainline development in this area. How do these patches do with the libhugetlbfs test suite? We are adding support for ginormous pages (1GB, 16GB, etc) but it is not complete. Should run fine with 2M pages though. Before you ask, here is the link: http://libhugetlbfs.ozlabs.org/snapshots/libhugetlbfs-dev-20080310.tar.gz -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
Yes. Looking at the warning I'm not sure why lockdep doesn't filter it out automatically. I cannot think of a legitimate case where a "possible recursive lock" with different lock addresses would be a genuine bug. I wasn't aware of that one. -Andi --
Libhugetlbfs comes with a rigorous functional test suite. It has test cases for specific bugs that have since been fixed. I ran it on your patches and got an oops around hugetlb_overcommit_handler() when running the 'counters' test. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
Andi,
Are all the "interesting" cpuset related changes in patch:
[PATCH] [1/18] Convert hugeltlb.c over to pass global state around in a structure
?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
That one and Add basic support for more than one hstate in hugetlbfs and partly Add support to have individual hstates for each hugetlbfs mount It all builds on each other. Ideally look at the end result of the whole series. -Andi --
Ok. Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I just updated to 2.6.25-rc6 base on ftp://firstfloor.org/pub/ak/gbpages/patches/ and gave it a quick test. So you can use that one too. It only had a single easy reject. -Andi --
What kernel version is this patchset against ... apparently not 2.6.25-rc5-mm1.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Large, but rather mechanical patch that converts most of the hugetlb.c globals into structure members and passes them around. Right now there is only a single global hstate structure, but most of the infrastructure to extend it is there. Signed-off-by: Andi Kleen <ak@suse.de> --- arch/ia64/mm/hugetlbpage.c | 2 arch/powerpc/mm/hugetlbpage.c | 2 arch/sh/mm/hugetlbpage.c | 2 arch/sparc64/mm/hugetlbpage.c | 2 arch/x86/mm/hugetlbpage.c | 2 fs/hugetlbfs/inode.c | 45 +++--- include/linux/hugetlb.h | 70 +++++++++ ipc/shm.c | 3 mm/hugetlb.c | 295 ++++++++++++++++++++++-------------------- mm/memory.c | 2 mm/mempolicy.c | 10 - mm/mmap.c | 3 12 files changed, 269 insertions(+), 169 deletions(-) Index: linux/mm/hugetlb.c =================================================================== --- linux.orig/mm/hugetlb.c +++ linux/mm/hugetlb.c @@ -22,30 +22,24 @@ #include "internal.h" const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL; -static unsigned long nr_huge_pages, free_huge_pages, resv_huge_pages; -static unsigned long surplus_huge_pages; -static unsigned long nr_overcommit_huge_pages; unsigned long max_huge_pages; unsigned long sysctl_overcommit_huge_pages; -static struct list_head hugepage_freelists[MAX_NUMNODES]; -static unsigned int nr_huge_pages_node[MAX_NUMNODES]; -static unsigned int free_huge_pages_node[MAX_NUMNODES]; -static unsigned int surplus_huge_pages_node[MAX_NUMNODES]; static gfp_t htlb_alloc_mask = GFP_HIGHUSER; unsigned long hugepages_treat_as_movable; -static int hugetlb_next_nid; + +struct hstate global_hstate; /* * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages */ static DEFINE_SPINLOCK(hugetlb_lock); -static void clear_huge_page(struct page *page, unsigned long addr) +static void clear_huge_page(struct page *page, ...
I didn't see anything fundamentally wrong with this... In fact it is looking really nice notwithstanding the minor nits below. Could you define a macro for (1 << huge_page_order(h))? It is used at least 4 times. How about something like pages_per_huge_page(h) or something? I think that would convey the meaning more clearly. Whitespace? Whitespace? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
hstate isn't a particularly informative name as it's the state of what?
At a glance, someone may think it's a per-mount state where I am
expecting that multiple mounts using the same pagesize will share the
same pool.
This is not the patch to do it but it'll be worth looking at moving
hugetlb_lock into hstate later for workloads using different pagesizes
hpage_size instead of sz to match the old define HPAGE_SIZE but to reflect
it is potentially no longer a constant?
That said, when calling clear_huge_page(), the caller has the VMA and could
pass a struct hstate * instead of sz here, more on what that may be useful
If you passed the hstate, and had a helper like
static inline int basepages_per_hpage(struct hstate *h)
{
return 1 << huge_page_order(h);
}
you could have i < basepages_per_hpage(h) here and use it in a number of
places throughout the patch. (suggestions on a better name are welcome)
sz/PAGE_SIZE is not very self-explanatory (hpage_size is a little easier)
hmm, when there are multiple struct hstates later, you are going to need to
distinguish between them otherwise pages of the wrong size will end up on the
wrong pool. As you are getting a compound page, I am guess you distinguish
based on size. This patch in isolation, it's fine but needs to be watched
for as if it is overlooked, it'll cause oopses or memory corruption when
Similar comment to free_huge_page(), if more than two hugepage sizes exist,
the boot paramters will need to distinguish which pool is being referred to.
global_hstate would be replaced by the default_hugepage_pool here I would
Unwritten assumption here that HPAGE_SIZE != PAGE_SIZE. Probably a safe
assumption though.
hmm, unrelated to this patch but that printk() is misleading. The language
implies it is size in bytes but the value is in pages. As you are changing the
code anyway, do you care to print out the size of the pages being allocated
Similar comments to free_huge_page(), will need to ...- Convert hstates to an array
- Add a first default entry covering the standard huge page size
- Add functions for architectures to register new hstates
- Add basic iterators over hstates
Signed-off-by: Andi Kleen <ak@suse.de>
---
include/linux/hugetlb.h | 10 +++++++++-
mm/hugetlb.c | 46 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 46 insertions(+), 10 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -27,7 +27,15 @@ unsigned long sysctl_overcommit_huge_pag
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
unsigned long hugepages_treat_as_movable;
-struct hstate global_hstate;
+static int max_hstate = 1;
+
+struct hstate hstates[HUGE_MAX_HSTATE];
+
+/* for command line parsing */
+struct hstate *parsed_hstate __initdata = &global_hstate;
+
+#define for_each_hstate(h) \
+ for ((h) = hstates; (h) < &hstates[max_hstate]; (h)++)
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
@@ -474,15 +482,11 @@ static struct page *alloc_huge_page(stru
return page;
}
-static int __init hugetlb_init(void)
+static int __init hugetlb_init_hstate(struct hstate *h)
{
unsigned long i;
- struct hstate *h = &global_hstate;
- if (HPAGE_SHIFT == 0)
- return 0;
-
- if (!h->order) {
+ if (h == &global_hstate && !h->order) {
h->order = HPAGE_SHIFT - PAGE_SHIFT;
h->mask = HPAGE_MASK;
}
@@ -497,11 +501,34 @@ static int __init hugetlb_init(void)
break;
}
max_huge_pages = h->free_huge_pages = h->nr_huge_pages = i;
- printk("Total HugeTLB memory allocated, %ld\n", h->free_huge_pages);
+
+ printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
+ h->free_huge_pages,
+ 1 << (h->order + PAGE_SHIFT - 20));
return 0;
}
+
+static int __init hugetlb_init(void)
+{
+ if (HPAGE_SHIFT == 0)
+ return 0;
+ return hugetlb_init_hstate(&global_hstate);
+}
...I'd like to avoid assuming the huge page size is some multiple of MB. PowerPC will have a 64KB huge page. Granted, you do fix this in a later patch, so as long as the whole series goes together this shouldn't cause Since mask can always be derived from order, is there a reason we don't always calculate it? I guess it boils down to storage cost vs. calculation cost and I don't feel too strongly either way. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
No the later patch only supports GB and MB. If you want KB you have to do it yourself. If there was a reason I forgot it. Doesn't really matter much either way. -Andi --
global_hstate becomes a misleading name in this patch. default_hstate Why is there no need for if (huge_page_shift(h) == 0) return 0; ? Ah, you partially fix up my whinge from the previous patch here. page_alloc.c has a helper called K() for conversions. Perhaps move it to internal.h and add one for M instead of the - 20 here? Not a big deal as It's not clear in this patch what parsed_hstate is for as it is not used elsewhere. I've made a note to check if parsed_hstate makes an unwritten -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi Andi
this function is called once by one boot parameter, right?
if so, this function cause panic when stupid user write many
hugepagesz boot parameter.
Why don't you use following check.
if (max_hstate >= HUGE_MAX_HSTATE) {
printk("hoge hoge");
return;
}
- kosaki
--
A later patch fixes that up by looking up the hstate explicitely. Also it is bisect safe because the callers are only added later. -Andi --
I chose to just report the numbers in a row, in the hope
to minimze breakage of existing software. The "compat" page size
is always the first number.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 59 +++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 39 insertions(+), 20 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -683,37 +683,56 @@ int hugetlb_overcommit_handler(struct ct
#endif /* CONFIG_SYSCTL */
+static int dump_field(char *buf, unsigned field)
+{
+ int n = 0;
+ struct hstate *h;
+ for_each_hstate (h)
+ n += sprintf(buf + n, " %5lu", *(unsigned long *)((char *)h + field));
+ buf[n++] = '\n';
+ return n;
+}
+
int hugetlb_report_meminfo(char *buf)
{
- struct hstate *h = &global_hstate;
- return sprintf(buf,
- "HugePages_Total: %5lu\n"
- "HugePages_Free: %5lu\n"
- "HugePages_Rsvd: %5lu\n"
- "HugePages_Surp: %5lu\n"
- "Hugepagesize: %5lu kB\n",
- h->nr_huge_pages,
- h->free_huge_pages,
- h->resv_huge_pages,
- h->surplus_huge_pages,
- 1UL << (huge_page_order(h) + PAGE_SHIFT - 10));
+ struct hstate *h;
+ int n = 0;
+ n += sprintf(buf + 0, "HugePages_Total:");
+ n += dump_field(buf + n, offsetof(struct hstate, nr_huge_pages));
+ n += sprintf(buf + n, "HugePages_Free: ");
+ n += dump_field(buf + n, offsetof(struct hstate, free_huge_pages));
+ n += sprintf(buf + n, "HugePages_Rsvd: ");
+ n += dump_field(buf + n, offsetof(struct hstate, resv_huge_pages));
+ n += sprintf(buf + n, "HugePages_Surp: ");
+ n += dump_field(buf + n, offsetof(struct hstate, surplus_huge_pages));
+ n += sprintf(buf + n, "Hugepagesize: ");
+ for_each_hstate (h)
+ n += sprintf(buf + n, " %5u", huge_page_size(h) / 1024);
+ n += sprintf(buf + n, " kB\n");
+ return n;
}
int hugetlb_report_node_meminfo(int nid, char *buf)
{
- struct hstate *h = &global_hstate;
- return ...Glancing through the libhugetlbfs code, it appears to take the first value after Hugepagesize: as the "huge pagesize" so I suspect you're -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -550,26 +550,33 @@ static unsigned int cpuset_mems_nr(unsig
#ifdef CONFIG_SYSCTL
#ifdef CONFIG_HIGHMEM
-static void try_to_free_low(unsigned long count)
+static void do_try_to_free_low(struct hstate *h, unsigned long count)
{
- struct hstate *h = &global_hstate;
int i;
for (i = 0; i < MAX_NUMNODES; ++i) {
struct page *page, *next;
struct list_head *freel = &h->hugepage_freelists[i];
list_for_each_entry_safe(page, next, freel, lru) {
- if (count >= nr_huge_pages)
+ if (count >= h->nr_huge_pages)
return;
if (PageHighMem(page))
continue;
list_del(&page->lru);
- update_and_free_page(page);
+ update_and_free_page(h, page);
h->free_huge_pages--;
h->free_huge_pages_node[page_to_nid(page)]--;
}
}
}
+
+static void try_to_free_low(unsigned long count)
+{
+ struct hstate *h;
+ for_each_hstate (h) {
+ do_try_to_free_low(h, count);
+ }
+}
#else
static inline void try_to_free_low(unsigned long count)
{
--
With this patch you will call try_to_free_low on all registered page sizes. As written, when a user reduces the number of one page size, all page sizes could be affected. I don't think that's what you want to do. Perhaps just call do_try_to_free_low() on the hstate in question. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
Andi,
Seems to me that both patches 2/18 and 4/18 are called:
Add basic support for more than one hstate in hugetlbfs
You probably want to change this detail.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Fixed thanks. Indeed description went wrong on 4/18 2/ was the correct one. -Andi --
Missing leader and the subject is misleading as to what the patch is doing. Am assuming this is an accident. hmm, so this is freeing 'count' pages from all pools. I doubt that's what you really want to be doing here. If someone if using the proc entries to shrink a pool size, I imagine they want to shrink X pages of size Y from a single pool, not shrink X pages from all pools. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
- I didn't bother with hugetlb_shm_group and treat_as_movable,
these are still single global.
- Also improve error propagation for the sysctl handlers a bit
Signed-off-by: Andi Kleen <ak@suse.de>
---
include/linux/hugetlb.h | 5 +++--
kernel/sysctl.c | 2 +-
mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++------------
3 files changed, 35 insertions(+), 15 deletions(-)
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -32,8 +32,6 @@ int hugetlb_fault(struct mm_struct *mm,
int hugetlb_reserve_pages(struct inode *inode, long from, long to);
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
-extern unsigned long max_huge_pages;
-extern unsigned long sysctl_overcommit_huge_pages;
extern unsigned long hugepages_treat_as_movable;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
extern int sysctl_hugetlb_shm_group;
@@ -258,6 +256,9 @@ static inline unsigned huge_page_shift(s
return h->order + PAGE_SHIFT;
}
+extern unsigned long max_huge_pages[HUGE_MAX_HSTATE];
+extern unsigned long sysctl_overcommit_huge_pages[HUGE_MAX_HSTATE];
+
#else
struct hstate {};
#define hstate_file(f) NULL
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c
+++ linux/kernel/sysctl.c
@@ -935,7 +935,7 @@ static struct ctl_table vm_table[] = {
{
.procname = "nr_hugepages",
.data = &max_huge_pages,
- .maxlen = sizeof(unsigned long),
+ .maxlen = sizeof(max_huge_pages),
.mode = 0644,
.proc_handler = &hugetlb_sysctl_handler,
.extra1 = (void *)&hugetlb_zero,
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -22,8 +22,8 @@
#include "internal.h"
const unsigned long hugetlb_zero = 0, ...I cannot imagine why either of those would be per-pool anyway. Potentially shm_group could become a per-mount value which is both outside the scope of this patchset and not per-pool so unsuitable for Any particular reason for moving them? Also, offhand it's not super-clear why max_huge_pages is not part of hmm ok, it looks a little weird to be working out h - hstates multiple times This looks like we are assuming there is only ever one other parsed_hstate. For the purposes of what you aim to achieve in this set, it's not important but a comment over parsed_hstate about this hmm, this is saying when I write 10 to nr_hugepages, I am asking for 10 I'm failing to see how the error handling is improved when set_max_huge_pages() is not updating err. Maybe it happens in another Similar to the other sysctl here, the overcommit value is being set for -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
They need to be an separate array for the sysctl parsing function. -Andi --
D'oh, of course. Pointing that out answers my other questions in relation to how writing single values to a proc entry affects multiple pools as well. I was still thinking of max_huge_pages as as a single value instead of an array. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- Set up pointers to a suitable hstate for the set page size option
to the super block and the inode and the vma.
- Change the hstate accessors to use this information
- Add code to the hstate init function to set parsed_hstate for command
line processing
- Handle duplicated hstate registrations to the make command line user proof
Signed-off-by: Andi Kleen <ak@suse.de>
---
fs/hugetlbfs/inode.c | 50 ++++++++++++++++++++++++++++++++++++++----------
include/linux/hugetlb.h | 12 ++++++++---
mm/hugetlb.c | 22 +++++++++++++++++----
3 files changed, 67 insertions(+), 17 deletions(-)
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -134,6 +134,7 @@ struct hugetlbfs_config {
umode_t mode;
long nr_blocks;
long nr_inodes;
+ struct hstate *hstate;
};
struct hugetlbfs_sb_info {
@@ -142,12 +143,14 @@ struct hugetlbfs_sb_info {
long max_inodes; /* inodes allowed */
long free_inodes; /* inodes free */
spinlock_t stat_lock;
+ struct hstate *hstate;
};
struct hugetlbfs_inode_info {
struct shared_policy policy;
struct inode vfs_inode;
+ struct hstate *hstate;
};
static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
@@ -212,6 +215,7 @@ struct hstate {
};
void __init huge_add_hstate(unsigned order);
+struct hstate *huge_lookup_hstate(unsigned long pagesize);
#ifndef HUGE_MAX_HSTATE
#define HUGE_MAX_HSTATE 1
@@ -223,17 +227,19 @@ extern struct hstate hstates[HUGE_MAX_HS
static inline struct hstate *hstate_vma(struct vm_area_struct *vma)
{
- return &global_hstate;
+ return (struct hstate *)vma->vm_private_data;
}
static inline struct hstate *hstate_file(struct file *f)
{
- return &global_hstate;
+ struct dentry *d = f->f_dentry;
+ struct inode *i = ...FWIW, I think this approach is definitely the way to go for supporting multiple huge page sizes. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
I'm somewhat surprised it is necessary for the hstate to be on a per-inode basis when it's already in the hugetlbfs_sb_info. Would lookup_hstate_pagesize() maybe? The name as-is told me nothing about what HUGETLBFS_SB(HUGETLBFS_I(i)->i_sb)->hstate ? Pretty fugly I'll admit, but it's contained in a helper and keeps the -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Need this as a separate function for a future patch.
No behaviour change.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 37 ++++++++++++++++++++++---------------
1 file changed, 22 insertions(+), 15 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -219,6 +219,27 @@ static struct page *alloc_fresh_huge_pag
return page;
}
+/*
+ * Use a helper variable to find the next node and then
+ * copy it back to hugetlb_next_nid afterwards:
+ * otherwise there's a window in which a racer might
+ * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * But we don't need to use a spin_lock here: it really
+ * doesn't matter if occasionally a racer chooses the
+ * same nid as we do. Move nid forward in the mask even
+ * if we just successfully allocated a hugepage so that
+ * the next caller gets hugepages on the next node.
+ */
+static int huge_next_node(struct hstate *h)
+{
+ int next_nid;
+ next_nid = next_node(h->hugetlb_next_nid, node_online_map);
+ if (next_nid == MAX_NUMNODES)
+ next_nid = first_node(node_online_map);
+ h->hugetlb_next_nid = next_nid;
+ return next_nid;
+}
+
static int alloc_fresh_huge_page(struct hstate *h)
{
struct page *page;
@@ -232,21 +253,7 @@ static int alloc_fresh_huge_page(struct
page = alloc_fresh_huge_page_node(h, h->hugetlb_next_nid);
if (page)
ret = 1;
- /*
- * Use a helper variable to find the next node and then
- * copy it back to hugetlb_next_nid afterwards:
- * otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
- * But we don't need to use a spin_lock here: it really
- * doesn't matter if occasionally a racer chooses the
- * same nid as we do. Move nid forward in the mask even
- * if we just successfully allocated a hugepage so that
- * the next caller gets hugepages on the next node.
- */
- next_nid ...Maybe if you moved this beside patch 1, they could both be tested in isolation as a fairly reasonable cleanup that does not alter hmm, I'm not seeing where next_nid gets declared locally here as it should have been removed in an earlier patch. Maybe it's reintroduced later but if you do reshuffle the patchset so that the cleanups can be Other than the possible gotcha with next_nid declared locally, the move seems fine. Acked-by: Mel Gorman <mel@csn.ul.ie> -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
No there was no earlier patch touching this, so the old next_nid is still there. -Andi --
ah yes, my bad. I thought it went away in patch 1/18. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Straight forward variant of the existing __alloc_bootmem_node, only
Signed-off-by: Andi Kleen <ak@suse.de>
difference is that it doesn't panic on failure
Signed-off-by: Andi Kleen <ak@suse.de>
---
include/linux/bootmem.h | 4 ++++
mm/bootmem.c | 12 ++++++++++++
2 files changed, 16 insertions(+)
Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -471,6 +471,18 @@ void * __init __alloc_bootmem_node(pg_da
return __alloc_bootmem(size, align, goal);
}
+void * __init __alloc_bootmem_node_nopanic(pg_data_t *pgdat, unsigned long size,
+ unsigned long align, unsigned long goal)
+{
+ void *ptr;
+
+ ptr = __alloc_bootmem_core(pgdat->bdata, size, align, goal, 0);
+ if (ptr)
+ return ptr;
+
+ return __alloc_bootmem_nopanic(size, align, goal);
+}
+
#ifndef ARCH_LOW_ADDRESS_LIMIT
#define ARCH_LOW_ADDRESS_LIMIT 0xffffffffUL
#endif
Index: linux/include/linux/bootmem.h
===================================================================
--- linux.orig/include/linux/bootmem.h
+++ linux/include/linux/bootmem.h
@@ -90,6 +90,10 @@ extern void *__alloc_bootmem_node(pg_dat
unsigned long size,
unsigned long align,
unsigned long goal);
+extern void *__alloc_bootmem_node_nopanic(pg_data_t *pgdat,
+ unsigned long size,
+ unsigned long align,
+ unsigned long goal);
extern unsigned long init_bootmem_node(pg_data_t *pgdat,
unsigned long freepfn,
unsigned long startpfn,
--
Straight-forward. Mildly irritating that there are multiple variants that only differ by whether they panic on allocation failure or not. Probably should be a seperate removal of duplicated code there but outside the Acked-by: Mel Gorman <mel@csn.ul.ie> -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
hugetlb will need to get compound pages from bootmem to handle
the case of them being larger than MAX_ORDER. Export
the constructor function needed for this.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/internal.h | 2 ++
mm/page_alloc.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -13,6 +13,8 @@
#include <linux/mm.h>
+extern void prep_compound_page(struct page *page, unsigned long order);
+
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_count, v);
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -272,7 +272,7 @@ static void free_compound_page(struct pa
__free_pages_ok(page, compound_order(page));
}
-static void prep_compound_page(struct page *page, unsigned long order)
+void prep_compound_page(struct page *page, unsigned long order)
{
int i;
int nr_pages = 1 << order;
--
Needed to avoid code duplication in follow up patches.
This happens to fix a minor bug. When alloc_bootmem_node returns
a fallback node on a different node than passed the old code
would have put it into the free lists of the wrong node.
Now it would end up in the freelist of the correct node.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -200,6 +200,17 @@ static int adjust_pool_surplus(struct hs
return ret;
}
+static void huge_new_page(struct hstate *h, struct page *page)
+{
+ unsigned nid = pfn_to_nid(page_to_pfn(page));
+ set_compound_page_dtor(page, free_huge_page);
+ spin_lock(&hugetlb_lock);
+ h->nr_huge_pages++;
+ h->nr_huge_pages_node[nid]++;
+ spin_unlock(&hugetlb_lock);
+ put_page(page); /* free it into the hugepage allocator */
+}
+
static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
{
struct page *page;
@@ -207,14 +218,8 @@ static struct page *alloc_fresh_huge_pag
page = alloc_pages_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
huge_page_order(h));
- if (page) {
- set_compound_page_dtor(page, free_huge_page);
- spin_lock(&hugetlb_lock);
- h->nr_huge_pages++;
- h->nr_huge_pages_node[nid]++;
- spin_unlock(&hugetlb_lock);
- put_page(page); /* free it into the hugepage allocator */
- }
+ if (page)
+ huge_new_page(h, page);
return page;
}
--
We do not usually preface functions in mm/hugetlb.c with "huge" and the name you have chosen doesn't seem that clear to me anyway. Could we rename it to prep_new_huge_page() or something similar? -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center --
It fixes a real bug for sure. It may be possible with that bug to leak pages onto a linked list with bogus counters. Possibly another candidate patch to move to the start of the series so prep_new_huge_page() as it has a similar responsibility to prep_new_page() ? Just at a glance, huge_new_page() implies to me that -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Without this fix bootmem can return unaligned addresses when the start of a
node is not aligned to the align value. Needed for reliably allocating
gigabyte pages.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/bootmem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -197,6 +197,7 @@ __alloc_bootmem_core(struct bootmem_data
{
unsigned long offset, remaining_size, areasize, preferred;
unsigned long i, start = 0, incr, eidx, end_pfn;
+ unsigned long pfn;
void *ret;
if (!size) {
@@ -239,12 +240,13 @@ __alloc_bootmem_core(struct bootmem_data
preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
areasize = (size + PAGE_SIZE-1) / PAGE_SIZE;
incr = align >> PAGE_SHIFT ? : 1;
+ pfn = PFN_DOWN(bdata->node_boot_start);
restart_scan:
for (i = preferred; i < eidx; i += incr) {
unsigned long j;
i = find_next_zero_bit(bdata->node_bootmem_map, eidx, i);
- i = ALIGN(i, incr);
+ i = ALIGN(pfn + i, incr) - pfn;
if (i >= eidx)
break;
if (test_bit(i, bdata->node_bootmem_map))
--
Seems like something that should be fixed anyway independently of your patchset. If moved to the start of the set, it can be treated in batch with hmm, preferred is already been aligned above and it appears that "offset" was meant to handle the situation you are dealing with here. Is the caller passing in "goal" (to avoid DMA32 for example) and messing up how "offset" -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
> node_boot_start is not page aligned? It is, but it is not necessarily GB aligned and without this change sometimes alloc_bootmem when requesting GB alignment doesn't return GB aligned memory. This was a nasty problem that took some time to track down. -Andi --
or preferred has some problem? preferred = PFN_DOWN(ALIGN(preferred, align)) + offset; YH --
when node_boot_start is 512M alignment, and align is 1024M, offset could be 512M. it seems i = ALIGN(i, incr) need to do sth with offset... YH --
It's possible that there are better fixes for this, but at least my simple patch seems to work here. I admit I was banging my head against this for some time and when I did the fix I just wanted the bug to go away and didn't really go for subtleness. The bootmem allocator is quite spaghetti in fact, it could really need some general clean up (although it's' not quite as bad yet as page_alloc.c) -Andi --
i = ALIGN(i+offset, incr) - offset; also the one in fail_block... only happen when align is large than alignment of node_boot_start. YH --
> only happen when align is large than alignment of node_boot_start.
Here's an updated version of the patch with this addressed.
Please review. The patch is somewhat more complicated, but
actually makes the code a little cleaner now.
-Andi
Fix alignment bug in bootmem allocator
Without this fix bootmem can return unaligned addresses when the start of a
node is not aligned to the align value. Needed for reliably allocating
gigabyte pages.
I removed the offset variable because all tests should align themself correctly
now. Slight drawback might be that the bootmem allocator will spend
some more time skipping bits in the bitmap initially, but that shouldn't
be a big issue.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/bootmem.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
Index: linux/mm/bootmem.c
===================================================================
--- linux.orig/mm/bootmem.c
+++ linux/mm/bootmem.c
@@ -195,8 +195,9 @@ void * __init
__alloc_bootmem_core(struct bootmem_data *bdata, unsigned long size,
unsigned long align, unsigned long goal, unsigned long limit)
{
- unsigned long offset, remaining_size, areasize, preferred;
- unsigned long i, start = 0, incr, eidx, end_pfn;
+ unsigned long remaining_size, areasize, preferred;
+ unsigned long i, start, incr, eidx, end_pfn;
+ unsigned long pfn;
void *ret;
if (!size) {
@@ -218,10 +219,6 @@ __alloc_bootmem_core(struct bootmem_data
end_pfn = limit;
eidx = end_pfn - PFN_DOWN(bdata->node_boot_start);
- offset = 0;
- if (align && (bdata->node_boot_start & (align - 1UL)) != 0)
- offset = align - (bdata->node_boot_start & (align - 1UL));
- offset = PFN_DOWN(offset);
/*
* We try to allocate bootmem pages above 'goal'
@@ -236,15 +233,18 @@ __alloc_bootmem_core(struct bootmem_data
} else
preferred = 0;
- preferred = PFN_DOWN(ALIGN(preferred, align)) + offset;
+ start = bdata->node_boot_start;
+ preferred = ...how about create local node_boot_start and node_bootmem_map that make sure node_boot_start has bigger alignment than align input. YH --
please don't use v2... it doesn't work. YH --
please check the one against -mm and x86.git ---
No offset is not enough because it is still relative to the zone start. I'm preparing an updated patch. -Andi --
This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB.
Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.
These 1G bootmem pages are never freed. In theory it would be possible
to implement that with some complications, but since it would be a one-way
street (> MAX_ORDER pages cannot be allocated later) I decided not to currently.
The > MAX_ORDER code is not ifdef'ed per architecture. It is not very big
and the ifdef uglyness seemed not be worth it.
Known problems: /proc/meminfo and "free" do not display the memory
allocated for gb pages in "Total". This is a little confusing for the
user.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 62 insertions(+), 2 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/bootmem.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -153,7 +154,7 @@ static void free_huge_page(struct page *
INIT_LIST_HEAD(&page->lru);
spin_lock(&hugetlb_lock);
- if (h->surplus_huge_pages_node[nid]) {
+ if (h->surplus_huge_pages_node[nid] && h->order <= MAX_ORDER) {
update_and_free_page(h, page);
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
@@ -215,6 +216,9 @@ static struct page *alloc_fresh_huge_pag
{
struct page *page;
+ if (h->order > MAX_ORDER)
+ return NULL;
+
page = alloc_pages_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|__GFP_NOWARN,
huge_page_order(h));
@@ -271,6 +275,9 @@ static struct page *alloc_buddy_huge_pag
struct page *page;
unsigned int nid;
+ if (h->order > MAX_ORDER)
+ return NULL;
+
/*
* Assume we will ...Should this print out a KERN_INFO message to the effect that pages of Ah, scratch the comment on an earlier patch where I said I cannot see -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This looks like an off-by-one error here and in the code below -- it
should be ">= MAX_ORDER" not "> MAX_ORDER". Cf alloc_pages() in gfp.h:
if (unlikely(order >= MAX_ORDER))
-Andrew Hastings
Cray Inc.
--
True good point. Although it will only matter if some architecture has MAX_ORDER sized huge pages :) x86-64 definitely hasn't. I passed this code over to Nick so he'll hopefully incorporate the fix. -Andi --
Signed-off-by: Andi Kleen <ak@suse.de>
---
include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 23 ++++++++++++++++++-----
2 files changed, 19 insertions(+), 5 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -552,19 +552,23 @@ static int __init hugetlb_init_hstate(st
{
unsigned long i;
- for (i = 0; i < MAX_NUMNODES; ++i)
- INIT_LIST_HEAD(&h->hugepage_freelists[i]);
+ /* Don't reinitialize lists if they have been already init'ed */
+ if (!h->hugepage_freelists[0].next) {
+ for (i = 0; i < MAX_NUMNODES; ++i)
+ INIT_LIST_HEAD(&h->hugepage_freelists[i]);
- h->hugetlb_next_nid = first_node(node_online_map);
+ h->hugetlb_next_nid = first_node(node_online_map);
+ }
- for (i = 0; i < max_huge_pages[h - hstates]; ++i) {
+ while (h->parsed_hugepages < max_huge_pages[h - hstates]) {
if (h->order > MAX_ORDER) {
if (!alloc_bm_huge_page(h))
break;
} else if (!alloc_fresh_huge_page(h))
break;
+ h->parsed_hugepages++;
}
- max_huge_pages[h - hstates] = h->free_huge_pages = h->nr_huge_pages = i;
+ max_huge_pages[h - hstates] = h->parsed_hugepages;
printk(KERN_INFO "Total HugeTLB memory allocated, %ld %dMB pages\n",
h->free_huge_pages,
@@ -602,6 +606,15 @@ static int __init hugetlb_setup(char *s)
unsigned long *mhp = &max_huge_pages[parsed_hstate - hstates];
if (sscanf(s, "%lu", mhp) <= 0)
*mhp = 0;
+ /*
+ * Global state is always initialized later in hugetlb_init.
+ * But we need to allocate > MAX_ORDER hstates here early to still
+ * use the bootmem allocator.
+ * If you add additional hstates <= MAX_ORDER you'll need
+ * to fix that.
+ */
+ if (parsed_hstate != &global_hstate)
+ hugetlb_init_hstate(parsed_hstate);
return 1;
}
__setup("hugepages=", hugetlb_setup);
Index: linux/include/linux/hugetlb.h
===================================================================
--- ...hmm, it's not very clear to me how hugetlb_init_hstate() would get called twice for the same hstate. Should it be VM_BUG_ON() if a hstate -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
It is called from a __setup function and the user can specify them multiple times. Also when the user specified the HPAGE_SIZE already and it got set up it should not be called again. -Andi --
Ok, that is a fair explanation. Thanks. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/hugetlb.c | 33 ++++++++++++++++++++++++++++++---
1 file changed, 30 insertions(+), 3 deletions(-)
Index: linux/mm/hugetlb.c
===================================================================
--- linux.orig/mm/hugetlb.c
+++ linux/mm/hugetlb.c
@@ -510,6 +510,15 @@ static struct page *alloc_huge_page(stru
return page;
}
+static __init char *memfmt(char *buf, unsigned long n)
+{
+ if (n >= (1UL << 30))
+ sprintf(buf, "%lu GB", n >> 30);
+ else
+ sprintf(buf, "%lu MB", n >> 20);
+ return buf;
+}
+
static __initdata LIST_HEAD(huge_boot_pages);
struct huge_bm_page {
@@ -536,14 +545,28 @@ static int __init alloc_bm_huge_page(str
/* Put bootmem huge pages into the standard lists after mem_map is up */
static int __init huge_init_bm(void)
{
+ unsigned long pages = 0;
struct huge_bm_page *m;
+ struct hstate *h = NULL;
+ char buf[32];
+
list_for_each_entry (m, &huge_boot_pages, list) {
struct page *page = virt_to_page(m);
- struct hstate *h = m->hstate;
+ h = m->hstate;
__ClearPageReserved(page);
prep_compound_page(page, h->order);
huge_new_page(h, page);
+ pages++;
}
+
+ /*
+ * This only prints for a single hstate. This works for x86-64,
+ * but if you do multiple > MAX_ORDER hstates you'll need to fix it.
+ */
+ if (pages > 0)
+ printk(KERN_INFO "HugeTLB pre-allocated %ld %s pages\n",
+ h->free_huge_pages,
+ memfmt(buf, huge_page_size(h)));
return 0;
}
__initcall(huge_init_bm);
@@ -551,6 +574,8 @@ __initcall(huge_init_bm);
static int __init hugetlb_init_hstate(struct hstate *h)
{
unsigned long i;
+ char buf[32];
+ unsigned long pages = 0;
/* Don't reinitialize lists if they have been already init'ed */
if (!h->hugepage_freelists[0].next) {
@@ -567,12 +592,14 @@ static ...Scratch earlier comments about this printk. If the printk fix was broken out, it could be moved to the start of the set so it can be tested/merged separetly. The remainder of this patch could then be -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Signed-off-by: Andi Kleen <ak@suse.de>
---
arch/x86/mm/hugetlbpage.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -133,9 +133,14 @@ pte_t *huge_pte_alloc(struct mm_struct *
pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
if (pud) {
- if (pud_none(*pud))
- huge_pmd_share(mm, addr, pud);
- pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ if (sz == PUD_SIZE) {
+ pte = (pte_t *)pud;
+ } else {
+ BUG_ON(sz != PMD_SIZE);
+ if (pud_none(*pud))
+ huge_pmd_share(mm, addr, pud);
+ pte = (pte_t *) pmd_alloc(mm, pud, addr);
+ }
}
BUG_ON(pte && !pte_none(*pte) && !pte_huge(*pte));
@@ -151,8 +156,11 @@ pte_t *huge_pte_offset(struct mm_struct
pgd = pgd_offset(mm, addr);
if (pgd_present(*pgd)) {
pud = pud_offset(pgd, addr);
- if (pud_present(*pud))
+ if (pud_present(*pud)) {
+ if (pud_large(*pud))
+ return (pte_t *)pud;
pmd = pmd_offset(pud, addr);
+ }
}
return (pte_t *) pmd;
}
--
Straight forward extensions for huge pages located in the PUD
instead of PMDs.
Signed-off-by: Andi Kleen <ak@suse.de>
---
arch/ia64/mm/hugetlbpage.c | 6 ++++++
arch/powerpc/mm/hugetlbpage.c | 5 +++++
arch/sh/mm/hugetlbpage.c | 5 +++++
arch/sparc64/mm/hugetlbpage.c | 5 +++++
arch/x86/mm/hugetlbpage.c | 25 ++++++++++++++++++++++++-
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 9 +++++++++
7 files changed, 59 insertions(+), 1 deletion(-)
Index: linux/include/linux/hugetlb.h
===================================================================
--- linux.orig/include/linux/hugetlb.h
+++ linux/include/linux/hugetlb.h
@@ -45,7 +45,10 @@ struct page *follow_huge_addr(struct mm_
int write);
struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
pmd_t *pmd, int write);
+struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
+ pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
+int pud_huge(pud_t pmd);
void hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
@@ -112,8 +115,10 @@ static inline unsigned long hugetlb_tota
#define hugetlb_report_meminfo(buf) 0
#define hugetlb_report_node_meminfo(n, buf) 0
#define follow_huge_pmd(mm, addr, pmd, write) NULL
+#define follow_huge_pud(mm, addr, pud, write) NULL
#define prepare_hugepage_range(addr,len) (-EINVAL)
#define pmd_huge(x) 0
+#define pud_huge(x) 0
#define is_hugepage_only_range(mm, addr, len) 0
#define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; })
#define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; })
Index: linux/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/ia64/mm/hugetlbpage.c
+++ linux/arch/ia64/mm/hugetlbpage.c
@@ -106,6 +106,12 @@ int pmd_huge(pmd_t pmd)
{
return 0;
}
+
+int pud_huge(pud_t pud)
+{
+ return ...mm/memory.c seems to have already gained some knowledge about huge pages:
in particularly in get_user_pages. Fix that code up to support huge
puds.
Signed-off-by: Andi Kleen <ak@suse.de>
---
mm/memory.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -931,7 +931,13 @@ struct page *follow_page(struct vm_area_
pud = pud_offset(pgd, address);
if (pud_none(*pud) || unlikely(pud_bad(*pud)))
goto no_page_table;
-
+
+ if (pud_huge(*pud)) {
+ BUG_ON(flags & FOLL_GET);
+ page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
+ goto out;
+ }
+
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
goto no_page_table;
@@ -1422,6 +1428,8 @@ static int apply_to_pmd_range(struct mm_
unsigned long next;
int err;
+ BUG_ON(pud_huge(*pud));
+
pmd = pmd_alloc(mm, pud, addr);
if (!pmd)
return -ENOMEM;
--
Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.
This finally allows to select GB pages for hugetlbfs in x86 now
that all the infrastructure is in place.
Signed-off-by: Andi Kleen <ak@suse.de>
---
Documentation/kernel-parameters.txt | 11 +++++++++--
arch/x86/mm/hugetlbpage.c | 17 +++++++++++++++++
include/asm-x86/page.h | 2 ++
3 files changed, 28 insertions(+), 2 deletions(-)
Index: linux/arch/x86/mm/hugetlbpage.c
===================================================================
--- linux.orig/arch/x86/mm/hugetlbpage.c
+++ linux/arch/x86/mm/hugetlbpage.c
@@ -421,3 +421,20 @@ hugetlb_get_unmapped_area(struct file *f
#endif /*HAVE_ARCH_HUGETLB_UNMAPPED_AREA*/
+#ifdef CONFIG_X86_64
+static __init int setup_hugepagesz(char *opt)
+{
+ unsigned long ps = memparse(opt, &opt);
+ if (ps == PMD_SIZE) {
+ huge_add_hstate(PMD_SHIFT - PAGE_SHIFT);
+ } else if (ps == PUD_SIZE && cpu_has_gbpages) {
+ huge_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+ } else {
+ printk(KERN_ERR "hugepagesz: Unsupported page size %lu M\n",
+ ps >> 20);
+ return 0;
+ }
+ return 1;
+}
+__setup("hugepagesz=", setup_hugepagesz);
+#endif
Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -21,6 +21,8 @@
#define HPAGE_MASK (~(HPAGE_SIZE - 1))
#define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT)
+#define HUGE_MAX_HSTATE 2
+
/* to align the pointer to the (next) page boundary */
#define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)
Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt
+++ linux/Documentation/kernel-parameters.txt
@@ -726,8 +726,15 @@ and is between 256 and 4096 characters.
hisax= [HW,ISDN]
See Documentation/isdn/README.HiSax.
- hugepages= [HW,X86-32,IA-64] ...Andi wrote:
+ hugepages= [HW,X86-32,IA-64] HugeTLB pages to allocate at boot.
+ hugepagesz= [HW,IA-64,PPC,X86-64] The size of the HugeTLB pages.
+ On x86 this option can be specified multiple times
+ interleaved with hugepages= to reserve huge pages
+ of different sizes. Valid pages sizes on x86-64
+ are 2M (when the CPU supports "pse") and 1G (when the
+ CPU supports the "pdpe1gb" cpuinfo flag)
+ Note that 1GB pages can only be allocated at boot time
+ using hugepages= and not freed afterwards.
This seems to say that hugepages are required for hugepagesz to be
useful, but hugepagesz is supported on PPC, whereas hugepages is not
supported on PPC ...odd.
Should those two HW lists be the same (and sorted in the same order,
for ease of reading)?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Yes, but that was already there before. I didn't change it. I agree it should be fixed, but i would prefer to not mix PPC specific patches into my patchkit so I hope someone Not all architectures support hugepagesz=, in particular i386 does not and possibly others. It is implemented by arch specific code. -Andi --
Ok - good plan.
Do you know offhand what would be the correct HW list for hugepages and
hugepagesz?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Well, from what I can see, Ken Chen wrote the code that deals with
constraints on hugetlb allocation. So I'll copy him on this reply,
along with the other two subject matter experts I know of in this area,
Christoph Lameter and Adam Litke.
The following is the only cpuset related change I saw in this
patchset. It looks pretty obvious to me ... just changing the code to
adapt to Andi's new 'struct hstate' for holding what had been global
hugetlb state.
@@ -1228,18 +1252,18 @@ static int hugetlb_acct_memory(long delt
* semantics that cpuset has.
*/
if (delta > 0) {
- if (gather_surplus_pages(delta) < 0)
+ if (gather_surplus_pages(h, delta) < 0)
goto out;
- if (delta > cpuset_mems_nr(free_huge_pages_node)) {
- return_unused_surplus_pages(delta);
+ if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
+ return_unused_surplus_pages(h, delta);
goto out;
}
}
Andi claimed, in one of his replies earlier on this thread, that there
were further interactions with cpusets and later patches in the set
that "Add basic support for more than one hstate in hugetlbfs
and partly Add support to have individual hstates for each hugetlbfs
mount", but I'm not understanding what that interaction is yet.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
