The following two patches add support for printing the size used for hugepage-backed regions. This can be used by a user to verify that a hugepage-aware application is using the expected page sizes. The first patch should not be considered too contensious as it is highly unlikely to break any parsers. There is a possibility that the second patch will break parsers that arguably are already broken. More details are in the patches themselves. fs/proc/task_mmu.c | 29 +++++++++++++++++++++-------- include/linux/hugetlb.h | 13 +++++++++++++ 2 files changed, 34 insertions(+), 8 deletions(-) --
It is useful to verify that a hugepage-aware application is using the expected
pagesizes in each of its memory regions. This patch reports the pagesize
backing the VMA in /proc/pid/smaps. This should not break any sensible
parser as the file format is multi-line and it should skip information it
does not recognise.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
fs/proc/task_mmu.c | 6 ++++--
include/linux/hugetlb.h | 13 +++++++++++++
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 73d1891..81a3f91 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -394,7 +394,8 @@ static int show_smap(struct seq_file *m, void *v)
"Private_Clean: %8lu kB\n"
"Private_Dirty: %8lu kB\n"
"Referenced: %8lu kB\n"
- "Swap: %8lu kB\n",
+ "Swap: %8lu kB\n"
+ "PageSize: %8lu kB\n",
(vma->vm_end - vma->vm_start) >> 10,
mss.resident >> 10,
(unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
@@ -403,7 +404,8 @@ static int show_smap(struct seq_file *m, void *v)
mss.private_clean >> 10,
mss.private_dirty >> 10,
mss.referenced >> 10,
- mss.swap >> 10);
+ mss.swap >> 10,
+ vma_page_size(vma) >> 10);
return ret;
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 32e0ef0..0c83445 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -231,6 +231,19 @@ static inline unsigned long huge_page_size(struct hstate *h)
return (unsigned long)PAGE_SIZE << h->order;
}
+static inline unsigned long vma_page_size(struct vm_area_struct *vma)
+{
+ struct hstate *hstate;
+
+ if (!is_vm_hugetlb_page(vma))
+ return PAGE_SIZE;
+
+ hstate = hstate_vma(vma);
+ VM_BUG_ON(!hstate);
+
+ return 1UL << (hstate->order + PAGE_SHIFT);
+}
+
static inline unsigned long huge_page_mask(struct hstate *h)
{
return h->mask;
--
1.5.6.5
--
CONFIG_HUGETLB_PAGE=n? What did you hope to gain by inlining this? --
Inclusion with similar helper functions in the header but it's the wrong thing to do in this case, obvious when pointed out. It's too large and called from multiple places. I'll revise the patch -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Time to play devil's advocate. :) To be fair, this doesn't return the MMU pagesize backing the VMA. It returns pagesize that hugetlb reports *or* the kernel's base PAGE_SIZE. The ppc64 case where we have a 64k PAGE_SIZE, but no hardware 64k support means that we'll have a 4k MMU pagesize that we're pretending is a 64k MMU page. That might confuse someone seeing 16x the number of TLB misses they expect. This also doesn't work if, in the future, we get multiple page sizes mapped under one VMA. But, I guess that all only matters if you worry about how the kernel is treating the pages vs. the MMU hardware. -- Dave --
True. In the vast majority of cases, this is the MMU size with ppc64 on The corollary is that someone running with a 64K base page kernel may be surprised that the pagesize is always 4K. However I'll check if there is Will deal with that problem if and when we encounter it. It may be a case that VMAs split or that we could report how many pages of each MMU size are in that VMA. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Sure. If it isn't easy, the best thing to do is probably just to document the "interesting" behavior. -- Dave --
Dave, please let me know getpagesize() function return to 4k or 64k on ppc64. I think the PageSize line of the /proc/pid/smap and getpagesize() result should be matched. otherwise, enduser may be confused. --
To distinguish between the two, I now report the kernel pagesize and the mmu pagesize like so KernelPageSize: 64 kB MMUPageSize: 4 kB This is running a kernel with a 64K base pagesize on a PPC970MP which does not support 64K hardware pagesizes. Does this make sense? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hmmm, Who want to this infomation?
I agreed with
- An administrator want to know these page are normal or huge.
- An administrator want to know hugepage size.
(e.g. x86_64 has two hugepage size (2M and 1G))
but above ppc64 case seems deeply implementation depended infomation and
nobody want to know it.
it seems a bottleneck of future enhancement.
then I disagreed with
- show both KernelPageSize and MMUPageSize in normal page.
I like following two choice
1) in normal page, show PAZE_SIZE
because, any userland application woks as pagesize==PAZE_SIZE
on current powerpc architecture.
because
fs/binfmt_elf.c
------------------------------
static int
create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
unsigned long load_addr, unsigned long interp_load_addr)
{
(snip)
NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE); /* pass ELF_EXEC_PAGESIZE to libc */
include/asm-powerpc/elf.h
-----------------------------
#define ELF_EXEC_PAGESIZE PAGE_SIZE
2) in normal page, no display any page size.
only hugepage case, display page size.
because, An administrator want to hugepage size only. (AFAICS)
Thought?
--
Someone doing performance analysis on POWER may want it. If they switched to a large base page size without using hugetlbfs at all and saw the same number of TLB misses, it could be explained by the lower MMU pagesize. Admittedly, I admit it's ppc64-specific. In the latest patch series, I made this a separate patch so that it could be readily dropped again for this reason. Maybe an alternative would be to display MMUPageSize *only* where it differs I'm ok with this option and dropping the MMUPageSize patch as the user should already be able to identify that the hardware does not support 64K base pagesizes. I will leave the name as KernelPageSize so that it is still I prefer option 1 as it's easier to parse the presense of information than infer from the absense of it. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I would also think that any arch implementing fallback from large to small pages in a hugetlbfs area (Adam needs to post his patches :) would also use this. -- Dave --
Fair point. Maybe the thing to do is backburner this patch for the moment and reintroduce it when/if an architecture supports demotion? The KernelPageSize reporting in smaps and what the hpagesize in maps is still useful though I believe. Any comment? (future stuff from here on) In the future if demotion does happen then the MMUPageSize information may be genuinely useful instead of just a curious oddity on ppc64. As you point out, Adam (added to cc) has worked on this area (starting with x86 demotion) in the past but it's a while before it'll be considered for merging I believe. That aside, more would need to be done with the page size reporting then anyway. For example, it maybe indicate how much of each pagesize is in a VMA or indicate that KernelPageSize is what is being requested but in reality it is mixed like; KernelPageSize: 2048 kB (mixed) or KernelPageSize: 2048 kB * 5, 4096 kB * 20 -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I'd kinda prefer to see it normalized into a single place rather than sprinkle it in each smaps file. We should be able to figure out which mount the file is from and, from there, maybe we need some per-mount Looks a bit verbose, but I agree with the sentiment. -- Dave --
I don't get what you mean by it being sprinkled in each smaps file. How Per-mount information is already exported and you can infer the data about huge pagesizes. For example, if you know the default huge pagesize (from /proc/meminfo), and the file is on hugetlbfs (read maps, then /proc/mounts) and there is no pagesize= mount option (mounts again), you could guess what the hugepage that is backing a VMA is. Shared memory segments are a little harder but again, you can infer the information if you look around for long enough. However, this is awkward and not very user-friendly. With the patches (minus MMUPageSize as I think we've agreed to postpone that), it's easy to see what pagesize is being used at a glance. Without it, you need to know a fair bit Grand, I'll keep note of this to revisit it in the future when/if pagesizes get mixed in a VMA. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
1. figure out what the file path is from smaps 2. look up the mount I agree completely. But, if we consider this a user ABI thing, then we're stuck with it for a long time, and we better make it flexible enough to at least contain the gunk we're planning on adding in a small number of years, like the fallback. We don't want to be adding this stuff if it isn't going to be stable. -- Dave --
You should be able to do that today but it's not a particularly friendly task. I expect without decent knowledge of how hugepages work that you'll get it wrong. A userspace tool could do this of course and likely would use stat on the file to get teh blocksize if it was hugetlbfs instead of consulting mounts. It's just not as user-friendly. Consider "cat smaps" as opposed to What's wrong with KernelPageSize: X kB now which a parser can easily handle and later KernelPageSize: X kb * nX Y kB * nY where X is a pagesize, nX is the number of pages of that size in a VMA later? The second format should not break a naive parser. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This could also be done as KernelPageSize == Kernel page size that is ideally used in this VMA and later MixedPageSize == Breakdown of the pagesizes that are used in the VMA -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
OK. I'll review and test your latest patch without MMUPageSize part. (maybe today's midnight or tommorow) Thanks! --
This patch adds a new field for hugepage-backed memory regions to show the pagesize in /proc/pid/maps. While the information is available in smaps, maps is more human-readable and does not incur the significant cost of calculating Pss. An example of a /proc/self/maps output for an application using hugepages with this patch applied is; 08048000-0804c000 r-xp 00000000 03:01 49135 /bin/cat 0804c000-0804d000 rw-p 00003000 03:01 49135 /bin/cat 08400000-08800000 rw-p 00000000 00:10 4055 /mnt/libhugetlbfs.tmp.QzPPTJ (deleted) (hpagesize=4096kB) b7daa000-b7dab000 rw-p b7daa000 00:00 0 b7dab000-b7ed2000 r-xp 00000000 03:01 116846 /lib/tls/i686/cmov/libc-2.3.6.so b7ed2000-b7ed7000 r--p 00127000 03:01 116846 /lib/tls/i686/cmov/libc-2.3.6.so b7ed7000-b7ed9000 rw-p 0012c000 03:01 116846 /lib/tls/i686/cmov/libc-2.3.6.so b7ed9000-b7edd000 rw-p b7ed9000 00:00 0 b7ee1000-b7ee8000 r-xp 00000000 03:01 49262 /root/libhugetlbfs-git/obj32/libhugetlbfs.so b7ee8000-b7ee9000 rw-p 00006000 03:01 49262 /root/libhugetlbfs-git/obj32/libhugetlbfs.so b7ee9000-b7eed000 rw-p b7ee9000 00:00 0 b7eed000-b7f02000 r-xp 00000000 03:01 119345 /lib/ld-2.3.6.so b7f02000-b7f04000 rw-p 00014000 03:01 119345 /lib/ld-2.3.6.so bf8ef000-bf903000 rwxp bffeb000 00:00 0 [stack] bf903000-bf904000 rw-p bffff000 00:00 0 ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] To be predictable for parsers, the patch adds the notion of reporting VMA attributes by adding fields that look like "(attribute[=value])". This already happens when a file is deleted and the user sees (deleted) after the filename. The expectation is that existing parsers will not break as those that read the filename should be reading forward after the inode number and stopping when it sees something that is not part of the filename. Parsers that assume everything after / is a filename will get confused by (hpagesize=XkB) but are already broken due to (deleted). Signed-off-by: Mel Gorman ...
