Re: [PATCH 1/2] Report the pagesize backing a VMA in /proc/pid/smaps

Previous thread: netconsole and a serial one by Jon Smirl on Sunday, September 21, 2008 - 5:14 pm. (3 messages)

Next thread: UPGRADE YOUR EMAIL ACCOUNT by EMAIL ACCOUNT HELP TEAM on Sunday, September 21, 2008 - 3:08 pm. (1 message)
From: Mel Gorman
Date: Sunday, September 21, 2008 - 6:38 pm

The following two patches add support for printing the size used for
hugepage-backed regions. This can be used by a user to verify that a
hugepage-aware application is using the expected page sizes.

The first patch should not be considered too contensious as it is highly
unlikely to break any parsers. There is a possibility that the second patch
will break parsers that arguably are already broken. More details are in
the patches themselves.

 fs/proc/task_mmu.c      |   29 +++++++++++++++++++++--------
 include/linux/hugetlb.h |   13 +++++++++++++
 2 files changed, 34 insertions(+), 8 deletions(-)

--

From: Mel Gorman
Date: Sunday, September 21, 2008 - 6:38 pm

It is useful to verify that a hugepage-aware application is using the expected
pagesizes in each of its memory regions. This patch reports the pagesize
backing the VMA in /proc/pid/smaps. This should not break any sensible
parser as the file format is multi-line and it should skip information it
does not recognise.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/proc/task_mmu.c      |    6 ++++--
 include/linux/hugetlb.h |   13 +++++++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 73d1891..81a3f91 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -394,7 +394,8 @@ static int show_smap(struct seq_file *m, void *v)
 		   "Private_Clean:  %8lu kB\n"
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
-		   "Swap:           %8lu kB\n",
+		   "Swap:           %8lu kB\n"
+		   "PageSize:       %8lu kB\n",
 		   (vma->vm_end - vma->vm_start) >> 10,
 		   mss.resident >> 10,
 		   (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
@@ -403,7 +404,8 @@ static int show_smap(struct seq_file *m, void *v)
 		   mss.private_clean >> 10,
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
-		   mss.swap >> 10);
+		   mss.swap >> 10,
+		   vma_page_size(vma) >> 10);
 
 	return ret;
 }
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 32e0ef0..0c83445 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -231,6 +231,19 @@ static inline unsigned long huge_page_size(struct hstate *h)
 	return (unsigned long)PAGE_SIZE << h->order;
 }
 
+static inline unsigned long vma_page_size(struct vm_area_struct *vma)
+{
+	struct hstate *hstate;
+
+	if (!is_vm_hugetlb_page(vma))
+		return PAGE_SIZE;
+
+	hstate = hstate_vma(vma);
+	VM_BUG_ON(!hstate);
+
+	return 1UL << (hstate->order + PAGE_SHIFT);
+}
+
 static inline unsigned long huge_page_mask(struct hstate *h)
 {
 	return h->mask;
-- 
1.5.6.5

--

From: Andrew Morton
Date: Monday, September 22, 2008 - 1:30 am

CONFIG_HUGETLB_PAGE=n?

What did you hope to gain by inlining this?
--

From: Mel Gorman
Date: Monday, September 22, 2008 - 9:17 am

Inclusion with similar helper functions in the header but it's the wrong thing
to do in this case, obvious when pointed out. It's too large and called from
multiple places. I'll revise the patch

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Hansen
Date: Monday, September 22, 2008 - 8:55 am

Time to play devil's advocate. :)

To be fair, this doesn't return the MMU pagesize backing the VMA.  It
returns pagesize that hugetlb reports *or* the kernel's base PAGE_SIZE.

The ppc64 case where we have a 64k PAGE_SIZE, but no hardware 64k
support means that we'll have a 4k MMU pagesize that we're pretending is
a 64k MMU page.  That might confuse someone seeing 16x the number of TLB
misses they expect.

This also doesn't work if, in the future, we get multiple page sizes
mapped under one VMA.  But, I guess that all only matters if you worry
about how the kernel is treating the pages vs. the MMU hardware.

-- Dave

--

From: Mel Gorman
Date: Monday, September 22, 2008 - 9:21 am

True. In the vast majority of cases, this is the MMU size with ppc64 on

The corollary is that someone running with a 64K base page kernel may be
surprised that the pagesize is always 4K. However I'll check if there is

Will deal with that problem if and when we encounter it. It may be a
case that VMAs split or that we could report how many pages of each MMU
size are in that VMA.

Thanks


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Hansen
Date: Monday, September 22, 2008 - 9:48 am

Sure.  If it isn't easy, the best thing to do is probably just to
document the "interesting" behavior.

-- Dave

--

From: KOSAKI Motohiro
Date: Tuesday, September 23, 2008 - 5:15 am

Dave, please let me know getpagesize() function return to 4k or 64k on ppc64.
I think the PageSize line of the /proc/pid/smap and getpagesize() result should be matched.

otherwise, enduser may be confused.



--

From: Mel Gorman
Date: Tuesday, September 23, 2008 - 12:46 pm

To distinguish between the two, I now report the kernel pagesize and the
mmu pagesize like so

KernelPageSize:       64 kB
MMUPageSize:           4 kB

This is running a kernel with a 64K base pagesize on a PPC970MP which
does not support 64K hardware pagesizes.

Does this make sense?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Wednesday, September 24, 2008 - 5:32 am

Hmmm, Who want to this infomation?

I agreed with
  - An administrator want to know these page are normal or huge.
  - An administrator want to know hugepage size.
    (e.g. x86_64 has two hugepage size (2M and 1G))

but above ppc64 case seems deeply implementation depended infomation and
nobody want to know it.

it seems a bottleneck of future enhancement.

then I disagreed with
  - show both KernelPageSize and MMUPageSize in normal page.


I like following two choice


1) in normal page, show PAZE_SIZE

because, any userland application woks as pagesize==PAZE_SIZE 
on current powerpc architecture.

because

fs/binfmt_elf.c
------------------------------
static int
create_elf_tables(struct linux_binprm *bprm, struct elfhdr *exec,
                unsigned long load_addr, unsigned long interp_load_addr)
{
(snip)
        NEW_AUX_ENT(AT_HWCAP, ELF_HWCAP);
        NEW_AUX_ENT(AT_PAGESZ, ELF_EXEC_PAGESIZE); /* pass ELF_EXEC_PAGESIZE to libc */

include/asm-powerpc/elf.h
-----------------------------
#define ELF_EXEC_PAGESIZE       PAGE_SIZE 


2) in normal page, no display any page size.
   only hugepage case, display page size.

because, An administrator want to hugepage size only. (AFAICS)



Thought?


--

From: Mel Gorman
Date: Wednesday, September 24, 2008 - 8:41 am

Someone doing performance analysis on POWER may want it. If they switched to
a large base page size without using hugetlbfs at all and saw the same number
of TLB misses, it could be explained by the lower MMU pagesize. Admittedly,

I admit it's ppc64-specific. In the latest patch series, I made this a
separate patch so that it could be readily dropped again for this reason.
Maybe an alternative would be to display MMUPageSize *only* where it differs


I'm ok with this option and dropping the MMUPageSize patch as the user
should already be able to identify that the hardware does not support 64K
base pagesizes. I will leave the name as KernelPageSize so that it is still

I prefer option 1 as it's easier to parse the presense of information
than infer from the absense of it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Hansen
Date: Wednesday, September 24, 2008 - 9:06 am

I would also think that any arch implementing fallback from large to
small pages in a hugetlbfs area (Adam needs to post his patches :) would
also use this.

-- Dave

--

From: Mel Gorman
Date: Wednesday, September 24, 2008 - 10:10 am

Fair point. Maybe the thing to do is backburner this patch for the moment and
reintroduce it when/if an architecture supports demotion? The KernelPageSize
reporting in smaps and what the hpagesize in maps is still useful though
I believe. Any comment?

(future stuff from here on)

In the future if demotion does happen then the MMUPageSize information may
be genuinely useful instead of just a curious oddity on ppc64. As you point
out, Adam (added to cc) has worked on this area (starting with x86 demotion)
in the past but it's a while before it'll be considered for merging I believe.

That aside, more would need to be done with the page size reporting then
anyway. For example, it maybe indicate how much of each pagesize is in a VMA
or indicate that KernelPageSize is what is being requested but in reality
it is mixed like;

KernelPageSize:		2048 kB (mixed)

or

KernelPageSize:		2048 kB * 5, 4096 kB * 20


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Hansen
Date: Wednesday, September 24, 2008 - 11:59 am

I'd kinda prefer to see it normalized into a single place rather than
sprinkle it in each smaps file.  We should be able to figure out which
mount the file is from and, from there, maybe we need some per-mount

Looks a bit verbose, but I agree with the sentiment.

-- Dave

--

From: Mel Gorman
Date: Wednesday, September 24, 2008 - 12:11 pm

I don't get what you mean by it being sprinkled in each smaps file. How

Per-mount information is already exported and you can infer the data about
huge pagesizes. For example, if you know the default huge pagesize (from
/proc/meminfo), and the file is on hugetlbfs (read maps, then /proc/mounts)
and there is no pagesize= mount option (mounts again), you could guess what the
hugepage that is backing a VMA is. Shared memory segments are a little harder
but again, you can infer the information if you look around for long enough.

However, this is awkward and not very user-friendly. With the patches (minus
MMUPageSize as I think we've agreed to postpone that), it's easy to see what
pagesize is being used at a glance. Without it, you need to know a fair bit

Grand, I'll keep note of this to revisit it in the future when/if
pagesizes get mixed in a VMA. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Hansen
Date: Wednesday, September 24, 2008 - 12:23 pm

1. figure out what the file path is from smaps
2. look up the mount

I agree completely.  But, if we consider this a user ABI thing, then
we're stuck with it for a long time, and we better make it flexible
enough to at least contain the gunk we're planning on adding in a small
number of years, like the fallback.  We don't want to be adding this
stuff if it isn't going to be stable.

-- Dave

--

From: Mel Gorman
Date: Wednesday, September 24, 2008 - 4:39 pm

You should be able to do that today but it's not a particularly friendly
task. I expect without decent knowledge of how hugepages work that you'll get
it wrong. A userspace tool could do this of course and likely would use stat
on the file to get teh blocksize if it was hugetlbfs instead of consulting
mounts. It's just not as user-friendly. Consider "cat smaps" as opposed to

What's wrong with

KernelPageSize: X kB 

now which a parser can easily handle and later

KernelPageSize: X kb * nX Y kB * nY

where X is a pagesize, nX is the number of pages of that size in a VMA
later? The second format should not break a naive parser.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Wednesday, September 24, 2008 - 4:42 pm

This could also be done as

KernelPageSize == Kernel page size that is ideally used in this VMA

and later

MixedPageSize == Breakdown of the pagesizes that are used in the VMA

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Thursday, September 25, 2008 - 5:23 am

OK.

I'll review and test your latest patch without MMUPageSize part.
(maybe today's midnight or tommorow)

Thanks!



--

From: Mel Gorman
Date: Sunday, September 21, 2008 - 6:38 pm

This patch adds a new field for hugepage-backed memory regions to show the
pagesize in /proc/pid/maps.  While the information is available in smaps,
maps is more human-readable and does not incur the significant cost of
calculating Pss. An example of a /proc/self/maps output for an application
using hugepages with this patch applied is;

08048000-0804c000 r-xp 00000000 03:01 49135      /bin/cat
0804c000-0804d000 rw-p 00003000 03:01 49135      /bin/cat
08400000-08800000 rw-p 00000000 00:10 4055       /mnt/libhugetlbfs.tmp.QzPPTJ (deleted) (hpagesize=4096kB)
b7daa000-b7dab000 rw-p b7daa000 00:00 0
b7dab000-b7ed2000 r-xp 00000000 03:01 116846     /lib/tls/i686/cmov/libc-2.3.6.so
b7ed2000-b7ed7000 r--p 00127000 03:01 116846     /lib/tls/i686/cmov/libc-2.3.6.so
b7ed7000-b7ed9000 rw-p 0012c000 03:01 116846     /lib/tls/i686/cmov/libc-2.3.6.so
b7ed9000-b7edd000 rw-p b7ed9000 00:00 0
b7ee1000-b7ee8000 r-xp 00000000 03:01 49262      /root/libhugetlbfs-git/obj32/libhugetlbfs.so
b7ee8000-b7ee9000 rw-p 00006000 03:01 49262      /root/libhugetlbfs-git/obj32/libhugetlbfs.so
b7ee9000-b7eed000 rw-p b7ee9000 00:00 0
b7eed000-b7f02000 r-xp 00000000 03:01 119345     /lib/ld-2.3.6.so
b7f02000-b7f04000 rw-p 00014000 03:01 119345     /lib/ld-2.3.6.so
bf8ef000-bf903000 rwxp bffeb000 00:00 0          [stack]
bf903000-bf904000 rw-p bffff000 00:00 0
ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]

To be predictable for parsers, the patch adds the notion of reporting
VMA attributes by adding fields that look like "(attribute[=value])". This
already happens when a file is deleted and the user sees (deleted) after the
filename. The expectation is that existing parsers will not break as those
that read the filename should be reading forward after the inode number
and stopping when it sees something that is not part of the filename.
Parsers that assume everything after / is a filename will get confused by
(hpagesize=XkB) but are already broken due to (deleted).

Signed-off-by: Mel Gorman ...
Previous thread: netconsole and a serial one by Jon Smirl on Sunday, September 21, 2008 - 5:14 pm. (3 messages)

Next thread: UPGRADE YOUR EMAIL ACCOUNT by EMAIL ACCOUNT HELP TEAM on Sunday, September 21, 2008 - 3:08 pm. (1 message)