Re: [PATCH] hugepage: support ZERO_PAGE()

Previous thread: [PATCH -next] sysctl: simplify ->strategy by Alexey Dobriyan on Wednesday, August 27, 2008 - 10:02 pm. (1 message)

Next thread: x86: move dir es7000 to es7000_32.c by Yinghai Lu on Wednesday, August 27, 2008 - 11:01 pm. (2 messages)
From: KOSAKI Motohiro
Date: Wednesday, August 27, 2008 - 10:24 pm

Now, hugepage's vma has VM_RESERVED flag because it cannot be swapped.

and VM_RESERVED vma isn't core dumped because its flag often be used for
kernel internal vma (e.g. vmalloc, sound related).

So, hugepage is never dumped and it indicate hugepages's program can't be debugged easily.

In these days, demand on making use of hugepage is increasing.
IMO, native support for coredump of hugepage is useful.


I think VM_RESERVED default dumping bahavior is good,
then I'd like to add coredump_filter mask.

This patch doesn't change dafault behavior.


I tested by following method.

# ulimit -c unlimited
# echo 0x23 > /proc/self/coredump_filter
# ./hugepage_dump
# gdb ./hugepage_dump core


hugepage_dump.c
------------------------------------------------
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

#define HUGEPAGE_SIZE (256*1024*1024)

int main(int argc, char** argv)
{
	int err;
	int shmid;
	int *pval;
	int shm_flags = 0666;

	if ((argc >= 2) && (strcmp(argv[1], "-h")==0))
		shm_flags |= SHM_HUGETLB;

	err = shmid = shmget(IPC_PRIVATE, HUGEPAGE_SIZE, shm_flags);
	if (err < 0) {
		perror("shmget");
		exit(1);
	}

	pval = shmat(shmid, 0, 0);
	if (pval == (void*)-1) {
		perror("shmat");
		exit(1);
	}

	*pval = 1;

	*(int*)0 = 1;

	exit(0);
}
-----------------------------------------------------


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com>
CC: Hugh Dickins <hugh@veritas.com>
CC: William Irwin <wli@holomorphy.com>
CC: Adam Litke <agl@us.ibm.com>

---
 Documentation/filesystems/proc.txt |    3 ++-
 fs/binfmt_elf.c                    |    7 ++++++-
 include/linux/sched.h              |    3 ++-
 3 files changed, 10 insertions(+), 3 deletions(-)

Index: ...
From: Peter Teoh
Date: Wednesday, August 27, 2008 - 10:38 pm

Issuing the following:

git pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

result in the following output:

remote: Counting objects: 354, done.
remote: Compressing objects: 100% (220/220), done.
remote: Total 354 (delta 173), reused 278 (delta 134)
Receiving objects: 100% (354/354), 603.00 KiB | 10 KiB/s, done.
Resolving deltas: 100% (173/173), done.
* non-commit: keep
  a72c497614f3fec7453b826b33ab0303c3c087fc of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
* committish: 4c246edd2550304df5b766cc841584b2bb058843
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
fatal: Error in line 1: keep    not-for-merge
a72c497614f3fec7453b826b33ab0303c3c087fc of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

I cannot find anything in the Internet for the above error, anyone can help?

Thank you very much.

-- 
Regards,
Peter Teoh
--

From: Adam Litke
Date: Thursday, August 28, 2008 - 7:48 am

I have tested this using both the provided C program and with a program
that used libhugetlbfs to back its text and data segments with huge
pages.  This seems to work as expected.  From a hugetlbfs perspective,

Tested-by: Adam Litke <agl@us.ibm.com>
-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--

From: KOSAKI Motohiro
Date: Thursday, August 28, 2008 - 7:59 am

Thank you, Adam!

Yes, libhugetlbfs user increased in these days.
--

From: Hugh Dickins
Date: Thursday, August 28, 2008 - 9:38 am

This seems very reasonable to me
(though I've little use for coredumps or hugepages myself).

One caution though: how well does it behave when coredumping a large
area of hugepages which have not actually been instantiated prior to
the coredump?  We have that funny FOLL_ANON ZERO_PAGE code in
follow_page() to avoid wasting memory on large uninstantiated anon
areas, but hugepages won't go that way.  If the dump hangs waiting for
memory to be freed, or OOMkills other processes, that wouldn't be good;
whereas if hugepage reservations (I've not followed what happens with
them) or whatever just make it skip when no more, that should be okay.

Hugh
--

From: KOSAKI Motohiro
Date: Thursday, August 28, 2008 - 4:35 pm

I think hugepage reservation pages always exist when hugepage COW happend.
Then, hugepage access never cause try_to_free_pages() nor OOM.

Adam, if I talk lie, please tell us truth.




--

From: Adam Litke
Date: Friday, August 29, 2008 - 8:28 am

(Mel, since you wrote the private reservation hugetlb code, would you
care to verify the following:)

Well, reserved huge pages _almost_ always exist.  The notable exception
happens when a process creates a MAP_PRIVATE hugetlb mapping and then
forks.  No guarantees are made to the children for access to that
hugetlb mapping.  So if such a child were to core dump an unavailable
huge page, follow_hugetlb_page() would fail.  I think that case is
harmless since it looks like elf_coredump() will replace it with a
zeroed page?

The part of Hugh's email that does deserve more attention is the part
about FOLL_ANON and the ZERO_PAGE.  It seems like an awful waste to zero
out and instantiate huge pages just for a core dump.  I think it would
be worth adding a flag to follow_hugetlb_page() so that it can be
instructed to not fault in un-instantiated huge pages.  This would take
some investigating as to whether it is even valid for
follow_hugetlb_page() to return the ZERO_PAGE().

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--

From: KOSAKI Motohiro
Date: Monday, September 1, 2008 - 6:21 pm

Adam, Thank you precious explain.

Honestly, I can't imazine non-zero-page-support cause terrible things.
Can you explain when happend the terrible things?
I don't know its problem is big issue or not.


Anyway, I made hugepage's zero page patch.
Could you please see it?



=======================================================================================
Subject: hugepage: supoort ZERO_PAGE()

Now, hugepage doesn't use zero page at all because zero page is almost used for coredumping only
and it isn't supported ago.

But now, we implemented hugepage coredumping and we should implement the zero page of hugepage.
The patch do that.


Implementation note:
-------------------------------------------------------------
o Why do we only check VM_SHARED for zero page?
  normal page checked as ..

	static inline int use_zero_page(struct vm_area_struct *vma)
	{
	        if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
	                return 0;
	
	        return !vma->vm_ops || !vma->vm_ops->fault;
	}

First, hugepages never mlock()ed. we don't need concern to VM_LOCKED.

Second, hugetlbfs is pseudo filesystem, not real filesystem and it doesn't have any file backing.
Then, ops->fault checking is meaningless.


o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or any error happend.
So, We shouldn't return zero page if any error happend.



test method
-------------------------------------------------------
console 1:

	# su
	# echo 100 >/proc/sys/vm/nr_hugepages
	# mount -t hugetlbfs none /hugetlbfs/
	# watch -n1 cat /proc/meminfo

console 2:
	% gcc -g -Wall crash_hugepage.c -o crash_hugepage -lhugetlbfs
	% ulimit -c unlimited
	% echo 0x23 >/proc/self/coredump_filter
	% HUGETLB_MORECORE=yes ./crash_hugepage 50
		-> segmentation fault
	% gdb

crash_hugepage.c
----------------------
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define HUGEPAGE_SIZE (2*1024*1024)

int main(int argc, char** argv){
	char* ...
From: Mel Gorman
Date: Tuesday, September 2, 2008 - 7:22 am

Also when MAP_NORESERVE is specified, there is not guarantee the huge

I believe the impact is that core dumps could take longer and be of a larger
size than necessary if uninstantiated pages are not skipped.  However, if the
zero page was used for anything other than core dumps, you have to be sure
that only the base pages worth of data is being read. I'm not convinced your
patch is doing that. For example, what happens if you ptrace an application


Offhand, I'm not 100% certain but I think it's because a shared mapping
should always fault the page in case another process sharing the mapping

It's not a bit deal but as you link against libhugetlbfs, you could have

This does not look safe in the ptrace case at all. If I ptrace the app
to read a hugetlbfs-backed region, get_user_pages() gets called and then
this. In that case, it would appear that a 4K page would be put in place

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Tuesday, September 2, 2008 - 8:13 am

D'oh, what was I thinking. It's only read in PAGE_SIZE portions so it will
not be reading beyond the boundary. It should be safe in the ptrace and

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Tuesday, September 2, 2008 - 9:27 am

Here is a second go at reviewing this after spending a bit more time on

I checked and this appears to be ok for both gdb and direct-io at least.

Why does the signature need to change? You have the VMA and could check

Calculate pfn_offset within the if statement here so that the change
below is unnecessary. The zeropage_ok ? 0 : pfn_offset is trickier to

For direct-IO on NUMA, do we care that we are now calling get_page() and
bumping the reference count on the zero page instead of a struct page
that could be local? I suspect the answer is "no" because the same

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Adam Litke
Date: Tuesday, September 2, 2008 - 10:27 am

In addition to the comments from Mel, I'd like to see this function
collapsed a bit...

if (!ptep || write || shared)
	return 0;
else
	return huge_pte_none(huge_ptep_get(ptep));

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--

From: Hidehiro Kawai
Date: Sunday, August 31, 2008 - 11:00 pm

Hi Kosaki-san,


Hugetlb VMAs fall also into one of the lowest 4 bit case.
If you introduce the hugetlb bit (bit 5), you'd better describe that
the hugetlb bit takes precedence over the lowest 4 bits.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: KOSAKI Motohiro
Date: Monday, September 1, 2008 - 7:18 pm

Thank you good review.
I updated that documentations.


=====================================================================
Subject: [PATCH v2] coredump_filter: add hugepage core dumping

Now, hugepage's vma has VM_RESERVED flag in order not to being swapped.
But VM_RESERVED vma isn't core dumped because this flag is often used for
some kernel vmas (e.g. vmalloc, sound related).

Then hugepage is never dumped and it can't be debugged easily.
Many developers want hugepages to be included into core-dump.

I think current VM_RESERVED default dumping bahavior is good,
then I'd like to add coredump_filter mask.


I tested by following method.

# ulimit -c unlimited
# echo 0x23 > /proc/self/coredump_filter
# ./hugepage_dump
# gdb ./hugepage_dump core


hugepage_dump.c
------------------------------------------------
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

#define HUGEPAGE_SIZE (256*1024*1024)

int main(int argc, char** argv)
{
	int err;
	int shmid;
	int *pval;
	int shm_flags = 0666;

	if ((argc >= 2) && (strcmp(argv[1], "-h")==0))
		shm_flags |= SHM_HUGETLB;

	err = shmid = shmget(IPC_PRIVATE, HUGEPAGE_SIZE, shm_flags);
	if (err < 0) {
		perror("shmget");
		exit(1);
	}

	pval = shmat(shmid, 0, 0);
	if (pval == (void*)-1) {
		perror("shmat");
		exit(1);
	}

	*pval = 1;

	*(int*)0 = 1;

	exit(0);
}
-----------------------------------------------------


Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
CC: Kawai, Hidehiro <hidehiro.kawai.ez@hitachi.com>
CC: Hugh Dickins <hugh@veritas.com>
CC: William Irwin <wli@holomorphy.com>
Tested-by: Adam Litke <agl@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>

---
 Documentation/filesystems/proc.txt |    6 +++++-
 fs/binfmt_elf.c                    |    6 +++++-
 include/linux/sched.h              |    3 ++-
 3 files changed, 12 insertions(+), 3 ...
From: Mel Gorman
Date: Tuesday, September 2, 2008 - 6:48 am

The meaning of VM_RESERVED is a bit more complicated than that, but


That aside, it is not always safe to read a VM_RESERVED area without

I wonder about this and if that is the right thing to do. By default,
core dumps include anonymous private and shared memory. Strictly speaking,
hugetlbfs is a file-backed mapping but it is always a ram-based file and a
private mapping is likely to contain information that would normally be in
a private anonymous mapping. It feels like that information should be core
dumped by default. Would it be better to

1. Distinguish between private and shared hugetlbfs mappings
2. Default dump MAP_PRIVATE hugetlbfs mappings


It's not your fault, but the meaning of bit 4 appears to be


Otherwise, it appears ok.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Friday, September 5, 2008 - 1:06 am

make much sense.
ok, I'll make again under your design.

Thank you very much.


--

From: Hidehiro Kawai
Date: Sunday, September 7, 2008 - 6:51 pm

[Added CC to Roland McGrath]


I think it was just forgotten to be updated.  Bit 4 was introduced
by Roland McGrath, and it means elf header pages in file-backed
private VMAs are dumped even if bit 2 is cleared.

Thanks,


Subject: [PATCH] coredump_filter: add description of bit 4

There is no description of bit 4 of coredump_filter in the
documentation.  This patch adds it.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
CC: Roland McGrath <roland@redhat.com>
---
 Documentation/filesystems/proc.txt |    2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6.27-rc5/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.27-rc5.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.27-rc5/Documentation/filesystems/proc.txt
@@ -2394,6 +2394,8 @@ The following 4 memory types are support
   - (bit 1) anonymous shared memory
   - (bit 2) file-backed private memory
   - (bit 3) file-backed shared memory
+  - (bit 4) ELF header pages in file-backed private memory areas (it is
+            effective only if the bit 2 is cleared)
 
   Note that MMIO pages such as frame buffer are never dumped and vDSO pages
   are always dumped regardless of the bitmask status.



--

From: KOSAKI Motohiro
Date: Tuesday, September 9, 2008 - 4:20 am

Very thanks, kawai-san.

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
--

From: Roland McGrath
Date: Tuesday, September 9, 2008 - 11:04 pm

It was certainly not meant to be excluded from any documentation.  

To make the description clear, MMF_DUMP_ELF_HEADERS applies when a
file-backed private mapping would otherwise be elided (based on the other
bits and whether it's been modified).  If the vma maps offset 0 of the
file, and that page is readable and starts with ELFMAG, then just that
first page will be dumped.  (So p_filesz will be PAGE_SIZE rather than 0
for an elided mapping, or p_memsz for a mapping dumped whole.)


Thanks,
Roland
--

From: KOSAKI Motohiro
Date: Tuesday, September 9, 2008 - 11:53 pm

Could you please make the patch.
Then, I'll ack it with presure.

Thanks.
--

Previous thread: [PATCH -next] sysctl: simplify ->strategy by Alexey Dobriyan on Wednesday, August 27, 2008 - 10:02 pm. (1 message)

Next thread: x86: move dir es7000 to es7000_32.c by Yinghai Lu on Wednesday, August 27, 2008 - 11:01 pm. (2 messages)