Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT

Previous thread: [PATCH] ieee1394: dv1394: move deprecation message from module init to file open by Stefan Richter on Tuesday, February 3, 2009 - 9:54 am. (1 message)

Next thread: [PATCH] firewire: normalize a variable name by Stefan Richter on Tuesday, February 3, 2009 - 9:55 am. (1 message)
From: Eli Cohen
Date: Tuesday, February 3, 2009 - 9:49 am

Drivers may want to take advantage of the large pages used for memory obtained
from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in
the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale
significantly better when registering larger memory regions. Other drivers
could also benefit from this.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
---
 arch/powerpc/mm/hash_utils_64.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8d5b475..6cff8c7 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 #ifdef CONFIG_HUGETLB_PAGE
 unsigned int HPAGE_SHIFT;
+EXPORT_SYMBOL(HPAGE_SHIFT);
 #endif
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
-- 
1.6.1

--

From: Roland Dreier
Date: Tuesday, February 3, 2009 - 6:08 pm

Forwarding Eli's patch below, since PowerPC guys may have missed it.  I
guess the question for Ben et al is whether there is any issue with
exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel
it's an internal detail).  It would probably make sense to roll this
change into the mlx4 change that Eli alludes to below and merge through
my tree (with ppc maintainer acks of course), rather than splitting this
patch out and introducing cross-tree dependencies (and also separating
the rationale for the change from the change itself).

Thanks,
  Roland


Drivers may want to take advantage of the large pages used for memory obtained
from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in
the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale
significantly better when registering larger memory regions. Other drivers
could also benefit from this.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
---
 arch/powerpc/mm/hash_utils_64.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8d5b475..6cff8c7 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 #ifdef CONFIG_HUGETLB_PAGE
 unsigned int HPAGE_SHIFT;
+EXPORT_SYMBOL(HPAGE_SHIFT);
 #endif
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
-- 
1.6.1


--

From: Benjamin Herrenschmidt
Date: Tuesday, February 3, 2009 - 6:50 pm

Except that we support multiple large page sizes nowadays ... I think
the size can be specified per mountpoint of hugetlbfs no ? Thus things
like mellanox would have to query the page size used for a given
mapping.

Do the generic hugetlbfs code provides such an API ? If not, we may need
to add one.

Cheers,

--

From: Andrew Morton
Date: Tuesday, February 3, 2009 - 10:13 pm

I think it's something like

	huge_page_size(page_hstate(page))

--

From: Nick Piggin
Date: Tuesday, February 3, 2009 - 10:31 pm

That would work if you have a page, yes. If you want to query which hugepage
sizes are available, then you probably want for_each_hstate() (which is only
within mm/hugetlb.c at the moment, but I have no objections to exporting it
and symbols it requires).

--

From: wli
Date: Tuesday, February 3, 2009 - 11:17 pm

Exciting things have happened in mm/hugetlb.c while I was out sick.
I've got quite some catching up to do.


-- wli
--

From: Roland Dreier
Date: Tuesday, February 3, 2009 - 11:16 pm

> I think it's something like
 > 
 > 	huge_page_size(page_hstate(page))

That would suit.  I assume the intention is for that to be usable by
driver modules on any architecture?

 - R.
--

From: Andrew Morton
Date: Tuesday, February 3, 2009 - 11:26 pm

erm, you overestimate the amount of planning and forethought which goes
into these things ;)

The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint.

--

From: Roland Dreier
Date: Wednesday, February 4, 2009 - 12:11 pm

> >  > 	huge_page_size(page_hstate(page))

 > > That would suit.  I assume the intention is for that to be usable by
 > > driver modules on any architecture?

 > erm, you overestimate the amount of planning and forethought which goes
 > into these things ;)

 > The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint.

Heh.  Looking into the implementation, it seems that I could actually do

	PAGE_SIZE << compound_order(page)

directly (since there's no reason to go from size to hstate and back to
size.  I don't know all the details of these VM internals, but that
seems to only work on the first (small) page of a giant page?  Which
causes problems for what we're trying to do here...

To summarize the goal, we are mapping user memory to a device that has
its own page tables, where the device's page tables can also use
multiple page sizes.  Using big pages on the device leads to similar
efficiencies as hugetlb pages do on the CPU, and in fact if a user has
used hugetlb pages for the memory they're giving to the device, that's a
very strong hint that the device should use big pages too.

But one valid situation we have to handle in the driver is if, say,
userspace has a hugetlb mapped at virtual address 0x200000, and wants to
map 0x80000 bytes at 0x280000 to the device.  In that case, we're going
to do essentially

	get_user_pages(..., 0x280000, 0x80000 / PAGE_SIZE, ...)

and get_user_pages() is going to give us a bunch of normal PAGE_SIZE
pages starting at offset 0x800000 within the compound page that makes up
the huge page mapped at 0x200000.

get_user_pages() also gives us the vma back, and we can see from
is_vm_hugetlb_page() (-- BTW can I just say that a function
is_xxx_page() that operates on vmas is horribly misnamed --) that these
pages all come from a hugetlb mapping, but figuring out the size of that
mapping is I guess a challenge.

 - R.
--

From: wli
Date: Wednesday, February 4, 2009 - 2:00 pm

You should be able to find the head of a compound page using the
compound_head() inline, so try

	PAGE_SIZE << compound_order(compound_head(page))


-- wli
--

From: Roland Dreier
Date: Wednesday, February 4, 2009 - 2:31 pm

> You should be able to find the head of a compound page using the
 > compound_head() inline, so try

 > 	PAGE_SIZE << compound_order(compound_head(page))

Thanks!  Looks like that should be exactly what we need.

 - R.
--

From: Andrew Morton
Date: Wednesday, February 4, 2009 - 2:23 pm

On Wed, 04 Feb 2009 11:11:22 -0800

compound_head() will convert any page* inside a hugepage into a pointer
to the head page.  It should work OK for regular pages as well as
CONFIG_HUGETLB=n.

So..

	PAGE_SIZE << compound_order(compound_head(page))

?
--

From: Benjamin Herrenschmidt
Date: Wednesday, February 4, 2009 - 4:55 pm

Note that g_u_p() has all sort of shortcommings... we were discussing
some of that recently due to bugs reported from the field.

The problem mostly is that you cannot guarantee that the physical page
will remain mapped to that virtual address in the process. For example,
if your code is part of some library used by an application, and that
application somewhere does a fork/exec (for example, a system() call to
run a shell helper), copy-on-write will hit, and you may end up with
the child process getting the original physical page and the original
process getting the copy...

So your HW will still DMA to a valid page (ie, it's count will have
been incremented) but it's not going to be the one the application
uses any more.

There are similar issues that can be cause, afaik, by madvise, etc...

We've been discussing that at KS with various people, Linus says g_u_p()
sucks, don't do that :-) Most of the time, the other approach should be
used, ie, the driver allocates memory, and userspace mmap's it, in which
case you get access to the VMA to set flags such as don't copy on fork.

An option possibly would be to make fork() pre-COW pages with an
elevated count to ensure that at least the original process is the one
to keep the original physical page... but that has other potential side
effects or performance issues.

A can of worms..

Cheers,
Ben.


--

From: Roland Dreier
Date: Wednesday, February 4, 2009 - 10:10 pm

> Note that g_u_p() has all sort of shortcommings... we were discussing
 > some of that recently due to bugs reported from the field.
 > 
 > The problem mostly is that you cannot guarantee that the physical page
 > will remain mapped to that virtual address in the process. For example,
 > if your code is part of some library used by an application, and that
 > application somewhere does a fork/exec (for example, a system() call to
 > run a shell helper), copy-on-write will hit, and you may end up with
 > the child process getting the original physical page and the original
 > process getting the copy...

Believe me, I'm well aware of that.  We added VM_DONTCOPY to allow apps
to fork() without the child triggering COW, but that means only the
parent process can use the mapped memory (and the app has to worry about
page boundaries etc).

 > We've been discussing that at KS with various people, Linus says g_u_p()
 > sucks, don't do that :-) Most of the time, the other approach should be
 > used, ie, the driver allocates memory, and userspace mmap's it, in which
 > case you get access to the VMA to set flags such as don't copy on fork.

Yes, but unfortunately MPI says apps can allocate memory however they
damn well please... in any case these issues are all-too-well-known in
the RDMA world for quite a while.

Thanks,
  Roland

--

From: Benjamin Herrenschmidt
Date: Wednesday, February 4, 2009 - 10:24 pm

Right, but then you need to set that in the VMA's, and thus gone is your

Yup. What do you think of the idea of pre-COWing pages with an elevated
count at fork time ?

Cheers,
Ben.


--

From: Roland Dreier
Date: Wednesday, February 4, 2009 - 10:33 pm

> Right, but then you need to set that in the VMA's, and thus gone is your
 > nice fast g_u_p() that doesn't touch VMAs :-)

Registering memory is a slow path thing in the RDMA world.  Speeding it
up is nice, so we make userspace do the madvise(VM_DONTCOPY) if it cares
but if it doesn't it can leave it out.

 > > Yes, but unfortunately MPI says apps can allocate memory however they
 > > damn well please... in any case these issues are all-too-well-known in
 > > the RDMA world for quite a while.

 > Yup. What do you think of the idea of pre-COWing pages with an elevated
 > count at fork time ?

Super-duper sucks if the first thing the child does is exec() :)
Also if the parent has registered > half the memory in the system then
it's instant OOM.  So not that useful for the RDMA case :)

The one thing that might make sense is to pre-COW any partial pages that
the parent has registered -- ie if half a page can be used by the child,
at least pre-COW that, but leave all the full pages with VM_DONTCOPY.

 - R.
--

Previous thread: [PATCH] ieee1394: dv1394: move deprecation message from module init to file open by Stefan Richter on Tuesday, February 3, 2009 - 9:54 am. (1 message)

Next thread: [PATCH] firewire: normalize a variable name by Stefan Richter on Tuesday, February 3, 2009 - 9:55 am. (1 message)