Re: [PATCH] x86: fix /proc/meminfo DirectMap

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- question about NMI watchdog by David Witbrodt on Friday, August 15, 2008 - 5:49 am. (2 messages)

Next thread: [PATCH v2] 2.6.27-rc3-mmotm tpm-correct-tpm-timeouts-to-jiffies-conversion-d820-fix.patch by Valdis.Kletnieks on Friday, August 15, 2008 - 2:56 am. (1 message)
From: Hugh Dickins
Date: Friday, August 15, 2008 - 5:58 am

Do we actually want these DirectMap lines in the x86 /proc/meminfo?
I can see they're interesting to CPA developers and TLB optimizers,
but they don't fit its usual "where has all my memory gone?" usage.
If they are to stay, here are some fixes.

1. On x86_32 without PAE, they're not 2M but 4M pages: no need to
   mess with the internal enum, but show the right name to users.

2. Many machines can never show anything but 0 for DirectMap1G,
   so suppress that line unless direct_gbpages are really enabled.

3. The unit in /proc/meminfo is kB not number of pages: HugePages
   messed that up, but they're an example to regret not to follow.

4. Once we use kB, it's easy to see that 1GB has gone missing (which
   explains why CONFIG_CPA_DEBUG=y soon wraps DirectMap2M negative):
   because head_64.S's level2_ident_pgt entries were not counted.
   My fix is not ideal, but works for more and for less than 1G,
   and avoids interfering with early bootup pagetable contortions.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
You might prefer me to split these up?

Should we really be using level2_ident_pgt (which needs to avoid NX)
for the final direct map (which wants to use NX)?  But my attempt
to build up a fresh pagetable there failed miserably to boot!

 arch/x86/mm/init_64.c  |    6 +++++-
 arch/x86/mm/pageattr.c |   18 ++++++++++++------
 2 files changed, 17 insertions(+), 7 deletions(-)

--- 2.6.27-rc3/arch/x86/mm/init_64.c	2008-07-29 04:24:15.000000000 +0100
+++ linux/arch/x86/mm/init_64.c	2008-08-13 16:37:41.000000000 +0100
@@ -60,7 +60,7 @@ static unsigned long dma_reserve __initd
 
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
 
-int direct_gbpages __meminitdata
+int direct_gbpages
 #ifdef CONFIG_DIRECT_GBPAGES
 				= 1
 #endif
@@ -314,6 +314,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned 
 {
 	unsigned long pages = 0;
 	unsigned long last_map_addr = end;
+	unsigned long start = address;
 
 	int i = pmd_index(address);
 
@@ -334,6 +335,9 @@ ...
From: Andi Kleen
Date: Friday, August 15, 2008 - 6:15 am

I made them unconditional to minimize the risk of some dumb
parser not being able to deal with them. Longer term there
will be more and more machines that support them.

Admittedly that's not a very strong argument.

-Andi
--

From: Hugh Dickins
Date: Friday, August 15, 2008 - 6:32 am

Yes, that's what I meant by the TLB optimizers.  But it's going to
be a fractional effect, isn't it, when you're trying to get the last
1% out of the machine?  And in such a case, you might wonder more
what all the 4k ones are actually being used for (no problem at all
if they've ended up behind vmalloced module text).

Hugh
--

From: Andi Kleen
Date: Friday, August 15, 2008 - 6:36 am

Depending on the workload it can be much more than that.

-Andi
--

From: Ingo Molnar
Date: Friday, August 15, 2008 - 6:45 am

i cannot see any performance difference myself between 2MB and 1GB TLBs.

There are measurements that Andi Kleen did originally in this commit:

 commit 8346ea17aa20e9864b0f7dc03d55f3cd5620b8c1
 Author: Andi Kleen <andi@firstfloor.org>
 Date:   Wed Mar 12 03:53:32 2008 +0100

    x86: split large page mapping for AMD TSEG

    [lower is better]
                  no split stddev         split  stddev    delta
    Elapsed Time   87.146 (0.727516)     84.296 (1.09098)  -3.2%
    User Time     274.537 (4.05226)     273.692 (3.34344)  -0.3%
    System Time    34.907 (0.42492)      34.508 (0.26832)  -1.1%
    Percent CPU   322.5   (38.3007)     326.5   (44.5128)  +1.2%

    => About 3.2% improvement in elapsed time for kernbench.
 [...]

meanwhile i have Barcelona class hardware myself and i cannot reproduce 
these claimed improvements in kernbench performance. gbpages versus 
no-gbpages results are dead on the same, within statistical noise.

( i'm sure it could make some difference in synthetic user-space 
  workloads - but gbpages are not exposed to user-space anyway. )

	Ingo
--

From: Ingo Molnar
Date: Friday, August 15, 2008 - 6:28 am

i think they are borderline useful - so i've applied your fixes to 


hm, exactly what change have you tried? (patch?)

	Ingo
--

From: Hugh Dickins
Date: Friday, August 15, 2008 - 7:30 am

As soon as that kernel failed to boot, I chucked the patch away and
erased it from my mind: much better to leave such a change to the
people who are intimate with this sequence and can debug it.

It wasn't anything much, the page to use has already been set aside
for alloc_low_page, I thought it was just a matter of breaking the
association with level2_ident_pgt at the right level then letting
phys_pmd_init do its usual setup from scratch.

Maybe it didn't work because I got it slightly wrong, or maybe it
it didn't work for more subtle reasons e.g. I was then building up
the first 1GB of direct map 2MB by 2MB: if direct map is actually
used in there and falls out of TLB, I'd certainly be in trouble.

Hugh
--

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- question about NMI watchdog by David Witbrodt on Friday, August 15, 2008 - 5:49 am. (2 messages)

Next thread: [PATCH v2] 2.6.27-rc3-mmotm tpm-correct-tpm-timeouts-to-jiffies-conversion-d820-fix.patch by Valdis.Kletnieks on Friday, August 15, 2008 - 2:56 am. (1 message)