Re: [patch] mm: reduce pagetable-freeing latencies

Previous thread: 2.6.23-rc1 sky2 boot crash in sky2_mac_intr by Florian Lohoff on Tuesday, July 24, 2007 - 1:22 am. (6 messages)

Next thread: dm: device mapper code cleanup. by Dmitry Monakhov on Tuesday, July 24, 2007 - 1:51 am. (10 messages)
From: Ingo Molnar
Date: Tuesday, July 24, 2007 - 1:38 am

the patch below has been tested in -rt for a long time - lets get it 
upstream please. It fixes some bad latencies on PREEMPT too.

	Ingo

----------------------->
From: Hugh Dickins <hugh@veritas.com>
Subject: mm: reduce pagetable-freeing latencies

2.6.15-rc1 moved the unlinking of a vma from its prio_tree and anon_vma
into free_pgtables: so the vma is hidden from rmap and vmtruncate before
freeing its page tables, allowing safe descent without page table lock.
But free_pgtables is still called with preemption disabled, and Lee
Revell has now detected high latency there.

The right fix will be to rework the mmu_gathering, not to need preemption
disabled; but for now an ugly CONFIG_PREEMPT block in free_pgtables, to
make an initial unlinking pass with preemption enabled - made uglier by
CONFIG_IA64 definitions (only ia64 actually uses the start and end given
to tlb_finish_mmu, and our floor and ceiling don't quite work for those).
These CONFIG choices being to minimize the additional TLB flushing.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--

 mm/memory.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Index: linux-rt-rebase.q/mm/memory.c
===================================================================
--- linux-rt-rebase.q.orig/mm/memory.c
+++ linux-rt-rebase.q/mm/memory.c
@@ -264,18 +264,48 @@ void free_pgd_range(struct mmu_gather **
 		flush_tlb_pgtables((*tlb)->mm, start, end);
 }
 
+#ifdef CONFIG_IA64
+#define tlb_start_addr(tlb)	(tlb)->start_addr
+#define tlb_end_addr(tlb)	(tlb)->end_addr
+#else
+#define tlb_start_addr(tlb)	0UL	/* only ia64 really uses it */
+#define tlb_end_addr(tlb)	0UL	/* only ia64 really uses it */
+#endif
+
 void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
+#ifdef CONFIG_PREEMPT
+	struct vm_area_struct *unlink = vma;
+	int fullmm = (*tlb)->fullmm;
+
+	if (!vma)	/* Sometimes when ...
From: Andrew Morton
Date: Tuesday, July 24, 2007 - 1:54 am

What a truly putrid patch.  I am suspecting that this was a quick
get-you-out-of-trouble thing, which then got forgotten about.

We have two months to do the "right fix".  Please?
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:40 am

Working on it... 

Ben.


-

From: Andi Kleen
Date: Tuesday, July 24, 2007 - 5:13 am

Ideally the patch would DTRT even on non preemptible kernels,
aka do cond_resched()s when needed.

-Andi
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:29 pm

First is to rework the batch structure to make it more manageable. That
is, patch #1 will keep the page list in per-cpu (and thus non-preempt),
but the batch "head" will be on the stack.

Now, there are two approaches regarding getting rid of the
get_cpu/put_cpu:

 - One is to have a small number of entries for the page list in the
batch structure on the stack, and attempt to gfp' a page for more. If
that fails, we can still free, though with less batching, using only the
few entries in the batch struct itself. That's Hugh initial appraoch
iirc.

 - Another is to hook up with those folks who've been asking for a
notifier that we are being preempted/scheduled out. In this case, I can
happily access the per-cpu list, and just trigger a batch flush if we
happen to be scheduled out.

I tend to prefer the former solution though, gfp should be fast, and
there is no need to force a flush if we get scheduled out. It would be
rare to hit the worst case scenario of falling back to the few page
heads in the batch itself. On the other hand, that solution has the
problem of bloating the stack a bit (with the few page pointers) even in
the case where I plan to use the extended batch outside of zap_*, such
as fork, mprotect, ....

So I'll first do patch #1, which will not fix the problem, but will make
the fix easier to fit in, in the meantime, please provide feedback of
your preferred solution for avoiding the get/put_cpu of the 2 above,
unless you find a good 3rd one.

Cheers,
Ben.

-

From: Peter Zijlstra
Date: Tuesday, July 24, 2007 - 11:44 pm

I too would prefer the former solution. I think preemption notifiers are
a particular iffy hack.

You could perhaps use C99 variable length arrays to avoid the stack
waste when not needed, however Andi once told me that generates rather
dubious code.



-

From: Andi Kleen
Date: Wednesday, July 25, 2007 - 2:46 am

It generates frame pointers, but that's not that bad.  I'm not 
aware of any other bad side effects. Ok the compiler will limit
your goto usage, but that's more a good thing.

But since you always have to strictly limit the array in kernel code anyways
you could as well just allocate the fixed limit.

-Andi
-

From: Benjamin Herrenschmidt
Date: Wednesday, July 25, 2007 - 3:08 am

Plan is fixed array or 4 or maybe 8 entries (pointers), that shouldn't
be -too- bad. The code path I'm a bit worried about is
unmap_mapping_ranges() which goes into zapping page tables from deep
within filesystems.

At worst, I can reduce the fixed array to 1 entry. That means that if
the batch can't manage to get a page to use for the page list, it will
end up doing the flush for each page :-) But that should rarely happen,
in fact, I would expect it to be able to get a page the next time around
because it just freed one...

Ben.


-

From: Andi Kleen
Date: Wednesday, July 25, 2007 - 3:26 am

Your aim is to conserve stack space? 
The worst case has to work without overflow anyways, so using VLAs don't 
help you. Just allocate the largest size you can safely afford.

This is also easier to test; if it is too large it is better when
the overflow triggers always.

They are more useful in user space with larger stacks when you don't 
know the maximum size.

-Andi

-

From: Benjamin Herrenschmidt
Date: Wednesday, July 25, 2007 - 3:46 am

In fact, i discussed with peter and I think the best is to only have one
entry in the stack mmu_gather structure. If we fail to allocate a page
to batch entries, we flush that one and steal it :-) Then we have a page
to gather more.

I'll give that a go tomorrow, I got delayed by some stupid HW issues
today.

Cheers,
Ben.


-

From: Hugh Dickins
Date: Thursday, July 26, 2007 - 10:11 am

Four or eight would be much nicer than one, but you're right there's
a stackdepth worry: good spotting of those unmap_mapping_range cases,
I'd certainly been thinking we were fairly near the top of the stack

That will often be the case for anonymous pages, but not for pagecache
pages.  I had wondered whether to make exit_mmap unmap the stack first
(rather than text first as usually happens), but seems rather a hack.

Hugh
-

From: Benjamin Herrenschmidt
Date: Thursday, July 26, 2007 - 2:35 pm

Yup and defeated by HIGHMEM ,so I'm back to per-cpu page lists for now
and we'll see how to move to the next step once I've ironed out other
issues such as sparc64/ia64 related bits.

Ben.


-

From: Benjamin Herrenschmidt
Date: Friday, July 27, 2007 - 6:36 pm

As I'm sweeping through arch code etc... preparing the ground for the
proper mmu_gather surgery, I've been thinking about the way to deal with
that per-cpu page list and finally came up with the idea that the best
we can do is around the lines of trying to allocate the list via gfp,
and if that fails, fallback to a (smaller than now) per-cpu. I'm
reworking the interfaces such that the higher level code doesn't have to
care whether preemption is enabled or disabled at a given point.

Ben.


-

From: Hugh Dickins
Date: Friday, July 27, 2007 - 10:54 pm

That doesn't sound like the best way to me at all.  Using two means
of buffering, one with preemption enabled and the other not, seems
complex and prone to error (perhaps not while you're working on it,
but later on).  We do already have that problem (the i_mmap_lock case
versus the others), but it's not a complication I'd want to extend.

The onstack array seems fine to me, even if you do end up deciding on
an array of one.  Is there any evidence that it's a problem getting a
page for the freeing (other than in circumstances that are already
badly slowed down)?  It's obvious that we need a fallback route,
but optimizing throughput on that route seems premature.

Hugh
-

From: Benjamin Herrenschmidt
Date: Saturday, July 28, 2007 - 3:36 pm

Hrm, no evidence of that so far indeed. I'm worried by the stack usage
of the unmap_mapping_ranges() but appart from that, no.

Appart from that, yeah, I suppose we can have a macro defining how many
on-stack backup we have and adjust it if we see that being a problem.
I'm not fan of dynamic on-stack allocations.

Ben.


-

From: Ingo Molnar
Date: Tuesday, July 24, 2007 - 4:22 am

sure. Just wanted to bring attention to this issue - the regression got 
introduced in v2.6.15. I'd be happy to test-drive Ben's patch in -rt.

	Ingo
-

From: Benjamin Herrenschmidt
Date: Tuesday, July 24, 2007 - 2:40 am

Note that I'm working on the "right way" (reworking mmu gathering)

Patches hopefully this week.


-

Previous thread: 2.6.23-rc1 sky2 boot crash in sky2_mac_intr by Florian Lohoff on Tuesday, July 24, 2007 - 1:22 am. (6 messages)

Next thread: dm: device mapper code cleanup. by Dmitry Monakhov on Tuesday, July 24, 2007 - 1:51 am. (10 messages)