the patch below has been tested in -rt for a long time - lets get it
upstream please. It fixes some bad latencies on PREEMPT too.
Ingo
----------------------->
From: Hugh Dickins <hugh@veritas.com>
Subject: mm: reduce pagetable-freeing latencies
2.6.15-rc1 moved the unlinking of a vma from its prio_tree and anon_vma
into free_pgtables: so the vma is hidden from rmap and vmtruncate before
freeing its page tables, allowing safe descent without page table lock.
But free_pgtables is still called with preemption disabled, and Lee
Revell has now detected high latency there.
The right fix will be to rework the mmu_gathering, not to need preemption
disabled; but for now an ugly CONFIG_PREEMPT block in free_pgtables, to
make an initial unlinking pass with preemption enabled - made uglier by
CONFIG_IA64 definitions (only ia64 actually uses the start and end given
to tlb_finish_mmu, and our floor and ceiling don't quite work for those).
These CONFIG choices being to minimize the additional TLB flushing.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
mm/memory.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
Index: linux-rt-rebase.q/mm/memory.c
===================================================================
--- linux-rt-rebase.q.orig/mm/memory.c
+++ linux-rt-rebase.q/mm/memory.c
@@ -264,18 +264,48 @@ void free_pgd_range(struct mmu_gather **
flush_tlb_pgtables((*tlb)->mm, start, end);
}
+#ifdef CONFIG_IA64
+#define tlb_start_addr(tlb) (tlb)->start_addr
+#define tlb_end_addr(tlb) (tlb)->end_addr
+#else
+#define tlb_start_addr(tlb) 0UL /* only ia64 really uses it */
+#define tlb_end_addr(tlb) 0UL /* only ia64 really uses it */
+#endif
+
void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
unsigned long floor, unsigned long ceiling)
{
+#ifdef CONFIG_PREEMPT
+ struct vm_area_struct *unlink = vma;
+ int fullmm = (*tlb)->fullmm;
+
+ if (!vma) /* Sometimes when ...What a truly putrid patch. I am suspecting that this was a quick get-you-out-of-trouble thing, which then got forgotten about. We have two months to do the "right fix". Please? -
Ideally the patch would DTRT even on non preemptible kernels, aka do cond_resched()s when needed. -Andi -
First is to rework the batch structure to make it more manageable. That is, patch #1 will keep the page list in per-cpu (and thus non-preempt), but the batch "head" will be on the stack. Now, there are two approaches regarding getting rid of the get_cpu/put_cpu: - One is to have a small number of entries for the page list in the batch structure on the stack, and attempt to gfp' a page for more. If that fails, we can still free, though with less batching, using only the few entries in the batch struct itself. That's Hugh initial appraoch iirc. - Another is to hook up with those folks who've been asking for a notifier that we are being preempted/scheduled out. In this case, I can happily access the per-cpu list, and just trigger a batch flush if we happen to be scheduled out. I tend to prefer the former solution though, gfp should be fast, and there is no need to force a flush if we get scheduled out. It would be rare to hit the worst case scenario of falling back to the few page heads in the batch itself. On the other hand, that solution has the problem of bloating the stack a bit (with the few page pointers) even in the case where I plan to use the extended batch outside of zap_*, such as fork, mprotect, .... So I'll first do patch #1, which will not fix the problem, but will make the fix easier to fit in, in the meantime, please provide feedback of your preferred solution for avoiding the get/put_cpu of the 2 above, unless you find a good 3rd one. Cheers, Ben. -
I too would prefer the former solution. I think preemption notifiers are a particular iffy hack. You could perhaps use C99 variable length arrays to avoid the stack waste when not needed, however Andi once told me that generates rather dubious code. -
It generates frame pointers, but that's not that bad. I'm not aware of any other bad side effects. Ok the compiler will limit your goto usage, but that's more a good thing. But since you always have to strictly limit the array in kernel code anyways you could as well just allocate the fixed limit. -Andi -
Plan is fixed array or 4 or maybe 8 entries (pointers), that shouldn't be -too- bad. The code path I'm a bit worried about is unmap_mapping_ranges() which goes into zapping page tables from deep within filesystems. At worst, I can reduce the fixed array to 1 entry. That means that if the batch can't manage to get a page to use for the page list, it will end up doing the flush for each page :-) But that should rarely happen, in fact, I would expect it to be able to get a page the next time around because it just freed one... Ben. -
Your aim is to conserve stack space? The worst case has to work without overflow anyways, so using VLAs don't help you. Just allocate the largest size you can safely afford. This is also easier to test; if it is too large it is better when the overflow triggers always. They are more useful in user space with larger stacks when you don't know the maximum size. -Andi -
In fact, i discussed with peter and I think the best is to only have one entry in the stack mmu_gather structure. If we fail to allocate a page to batch entries, we flush that one and steal it :-) Then we have a page to gather more. I'll give that a go tomorrow, I got delayed by some stupid HW issues today. Cheers, Ben. -
Four or eight would be much nicer than one, but you're right there's a stackdepth worry: good spotting of those unmap_mapping_range cases, I'd certainly been thinking we were fairly near the top of the stack That will often be the case for anonymous pages, but not for pagecache pages. I had wondered whether to make exit_mmap unmap the stack first (rather than text first as usually happens), but seems rather a hack. Hugh -
Yup and defeated by HIGHMEM ,so I'm back to per-cpu page lists for now and we'll see how to move to the next step once I've ironed out other issues such as sparc64/ia64 related bits. Ben. -
As I'm sweeping through arch code etc... preparing the ground for the proper mmu_gather surgery, I've been thinking about the way to deal with that per-cpu page list and finally came up with the idea that the best we can do is around the lines of trying to allocate the list via gfp, and if that fails, fallback to a (smaller than now) per-cpu. I'm reworking the interfaces such that the higher level code doesn't have to care whether preemption is enabled or disabled at a given point. Ben. -
That doesn't sound like the best way to me at all. Using two means of buffering, one with preemption enabled and the other not, seems complex and prone to error (perhaps not while you're working on it, but later on). We do already have that problem (the i_mmap_lock case versus the others), but it's not a complication I'd want to extend. The onstack array seems fine to me, even if you do end up deciding on an array of one. Is there any evidence that it's a problem getting a page for the freeing (other than in circumstances that are already badly slowed down)? It's obvious that we need a fallback route, but optimizing throughput on that route seems premature. Hugh -
Hrm, no evidence of that so far indeed. I'm worried by the stack usage of the unmap_mapping_ranges() but appart from that, no. Appart from that, yeah, I suppose we can have a macro defining how many on-stack backup we have and adjust it if we see that being a problem. I'm not fan of dynamic on-stack allocations. Ben. -
sure. Just wanted to bring attention to this issue - the regression got introduced in v2.6.15. I'd be happy to test-drive Ben's patch in -rt. Ingo -
Note that I'm working on the "right way" (reworking mmu gathering) Patches hopefully this week. -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir |
