git commit 34feb2c83beb3bdf13535a36770f7e50b47ef299 started using quicklists
for freeing page table pages and removed the usage of tlb_remove_page()And looking at quicklist_free() and quicklist_free_page(), on a NUMA platform,
this can potentially free the page before the corresponding TLB caches
are flushed.Essentially quicklist free routines are doing something like
__quicklist_free()
...
if (unlikely(nid != numa_node_id())) {
__free_page(page);
...
}....
Now this will potentially cause a problem, if a cpu in someother node starts
using this page, while the corresponding TLB entries are still alive
in the original cpu which is still freeing the page table pages.This violates the guideline documented in
http://developer.intel.com/design/processor/applnots/317080.pdfThis potentially can cause SW failures and hard to debug issues like
http://www.ussg.iu.edu/hypermail/linux/kernel/0205.2/1254.htmlCan we revert this commit for 2.6.23 and look at this code post 2.6.23?
thanks,
suresh
-
Cannot get to this page:
Not Found
The requested URL /hypermail/linux/kernel/0205.2/1254.htm was not found on
We can remove this piece alone since it was an optimization.
---
include/linux/quicklist.h | 7 -------
1 file changed, 7 deletions(-)Index: linux-2.6/include/linux/quicklist.h
===================================================================
--- linux-2.6.orig/include/linux/quicklist.h 2007-09-21 11:46:44.000000000 -0700
+++ linux-2.6/include/linux/quicklist.h 2007-09-21 11:47:01.000000000 -0700
@@ -58,13 +58,6 @@ static inline void __quicklist_free(int
struct quicklist *q;
int nid = page_to_nid(page);- if (unlikely(nid != numa_node_id())) {
- if (dtor)
- dtor(p);
- __free_page(page);
- return;
- }
-
q = &get_cpu_var(quicklist)[nr];
*(void **)p = q->page;
q->page = p;
-
Hmm? We only add them to the quicklists in the exact same places where we
*used* to just free them entirely. So I don't see why semantics would haveI'll happily revert it, but I want to understand this better, so more of
an explanation of the codepath that actually does something wrong, please.Linus
-
The quicklists collect pages while we gather pages for TLB flushing.
These pages must be kept until the actual TLB flush has occurred. The
optimization of releasing off node pages early is therefore not valid.-
That should be the "mmu_gather" structure, not the quicklists.
Linus
-
Oh, and I see what's wrong: you not only switched "free_page()" to
"quicklist_free()", you *also* switched "tlb_remove_page()" to
"quicklist_free()".Ok, that commit is totally and utterly broken. Will revert.
Linus
-
On x86_64 the quicklists partially replace the role of the mmu_gather
list because we would otherwise create more overhead by having to move
pages between the mmu gather and the quicklists lists.
lists.The final flush does only check if there are too many pages on the
quicklists. Otherwise the quicklist is not dumped/freed (unlike the
mmu_gather list) but kept for the following page table page allocations
since we have cache hot cpu objects there.The only problem is that we cannot release any pages before the TLB flush
has happened. The optimization of releasing off node pages in order to
keep only node local pages is therefore not valid.---
include/linux/quicklist.h | 8 --------
1 file changed, 8 deletions(-)Index: linux-2.6/include/linux/quicklist.h
===================================================================
--- linux-2.6.orig/include/linux/quicklist.h 2007-09-21 11:46:44.000000000 -0700
+++ linux-2.6/include/linux/quicklist.h 2007-09-21 11:55:01.000000000 -0700
@@ -56,14 +56,6 @@ static inline void __quicklist_free(int
struct page *page)
{
struct quicklist *q;
- int nid = page_to_nid(page);
-
- if (unlikely(nid != numa_node_id())) {
- if (dtor)
- dtor(p);
- __free_page(page);
- return;
- }q = &get_cpu_var(quicklist)[nr];
*(void **)p = q->page;
-
Yeah, and the whole thing seems totally bogus. It totally depends on
mmu_gather doing everything right (which very much includes the dependency
on mmu gathering disabling preempt).For exmaple, if we were to go back to the original small tlb_gather with a
simple quicklist on the stack, rather than the per-cpu datastructure, the
quicklists would immediately break horribly - simply because they are
incorrectly now depending on the internal semantics of that tlb-gather.As it is, the quicklists try to be something separate, but by virtue of
being separate, they will always be buggy.The only way to fix it would be to integrate the quicklist stuff *with*
the mmu_gather stuff, so that these kinds of implementation issues are
explicitly shown in the relationship, instead of havign two "independent"
pieces of code where one piece very subtly depends on the exact
implementation of the other.Linus
-
I've been mostly offline since KS (since a bit before in fact), so I
missed some of those discussions but so that you know, I'm toying a bit
with mmu gather and page table accessors at the moment, and one of the
things I've been contemplating is just that ... integrating a quicklist
in the gather to handle just that (and possibly other issues that have
been overlooked on some archs).I'd suggest just reverting the patch for now (well, I see from the
commit list that you did just that) and I'll try to come up with
something better.Christoph, I'd be happy if you didn't start butchering mmu_gather just
right now since I'm doing just that and it will collide all over the
place :-) Or if you want something specific done, please throw
ideas/patches at me and I'll integrate that in my serie.Cheers,
Ben.-
That would be great. Note that the reversal of the x86_64 quicklist
support patch does not address the issue that is (as pointed out by
Siddha) in core code and not in x86_64 arch dode. The fix that I postedAny "butchering" was done for 2.6.22. 2.6.23 only added arch support for
x86_64 since Andi forgot to merge it for .22. I'd be glad if you could
look into this and make quicklists interoperate nicely withh mmu gather.
-
I'll try, I'll let you know. Give me a few days to finish catching up
with backlog of things before i get back to it though.Cheers,
Ben.-
The quicklists have been existing for a long time in this relationship
and were actually developed initially (by Dave Miller I believe) to work
this way. The generic TLB flushing thing may have developed in parallel onHmmmm.. Right the integration of the approaches that have now diverged on
I guess we need to re-join what was separated by developments on different
Right. But will that not mean that quicklists would have to be used on all
platforms in a generic way?
-
No. Each architecture can already choose how it does its own "mmu_gather".
*IF* the architecture uses a per-cpu variable for this, *THEN* the
architecture could also embed a "quicklist" in that per-cpu variable, and
make use of that. If you were to do it that way, it would all work fine.So for example, if you actually want to use quicklists in the "generic"
TLB-gather implementation, you should replace thestruct page *pages[FREE_PTE_NR];
entry in the "struct mmu_gather" with a set of quicklists instead.
Or, alternatively, you could decide to not use the generic version at all,
and instead do a per-architecture one that uses the quicklists.The problem with that commit that I just reverted was that it mixed the
two, but not completely. It still kept them separate.Linus
-
The quicklists have a long history and this bug has therefore also been in
IA64 for a long time and it also likely exists on sparc64, sh and sh64. We
need the patch that I posted to fix the other platforms. And with this fix
there would be nothing amiss on x86_64 either.---
include/linux/quicklist.h | 8 --------
1 file changed, 8 deletions(-)Index: linux-2.6/include/linux/quicklist.h
===================================================================
--- linux-2.6.orig/include/linux/quicklist.h 2007-09-21 11:46:44.000000000 -0700
+++ linux-2.6/include/linux/quicklist.h 2007-09-21 11:55:01.000000000 -0700
@@ -56,14 +56,6 @@ static inline void __quicklist_free(int
struct page *page)
{
struct quicklist *q;
- int nid = page_to_nid(page);
-
- if (unlikely(nid != numa_node_id())) {
- if (dtor)
- dtor(p);
- __free_page(page);
- return;
- }q = &get_cpu_var(quicklist)[nr];
*(void **)p = q->page;
-
IA64 doesn't do a pgd>pud>pmd>pte table walk in h/w. The VHPT leaps
straight to the pte page (through the virtual mapping of the VHPT). So
this wasn't a problem for IA64.-Tony
-
But this is concerning a page being freed before the TLB flush has been
performed. Another process may then reuse the page and may rely on the
fact that there is no TLB entry installed for the page that was just
allocated. But there is still one there. The issues that result from this
may be depend on the nature of each MMU.Granted we usually install another TLB entry for the page mapping the page
into a different address space and never use the old TLB. Which is likely
the reason why we have never seen an issue before and also why this is
probably difficult to cause any issues in the first place.Plus this can only occur for off node pages, meaning this must involve a
remote processor and the other processor is likely to have a completely
different set of TLB entries.
-
Side note: this would obviously require that the interfaces to the
"quicklists" would have to be changed, and you'd have an actual head
pointer rather than using explicitly numbered per-cpu variables, but that
seems to be a good idea in itself.Another option is to just not use quicklists AT ALL, but just have a
simple list of pages that is purely internal to mmu_gather, and just teach
pgd/pmd/pud_alloc about that list. That actually sounds like the most
straightforward and least confusing approach. The quicklists code is
pretty damn ugly in its mixing of "struct page" and the virtual address.Linus
-
Fine by me to revert it. I must admit I never 100% understood how it was
supposed to work anyways, but eventually gave up on it.-Andi
-
