Hi Dave & other XFS folk,
I'm tracking down a bug which appears to be a bad interaction between
XFS and Xen. It looks like XFS is holding RW mappings on free pages,
which Xen is trying to get an exclusive RO mapping on so it can turn
them into pagetables.
I'm assuming the pages are actually free, and this isn't a use after
free problem. From a quick poke around, the most likely pieces of XFS
code is the stuff in xfs_map.c which creates a virtually contiguous
mapping of pages with vmap, and seems to delay unmapping them.
When pinning a pagetable, Xen tries to eliminate any RW aliases of the
pages its using. This is generally trivial because pages returned by
get_free_page don't have any mappings aside from the normal kernel
mapping. High pages, when using CONFIG_HIGHPTE, may have a residual
kmap mapping, so we clear out the kmap cache if we encounter a highpage
in the pagetable.
I guess we could create a special-case interface to do the same thing
with XFS mappings, but it would be nicer to have something more generic.
Is my analysis correct? Or should XFS not be holding stray mappings?
Or is there already some kind of generic mechanism I can use to get it
to release its mappings?
Thanks,
J
-
This test patch confirms my theory:
diff -r 36a518c1fb4b fs/xfs/linux-2.6/xfs_buf.c
--- a/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:03:56 2007 -0700
+++ b/fs/xfs/linux-2.6/xfs_buf.c Fri Oct 12 10:07:03 2007 -0700
@@ -186,6 +186,11 @@ free_address(
void *addr)
{
a_list_t *aentry;
+
+#ifdef CONFIG_XEN
+ vunmap(addr);
+ return;
+#endif
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
With this in place, the problem goes away.
J
-
You mean xfs_buf.c. And yes, we delay unmapping pages until we have a batch of them to unmap. vmap and vunmap do not scale, so this is batching helps The xfsbufd cleans out any stale mappings - it's woken by the memory shrinker interface (i.e. calls xfsbufd_wakeup()). Otherwise every 64th buffer being vmap()d will flush out stale mappings. Realistically, if this delayed release of vmaps is a problem for Xen, then I think that some generic VM solution is needed to this problem as vmap() is likely to become more common in future (think large blocks in filesystems). Nick - any comments? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
How much performance does it cost? What kind of workloads would it show
Well, the only real problem is that the pages are returned to the free
pool and reallocated while still being part of a mapping. If the pages
are still owned by the filesystem/pagecache, then there's no problem.
What's the lifetime of things being vmapped/unmapped in xfs? Are they
necessarily being freed when they're unmapped, or could unmapping of
freed memory be more immediate than other memory?
Maybe it just needs a notifier chain or something.
J
-
Yes, as Dave said, vmap (more specifically: vunmap) is very expensive because it generally has to invalidate TLBs on all CPUs. I'm looking at some more general solutions to this (already have some batching / lazy unmapping that replaces the XFS specific one), however they are still likely going to leave vmap mappings around after freeing the page. We _could_ hold on to the pages as well, but that's pretty inefficient. The memory cost of keeping the mappings around tends to be well under 1% the cost of the page itself. OTOH we could also avoid lazy flushes on architectures where it is not costly. Either way, it probably would require an arch hook or even a couple of ifdefs in mm/vmalloc.c for Xen. Although... it would be nice if Xen could take advantage of some of these optimisations as well. What's the actual problem for Xen? Anything that can be changed? -
Hm. Could there be a call to shoot down any lazy mappings of a page, so
the Xen pagetable code could use it on any pagetable page? Ideally one
that could be used on any page, but only causes expensive operations
In general the lazy unmappings won't worry Xen. It's only for the
specific case of allocating memory for pagetables. Xen can do a bit of
extra optimisation for cross-cpu tlb flushes (if the target vcpus are
not currently running, then you don't need to do anything), but they're
Not easily. Xen doesn't use shadow pagetables. Instead, it gives the
guest domains direct access to the real CPU's pagetable, but makes sure
they're always mapped RO so that the hypervisor can control updates to
the pagetables (either by trapping writes or via explicit hypercalls).
This means that when constructing a new pagetable, Xen will verify that
all the mappings of pages making up the new pagetable are RO before
allowing it to be used. If there are stray RW mappings of those pages,
pagetable construction will fail.
Aside from XFS, the only other case I've found where there could be
stray RW mappings is when using high pages which are still in the kmap
cache; I added an explicit call to flush the kmap cache to handle this.
If vmap and kmap can be unified (at least the lazy unmap aspects of
them), then that would be a nice little cleanup.
J
-
Yeah, it would be possible. The easiest way would just be to shoot down all lazy vmaps (because you're doing the global IPIs anyway, which are the expensive thing, at which point you may as well purge the rest of your lazy mappings). OK, I see. Because even though it is technically safe where we are using it (because nothing writes through the mappings after the page is freed), a corrupted guest could use the same window to do bad vmap is slightly harder than kmap in some respects. However it would be really nice to get vmap fast and general enough to completely replace all the kmap crud -- that's one goal, but the first thing I'm doing is to concentrate on just vmap to work out how to make it as fast as possible. For Xen -- shouldn't be a big deal. We can have a single Linux mm API to call, and we can do the right thing WRT vmap/kamp. I should try to merge my current lazy vmap patches which replace the XFS stuff, so we can implement such an API and fix your XFS issue? That's not going to happen for at least a cycle or two though, so in the meantime maybe an ifdef for that XFS vmap batching code would help? -
Yes. If there's some way to tell whether a particular page is in a lazy
mapping then that would help, since we could easily tell whether we need
to do the whole shootdown thing. I would expect the population of
lazily mapped pages in the whole freepage pool to be pretty small, but
if the allocator tends to return the most recently freed pages you might
hit them fairly regularly (shoving them at the other end of the freelist
That's right. The hypervisor doesn't trust the guests, so it prevents
For now I've proposed a patch to simply eagerly vunmap everything when
CONFIG_XEN is set. It certainly works, but I don't have a good feel for
how much of a performance hit that imposes on XFS. A slightly more
subtle change would be to test to see if we're actually running under
Xen before taking the eager path, so at least the performance burden
only affects actual Xen users (and I presume xfs+xen is a fairly rare
combination, or this problem would have turned up earlier, or perhaps
the old xenified kernels have some other workaround for it).
J
-
With defaults - little effect as vmap should never be used. It's only when you start using larger block sizes for metadata that this becomes an issue. The CONFIG_XEN workaround should be fine until we get a proper vmap cache.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Hm, well I saw the problem with a filesystem made with mkfs.xfs with no
options, so there must be at least *some* vmapping going on there.
J
-
Sorry - I should have been more precise - vmap should never be used in performance critical paths on default configs. Log recovery will trigger vmap/vunmap usage, so this is probably what you are seeing. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
The iclogs are also vmapped, but they aren't unmapped until unmount so this optimizations doesn't matter either. -
why is that? ignoring 32-bit archs we have heaps of address space available... couldn't the kernel just burn address space and delay global TLB invalidate by some relatively long time (say 1 second)? -dean -
Yes, that's precisely the problem. xfs does delay the unmap, leaving
stray mappings, which upsets Xen.
J
-
sounds like a bug in xen to me :) -dean -
You could call it a bug I think. I don't know much about Xen though, whether or not it expects to be able to run an arbitrary OS kernel. Presumably, the hypervisor _could_ write protect and trap writes to *all* page table page mappings. -
Xen's paravirtualized mode always requires a guest OS to be modified;
certainly some operating systems would be very hard to make work with
Xen. But you can always fall back to using shadow pagetables or full
Xen manages this stuff with refcounts; it doesn't maintain an rmap for
these pages, so it would have to exhaustively search to do this. But
aside from that, Xen never actively modifies pagetables, so this would
be a new behaviour in the interface.
J
-
I explained at the head of this thread how and why Xen works in this
manner. It's certainly a change from native execution; whether you
consider it to be a bug is a different matter.
But it turns out that leaving stray mappings around on pages which get
later used in special ways is a bad idea, and can manifest in all kinds
of random unpleasant ways. Getting a clear error out of Xen is probably
nicest way to discover and debug this problem.
J
-
Again it not just upsets Xen, keeping mappings to freed pages is wrong generally and violates the x86 (and likely others like PPC) architecture because it can cause illegal caching attribute aliases. The patch that went into the tree was really not correct -- this bogus optimization should have been unconditionally removed or if you really wanted an ifdef made dependent on !CONFIG_XEN && !CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that uses uncached mappings in memory). You just worked around the obvious failure and leave the non obvious rare corruptions in, which isn't a good strategy. -Andi -
Well, at least it becomes a known issue and/or placeholder for when Nick
does his grand unified vmap manager. I guess a clean workaround would
be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level...
I'll cook up a patch.
J
-
It's hidden now so it causes any obvious failures any more. Just subtle ones which is much worse. But why not just disable it? It's not critical functionality, just a optimization that unfortunately turned out to be incorrect. It could be made correct short term by not freeing the pages until This option could only be safely set in architectures that don't care about caching attribute aliases or never remap kernel pages uncached. That's not x86, powerpc, ia64 at a minimum. I think the only good short term fix is to turn your ifdef CONFIG_XEN into a #if 0 -Andi -
I think it only ends up holding 64 pages (or is it 64 mappings?), so its
not a horrible use of memory. Particularly since it responds to memory
#if 1, but yes, I see your point.
J
-
There's no evidence of any problems ever being caused by this code. It is critical functionality for larger machines because vmap()/vunmap() Could vmap()/vunmap() take references to the pages that are mapped? That way delaying the unmap would also delay the freeing of the pages and hence we'd have no problems with the pages being reused before the mapping is torn down. That'd work for Xen even with XFS's lazy unmapping scheme, and would allow Nick's more advanced methods to work as well.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
We had a fairly nasty bug a couple of years ago with such a mapping that took months to track down. Luckily we had help from a hardware debug lab, without that it would have been very near impossible. But just stabilizing the problem was a nightmare. You will only see problems if the memory you're still mapping will be allocated by someone who turns it into an uncached mapping. And even then it doesn't happen always because the CPU does not always do the necessary prefetches. Also the corruption will not be on the XFS side, but on the uncached mapping so typically some data sent to some device gets a few corrupted cache line sized blocks. Depending on the device this might also not be fatal -- e.g. if it is texture data some corrupted pixels occasionally are not that visible. But there can be still cases where it can cause real failures when it hits something critical (the failures were it was tracked down was usually it hitting some command ring and the hardware erroring out) Also every time I broke change_page_attr() flushing (which handles exactly this problem and is tricky so it got broken a few times) I eventually got complaints/reports from some users who ran into such data inconsistencies. Given it was mostly from closed 3d driver users who seem to be most aggressive in using uncached mappings. But with more systems running more aggressive That doesn't mean it is correct. Basically you're saying: I don't care if I corrupt device driver Critical perhaps, but also broken. And if it's that big a problem would it be really that difficult to change only the time critical paths using it to not? I assume it's only the directory code where it is a problem? That would You could always just keep around an array of the pages and then drop the reference count after unmap. Or walk the vmalloc mapping and generate such an array before freeing, then unmap and then drop the reference counts. Or the other alternative would be change the time critical ...
It is a serious offense to leave stray mappings for memory which can get re-mapped to I/O devices... especially with PCI and other device hotplug. I have to back up Andi on this one unconditionally. On architectures where you have multibyte, non-wordsize updates to hardware page tables, you even have races here when setting, updating and clearing PTEs that must be done in a sequence where no aliasing of mappings to partially written PTEs can result in I/O memory getting mapped in a cacheable state. The window here is only one instruction, and yes, it is possible for a window this small to create a problem. A large (or permanently lazy) window is extremely frightening. These things do cause bugs. The bugs take a very long time to show up and are very difficult to track down, since they can basically cause random behavior, such as hanging the machine or losing orbit and crashing into the moon. Zach -
Right, but it also points to the fact that it's not causing problems No, I did not say that. I said that there's no evidence that points to Difficult - yes. All the btree code in XFS would have to be rewritten to remove the assumption that the buffer structures are contiguous in memory. Any bug we introduce in doing this will result in on disk corruption, which is far worse than occasionally crashing a machine with a stray You mean like vmap() could record the pages passed to it in the area->pages array, and we walk and release than in __vunmap() like it already does for vfree()? If we did this, it would probably be best to pass a page release function into the vmap or vunmap call - we'd need page_cache_release() called on the page rather than __free_page().... The solution belongs behind the vmap/vunmap interface, not in XFS.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Lightly tested(*) patch that does this with lazy unmapping
below for comment.
(*) a) kernel boots, b) made an XFS filesystem with 64k directory
blocks, created ~100,000 files in a directory to get a wide btree
(~1700 blocks, still only a single level) and run repeated finds
across it dropping caches in between. Each traversal maps and
unmaps every btree block.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
---
fs/xfs/linux-2.6/xfs_buf.c | 21 +----
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 160 +++++++++++++++++++++++++++++++++++----------
3 files changed, 133 insertions(+), 52 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-23 14:48:32.131088616 +1000
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;
-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr, put_page);
}
}
@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);
while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr, put_page);
aentry = aentry->next;
kfree(old);
}
@@ -456,8 +443,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
...Hmm, the __free_page -> page_cache_release() change in vfree() would have been simpler wouldn't it? But if it works it is fine. -Andi -
Yes, it would, but I tried to leave vmalloc doing the same thing as it was before. I think that it would be safe simply to call put_page() directly in the __vunmap() code and drop all the release function passing, but I don't know enough about that code to say for certain. Seems to - it's passed XFSQA with 64k directories and a bunch of dirstress workouts as well. Nick, Jeremy, (others?) any objections to this approach to solve the problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Seems sounds in principle. I think Nick's shootdown-stray-mappings mm
API call is a better long-term answer, but this will do for now. I'll
test it out today.
J
-
Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.
Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.
Version 2:
- drop the release function stuff and just call put_page()
as it falls down to the same code in all calling cases.
Signed-Off-By: Dave Chinner <dgc@sgi.com>
---
fs/xfs/linux-2.6/xfs_buf.c | 21 +-------
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 106 insertions(+), 37 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c 2007-10-24 09:16:26.930345566 +1000
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;
-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}
@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);
while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
...This doesn't apply cleanly to the current git tree.
J
-
Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.
Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.
Version 3:
- compile on latest -git
Version 2:
- drop the release function stuff and just call put_page()
as it falls down to the same code in all calling cases.
Signed-off-by: Dave Chinner <dgc@sgi.com>
---
fs/xfs/linux-2.6/xfs_buf.c | 21 +-------
include/linux/vmalloc.h | 4 +
mm/vmalloc.c | 118 +++++++++++++++++++++++++++++++++++++--------
3 files changed, 106 insertions(+), 37 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;
-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}
@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);
while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
kfree(old);
}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
} else if (flags & ...Not quite: CC mm/vmalloc.o /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function 'vm_area_alloc_pagearray': /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: 'GFP_LEVEL_MASK' undeclared (first use in this function) /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once /home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.) make[3]: *** [mm/vmalloc.o] Error 1 make[2]: *** [mm] Error 2 make[2]: *** Waiting for unfinished jobs.... GFP_RECLAM_MASK now? J -
Yeah, it is. Not sure what happened there - I did make that change.
new diff below.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
{
a_list_t *aentry;
-#ifdef CONFIG_XEN
- /*
- * Xen needs to be able to make sure it can get an exclusive
- * RO mapping of pages it wants to turn into a pagetable. If
- * a newly allocated page is also still being vmap()ed by xfs,
- * it will cause pagetable construction to fail. This is a
- * quick workaround to always eagerly unmap pages so that Xen
- * is happy.
- */
- vunmap(addr);
- return;
-#endif
-
aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
if (likely(aentry)) {
spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
as_list_len++;
spin_unlock(&as_lock);
} else {
- vunmap(addr);
+ vunmap_pages(addr);
}
}
@@ -228,7 +215,7 @@ purge_addresses(void)
spin_unlock(&as_lock);
while ((old = aentry) != NULL) {
- vunmap(aentry->vm_addr);
+ vunmap_pages(aentry->vm_addr);
aentry = aentry->next;
kfree(old);
}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
} else if (flags & XBF_MAPPED) {
if (as_list_len > 64)
purge_addresses();
- bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
- VM_MAP, PAGE_KERNEL);
+ bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+ VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
if (unlikely(bp->b_addr == NULL))
return -ENOMEM;
bp->b_addr += bp->b_offset;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 89338b4..40c34da 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
unsigned long flags, pgprot_t prot);
extern void vunmap(void *addr);
+extern void *vmap_pages(struct page **pages, ...The page tables contain pointers to the pages anyways. vunmap() has to walk them. It would not be very difficult to store them in an array during Possible. Normally vmalloc pages should not be in the LRU except yours You could also just keep the array from map time around yourself. Then you could do it yourself. -Andi -
Every vunmap() cal causes a global TLB sync, and the region lists are globl with a spin lock protecting them. I thin kNick has shown a 64p altix with ~60 cpus spinning on the vmap locks under a A directory traversal when using large directory block sizes The pages are still attached to the blockdev address space mapping, but there's nothing stopping them from being reclaimed before they are It's all "freed memory". At the time we pull the buffer down, there are no further references to the buffer. the pages are released and the mapping is never used again until it is torn down. it is torn down either on the next xfsbufd run (either memory pressure or every 15s) or every 64th We've already got a memroy shrinker hook that triggers this reclaim. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
You're keeping vmaps around for already freed pages? That will be a big problem for proper PAT support, which needs to track all mappings to memory. It's not just a problem for Xen. In fact I suspect it is already broken with DRM or AGP for example which can use UC and WC mappings -- if you keep the mapping around and DRM or AGP turns the page in another mapping uncacheable you're creating an illegal cache attribute alias. These are known to occasionally create cache corruptions on several x86s; giving ___VERY___ hard to debug bugs once a blue moon. Probably it'll require some generic VM batching mechanism where Xen or PAT code can hook into the list or force unmap the mappings as needed. Definitely needs to be fixed if true. You're lucky that Xen caught it in time. -Andi -
Is this true even if you don't write through those old mappings? Is DRM or AGP then not also broken with lazy highmem flushing, or I've been thinking that a simple debug mode could be a good idea. Have a new field in the struct page to count the number of possible unflushed mappings to the page so that things like PAT could go BUG_ON appropriate conditions. Would that be helpful? -
I think it happened for reads too. It is a little counter intuitive because in theory the CPU doesn't need to write back non dirty lines, but in the one case which took so long to debug exactly this happened somehow. At it is undefined for reads and writes in the architecture so better be safe than sorry. And x86 CPUs are out of order and do speculative executation and that can lead to arbitary memory accesses even if the code never touches an particular address. Newer Intel CPUs have something called self-snoop which was supposed to handle this; but in some situations it doesn't seem to catch it AGP doesn't allocate highmem pages. Not sure about the DRM code. -Andi -
Fair enough, so we have to have this lazy tlb flush hook for Xen/PAT/etc. I don't think it should be much problem to Hmm, OK. It looks like DRM vmallocs memory (which gives highmem). -
> Hmm, OK. It looks like DRM vmallocs memory (which gives highmem). I meant I'm not sure if it uses that memory uncached. I admit not quite understanding that code. There used to be at least one place where it set UC for an user mapping though. -Andi -
Currently the only DRM memory handed to userspace is vmalloc_32 in drm_scatter.c I notice the VIA driver does its own vmalloc for dmablit, so it may have an issue with this if highmem is involved. This will change with the upcoming memory manager so I'll need to investigate it a bit perhaps... Dave. -
The problem exist also on ppc, and afaik, is due to the line being in the cache at all (either dirty (write) or not (read)), thus causing the snoop logic to hit, that is what's causing the problem vs. non cached accesses. Also, on some processors, the simple fact of having the page mapped can cause the CPU to prefetch from it even if it's not actually accessed (speculative prefetch can cross page boundaries if things are mapped). Ben. -
Exactly that happens on x86. Normally prefetches stop on TLB miss, but the CPU can do speculative TLB fetches too. Also even without any prefetching the CPU does speculative execution and that can lead to random addresses being followed. -Andi -
