Re: Interaction between Xen and XFS: stray RW mappings

Previous thread: [PATCH] Discardable strings for init and exit sections by Maciej W. Rozycki on Friday, October 12, 2007 - 9:50 am. (10 messages)

Next thread: [2.6.24 PATCH 02/25] dm io:ctl use constant struct size by Alasdair G Kergon on Friday, October 12, 2007 - 10:05 am. (5 messages)
From: Jeremy Fitzhardinge
Date: Friday, October 12, 2007 - 9:58 am

Hi Dave & other XFS folk,

I'm tracking down a bug which appears to be a bad interaction between
XFS and Xen.  It looks like XFS is holding RW mappings on free pages,
which Xen is trying to get an exclusive RO mapping on so it can turn
them into pagetables.

I'm assuming the pages are actually free, and this isn't a use after
free problem.  From a quick poke around, the most likely pieces of XFS
code is the stuff in xfs_map.c which creates a virtually contiguous
mapping of pages with vmap, and seems to delay unmapping them.

When pinning a pagetable, Xen tries to eliminate any RW aliases of the
pages its using.  This is generally trivial because pages returned by
get_free_page don't have any mappings aside from the normal kernel
mapping.  High pages, when using CONFIG_HIGHPTE, may have a residual
kmap mapping, so we clear out the kmap cache if we encounter a highpage
in the pagetable.

I guess we could create a special-case interface to do the same thing
with XFS mappings, but it would be nicer to have something more generic.

Is my analysis correct?  Or should XFS not be holding stray mappings? 
Or is there already some kind of generic mechanism I can use to get it
to release its mappings?

Thanks,
    J
-

From: Jeremy Fitzhardinge
Date: Friday, October 12, 2007 - 10:08 am

This test patch confirms my theory:

diff -r 36a518c1fb4b fs/xfs/linux-2.6/xfs_buf.c
--- a/fs/xfs/linux-2.6/xfs_buf.c	Fri Oct 12 10:03:56 2007 -0700
+++ b/fs/xfs/linux-2.6/xfs_buf.c	Fri Oct 12 10:07:03 2007 -0700
@@ -186,6 +186,11 @@ free_address(
 	void		*addr)
 {
 	a_list_t	*aentry;
+
+#ifdef CONFIG_XEN
+	vunmap(addr);
+	return;
+#endif
 
 	aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
 	if (likely(aentry)) {


With this in place, the problem goes away.

    J
-

From: David Chinner
Date: Sunday, October 14, 2007 - 3:56 pm

You mean xfs_buf.c.

And yes, we delay unmapping pages until we have a batch of them
to unmap. vmap and vunmap do not scale, so this is batching helps


The xfsbufd cleans out any stale mappings - it's woken by the memory
shrinker interface (i.e. calls xfsbufd_wakeup()). Otherwise every
64th buffer being vmap()d will flush out stale mappings.

Realistically, if this delayed release of vmaps is a problem for
Xen, then I think that some generic VM solution is needed to this
problem as vmap() is likely to become more common in future (think
large blocks in filesystems). Nick - any comments?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Jeremy Fitzhardinge
Date: Sunday, October 14, 2007 - 4:12 pm

How much performance does it cost?  What kind of workloads would it show

Well, the only real problem is that the pages are returned to the free
pool and reallocated while still being part of a mapping.  If the pages
are still owned by the filesystem/pagecache, then there's no problem.

What's the lifetime of things being vmapped/unmapped in xfs?  Are they
necessarily being freed when they're unmapped, or could unmapping of
freed memory be more immediate than other memory?

Maybe it just needs a notifier chain or something.

    J
-

From: Nick Piggin
Date: Sunday, October 14, 2007 - 9:15 pm

Yes, as Dave said, vmap (more specifically: vunmap) is very expensive
because it generally has to invalidate TLBs on all CPUs.

I'm looking at some more general solutions to this (already have some
batching / lazy unmapping that replaces the XFS specific one), however
they are still likely going to leave vmap mappings around after freeing
the page.

We _could_ hold on to the pages as well, but that's pretty inefficient.
The memory cost of keeping the mappings around tends to be well under
1% the cost of the page itself. OTOH we could also avoid lazy flushes
on architectures where it is not costly. Either way, it probably would
require an arch hook or even a couple of ifdefs in mm/vmalloc.c for
Xen. Although... it would be nice if Xen could take advantage of some
of these optimisations as well.

What's the actual problem for Xen? Anything that can be changed?
-

From: Jeremy Fitzhardinge
Date: Sunday, October 14, 2007 - 5:57 pm

Hm.  Could there be a call to shoot down any lazy mappings of a page, so
the Xen pagetable code could use it on any pagetable page?  Ideally one
that could be used on any page, but only causes expensive operations

In general the lazy unmappings won't worry Xen.  It's only for the
specific case of allocating memory for pagetables.  Xen can do a bit of
extra optimisation for cross-cpu tlb flushes (if the target vcpus are
not currently running, then you don't need to do anything), but they're

Not easily.  Xen doesn't use shadow pagetables.  Instead, it gives the
guest domains direct access to the real CPU's pagetable, but makes sure
they're always mapped RO so that the hypervisor can control updates to
the pagetables (either by trapping writes or via explicit hypercalls). 
This means that when constructing a new pagetable, Xen will verify that
all the mappings of pages making up the new pagetable are RO before
allowing it to be used.  If there are stray RW mappings of those pages,
pagetable construction will fail.

Aside from XFS, the only other case I've found where there could be
stray RW mappings is when using high pages which are still in the kmap
cache; I added an explicit call to flush the kmap cache to handle this. 
If vmap and kmap can be unified (at least the lazy unmap aspects of
them), then that would be a nice little cleanup.

    J
-

From: Nick Piggin
Date: Monday, October 15, 2007 - 12:26 am

Yeah, it would be possible. The easiest way would just be to shoot down
all lazy vmaps (because you're doing the global IPIs anyway, which are
the expensive thing, at which point you may as well purge the rest of
your lazy mappings).



OK, I see. Because even though it is technically safe where we are
using it (because nothing writes through the mappings after the page
is freed), a corrupted guest could use the same window to do bad

vmap is slightly harder than kmap in some respects. However it would
be really nice to get vmap fast and general enough to completely
replace all the kmap crud -- that's one goal, but the first thing
I'm doing is to concentrate on just vmap to work out how to make it
as fast as possible.

For Xen -- shouldn't be a big deal. We can have a single Linux mm API
to call, and we can do the right thing WRT vmap/kamp. I should try to
merge my current lazy vmap patches which replace the XFS stuff, so we
can implement such an API and fix your XFS issue? That's not going to
happen for at least a cycle or two though, so in the meantime maybe
an ifdef for that XFS vmap batching code would help?
-

From: Jeremy Fitzhardinge
Date: Sunday, October 14, 2007 - 8:42 pm

Yes.  If there's some way to tell whether a particular page is in a lazy
mapping then that would help, since we could easily tell whether we need
to do the whole shootdown thing.  I would expect the population of
lazily mapped pages in the whole freepage pool to be pretty small, but
if the allocator tends to return the most recently freed pages you might
hit them fairly regularly (shoving them at the other end of the freelist

That's right.  The hypervisor doesn't trust the guests, so it prevents


For now I've proposed a patch to simply eagerly vunmap everything when
CONFIG_XEN is set.  It certainly works, but I don't have a good feel for
how much of a performance hit that imposes on XFS.  A slightly more
subtle change would be to test to see if we're actually running under
Xen before taking the eager path, so at least the performance burden
only affects actual Xen users (and I presume xfs+xen is a fairly rare
combination, or this problem would have turned up earlier, or perhaps
the old xenified kernels have some other workaround for it).

    J
-

From: David Chinner
Date: Sunday, October 14, 2007 - 9:11 pm

With defaults - little effect as vmap should never be used. It's
only when you start using larger block sizes for metadata that this
becomes an issue. The CONFIG_XEN workaround should be fine until we
get a proper vmap cache....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Jeremy Fitzhardinge
Date: Sunday, October 14, 2007 - 9:18 pm

Hm, well I saw the problem with a filesystem made with mkfs.xfs with no
options, so there must be at least *some* vmapping going on there.

    J
-

From: David Chinner
Date: Sunday, October 14, 2007 - 9:25 pm

Sorry - I should have been more precise - vmap should never be used in
performance critical paths on default configs.  Log recovery will
trigger vmap/vunmap usage, so this is probably what you are seeing.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Christoph Hellwig
Date: Monday, October 15, 2007 - 1:31 am

The iclogs are also vmapped, but they aren't unmapped until unmount so
this optimizations doesn't matter either.

-

From: dean gaudet
Date: Sunday, October 21, 2007 - 8:18 pm

why is that?  ignoring 32-bit archs we have heaps of address space 
available... couldn't the kernel just burn address space and delay global 
TLB invalidate by some relatively long time (say 1 second)?

-dean
-

From: Jeremy Fitzhardinge
Date: Sunday, October 21, 2007 - 8:34 pm

Yes, that's precisely the problem.  xfs does delay the unmap, leaving
stray mappings, which upsets Xen.

    J
-

From: dean gaudet
Date: Sunday, October 21, 2007 - 9:28 pm

sounds like a bug in xen to me :)

-dean
-

From: Nick Piggin
Date: Sunday, October 21, 2007 - 9:39 pm

You could call it a bug I think. I don't know much about Xen though,
whether or not it expects to be able to run an arbitrary OS kernel.

Presumably, the hypervisor _could_ write protect and trap writes to
*all* page table page mappings.
-

From: Jeremy Fitzhardinge
Date: Monday, October 22, 2007 - 11:37 am

Xen's paravirtualized mode always requires a guest OS to be modified;
certainly some operating systems would be very hard to make work with
Xen.  But you can always fall back to using shadow pagetables or full

Xen manages this stuff with refcounts; it doesn't maintain an rmap for
these pages, so it would have to exhaustively search to do this.  But
aside from that, Xen never actively modifies pagetables, so this would
be a new behaviour in the interface.

    J
-

From: Jeremy Fitzhardinge
Date: Monday, October 22, 2007 - 11:32 am

I explained at the head of this thread how and why Xen works in this
manner.  It's certainly a change from native execution; whether you
consider it to be a bug is a different matter.

But it turns out that leaving stray mappings around on pages which get
later used in special ways is a bad idea, and can manifest in all kinds
of random unpleasant ways.  Getting a clear error out of Xen is probably
nicest way to discover and debug this problem.

    J
-

From: Andi Kleen
Date: Monday, October 22, 2007 - 6:47 am

Again it not just upsets Xen, keeping mappings to freed pages is wrong generally 
and violates the x86 (and likely others like PPC) architecture because it can 
cause illegal caching attribute aliases.

The patch that went into the tree was really not correct -- this
bogus optimization should have been unconditionally removed
or if you really wanted an ifdef made dependent on !CONFIG_XEN &&
!CONFIG_AGP (and likely && !CONFIG_DRM && !CONFIG_anything else that
uses uncached mappings in memory). 

You just worked around the obvious failure and leave the non obvious
rare corruptions in, which isn't a good strategy.

-Andi
-

From: Jeremy Fitzhardinge
Date: Monday, October 22, 2007 - 11:40 am

Well, at least it becomes a known issue and/or placeholder for when Nick
does his grand unified vmap manager.  I guess a clean workaround would
be to add a CONFIG_XFS_LAZY_UNMAP, and do it at the Kconfig level... 
I'll cook up a patch.

    J
-

From: Andi Kleen
Date: Monday, October 22, 2007 - 12:07 pm

It's hidden now so it causes any obvious failures any more. Just
subtle ones which is much worse.

But why not just disable it?  It's not critical functionality,
just a optimization that unfortunately turned out to be incorrect.

It could be made correct short term by not freeing the pages until

This option could only be safely set in architectures that don't
care about caching attribute aliases or never remap kernel pages
uncached. That's not x86, powerpc, ia64 at a minimum.

I think the only good short term fix is to turn your ifdef CONFIG_XEN
into a #if 0

-Andi

-

From: Jeremy Fitzhardinge
Date: Monday, October 22, 2007 - 12:11 pm

I think it only ends up holding 64 pages (or is it 64 mappings?), so its
not a horrible use of memory.  Particularly since it responds to memory

#if 1, but yes, I see your point.

    J
-

From: David Chinner
Date: Monday, October 22, 2007 - 3:32 pm

There's no evidence of any problems ever being caused by this code.

It is critical functionality for larger machines because vmap()/vunmap()

Could vmap()/vunmap() take references to the pages that are
mapped? That way delaying the unmap would also delay the freeing
of the pages and hence we'd have no problems with the pages
being reused before the mapping is torn down. That'd work for
Xen even with XFS's lazy unmapping scheme, and would allow
Nick's more advanced methods to work as well....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Andi Kleen
Date: Monday, October 22, 2007 - 4:35 pm

We had a fairly nasty bug a couple of years ago with such
a mapping that took months to track down. Luckily we had 
help from a hardware debug lab, without that it would have
been very near impossible. But just stabilizing the problem was
a nightmare.

You will only see problems if the memory you're still mapping
will be allocated by someone who turns it into an uncached
mapping. And even then it doesn't happen always because
the CPU does not always do the necessary prefetches.

Also the corruption will not be on the XFS side,
but on the uncached mapping so typically some data sent
to some device gets a few corrupted cache line sized blocks.
Depending on the device this might also not be fatal -- e.g.
if it is texture data some corrupted pixels occasionally are 
not that visible. But there can be still cases where
it can cause real failures when it hits something critical
(the failures were it was tracked down was usually it hitting
some command ring and the hardware erroring out) 

Also every time I broke change_page_attr() flushing (which 
handles exactly this problem and is tricky so it got broken
a few times) I eventually got complaints/reports from some users who ran 
into such data inconsistencies. Given it was mostly from closed
3d driver users who seem to be most aggressive in using 
uncached mappings. But with more systems running more aggressive

That doesn't mean it is correct.

Basically you're saying: I don't care if I corrupt device driver

Critical perhaps, but also broken.

And if it's that big a problem would it be really that difficult
to change only the time critical paths using it to not?  I assume
it's only the directory code where it is a problem?  That would

You could always just keep around an array of the pages and 
then drop the reference count after unmap. Or walk the vmalloc mapping
and generate such an array before freeing, then unmap and then
drop the reference counts.

Or the other alternative would be change the time critical ...
From: Zachary Amsden
Date: Monday, October 22, 2007 - 5:16 pm

It is a serious offense to leave stray mappings for memory which can get
re-mapped to I/O devices... especially with PCI and other device
hotplug.  I have to back up Andi on this one unconditionally.

On architectures where you have multibyte, non-wordsize updates to
hardware page tables, you even have races here when setting, updating
and clearing PTEs that must be done in a sequence where no aliasing of
mappings to partially written PTEs can result in I/O memory getting
mapped in a cacheable state.  The window here is only one instruction,
and yes, it is possible for a window this small to create a problem.  A
large (or permanently lazy) window is extremely frightening.

These things do cause bugs.  The bugs take a very long time to show up
and are very difficult to track down, since they can basically cause
random behavior, such as hanging the machine or losing orbit and
crashing into the moon.

Zach

-

From: David Chinner
Date: Monday, October 22, 2007 - 5:36 pm

Right, but it also points to the fact that it's not causing problems

No, I did not say that. I said that there's no evidence that points to

Difficult - yes. All the btree code in XFS would have to be rewritten
to remove the assumption that the buffer structures are contiguous in
memory. Any bug we introduce in doing this will result in on disk corruption,
which is far worse than occasionally crashing a machine with a stray

You mean like vmap() could record the pages passed to it in the area->pages
array, and we walk and release than in __vunmap() like it already does
for vfree()?

If we did this, it would probably be best to pass a page release function
into the vmap or vunmap call - we'd need page_cache_release() called on
the page rather than __free_page()....

The solution belongs behind the vmap/vunmap interface, not in XFS....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: David Chinner
Date: Tuesday, October 23, 2007 - 12:04 am

Lightly tested(*) patch that does this with lazy unmapping
below for comment.

(*) a) kernel boots, b) made an XFS filesystem with 64k directory
blocks, created ~100,000 files in a directory to get a wide btree
(~1700 blocks, still only a single level) and run repeated finds
across it dropping caches in between.  Each traversal maps and
unmaps every btree block.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_buf.c |   21 +----
 include/linux/vmalloc.h    |    4 +
 mm/vmalloc.c               |  160 +++++++++++++++++++++++++++++++++++----------
 3 files changed, 133 insertions(+), 52 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c	2007-10-23 14:48:32.131088616 +1000
@@ -187,19 +187,6 @@ free_address(
 {
 	a_list_t	*aentry;
 
-#ifdef CONFIG_XEN
-	/*
-	 * Xen needs to be able to make sure it can get an exclusive
-	 * RO mapping of pages it wants to turn into a pagetable.  If
-	 * a newly allocated page is also still being vmap()ed by xfs,
-	 * it will cause pagetable construction to fail.  This is a
-	 * quick workaround to always eagerly unmap pages so that Xen
-	 * is happy.
-	 */
-	vunmap(addr);
-	return;
-#endif
-
 	aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
 	if (likely(aentry)) {
 		spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
 		as_list_len++;
 		spin_unlock(&as_lock);
 	} else {
-		vunmap(addr);
+		vunmap_pages(addr, put_page);
 	}
 }
 
@@ -228,7 +215,7 @@ purge_addresses(void)
 	spin_unlock(&as_lock);
 
 	while ((old = aentry) != NULL) {
-		vunmap(aentry->vm_addr);
+		vunmap_pages(aentry->vm_addr, put_page);
 		aentry = aentry->next;
 		kfree(old);
 	}
@@ -456,8 +443,8 @@ _xfs_buf_map_pages(
 	} else if (flags & XBF_MAPPED) {
 		if (as_list_len > 64)
 ...
From: Andi Kleen
Date: Tuesday, October 23, 2007 - 2:30 am

Hmm, the __free_page -> page_cache_release() change in vfree() would
have been simpler wouldn't it? 

But if it works it is fine.

-Andi
-

From: David Chinner
Date: Tuesday, October 23, 2007 - 5:41 am

Yes, it would, but I tried to leave vmalloc doing the same thing as
it was before. I think that it would be safe simply to call put_page()
directly in the __vunmap() code and drop all the release function
passing, but I don't know enough about that code to say for certain.

Seems to - it's passed XFSQA with 64k directories and a bunch of
dirstress workouts as well.

Nick, Jeremy, (others?) any objections to this approach to solve
the problem?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Jeremy Fitzhardinge
Date: Tuesday, October 23, 2007 - 7:33 am

Seems sounds in principle.   I think Nick's shootdown-stray-mappings mm
API call is a better long-term answer, but this will do for now.  I'll
test it out today.

    J
-

From: David Chinner
Date: Tuesday, October 23, 2007 - 9:36 pm

Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.

Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.

Version 2:
  - drop the release function stuff and just call put_page()
    as it falls down to the same code in all calling cases.

Signed-Off-By: Dave Chinner <dgc@sgi.com>

---
 fs/xfs/linux-2.6/xfs_buf.c |   21 +-------
 include/linux/vmalloc.h    |    4 +
 mm/vmalloc.c               |  118 +++++++++++++++++++++++++++++++++++++--------
 3 files changed, 106 insertions(+), 37 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-10-18 17:11:06.000000000 +1000
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_buf.c	2007-10-24 09:16:26.930345566 +1000
@@ -187,19 +187,6 @@ free_address(
 {
 	a_list_t	*aentry;
 
-#ifdef CONFIG_XEN
-	/*
-	 * Xen needs to be able to make sure it can get an exclusive
-	 * RO mapping of pages it wants to turn into a pagetable.  If
-	 * a newly allocated page is also still being vmap()ed by xfs,
-	 * it will cause pagetable construction to fail.  This is a
-	 * quick workaround to always eagerly unmap pages so that Xen
-	 * is happy.
-	 */
-	vunmap(addr);
-	return;
-#endif
-
 	aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
 	if (likely(aentry)) {
 		spin_lock(&as_lock);
@@ -209,7 +196,7 @@ free_address(
 		as_list_len++;
 		spin_unlock(&as_lock);
 	} else {
-		vunmap(addr);
+		vunmap_pages(addr);
 	}
 }
 
@@ -228,7 +215,7 @@ purge_addresses(void)
 	spin_unlock(&as_lock);
 
 	while ((old = aentry) != NULL) {
-		vunmap(aentry->vm_addr);
+		vunmap_pages(aentry->vm_addr);
 		aentry = aentry->next;
 ...
From: Jeremy Fitzhardinge
Date: Tuesday, October 23, 2007 - 10:08 pm

This doesn't apply cleanly to the current git tree.

    J

-

From: David Chinner
Date: Wednesday, October 24, 2007 - 2:48 pm

Allow lazy unmapping of vmap()d regions be taking a reference
on each page in the region being vmap()ed. The page references
get released after the region is vunmap()ed, thereby ensuring
we don't leave stray mappings on freed pages that could lead to
problems pages being reallocated with incorrect mappings
associated with them.

Tested with XFS filesystems with 64k directory blocks with
dirstress and XFSQA.

Version 3:
  - compile on latest -git
 
Version 2:
  - drop the release function stuff and just call put_page()
    as it falls down to the same code in all calling cases.

Signed-off-by: Dave Chinner <dgc@sgi.com>

---

 fs/xfs/linux-2.6/xfs_buf.c |   21 +-------
 include/linux/vmalloc.h    |    4 +
 mm/vmalloc.c               |  118 +++++++++++++++++++++++++++++++++++++--------
 3 files changed, 106 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
 {
 	a_list_t	*aentry;
 
-#ifdef CONFIG_XEN
-	/*
-	 * Xen needs to be able to make sure it can get an exclusive
-	 * RO mapping of pages it wants to turn into a pagetable.  If
-	 * a newly allocated page is also still being vmap()ed by xfs,
-	 * it will cause pagetable construction to fail.  This is a
-	 * quick workaround to always eagerly unmap pages so that Xen
-	 * is happy.
-	 */
-	vunmap(addr);
-	return;
-#endif
-
 	aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
 	if (likely(aentry)) {
 		spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
 		as_list_len++;
 		spin_unlock(&as_lock);
 	} else {
-		vunmap(addr);
+		vunmap_pages(addr);
 	}
 }
 
@@ -228,7 +215,7 @@ purge_addresses(void)
 	spin_unlock(&as_lock);
 
 	while ((old = aentry) != NULL) {
-		vunmap(aentry->vm_addr);
+		vunmap_pages(aentry->vm_addr);
 		aentry = aentry->next;
 		kfree(old);
 	}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
 	} else if (flags & ...
From: Jeremy Fitzhardinge
Date: Wednesday, October 24, 2007 - 3:46 pm

Not quite:

  CC      mm/vmalloc.o
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c: In function 'vm_area_alloc_pagearray':
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: 'GFP_LEVEL_MASK' undeclared (first use in this function)
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: (Each undeclared identifier is reported only once
/home/jeremy/hg/xen/paravirt/linux/mm/vmalloc.c:338: error: for each function it appears in.)
make[3]: *** [mm/vmalloc.o] Error 1
make[2]: *** [mm] Error 2
make[2]: *** Waiting for unfinished jobs....


GFP_RECLAM_MASK now?

J
-

From: David Chinner
Date: Wednesday, October 24, 2007 - 4:21 pm

Yeah, it is. Not sure what happened there - I did make that change.
new diff below.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group


diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b9c8589..38f073f 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -187,19 +187,6 @@ free_address(
 {
 	a_list_t	*aentry;
 
-#ifdef CONFIG_XEN
-	/*
-	 * Xen needs to be able to make sure it can get an exclusive
-	 * RO mapping of pages it wants to turn into a pagetable.  If
-	 * a newly allocated page is also still being vmap()ed by xfs,
-	 * it will cause pagetable construction to fail.  This is a
-	 * quick workaround to always eagerly unmap pages so that Xen
-	 * is happy.
-	 */
-	vunmap(addr);
-	return;
-#endif
-
 	aentry = kmalloc(sizeof(a_list_t), GFP_NOWAIT);
 	if (likely(aentry)) {
 		spin_lock(&as_lock);
@@ -209,7 +196,7 @@ #endif
 		as_list_len++;
 		spin_unlock(&as_lock);
 	} else {
-		vunmap(addr);
+		vunmap_pages(addr);
 	}
 }
 
@@ -228,7 +215,7 @@ purge_addresses(void)
 	spin_unlock(&as_lock);
 
 	while ((old = aentry) != NULL) {
-		vunmap(aentry->vm_addr);
+		vunmap_pages(aentry->vm_addr);
 		aentry = aentry->next;
 		kfree(old);
 	}
@@ -458,8 +445,8 @@ _xfs_buf_map_pages(
 	} else if (flags & XBF_MAPPED) {
 		if (as_list_len > 64)
 			purge_addresses();
-		bp->b_addr = vmap(bp->b_pages, bp->b_page_count,
-					VM_MAP, PAGE_KERNEL);
+		bp->b_addr = vmap_pages(bp->b_pages, bp->b_page_count,
+					VM_MAP, PAGE_KERNEL, xb_to_gfp(flags));
 		if (unlikely(bp->b_addr == NULL))
 			return -ENOMEM;
 		bp->b_addr += bp->b_offset;
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 89338b4..40c34da 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -51,6 +51,10 @@ extern void *vmap(struct page **pages, u
 			unsigned long flags, pgprot_t prot);
 extern void vunmap(void *addr);
 
+extern void *vmap_pages(struct page **pages, ...
From: Andi Kleen
Date: Tuesday, October 23, 2007 - 2:28 am

The page tables contain pointers to the pages anyways. vunmap() has to walk 
them.  It would not be very difficult to store them in an array during

Possible. Normally vmalloc pages should not be in the LRU except yours

You could also just keep the array from map time around yourself. 
Then you could do it yourself.

-Andi
-

From: David Chinner
Date: Sunday, October 14, 2007 - 4:33 pm

Every vunmap() cal causes a global TLB sync, and the region lists
are globl with a spin lock protecting them. I thin kNick has shown
a 64p altix with ~60 cpus spinning on the vmap locks under a

A directory traversal when using large directory block sizes

The pages are still attached to the blockdev address space mapping,
but there's nothing stopping them from being reclaimed before they are

It's all "freed memory". At the time we pull the buffer down, there are
no further references to the buffer. the pages are released and the mapping
is never used again until it is torn down. it is torn down either on the
next xfsbufd run (either memory pressure or every 15s) or every 64th

We've already got a memroy shrinker hook that triggers this reclaim.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Andi Kleen
Date: Monday, October 15, 2007 - 2:36 am

You're keeping vmaps around for already freed pages?

That will be a big problem for proper PAT support, which needs
to track all mappings to memory. It's not just a problem for Xen.

In fact I suspect it is already broken with DRM or AGP for example which
can use UC and WC mappings -- if you keep the mapping around and
DRM or AGP turns the page in another mapping uncacheable you're
creating an illegal cache attribute alias. These are known to occasionally
create cache corruptions on several x86s; giving ___VERY___ hard to debug
bugs once a blue moon.

Probably it'll require some generic VM batching mechanism where
Xen or PAT code can hook into the list or force unmap the mappings
as needed.

Definitely needs to be fixed if true. You're lucky that Xen caught it
in time.

-Andi
-

From: Nick Piggin
Date: Monday, October 15, 2007 - 7:56 am

Is this true even if you don't write through those old mappings?
Is DRM or AGP then not also broken with lazy highmem flushing, or


I've been thinking that a simple debug mode could be a good idea.
Have a new field in the struct page to count the number of possible
unflushed mappings to the page so that things like PAT could go
BUG_ON appropriate conditions. Would that be helpful?
-

From: Andi Kleen
Date: Monday, October 15, 2007 - 4:07 am

I think it happened for reads too.  It is a little counter intuitive
because in theory the CPU doesn't need to write back non dirty lines,
but in the one case which took so long to debug exactly this happened
somehow.

At it is undefined for reads and writes in the architecture so 
better be safe than sorry.

And x86 CPUs are out of order and do speculative executation
and that can lead to arbitary memory accesses even if the code
never touches an particular address.

Newer Intel CPUs have something called self-snoop which was supposed
to handle this; but in some situations it doesn't seem to catch it

AGP doesn't allocate highmem pages.  Not sure about the DRM code.

-Andi
-

From: Nick Piggin
Date: Monday, October 15, 2007 - 4:28 am

Fair enough, so we have to have this lazy tlb flush hook for
Xen/PAT/etc. I don't think it should be much problem to

Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).
-

From: Andi Kleen
Date: Monday, October 15, 2007 - 5:54 am

> Hmm, OK. It looks like DRM vmallocs memory (which gives highmem).

I meant I'm not sure if it uses that memory uncached. I admit
not quite understanding that code. There used to be at least
one place where it set UC for an user mapping though.

-Andi
-

From: Dave Airlie
Date: Sunday, October 21, 2007 - 5:17 am

Currently the only DRM memory handed to userspace is vmalloc_32 in drm_scatter.c

I notice the VIA driver does its own vmalloc for dmablit, so it may
have an issue with this if highmem is involved.

This will change with the upcoming memory manager so I'll need to
investigate it a bit perhaps...

Dave.
-

From: Benjamin Herrenschmidt
Date: Sunday, October 21, 2007 - 3:16 pm

The problem exist also on ppc, and afaik, is due to the line being in
the cache at all (either dirty (write) or not (read)), thus causing the
snoop logic to hit, that is what's causing the problem vs. non cached
accesses.

Also, on some processors, the simple fact of having the page mapped can
cause the CPU to prefetch from it even if it's not actually accessed
(speculative prefetch can cross page boundaries if things are mapped).

Ben.


-

From: Andi Kleen
Date: Monday, October 22, 2007 - 2:49 am

Exactly that happens on x86. Normally prefetches stop on TLB miss,
but the CPU can do speculative TLB fetches too.

Also even without any prefetching the CPU does speculative execution
and that can lead to random addresses being followed.

-Andi
-

Previous thread: [PATCH] Discardable strings for init and exit sections by Maciej W. Rozycki on Friday, October 12, 2007 - 9:50 am. (10 messages)

Next thread: [2.6.24 PATCH 02/25] dm io:ctl use constant struct size by Alasdair G Kergon on Friday, October 12, 2007 - 10:05 am. (5 messages)