On Tue, 23 Nov 2010 15:55:31 +0100
Robert wi cki <robert@swiecki.net> wrote:
At a guess I'd say that another thread came in and established a
mapping against a page in the to-be-truncated range while
vmtruncate_range() was working on it. In fact I'd be suspecting that
the mapping was established after truncate_inode_page() ran its
page_mapped() test.
Let's take a look at vmtruncate_range():
int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
{
struct address_space *mapping = inode->i_mapping;
/*
* If the underlying filesystem is not going to provide
* a way to truncate a range of blocks (punch a hole) -
* we should return failure right now.
*/
if (!inode->i_op->truncate_range)
return -ENOSYS;
mutex_lock(&inode->i_mutex);
down_write(&inode->i_alloc_sem);
unmap_mapping_range(mapping, offset, (end - offset), 1);
truncate_inode_pages_range(mapping, offset, end);
unmap_mapping_range(mapping, offset, (end - offset), 1);
inode->i_op->truncate_range(inode, offset, end);
up_write(&inode->i_alloc_sem);
mutex_unlock(&inode->i_mutex);
return 0;
}
Now, why does it call unmap_mapping_range() twice?
Nick's original 2007 patch d00806b183152af6d2 ("mm: fix fault vs
invalidate race for linear mappings") added the second
unmap_mapping_range() call, along with this nice comment, which
explains it all:
+ /*
+ * unmap_mapping_range is called twice, first simply for efficiency
+ * so that truncate_inode_pages does fewer single-page unmaps. However
+ * after this first call, and before truncate_inode_pages finishes,
+ * it is possible for private pages to be COWed, which remain after
+ * truncate_inode_pages finishes, hence the second unmap_mapping_range
+ * call must be made for correctness.
+ /*
Later, some twirp deleted the damn comment. Why'd we do that? It
still seems to be valid.
If this _is_ still valid, and the first call to unmap_mapping_range() is
really just a best-effort performance thing which won't reliably clear
all the mappings then perhaps the BUG_ON(page_mapped(page)) assertion
in __remove_from_page_cache() is simply bogus.
We don't appear to have mmap_sem coverage around here, perhaps for
lock-ordering reasons. I suspect we'll be struggling to plug all holes
here without that coverage.
Fortunately the comment over madvise_remove() says it's tmpfs-only, so
we can blame Hugh :)
hm, I found the lost comment. It somehow wandered over into
truncate_pagecache(), but is still relevant at the vmtruncate_range()
site.
--