On 08.05.2008 [18:51:11 +0200], Hans Rosenfeld wrote:
Right, I agree *if* we allow huge pages in the walker. But AIUI, things
are broken now with hugepages in the process' address space. This is a
bug upstream and leads to hugepages leaking out of the kernel when
/proc/pid/pagemap is read. Why not, instead (as a short-term fix), skip
hugepage mappings altogether in the page-walker code?
Hrm, upon further investigation, this seems to be a pretty clear
limitation of walk_page_range(). One that is avoided in the other two
callers, i.e.
static int show_smap(struct seq_file *m, void *v)
{
...
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&smaps_walk, &mss);
...
}
static ssize_t clear_refs_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
...
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
...
}
No such protection exists for
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos);
So, is there any way to either add a is_vm_hugetlb_page(vma) check into
pagemap_read()? Or can we modify walk_page_range to take the a vma and
skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
complications down the road until hugepage walking is fixed]. I guess
the latter isn't possible for pagemap_read(), since we are just looking
at arbitrary addresses in the process space?
Dunno, seems quite clear that the bug is in pagemap_read(), not any
hugepage code, and that the simplest fix is to make pagemap_read() do
what the other walker-callers do, and skip hugepage regions.
Thanks,
Nish
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--