If the mmap_sem is not held while we walk_page_range(), then it is possible for find_vma() to race with a remove_vma_list() caused by do_munmap() (or others). Unable to handle kernel paging request at virtual address 6b6b6b5b Internal error: Oops: 5 [#1] PREEMPT CPU: 0 Not tainted (2.6.32.9-27154-ge3e6e27 #1) PC is at find_vma+0x40/0x7c LR is at walk_page_range+0x70/0x230 pc : [<c00aa3ac>] lr : [<c00b298c>] psr: 20000013 sp : c6aa9eb8 ip : 6b6b6b53 fp : c6a58f60 r10: c7e1d1b8 r9 : 0001bca0 r8 : 47000000 r7 : c6aa9f80 r6 : c6aa8000 r5 : 46fbd000 r4 : 6b6b6b6b r3 : c7ca4820 r2 : 6b6b6b6b r1 : 46fbd000 r0 : c70e3e40 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5787d Table: 26574019 DAC: 00000015 [<c00aa3ac>] (find_vma+0x40/0x7c) from [<c00b298c>] (walk_page_range+0x70/0x230) [<c00b298c>] (walk_page_range+0x70/0x230) from [<c00f5d3c>] (pagemap_read+0x1a4/0x278) [<c00f5d3c>] (pagemap_read+0x1a4/0x278) from [<c00bac40>] (vfs_read+0xa8/0x150) [<c00bac40>] (vfs_read+0xa8/0x150) from [<c00bad94>] (sys_read+0x3c/0x68) [<c00bad94>] (sys_read+0x3c/0x68) from [<c0026f00>] (ret_fast_syscall+0x0/0x2c) Code: 98bd8010 e5932004 e3a00000 ea000008 (e5124010) Signed-off-by: San Mehat <san@google.com> Cc: Brian Swetland <swetland@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> --- fs/proc/task_mmu.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2a1bef9..3f300c1 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -726,8 +726,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf, down_read(&current->mm->mmap_sem); ret = get_user_pages(current, current->mm, uaddr, pagecount, 1, 0, pages, NULL); - up_read(&current->mm->mmap_sem); - if (ret < 0) goto out_free; @@ -776,6 +774,7 @@ ...
I think you've found a bug, but I also look at that code and say "that's just totally insane". Why does it do that initial "get_user_pages()" at all? It never _uses_ that 'pages' array except to mark the pages dirty, but that's insane, since as far as I can see the way it actually dirties the pages in question is by doing a regular "put_user(pfn, pm->out);". And that will dirty the pages in hardware (or put_user). Also, I get the feeling that the _reason_ it is not doing that down_read() is that it would dead-lock the whole system, exactly on that "put_user()", if somebody else did a down_write() in another thread. In that case you have: thread#1 thread#2 -------- -------- down_read() ... down_write() - blocks ... put_user(); .. page fault .. down_read(); **DEADLOCK ** because our down_read() tries to be fair to the down_write(). So I think your patch would just create _different_ trouble. I get the _feeling_ that the whole point of that 'pages' array was to not do that put_user() at all, but write to the physical pages through that array. But the code looks totally buggy. I would seriously suggest that we consider removing the 'pagemap' interface. The way that code looks, it's just broken. Matt - give me a reason (which includes either a patch to fix this sh*t up or telling me why I'm wrong, but _also_ includes a real independent reason to keep that thing around regardless) to not remove it all. The whole notion seems to be utterly misdesigned. Linus --
Linus, I must say your charm has really worn thin. I've just stuck a post-it on my monitor saying "don't be Linus" to remind me not to be rude to my contributors. If I recall correctly, the get_user_pages is there to force all the output pages to be guaranteed present before we later fill them so that the output needn't be double-buffered and the locking and recursion on the page tables is significantly simpler and faster. put_user is then actually easier than "writing to the physical pages through the array". You're right that the SetPageDirty() at cleanup is redundant (but harmless), and I probably copied that pattern from another user of get_user_pages. Earlier versions of the pagewalk code studiously avoided calling find_vma and only looked at the page tables (with either caller doing locking or accepting the non-atomicity) to avoid the sort of issue that San has run into, but it looks like I forgot about that and let it sneak back in when other folks added more hugepage support. The deeper problem is that hugepages have various nasty layering violations associated with them like being tied to vmas so until some hugepage genius shows up and tells us how to do this cleanly, we'll probably have to deal with the associated mmap_sem. San, your patch is touching the wrong mm_sem, I think. The interesting races are going to happen on the target mm, not current's (generally not the same). Holding the mm_sem across the entire walk is also prohibitive, it probably needs to be localized to individual find_vma calls. -- http://selenic.com : development and support for Mercurial and Linux --
You didn't actually answer the problem, though. I'm rude, because I think the code is buggy. I pointed out how, and why I Umm. Why would get_user_pages() guarantee that the pages don't get swapped out in the meantime, and you end up with a page fault _anyway_? Yes, it makes those page faults much rarer, but that just makes things much worse. Now you have a nasty subtle fundamental bug that is hard to trigger too. NOTE! The fact that we mark the page dirty (and increment the page count) in get_user_pages() (not in the later loop, which does indeed look pointless) should mean that the page doesn't go away physically, and if it gets mapped out the same page will be mapped back in on access. That's how things like direct-IO work (modulo a concurrent fork(), when they don't work, but that's a separate issue). But the problem with the deadlock is not that "same physical page" issue, it's simply the problem of the page fault code itself wanting to do a down_read() on the mmap_sem. So the fact that the physical page is reliable in the array doesn't actually solve the bug I was pointing out. See? So the nr = get_user_pages(.. 1, 0, ..) ... for_each_page() mark_it_dirty(); pattern is a valid pattern, but it is _only_ valid if you then do the write to the physical page you looked up. If you do the accesses through Fine. So then you'd do get_user_pages() and do _nothing_ with the array? Please explain what the point of that is, again? See above: there are valid reasons for things like direct-IO to do that exact "get_user_pages()" pattern - they mark the pages dirty both early _and_ late, and both page dirtying is required: - the first one (done by get_user_pages() itself) is to make sure that an anonymous page doesn't get switched around to another anonymous page that just happens to have the same contents (ie if the page gets swapped out, the swapout code will turn it into a swap-cache page). Once it's a swap-out ...
Right. To increment page->count only affect vmscan to prevent free page. Not prevent pageout I/O nor page unmapping from a process. Thanks. --
And what does that achieve? I've got plenty of other work I could be Yes, I was muddling the distinction between pinned in page cache and pinned in the mm, and you've just now re-clarified it for me. So I'll That'd actually take us back to where it was when it hit mainline, which would make a lot of people unhappy. I wouldn't be one of them as there thankfully aren't any huge pages in my world. But I'm convinced put_user() must go. In which case, get_user_pages() stays, and I've got to switch things to direct physical page access into that array. Even if I fix that, I believe San's original bug can still be triggered though, as all the new callers to find_vma are run outside of the target's mm_sem. Fixing that should be reasonably straight-forward. -- http://selenic.com : development and support for Mercurial and Linux --
I would suggest you go back and read my original email once more, now that you realize that you had simply not understood the difference between physical page pinning and virtual page pinning. Seriously. Now that you understand why I called the code buggy, maybe you realize that calling the code "insane and misdesigned" is actually not overly rude: it's just an accurate representation of the state of the code. And if you read the mail once more, you'll also notice that every single derogatory remark was about the _code_, not you. Oh, and I did ask you for an explanation for why we shouldn't just remove it. There can't be all that many users. Because quite frankly, if you apparently want to keep the vma around, the code is going to get way more complex and ugly. You may be able to avoid some of the _worst_ crap if you require that user pointers have to always be u64-aligned. Yes, that's a very ugly and non-intuitive requirement for a read() interface, but probably better than the alternative. Or maybe just do the double buffering, and limiting pagemap reads to fairly small chunks at a time. Linus --
no. direct physical page access for /proc is completely wrong idea, i think. please imazine the caller process is multi threaded and it use fork case, example scenario) 1. the parent process has thread-A and thread-B. 2. thread-A call read_pagemap 3. read_pagemap grab the page-C 3. at the same time, thread-B call fork(), now page-C pointed from two process. 4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C') and the child process has original page-C. 5. thread-A write page-C by physical page access, then the child page is modified, instead parent one. I just recommend simply do double buffering. --
On Thu, 1 Apr 2010 14:54:43 +0900 (JST)
Like this ?
CC'ed Horiguchi, he touched hugepage part of this code recently.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.
This patch adds mmap_sem around walk_page_range().
Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.
Reported-by: San Mehat <san@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/proc/task_mmu.c | 89 ++++++++++++++++++++++-------------------------------
1 file changed, 38 insertions(+), 51 deletions(-)
Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m,
memset(&mss, 0, sizeof mss);
mss.vma = vma;
- if (vma->vm_mm && !is_vm_hugetlb_page(vma))
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ down_read(&vma->vm_mm->mmap_sem);
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
+ up_read(&vma->vm_mm->mmap_sem);
+ }
show_map_vma(m, vma);
@@ -552,7 +555,8 @@ const struct file_operations proc_clear_
};
struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};
#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +579,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and ...In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL. --
On Thu, 1 Apr 2010 15:05:38 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
Right.
Ok.
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.
This patch adds mmap_sem around walk_page_range().
Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.
Changelog:
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/proc/task_mmu.c | 86 ++++++++++++++++++++++-------------------------------
1 file changed, 36 insertions(+), 50 deletions(-)
Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};
struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};
#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
*/
+#define ...On Thu, 1 Apr 2010 15:09:56 +0900
Sigh...this is bad..
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.
This patch adds mmap_sem around walk_page_range().
Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.
Changelog:
- fixed start_vaddr calculation
- removed unnecessary cast.
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL
- use min().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/proc/task_mmu.c | 84 +++++++++++++++++++++--------------------------------
1 file changed, 34 insertions(+), 50 deletions(-)
Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};
struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};
#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if (pm->pos >= pm->len)
return PM_END_OF_BUFFER;
return 0;
}
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
* determine which areas of memory are actually mapped and llseek to
* skip over unmapped regions.
...This looks pretty reasonable. However, it also looks very similar to my first version of pagemap (which started with double-buffering). It's going to need re-testing to make sure it hasn't reintroduced any wrapping, alignment, or off-by-one bugs that have already been ironed out once or twice. -- http://selenic.com : development and support for Mercurial and Linux --
OK, I'm looking for your test result. :) Thanks matt for your contribution! --
Looks mostly correct to me (but just looking at the source, no testing, obviously). And I like how the double buffering removes more lines of code than it adds. I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end' ends up being odd. You'll never notice on architectures where the user space doesn't go all the way up to the end (walk_page_range will return 0 etc), but it will do the wrong thing if 'start' is close to the end, end is _at_ the end, and you'll not be able to read that range (because of the overflow). So I do think you should do something like end = start_vaddr + PAGEMAP_WALK_SIZE; /* overflow? or final chunk? */ if (end < start_vaddr || end > end_vaddr) end = end_vaddr; instead of using 'min()'. (This only matters if TASK_SIZE_OF() can be ~0ul, but I think that can happen on sparc, for example) Linus --
On Thu, 1 Apr 2010 08:10:40 -0700 (PDT)
Ok, here. now
end = start_vaddr _ PAGEMAP_WALK_SIZE;
if (end < start_vaddr || end > end_vaddr)
end = end_vaddr;
....walk....
start_vaddr =end;
Only tested on x86-64.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.
This patch adds mmap_sem around walk_page_range().
Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.
Changelog: 2010/Apr/2
- fixed start_vaddr and end overflow
Changelog: 2010/Apr/1
- fixed start_vaddr calculation
- removed unnecessary cast.
- removed unnecessary change in smaps.
- use GFP_TEMPORARY instead of GFP_KERNEL
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/proc/task_mmu.c | 87 ++++++++++++++++++++++-------------------------------
1 file changed, 37 insertions(+), 50 deletions(-)
Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
memset(&mss, 0, sizeof mss);
mss.vma = vma;
+ /* mmap_sem is held in m_start */
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
};
struct pagemapread {
- u64 __user *out, *end;
+ int pos, len;
+ u64 *buffer;
};
#define PM_ENTRY_BYTES sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
static int add_to_pagemap(unsigned long addr, u64 pfn,
struct pagemapread *pm)
{
- if (put_user(pfn, pm->out))
- return -EFAULT;
- pm->out++;
- if (pm->out >= pm->end)
+ pm->buffer[pm->pos++] = pfn;
+ if ...Looks like this gets the wrong return code? -- Mathematics is the supreme nostalgia of our time. --
On Fri, 2 Apr 2010 09:30:58 -0500 I'm sorry. And thank you for pointing out. I confirmed merged one has fixed code. Regards, -Kame --
