Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk

Previous thread: [PATCH v2 0/11] Uprobes patches. by Srikar Dronamraju on Wednesday, March 31, 2010 - 8:51 am. (46 messages)

Next thread: Re: [PATCH] drivers:staging: sources for ST core by Pavan Savoy on Wednesday, March 31, 2010 - 11:02 am. (2 messages)
From: San Mehat
Date: Wednesday, March 31, 2010 - 10:23 am

If the mmap_sem is not held while we walk_page_range(), then
it is possible for find_vma() to race with a remove_vma_list()
caused by do_munmap() (or others).

Unable to handle kernel paging request at virtual address 6b6b6b5b
Internal error: Oops: 5 [#1] PREEMPT
CPU: 0    Not tainted  (2.6.32.9-27154-ge3e6e27 #1)
PC is at find_vma+0x40/0x7c
LR is at walk_page_range+0x70/0x230
pc : [<c00aa3ac>]    lr : [<c00b298c>]    psr: 20000013
sp : c6aa9eb8  ip : 6b6b6b53  fp : c6a58f60
r10: c7e1d1b8  r9 : 0001bca0  r8 : 47000000
r7 : c6aa9f80  r6 : c6aa8000  r5 : 46fbd000  r4 : 6b6b6b6b
r3 : c7ca4820  r2 : 6b6b6b6b  r1 : 46fbd000  r0 : c70e3e40
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 10c5787d  Table: 26574019  DAC: 00000015

[<c00aa3ac>] (find_vma+0x40/0x7c) from [<c00b298c>] (walk_page_range+0x70/0x230)
[<c00b298c>] (walk_page_range+0x70/0x230) from [<c00f5d3c>] (pagemap_read+0x1a4/0x278)
[<c00f5d3c>] (pagemap_read+0x1a4/0x278) from [<c00bac40>] (vfs_read+0xa8/0x150)
[<c00bac40>] (vfs_read+0xa8/0x150) from [<c00bad94>] (sys_read+0x3c/0x68)
[<c00bad94>] (sys_read+0x3c/0x68) from [<c0026f00>] (ret_fast_syscall+0x0/0x2c)
Code: 98bd8010 e5932004 e3a00000 ea000008 (e5124010)

Signed-off-by: San Mehat <san@google.com>
Cc: Brian Swetland <swetland@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/proc/task_mmu.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2a1bef9..3f300c1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -726,8 +726,6 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 	down_read(&current->mm->mmap_sem);
 	ret = get_user_pages(current, current->mm, uaddr, pagecount,
 			     1, 0, pages, NULL);
-	up_read(&current->mm->mmap_sem);
-
 	if (ret < 0)
 		goto out_free;
 
@@ -776,6 +774,7 @@ ...
From: Linus Torvalds
Date: Wednesday, March 31, 2010 - 10:54 am

I think you've found a bug, but I also look at that code and say "that's 
just totally insane".

Why does it do that initial "get_user_pages()" at all? It never _uses_ 
that 'pages' array except to mark the pages dirty, but that's insane, 
since as far as I can see the way it actually dirties the pages in 
question is by doing a regular "put_user(pfn, pm->out);". And that will 
dirty the pages in hardware (or put_user).

Also, I get the feeling that the _reason_ it is not doing that down_read() 
is that it would dead-lock the whole system, exactly on that "put_user()", 
if somebody else did a down_write() in another thread. In that case you 
have:

	thread#1		thread#2
	--------		--------

	down_read()
	...
				down_write() - blocks
	...
	put_user();
	 .. page fault ..
	  down_read();	 **DEADLOCK **


because our down_read() tries to be fair to the down_write().

So I think your patch would just create _different_ trouble.

I get the _feeling_ that the whole point of that 'pages' array was to not 
do that put_user() at all, but write to the physical pages through that 
array. But the code looks totally buggy.

I would seriously suggest that we consider removing the 'pagemap' 
interface. The way that code looks, it's just broken.

Matt - give me a reason (which includes either a patch to fix this sh*t up 
or telling me why I'm wrong, but _also_ includes a real independent reason 
to keep that thing around regardless) to not remove it all.

The whole notion seems to be utterly misdesigned.

			Linus
--

From: Matt Mackall
Date: Wednesday, March 31, 2010 - 2:40 pm

Linus, I must say your charm has really worn thin. I've just stuck a
post-it on my monitor saying "don't be Linus" to remind me not to be
rude to my contributors. 

If I recall correctly, the get_user_pages is there to force all the
output pages to be guaranteed present before we later fill them so that
the output needn't be double-buffered and the locking and recursion on
the page tables is significantly simpler and faster. put_user is then
actually easier than "writing to the physical pages through the array".

You're right that the SetPageDirty() at cleanup is redundant (but
harmless), and I probably copied that pattern from another user of
get_user_pages.

Earlier versions of the pagewalk code studiously avoided calling
find_vma and only looked at the page tables (with either caller doing
locking or accepting the non-atomicity) to avoid the sort of issue that
San has run into, but it looks like I forgot about that and let it sneak
back in when other folks added more hugepage support.

The deeper problem is that hugepages have various nasty layering
violations associated with them like being tied to vmas so until some
hugepage genius shows up and tells us how to do this cleanly, we'll
probably have to deal with the associated mmap_sem.

San, your patch is touching the wrong mm_sem, I think. The interesting
races are going to happen on the target mm, not current's (generally not
the same). Holding the mm_sem across the entire walk is also
prohibitive, it probably needs to be localized to individual find_vma
calls.

-- 
http://selenic.com : development and support for Mercurial and Linux


--

From: Linus Torvalds
Date: Wednesday, March 31, 2010 - 6:33 pm

You didn't actually answer the problem, though.

I'm rude, because I think the code is buggy. I pointed out how, and why I 

Umm. Why would get_user_pages() guarantee that the pages don't get swapped 
out in the meantime, and you end up with a page fault _anyway_?

Yes, it makes those page faults much rarer, but that just makes things 
much worse. Now you have a nasty subtle fundamental bug that is hard to 
trigger too.

NOTE! The fact that we mark the page dirty (and increment the page count) 
in get_user_pages() (not in the later loop, which does indeed look 
pointless) should mean that the page doesn't go away physically, and if it 
gets mapped out the same page will be mapped back in on access. That's how 
things like direct-IO work (modulo a concurrent fork(), when they don't 
work, but that's a separate issue).

But the problem with the deadlock is not that "same physical page" issue, 
it's simply the problem of the page fault code itself wanting to do a 
down_read() on the mmap_sem. So the fact that the physical page is 
reliable in the array doesn't actually solve the bug I was pointing out.

See? 

So the

	nr = get_user_pages(.. 1, 0, ..)

	...

	for_each_page()
		mark_it_dirty();

pattern is a valid pattern, but it is _only_ valid if you then do the 
write to the physical page you looked up. If you do the accesses through 

Fine. So then you'd do get_user_pages() and do _nothing_ with the array? 
Please explain what the point of that is, again?

See above: there are valid reasons for things like direct-IO to do that 
exact "get_user_pages()" pattern - they mark the pages dirty both early 
_and_ late, and both page dirtying is required:

 - the first one (done by get_user_pages() itself) is to make sure that an
   anonymous page doesn't get switched around to another anonymous page 
   that just happens to have the same contents (ie if the page gets 
   swapped out, the swapout code will turn it into a swap-cache page).

   Once it's a swap-out ...
From: KOSAKI Motohiro
Date: Wednesday, March 31, 2010 - 7:10 pm

Right. To increment page->count only affect vmscan to prevent free page.
Not prevent pageout I/O nor page unmapping from a process.

Thanks.


--

From: Matt Mackall
Date: Wednesday, March 31, 2010 - 8:20 pm

And what does that achieve? I've got plenty of other work I could be

Yes, I was muddling the distinction between pinned in page cache and
pinned in the mm, and you've just now re-clarified it for me. So I'll

That'd actually take us back to where it was when it hit mainline, which
would make a lot of people unhappy. I wouldn't be one of them as there
thankfully aren't any huge pages in my world. But I'm convinced
put_user() must go. In which case, get_user_pages() stays, and I've got
to switch things to direct physical page access into that array. 

Even if I fix that, I believe San's original bug can still be triggered
though, as all the new callers to find_vma are run outside of the
target's mm_sem. Fixing that should be reasonably straight-forward.

-- 
http://selenic.com : development and support for Mercurial and Linux


--

From: Linus Torvalds
Date: Wednesday, March 31, 2010 - 9:27 pm

I would suggest you go back and read my original email once more, now that 
you realize that you had simply not understood the difference between 
physical page pinning and virtual page pinning.

Seriously.

Now that you understand why I called the code buggy, maybe you realize 
that calling the code "insane and misdesigned" is actually not overly 
rude: it's just an accurate representation of the state of the code.

And if you read the mail once more, you'll also notice that every single 
derogatory remark was about the _code_, not you.

Oh, and I did ask you for an explanation for why we shouldn't just remove 
it. There can't be all that many users. 

Because quite frankly, if you apparently want to keep the vma around, the 
code is going to get way more complex and ugly. You may be able to avoid 
some of the _worst_ crap if you require that user pointers have to always 
be u64-aligned. Yes, that's a very ugly and non-intuitive requirement for 
a read() interface, but probably better than the alternative.

Or maybe just do the double buffering, and limiting pagemap reads to 
fairly small chunks at a time.

		Linus
--

From: KOSAKI Motohiro
Date: Wednesday, March 31, 2010 - 10:54 pm

no. direct physical page access for /proc is completely wrong idea, i think.
please imazine the caller process is multi threaded and it use fork case, 

example scenario)
1. the parent process has thread-A and thread-B.
2. thread-A call read_pagemap
3. read_pagemap grab the page-C
3. at the same time, thread-B call fork(), now page-C pointed from two process.
4. thread-B touch page-C, cow occur, then the parent process has cowed page (page-C')
   and the child process has original page-C.
5. thread-A write page-C by physical page access, then the child page is
   modified, instead parent one.

I just recommend simply do double buffering.




--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 31, 2010 - 10:55 pm

On Thu,  1 Apr 2010 14:54:43 +0900 (JST)
Like this ?
CC'ed Horiguchi, he touched hugepage part of this code recently.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Reported-by: San Mehat <san@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |   89 ++++++++++++++++++++++-------------------------------
 1 file changed, 38 insertions(+), 51 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,8 +406,11 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
-	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
+	if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+		down_read(&vma->vm_mm->mmap_sem);
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
+		up_read(&vma->vm_mm->mmap_sem);
+	}
 
 	show_map_vma(m, vma);
 
@@ -552,7 +555,8 @@ const struct file_operations proc_clear_
 };
 
 struct pagemapread {
-	u64 __user *out, *end;
+	int pos, len;
+	u64 *buffer;
 };
 
 #define PM_ENTRY_BYTES      sizeof(u64)
@@ -575,10 +579,8 @@ struct pagemapread {
 static int add_to_pagemap(unsigned long addr, u64 pfn,
 			  struct pagemapread *pm)
 {
-	if (put_user(pfn, pm->out))
-		return -EFAULT;
-	pm->out++;
-	if (pm->out >= pm->end)
+	pm->buffer[pm->pos++] = pfn;
+	if (pm->pos >= pm->len)
 		return PM_END_OF_BUFFER;
 	return 0;
 }
@@ -720,21 +722,20 @@ static int pagemap_hugetlb_range(pte_t *
  * determine which areas of memory are actually mapped and ...
From: KOSAKI Motohiro
Date: Wednesday, March 31, 2010 - 11:05 pm

In /proc interface, GFP_TEMPORARY is recommened rather than GFP_KERNEL.



--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 31, 2010 - 11:09 pm

On Thu,  1 Apr 2010 15:05:38 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
Right.
Ok.

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog:
 - removed unnecessary change in smaps.
 - use GFP_TEMPORARY instead of GFP_KERNEL

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |   86 ++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
+	/* mmap_sem is held in m_start */
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
 };
 
 struct pagemapread {
-	u64 __user *out, *end;
+	int pos, len;
+	u64 *buffer;
 };
 
 #define PM_ENTRY_BYTES      sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
 static int add_to_pagemap(unsigned long addr, u64 pfn,
 			  struct pagemapread *pm)
 {
-	if (put_user(pfn, pm->out))
-		return -EFAULT;
-	pm->out++;
-	if (pm->out >= pm->end)
+	pm->buffer[pm->pos++] = pfn;
+	if (pm->pos >= pm->len)
 		return PM_END_OF_BUFFER;
 	return 0;
 }
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  */
+#define ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, March 31, 2010 - 11:34 pm

On Thu, 1 Apr 2010 15:09:56 +0900

Sigh...this is bad..

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog:
 - fixed start_vaddr calculation
 - removed unnecessary cast.
 - removed unnecessary change in smaps.
 - use GFP_TEMPORARY instead of GFP_KERNEL
 - use min().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |   84 +++++++++++++++++++++--------------------------------
 1 file changed, 34 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
+	/* mmap_sem is held in m_start */
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
 };
 
 struct pagemapread {
-	u64 __user *out, *end;
+	int pos, len;
+	u64 *buffer;
 };
 
 #define PM_ENTRY_BYTES      sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
 static int add_to_pagemap(unsigned long addr, u64 pfn,
 			  struct pagemapread *pm)
 {
-	if (put_user(pfn, pm->out))
-		return -EFAULT;
-	pm->out++;
-	if (pm->out >= pm->end)
+	pm->buffer[pm->pos++] = pfn;
+	if (pm->pos >= pm->len)
 		return PM_END_OF_BUFFER;
 	return 0;
 }
@@ -720,21 +720,20 @@ static int pagemap_hugetlb_range(pte_t *
  * determine which areas of memory are actually mapped and llseek to
  * skip over unmapped regions.
  ...
From: Matt Mackall
Date: Thursday, April 1, 2010 - 12:09 am

This looks pretty reasonable. However, it also looks very similar to my
first version of pagemap (which started with double-buffering). It's
going to need re-testing to make sure it hasn't reintroduced any
wrapping, alignment, or off-by-one bugs that have already been ironed
out once or twice.

-- 
http://selenic.com : development and support for Mercurial and Linux


--

From: KOSAKI Motohiro
Date: Thursday, April 1, 2010 - 12:21 am

OK, I'm looking for your test result. :)

Thanks matt for your contribution!


--

From: Linus Torvalds
Date: Thursday, April 1, 2010 - 8:10 am

Looks mostly correct to me (but just looking at the source, no testing, 
obviously). And I like how the double buffering removes more lines of code 
than it adds.


I think "start_vaddr + PAGEMAP_WALK_SIZE" might overflow, and then 'end' 
ends up being odd. You'll never notice on architectures where the user 
space doesn't go all the way up to the end (walk_page_range will return 0 
etc), but it will do the wrong thing if 'start' is close to the end, end 
is _at_ the end, and you'll not be able to read that range (because of the 
overflow).

So I do think you should do something like

	end = start_vaddr + PAGEMAP_WALK_SIZE;
	/* overflow?  or final chunk? */
	if (end < start_vaddr || end > end_vaddr)
		end = end_vaddr;

instead of using 'min()'.

(This only matters if TASK_SIZE_OF() can be ~0ul, but I think that can 
happen on sparc, for example)

			Linus
--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 1, 2010 - 5:11 pm

On Thu, 1 Apr 2010 08:10:40 -0700 (PDT)
Ok, here. now
	end = start_vaddr _ PAGEMAP_WALK_SIZE;
	if (end < start_vaddr || end > end_vaddr)
		end = end_vaddr;
	....walk....
	start_vaddr =end;

Only tested on x86-64. 
Thanks,
-Kame

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In initial design, walk_page_range() was designed just for walking page table and
it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range()
and we need mmap_sem around it.

This patch adds mmap_sem around walk_page_range().

Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get
rid of it to do sane fix.

Changelog: 2010/Apr/2
 - fixed start_vaddr and end overflow
Changelog: 2010/Apr/1
 - fixed start_vaddr calculation
 - removed unnecessary cast.
 - removed unnecessary change in smaps.
 - use GFP_TEMPORARY instead of GFP_KERNEL

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/proc/task_mmu.c |   87 ++++++++++++++++++++++-------------------------------
 1 file changed, 37 insertions(+), 50 deletions(-)

Index: linux-2.6.34-rc3/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.34-rc3.orig/fs/proc/task_mmu.c
+++ linux-2.6.34-rc3/fs/proc/task_mmu.c
@@ -406,6 +406,7 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof mss);
 	mss.vma = vma;
+	/* mmap_sem is held in m_start */
 	if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 		walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
 
@@ -552,7 +553,8 @@ const struct file_operations proc_clear_
 };
 
 struct pagemapread {
-	u64 __user *out, *end;
+	int pos, len;
+	u64 *buffer;
 };
 
 #define PM_ENTRY_BYTES      sizeof(u64)
@@ -575,10 +577,8 @@ struct pagemapread {
 static int add_to_pagemap(unsigned long addr, u64 pfn,
 			  struct pagemapread *pm)
 {
-	if (put_user(pfn, pm->out))
-		return -EFAULT;
-	pm->out++;
-	if (pm->out >= pm->end)
+	pm->buffer[pm->pos++] = pfn;
+	if ...
From: Matt Mackall
Date: Friday, April 2, 2010 - 7:30 am

Looks like this gets the wrong return code?

-- 
Mathematics is the supreme nostalgia of our time.
--

From: KAMEZAWA Hiroyuki
Date: Monday, April 5, 2010 - 11:48 pm

On Fri, 2 Apr 2010 09:30:58 -0500
I'm sorry. And thank you for pointing out.
I confirmed merged one has fixed code.

Regards,
-Kame

--

Previous thread: [PATCH v2 0/11] Uprobes patches. by Srikar Dronamraju on Wednesday, March 31, 2010 - 8:51 am. (46 messages)

Next thread: Re: [PATCH] drivers:staging: sources for ST core by Pavan Savoy on Wednesday, March 31, 2010 - 11:02 am. (2 messages)