Currently mlock() holds mmap_sem in exclusive mode while the pages get faulted in. In the case of a large mlock, this can potentially take a very long time, during which various commands such as 'ps auxw' will block. This makes sysadmins unhappy: real 14m36.232s user 0m0.003s sys 0m0.015s (output from 'time ps auxw' while a 20GB file was being mlocked without being previously preloaded into page cache) I propose that mlock() could release mmap_sem after the VM_LOCKED bits have been set in all appropriate VMAs. Then a second pass could be done to actually mlock the pages, in small batches, releasing mmap_sem when we block on disk access or when we detect some contention. Patches are against v2.6.37-rc4 plus my patches to avoid mlock dirtying (presently queued in -mm). Michel Lespinasse (6): mlock: only hold mmap_sem in shared mode when faulting in pages mm: add FOLL_MLOCK follow_page flag. mm: move VM_LOCKED check to __mlock_vma_pages_range() rwsem: implement rwsem_is_contended() mlock: do not hold mmap_sem for extended periods of time x86 rwsem: more precise rwsem_is_contended() implementation arch/alpha/include/asm/rwsem.h | 5 + arch/ia64/include/asm/rwsem.h | 5 + arch/powerpc/include/asm/rwsem.h | 5 + arch/s390/include/asm/rwsem.h | 5 + arch/sh/include/asm/rwsem.h | 5 + arch/sparc/include/asm/rwsem.h | 5 + arch/x86/include/asm/rwsem.h | 35 ++++++--- arch/x86/lib/rwsem_64.S | 4 +- arch/x86/lib/semaphore_32.S | 4 +- arch/xtensa/include/asm/rwsem.h | 5 + include/linux/mm.h | 1 + include/linux/rwsem-spinlock.h | 1 + lib/rwsem-spinlock.c | 12 +++ mm/internal.h | 3 +- mm/memory.c | 54 ++++++++++++-- mm/mlock.c | 150 ++++++++++++++++++-------------------- mm/nommu.c | 6 +- 17 files changed, 201 insertions(+), 104 deletions(-) -- ...
__get_user_pages gets a new 'nonblocking' parameter to signal that the
caller is prepared to re-acquire mmap_sem and retry the operation if needed.
This is used to split off long operations if they are going to block on
a disk transfer, or when we detect contention on the mmap_sem.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/internal.h | 3 ++-
mm/memory.c | 27 ++++++++++++++++++++++-----
mm/mlock.c | 34 ++++++++++++++++++++--------------
mm/nommu.c | 6 ++++--
4 files changed, 48 insertions(+), 22 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index dedb0af..bd4f581 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -243,7 +243,8 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn,
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, unsigned int foll_flags,
- struct page **pages, struct vm_area_struct **vmas);
+ struct page **pages, struct vm_area_struct **vmas,
+ int *nonblocking);
#define ZONE_RECLAIM_NOSCAN -2
#define ZONE_RECLAIM_FULL -1
diff --git a/mm/memory.c b/mm/memory.c
index f3a9242..85e56dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1366,7 +1366,8 @@ no_page_table:
int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int nr_pages, unsigned int gup_flags,
- struct page **pages, struct vm_area_struct **vmas)
+ struct page **pages, struct vm_area_struct **vmas,
+ int *nonblocking)
{
int i;
unsigned long vm_flags;
@@ -1465,11 +1466,15 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
+ int fault_flags = 0;
int ret;
+ if (foll_flags & FOLL_WRITE)
+ fault_flags |= FAULT_FLAG_WRITE;
+ if (nonblocking)
+ fault_flags |= FAULT_FLAG_ALLOW_RETRY;
ret = handle_mm_fault(mm, vma, start,
- (foll_flags & ...On Thu, 2 Dec 2010 16:16:51 -0800
Doesn't apply to linux-next because the KVM guys snuck in a new
FAULT_FLAG_MINOR (who knew?). With a bonus, undocumented,
exported-to-modules get_user_pages_noio().
I liked your code better so I munged __get_user_pages() together thusly:
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
unsigned int fault_flags = 0;
if (foll_flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (nonblocking)
fault_flags |= FAULT_FLAG_ALLOW_RETRY;
if (foll_flags & FOLL_MINOR)
fault_flags |= FAULT_FLAG_MINOR;
ret = handle_mm_fault(mm, vma, start,
fault_flags);
please review the end result..
--
We would like rwsem_is_contended() to return true only once a contending writer has had a chance to insert itself onto the rwsem wait queue. To that end, we need to differenciate between active and queued writers. A new property is introduced: RWSEM_ACTIVE_WRITE_BIAS is set to be 'more negative' than RWSEM_WAITING_BIAS. RWSEM_WAITING_MASK designates a bit in the rwsem count that will be set only when RWSEM_WAITING_BIAS is in effect. The basic properties that have been true so far also still hold: - RWSEM_ACTIVE_READ_BIAS & RWSEM_ACTIVE_MASK == 1 - RWSEM_ACTIVE_WRITE_BIAS & RWSEM_ACTIVE_MASK == 1 - RWSEM_WAITING_BIAS & RWSEM_ACTIVE_MASK == 0 - RWSEM_ACTIVE_WRITE_BIAS < 0 and RWSEM_WAITING_BIAS < 0 In addition, the rwsem count will be < RWSEM_WAITING_BIAS only if there are any active writers (though we don't make use of this property so far). Signed-off-by: Michel Lespinasse <walken@google.com> --- arch/x86/include/asm/rwsem.h | 32 +++++++++++++++++++------------- arch/x86/lib/rwsem_64.S | 4 ++-- arch/x86/lib/semaphore_32.S | 4 ++-- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index a35521e..1ce7759 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -16,11 +16,10 @@ * if there are writers (and maybe) readers waiting (in which case it goes to * sleep). * - * The value of WAITING_BIAS supports up to 32766 waiting processes. This can - * be extended to 65534 by manually checking the whole MSW rather than relying - * on the S flag. + * The WRITE_BIAS value supports up to 32767 processes simultaneously + * trying to acquire a write lock. * - * The value of ACTIVE_BIAS supports up to 65535 active processes. + * The value of ACTIVE_MASK supports up to 32767 active processes. * * This should be totally fair - if anything is waiting, a process that wants a * lock will go to the back of the queue. When the currently active lock ...
So you're only considering writer-writer contention? Not writer-reader and reader-writer contention? I'd argue rwsem_is_contended() should return true if there is _any_ blocked task, be it a read or a writer. If you want something else call it something else, like rwsem_is_write_contended() (there's a pending writer), or rwsem_is_read_contended() (there's a pending reader). --
I said 'a contending writer' because in my use case, the rwsem was being held for read, and other readers would get admitted in (so they don't count as contention). The code as written will work in the more general case - it tells if there is another thread queued waiting for the rwsem to be released. It can't tell you however if that thread is a reader or a writer (but if mmap_sem is already held for read, then the contending thread must be a writer). -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Trivial implementations for rwsem_is_contended()
Signed-off-by: Michel Lespinasse <walken@google.com>
---
arch/alpha/include/asm/rwsem.h | 5 +++++
arch/ia64/include/asm/rwsem.h | 5 +++++
arch/powerpc/include/asm/rwsem.h | 5 +++++
arch/s390/include/asm/rwsem.h | 5 +++++
arch/sh/include/asm/rwsem.h | 5 +++++
arch/sparc/include/asm/rwsem.h | 5 +++++
arch/x86/include/asm/rwsem.h | 5 +++++
arch/xtensa/include/asm/rwsem.h | 5 +++++
include/linux/rwsem-spinlock.h | 1 +
lib/rwsem-spinlock.c | 12 ++++++++++++
10 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index 1570c0b..6183eec 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -255,5 +255,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
return (sem->count != 0);
}
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ return (sem->count < 0);
+}
+
#endif /* __KERNEL__ */
#endif /* _ALPHA_RWSEM_H */
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 215d545..e965b7a 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -179,4 +179,9 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
return (sem->count != 0);
}
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ return (sem->count < 0);
+}
+
#endif /* _ASM_IA64_RWSEM_H */
diff --git a/arch/powerpc/include/asm/rwsem.h b/arch/powerpc/include/asm/rwsem.h
index 8447d89..69f5d13 100644
--- a/arch/powerpc/include/asm/rwsem.h
+++ b/arch/powerpc/include/asm/rwsem.h
@@ -179,5 +179,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
return sem->count != 0;
}
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+ return sem->count < 0;
+}
+
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_RWSEM_H ...Use a single code path for faulting in pages during mlock.
The reason to have it in this patch series is that I did not want to
update both code paths in a later change that releases mmap_sem when
blocking on disk.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/mlock.c | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index 0531173..241a5d2 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -169,7 +169,7 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
VM_BUG_ON(end > vma->vm_end);
VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
- gup_flags = FOLL_TOUCH | FOLL_MLOCK;
+ gup_flags = FOLL_TOUCH;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
@@ -178,6 +178,9 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
gup_flags |= FOLL_WRITE;
+ if (vma->vm_flags & VM_LOCKED)
+ gup_flags |= FOLL_MLOCK;
+
/* We don't try to access the guard page of a stack vma */
if (stack_guard_page(vma, start)) {
addr += PAGE_SIZE;
@@ -456,14 +459,11 @@ static int do_mlock_pages(unsigned long start, size_t len)
/*
* Now fault in a range of pages within the first VMA.
*/
- if (vma->vm_flags & VM_LOCKED) {
- ret = __mlock_vma_pages_range(vma, nstart, nend);
- if (ret) {
- ret = __mlock_posix_error_return(ret);
- break;
- }
- } else
- make_pages_present(nstart, nend);
+ ret = __mlock_vma_pages_range(vma, nstart, nend);
+ if (ret) {
+ ret = __mlock_posix_error_return(ret);
+ break;
+ }
}
up_read(&mm->mmap_sem);
return ret; /* 0 or negative error code */
--
1.7.3.1
--
Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in. In the case of a large mlock, this can potentially
take a very long time. Various things will block while mmap_sem is held,
including 'ps auxw'. This can make sysadmins angry.
I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only. We need
to recheck the vma flags after we re-acquire mmap_sem, but this is easy.
In the case where a vma has been munlocked before mlock completes,
pages that were already marked as PageMlocked() are handled by the
munlock() call, and mlock() is careful to not mark new page batches
as PageMlocked() after the munlock() call has cleared the VM_LOCKED
vma flags. So, the end result will be identical to what'd happen if
munlock() had executed after the mlock() call.
In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that
it won't be held for long periods of time even in shared mode.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/mlock.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 58 insertions(+), 17 deletions(-)
diff --git a/mm/mlock.c b/mm/mlock.c
index 4f31864..8d6b702 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -377,18 +377,10 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
int ret = 0;
int lock = newflags & VM_LOCKED;
- if (newflags == vma->vm_flags ||
- (vma->vm_flags & (VM_IO | VM_PFNMAP)))
+ if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+ is_vm_hugetlb_page(vma) || vma == get_gate_vma(current))
goto out; /* don't set VM_LOCKED, don't count */
- if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
- is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current)) {
- if ...The kernel holds down_write(mmap_sem) for 14m36s? geeze you guys are picky - that's less than a quarter hour! On Thu, 2 Dec 2010 16:16:47 -0800 Am I correct in believing that we'll still hold down_read(mmap_sem) for a quarter hour? If so, that's still pretty obnoxious behaviour - presumably there are workloads which will hurt like hell from that. We don't need to hold mmap_sem at all while faulting in those pages, do we? We could just do for (addr = start, addr < end; addr += PAGE_SIZE) get_user(x, addr); and voila. If the pages are in cache and the ptes are set up then that will be *vastly* faster than the proposed code. If the get_user() takes a minor fault then it'll be slower. If it's a major fault then the difference probably doesn't matter much. But whatever. Is this patchset a half-fix, and should we rather be looking for a full-fix? --
Yes... Yes, patch 1/6 changes the long hold time to be in read mode instead of write mode, which is only a band-aid. But, this prepares for patch 5/6, which releases mmap_sem whenever there is contention on it or get_user wouldn't suffice if the page is already mapped in, as we need to mark it as PageMlocked. Also, we need to skip IO and PFNMAP regions. I don't think you can make things much simpler than what I I think the series fully fixes the mlock() and mlockall() cases, which has been the more pressing use case for us. Even then, there are still cases where we could still observe long mmap_sem hold times - fundamentally, every place that calls get_user_pages (or do_mmap, in the mlockall MCL_FUTURE case) with a large page range may create such problems. From the looks of it, most of these places wouldn't actually care if the mmap_sem got dropped in the middle of the operation, but a general fix will have to involve looking at all the call sites to be sure. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
I have to say that I'm not a huge fan of that horribly kludgy contention check case. The "move page-in to read-locked sequence" and the changes to get_user_pages look fine, but the contention thing is just disgusting. I'd really like to see some other approach if at all possible. --
On Thu, Dec 9, 2010 at 10:11 PM, Linus Torvalds Are you OK with the part of patch 5/6 that drops mmap_sem when blocking on disk ? This by itself brings mmap_sem hold time down to a few seconds. Plus, I could add something to limit the interval passed to __mlock_vma_pages_range to a thousand pages or so, so that the hold time would then be bounded by that constant. I think rwsem_is_contended() actually sounds better than fiddling with constants, but OTOH maybe the mlock use case is not significant enough to justify introducing that new API. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Right, so I don't see the problem with _is_contended() either. In fact,
I introduce mutex_is_contended() in the mmu_preempt series to convert
existing (spin) lock break tests.
If you want to do lock-breaks like cond_resched_lock() all you really
have is *_is_contended(), sleeping locks will schedule unconditional.
int cond_break_mutex(struct mutex *mutex)
{
int ret = 0;
if (mutex_is_contended(mutex)) {
mutex_unlock(mutex);
ret = 1;
mutex_lock(mutex);
}
return 1;
}
Or more exotic lock breaks, like the mmu-gather stuff, which falls out
of its nested page-table loops and restarts the whole affair.
--
Andrew, should I amend my patches to remove the rwsem_is_contended() code ?
This would involve:
- remove rwsem-implement-rwsem_is_contended.patch and
x86-rwsem-more-precise-rwsem_is_contended-implementation.patch
- in mlock-do-not-hold-mmap_sem-for-extended-periods-of-time.patch,
drop the one hunk making use of rwsem_is_contended (rest of the patch
would still work without it)
- optionally, follow up patch to limit batch size to a constant
in do_mlock_pages():
diff --git a/mm/mlock.c b/mm/mlock.c
index 569ae6a..a505a7e 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -457,15 +457,23 @@ static int do_mlock_pages(unsigned long start, size_t len)
continue;
if (nstart < vma->vm_start)
nstart = vma->vm_start;
+ /*
+ * Constrain batch size to limit mmap_sem hold time.
+ */
+ if (nend > nstart + 1024 * PAGE_SIZE)
+ nend = nstart + 1024 * PAGE_SIZE;
/*
* Now fault in a range of pages. __mlock_vma_pages_range()
* double checks the vma flags, so that it won't mlock pages
* if the vma was already munlocked.
*/
ret = __mlock_vma_pages_range(vma, nstart, nend, &locked);
if (ret < 0) {
ret = __mlock_posix_error_return(ret);
break;
+ } else if (locked) {
+ locked = 0;
+ up_read(&mm->mmap_sem);
}
nend = nstart + ret * PAGE_SIZE;
ret = 0;
I don't really prefer using a constant, but I'm not sure how else to make
Linus happy :)
--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
On Mon, 13 Dec 2010 16:51:40 -0800 rwsem_is_contended() didn't seem so bad to me. Reading 1024 pages can still take a long time. I can't immediately think of a better approach though. --
On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton Note that we're only concerned page cache hist here, as __get_user_pages with non-NULL nonblocking argument will release mmap_sem when blocking on disk. So time per page is somewhat constant - we don't need to worry about disk seeks at least. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
On Mon, Dec 13, 2010 at 5:05 PM, Andrew Morton
I don't see the need for _any_ of this.
Guys, we used to hold the damn thing for writing the *WHOLE*DAMN*TIME*.
Without _any_ at all of the crappy "rwsem_contended()" or the stupid
constants, we hold it only for reading, _and_ we drop it for any
actual IO. So the semaphore is held only for actual CPU intensive
cases. We're talking a reduction from minutes to milliseconds.
So stop this insanity. Do neither the rwsem contention checking _nor_
the "do things in batches".
Really.
The thing is, afte six months of doing the simple and straightforward
and _obvious_ parts, if people still think it's a real problem, at
that point I'm going to be interested in hearing about trying to be
clever. But when the semaphore hold times have gone down by four
orders of magnitude, I simply think it's fundamentally wrong to dick
around with some stupid detail. Certainly not in the same patch
series.
"Keep It Simple, Stupid".
So don't even _try_ to send me a series that does all of this. I'm not
going to take it. Do a series that fixes the _problem_. No more.
And btw, read the paper "Worse is better".
Linus
--
On Tue, Dec 14, 2010 at 7:43 AM, Linus Torvalds It's actually still several seconds for a large enough mlock from page cache. But yes, I agree it'll do fine for now :) -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
Hi David, I forgot to add you on the original submission, but I think you'd be best qualified to look at the two patches implementing rwsem_is_contended()... -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
