Hi,
I am looking at ways to solve the following problem:
Some cluster monitoring software we use at Google periodically accesses
files such as /proc/<pid>/exe or /proc/<pid>/maps, which requires
acquiring that pid's mmap_sem for read. Sometimes when the machines get
loaded enough, this acquisition can take a long time - typically this
happens when thread A acquires mmap_sem for reads in do_page_fault and
gets blocked (either trying to allocate a page or trying to read from
disk); then thread B tries to acquire mmap_sem for write and gets queued
behind A; then the monitoring software tries to read /proc/<pid>/maps
and gets queued behind B due to rwlock fair behavior.
We have been using patches that address these issues by allowing the
/proc/<pid>/exe and /proc/<pid>/maps code paths to acquire the mmap_sem
for reading in an unfair way, thus allowing the monitoring software to
bypass thread B and acquire mmap_sem concurrently with thread A.
This was easy to implement with the generic rwsem, and looks like it's
doable with the x86 rwsem implementation as well in a way that would only
involve changes to the rwsem spinlock-protected slow paths in lib/rwsem.c .
We are still working on that code but I thought we should ask first how
the developer community feels about the general idea.
For reference, here is one patch we have (against 2.6.33) using
down_read_unfair() in such a way (but with no x86 specific rwsem
implementation yet)
Author: Ying Han <yinghan@google.com>
Introduce down_read_unfair()
In down_read_unfair(), reader is not waiting non-exclusive lock
even a writer on the queue. Apply it to maps & exes in procfs where
monitoring program reads frequently.
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 58324c2..d51bc55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,7 +1367,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
/* We need mmap_sem to protect against races with removal of
* VM_EXECUTABLE vmas ...If the system is as heavily loaded as you say, how do you prevent writer starvation? Or do things just grind along until sufficient threads are queued waiting for a write lock? David --
Reader/Writer fairness is not disabled in the general case - it only is for a few specific readers such as /proc/<pid>/maps. In particular, the do_page_fault path, which holds a read lock on mmap_sem for potentially long (~disk latency) periods of times, still uses a fair down_read() call. In comparison, the /proc/<pid>/maps path which we made unfair does not normally hold the mmap_sem for very long (it does not end up hitting disk); so it's been working out well for us in practice. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
FWIW, these sorts of block-ups are usually really pronounce on machines with harddrives that take _forever_ to respond to SMART commands (which are done via PIO, and which can serialize many drives when they are hidden behind a port multiplier). We've seen cases where hard faults can take unusually long on an otherwise non-busy machines (~10 seconds?). The other case we have problems with mmap_sem from a cluster monitoring perspective occurs when we get blocked up behind a task that is having problems dying from oom. We have a variety of hacks used internally to cover these cases, though I think we (David and I?) figured that it'd make more sense to fix the dependencies on down_read(&current->mmap_sem) in the do_exit() path. For instance, it really makes no sense to coredump when we are being oom killed (and thus we should be able to skip the mmap_sem dependency there..). Mike Waychison --
It's not as easy as it seems. Once an XADD-based rwsem is contended, you cannot necessarily tell without looking at the queue whether the rwsem is currently write-locked or read-locked. David --
I only said it was doable :) Not done with the implementation yet, but I can describe the general idea if that helps. The high part of the rwsem is decremented by two for each thread holding or trying to acquire a write lock; additionally the high part of the rwsem is decremented by one for the first thread getting queued. Since queuing is done under a spinlock, it is easy to decrement only for the first blocked thread there. In down_read_unfair(), the rwsem value is compared with RWSEM_WAITING_BIAS (== -1 << 16 or 32); if it's smaller then the rwsem might be write owned and we have to block; otherwise it only has waiters which we can decide to ignore. This is the idea in a nutshell. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
That would mean you're reducing the capacity of the upper counter by one since the high part must remain negative if we're to be able to check it for non-zeroness by checking the sign flag. That means a maximum of 2^14-1 writers queued on a 32-bit box (16384), but we can have more threads than that (up to ~32767). Currently, we can have a maximum of 32767 writers+readers queued as we only decrement the upper counter by 1 each time. On a 64-bit box, the limitations go away for all practical purposes. David --
