rwsem: down_read_unfair() proposal

Previous thread: [PATCH] sysfs-namespaces: add a high-level Documentation file by Serge E. Hallyn on Tuesday, May 4, 2010 - 7:45 pm. (1 message)

Next thread: [PATCH] perf/trace: use read() instead of lseek() in trace_event_read.c:skip() by Tom Zanussi on Tuesday, May 4, 2010 - 9:02 pm. (7 messages)
From: Michel Lespinasse
Date: Tuesday, May 4, 2010 - 8:20 pm

Hi,

I am looking at ways to solve the following problem:

Some cluster monitoring software we use at Google periodically accesses
files such as /proc/<pid>/exe or /proc/<pid>/maps, which requires
acquiring that pid's mmap_sem for read. Sometimes when the machines get
loaded enough, this acquisition can take a long time - typically this
happens when thread A acquires mmap_sem for reads in do_page_fault and
gets blocked (either trying to allocate a page or trying to read from
disk); then thread B tries to acquire mmap_sem for write and gets queued
behind A; then the monitoring software tries to read /proc/<pid>/maps
and gets queued behind B due to rwlock fair behavior.

We have been using patches that address these issues by allowing the
/proc/<pid>/exe and /proc/<pid>/maps code paths to acquire the mmap_sem
for reading in an unfair way, thus allowing the monitoring software to
bypass thread B and acquire mmap_sem concurrently with thread A.

This was easy to implement with the generic rwsem, and looks like it's
doable with the x86 rwsem implementation as well in a way that would only
involve changes to the rwsem spinlock-protected slow paths in lib/rwsem.c .
We are still working on that code but I thought we should ask first how
the developer community feels about the general idea.

For reference, here is one patch we have (against 2.6.33) using
down_read_unfair() in such a way (but with no x86 specific rwsem
implementation yet)


Author: Ying Han <yinghan@google.com>

    Introduce down_read_unfair()
    
    In down_read_unfair(), reader is not waiting non-exclusive lock
    even a writer on the queue. Apply it to maps & exes in procfs where
    monitoring program reads frequently.
    
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 58324c2..d51bc55 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1367,7 +1367,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 
 	/* We need mmap_sem to protect against races with removal of
 	 * VM_EXECUTABLE vmas ...
From: David Howells
Date: Wednesday, May 5, 2010 - 3:03 am

If the system is as heavily loaded as you say, how do you prevent writer
starvation?  Or do things just grind along until sufficient threads are queued
waiting for a write lock?

David
--

From: Michel Lespinasse
Date: Wednesday, May 5, 2010 - 3:36 am

Reader/Writer fairness is not disabled in the general case - it only is
for a few specific readers such as /proc/<pid>/maps. In particular, the
do_page_fault path, which holds a read lock on mmap_sem for potentially long
(~disk latency) periods of times, still uses a fair down_read() call.
In comparison, the /proc/<pid>/maps path which we made unfair does not
normally hold the mmap_sem for very long (it does not end up hitting disk);
so it's been working out well for us in practice.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Mike Waychison
Date: Thursday, May 6, 2010 - 4:26 pm

FWIW, these sorts of block-ups are usually really pronounce on machines 
with harddrives that take _forever_ to respond to SMART commands (which 
are done via PIO, and which can serialize many drives when they are 
hidden behind a port multiplier).  We've seen cases where hard faults 
can take unusually long on an otherwise non-busy machines (~10 seconds?).

The other case we have problems with mmap_sem from a cluster monitoring 
perspective occurs when we get blocked up behind a task that is having 
problems dying from oom.  We have a variety of hacks used internally to 
cover these cases, though I think we (David and I?) figured that it'd 
make more sense to fix the dependencies on down_read(&current->mmap_sem) 
in the do_exit() path.  For instance, it really makes no sense to 
coredump when we are being oom killed (and thus we should be able to 
skip the mmap_sem dependency there..).

Mike Waychison
--

From: David Howells
Date: Wednesday, May 5, 2010 - 3:06 am

It's not as easy as it seems.  Once an XADD-based rwsem is contended, you
cannot necessarily tell without looking at the queue whether the rwsem is
currently write-locked or read-locked.

David
--

From: Michel Lespinasse
Date: Wednesday, May 5, 2010 - 3:48 am

I only said it was doable :) Not done with the implementation yet, but I can
describe the general idea if that helps. The high part of the rwsem is
decremented by two for each thread holding or trying to acquire a write lock;
additionally the high part of the rwsem is decremented by one for the first
thread getting queued. Since queuing is done under a spinlock, it is easy
to decrement only for the first blocked thread there. In down_read_unfair(),
the rwsem value is compared with RWSEM_WAITING_BIAS (== -1 << 16 or 32);
if it's smaller then the rwsem might be write owned and we have to block;
otherwise it only has waiters which we can decide to ignore. This is the
idea in a nutshell.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: David Howells
Date: Wednesday, May 5, 2010 - 4:09 am

That would mean you're reducing the capacity of the upper counter by one since
the high part must remain negative if we're to be able to check it for
non-zeroness by checking the sign flag.  That means a maximum of 2^14-1 writers
queued on a 32-bit box (16384), but we can have more threads than that (up to
~32767).

Currently, we can have a maximum of 32767 writers+readers queued as we only
decrement the upper counter by 1 each time.

On a 64-bit box, the limitations go away for all practical purposes.

David
--

Previous thread: [PATCH] sysfs-namespaces: add a high-level Documentation file by Serge E. Hallyn on Tuesday, May 4, 2010 - 7:45 pm. (1 message)

Next thread: [PATCH] perf/trace: use read() instead of lseek() in trace_event_read.c:skip() by Tom Zanussi on Tuesday, May 4, 2010 - 9:02 pm. (7 messages)