Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC)

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Mike Snitzer <snitzer@...>
Cc: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 1:37 am

(Reposted for completeness.  Previously rejected by vger due to 
accidental send as html mail.  CC's except for Mike and vger deleted)

On Monday 17 September 2007 20:27, Mike Snitzer wrote:

The dread blk_congestion_wait is biting you hard.  We're very familiar
with the feeling.  Congestion_wait is basically the traffic cop that
implements the dirty page limit.  I believe it was conceived as a
method of fixing writeout deadlocks, but in our experience it does not
help, in fact it introduces a new kind of deadlock
(blk_congestion_wait) that is much easier to trigger.  One of the
things we do to get ddsnap running reliably is disable congestion_wait
via the PF_LESS_THROTTLE hack that was introduced to stop local NFS
clients from deadlocking.  NBD will need a similar treatment.

Actually, I hope to show quite soon that dirty page limiting is not
needed at all in order to prevent writeout deadlock.  In which case we
can just get rid of the dirty limits and go back to being able to use
all of non-reserve memory as a write cache, the way things used to be
in the days of yore.

It has been pointed out to me that congestion_wait not only enforces
the dirty limit, it controls the balancing of memory resources between
slow and fast block devices.  The Peterz/Phillips approach to deadlock
prevention does not provide any such balancing and so it seems to me
that congestion_wait is ideally situated in the kernel to provide that
missing functionality.  As I see it, blk_congestion_wait can easily be
modified to balance the _rate_ at which cache memory is dirtied for
various block devices of different speeeds.  This should turn out to
be less finicky than balancing the absolute ratios, after all you can
make a lot of mistakes in rate limiting and still not deadlock so long
as dirty rate doesn't drop to zero and stay there for any block
device.  Gotta be easy, hmm?

Please note: this plan is firmly in the category of speculation until
we have actually tried it and have patches to show, but I thought that
now  is about the right time to say something about where we think
this storage robustness work is headed.


Yes, and also inspect the code to ensure it doesn't violate mlock_all
by execing programs (no shell scripts!), dynamically loading
libraries, etc.


Avoiding glib is a good start.  Look at your library dependencies and
prune them merclilessly.  Just don't use any libraries that you can
code up yourself in a few hundred bytes of program text for the
functionalituy you need.


See PF_LESS_THROTTLE.   Also notice that this mechanism is somewhat
less than general.  In mainline it only has one user, NFS, and it only
can have one user before you have to fiddle that code to create things
like PF_EVEN_LESS_THROTTLE.

As far as I can see, not having any dirty page limit for normal
allocations is the way to go, it avoids this mess nicely.  Now we just
need to prove that this works ;-)


No, it's a patch I wrote based on Evgeniy's original, that appeared
quietly later in the thread.  At the time we hadn't tested it and now
we have.  It works fine, it's short, general, efficient and easy to
understand.  So it will get a post of its own pretty soon.


Yes.  Ddsnap includes a bit of code almost identical to that, which we
wrote independently.  Seems wild and crazy at first blush, doesn't it?
But this approach has proved robust in practice, and is to my mind,
obviously correct.


You do need the block IO throttling, and you need to bypass the dirty
page limiting.

Without throttling, your block driver will quickly consume any amount
of reserve memory you have, and you are dead.  Without an exemption
from dirty page limiting, the number of pages your user space daemon
can allocate without deadlocking is zero, which makes life very
difficult.

I will post our in-production version of the throttling patch in a day
or two.


Yes.


Yes, at least for device mapper devices.  In our production device
mapper throttling patch, which I will post pretty soon, we provide an
aribitrary limit by default, and the device mapper device may change
it in its constructor method.  Something similar should work for NBD.

As far as sub-optimal throughput goes, we run with a limit of 1,000
bvecs in flight (about 4 MB) and that does not seem to restrict
throughput measurably.

Though you also need this throttling, it is apparent from the traceback
you linked above that you ran around on blk_congestion_wait.   Try
setting your user space daemon into PF_LESS_THOTTLE mode and see what
happens.


A vm dagwood sandwich, I hope it tastes good :-)

Well, pretty soon we will join you in the NBD rehabilitation effort
because we require it for the next round of storage work, which
centers around the ddraid distributed block device.  This requires an
NBD that functions reliably, even when accessing an exported block
device locally.


I thought Peter was swapping over NBD?  Anyway, we have not moved into
the NBD problem yet because we are still busy chasing
non-deadlock-related ddsnap bugs.  Which require increasingly creative
efforts to trigger by the way, but we haven't quite run out of new
bugs, so we don't get to play with distributed storage just yet.


Seeing as we have a virtually identical target configuration in mind,
you can expect quite a lot of help from our direction in the near
future, and in the mean time we can provide encouragement, information
and perhaps a few useful lines of code.

Regards,

Daniel
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Tue Aug 14, 10:21 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Wed Sep 5, 5:20 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Sep 5, 6:42 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Wed Sep 5, 12:16 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Sep 10, 3:25 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Mon Sep 10, 3:55 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Sep 10, 4:22 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Mon Sep 10, 4:48 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Pavel Machek, (Fri Oct 26, 1:44 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Sat Oct 27, 7:08 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Fri Oct 26, 1:55 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Sat Oct 27, 6:58 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Mike Snitzer, (Sat Sep 8, 1:12 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Mon Sep 17, 8:28 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Mike Snitzer, (Mon Sep 17, 11:27 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Tue Sep 18, 5:30 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Daniel Phillips, (Tue Sep 18, 1:37 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Wed Sep 5, 7:42 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Sep 5, 8:14 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Wed Sep 12, 6:52 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Sep 12, 6:47 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Thu Sep 13, 4:19 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Thu Sep 13, 2:32 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Thu Sep 13, 3:24 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Wed Sep 5, 8:19 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Sep 10, 3:29 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Tue Sep 11, 3:41 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Mon Sep 10, 3:37 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Sep 10, 3:41 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Mon Sep 10, 3:55 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Sep 10, 4:17 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Mon Sep 10, 4:48 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Wed Aug 15, 8:22 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Wed Aug 15, 9:12 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Wed Aug 15, 11:29 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Sun Aug 19, 11:51 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Mon Aug 20, 8:28 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Sep 12, 6:39 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Tue Aug 21, 11:29 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Wed Aug 22, 11:02 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Mon Aug 20, 3:15 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Nick Piggin, (Mon Aug 20, 8:32 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Thu Aug 16, 4:27 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Aug 15, 4:29 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Andi Kleen, (Wed Aug 15, 10:15 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Wed Aug 15, 9:55 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Wed Aug 15, 4:32 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Andi Kleen, (Wed Aug 15, 10:34 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Tue Aug 14, 10:36 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Tue Aug 14, 11:29 am)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Peter Zijlstra, (Tue Aug 14, 3:32 pm)
Re: [RFC 0/3] Recursive reclaim (on __PF_MEMALLOC), Christoph Lameter, (Tue Aug 14, 3:41 pm)