Re: [PATCH 00/28] Swap over NFS -v16

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Monday, March 3, 2008 - 7:41 pm

Hi Peter,

 Thanks for trying to spell it out for me. :-)

On Monday March 3, a.p.zijlstra@chello.nl wrote:

Yep.


Yep.


Yes.  Good point.


Definitely.


Agreed.


Maybe...
 1/ Memory is used 
     a/ in caches, such as the fragment cache and the route cache
     b/ in transient allocations on their way from one place to
        another. e.g. network card to fragment cache, frag cache to
        socket. 
    The caches can (do?) impose a natural limit on the amount of
    memory they use.  The transient allocations should be satisfied
    from the normal low watermark pool.  When we are in a low memory
    conditions we can expect packet loss so we expect network streams
    to slow down, so we expect there to be fewer bits in transit.
    Also in low memory conditions the caches would be extra-cautious
    not to use too much memory.
    So it isn't completely clear (to me) that extra accounting is needed.

 2/ If we were to do accounting to "ensure progress for those packets
    we already have allocated", then I would expect a reservation
    (charge) of max_packet_size when a fragment arrives on the network
    card - or at least when a new fragment is determined to not match
    any packet already in the fragment cache.  But I didn't see that
    in your code.  I saw incremental charges as each page arrived.
    And that implementation does seem to fit the model.
  

Understood.


I don't understand why we want to "overflow this fragment cache".
I picture the cache having a target size.  When under this size,
fragments might be allowed to live longer.  When at or over the target
size, old fragments are pruned earlier.  When in a low memory
situation it might be even more keen to prune old fragments, to keep
beneath the target size.
When you say "overflow this fragment cache", I picture deliberately
allowing the cache to get bigger than the target size.  I don't
understand why you would want to do that.


That would be important, yes.


Those skbs we allocated - they are either sitting in the fragment
cache, or have been attached to a SK_MEMALLOC socket, or have been
freed - correct?  If so, then there is already a limit to how much
memory they can consume.


Good.  So as long as the normal emergency reserves covers the size of
the route cache plus the size of the fragment cache plus a little bit
of slack, we should be safe - yes?


Lots of it does, yes.


Catch-22 ?? :-)


Yes, rate-limiting those write-outs should keep that moving.

                                          ^not ??

Sounds fair.


Providing it frees any headers it attached to each page (or had
allocated them from a private pool), it should have no memory in use.
I'd have to check through the RPC code (I get lost in there too) to
see how much memory is tied up by each outstanding page write.


Yes.

So I can see two possible models here.

The first is the "bounded cache" or "locally bounded" model.
At every step in the path from writepage to clear_page_writeback,
the amount of extra memory used is bounded by some local rules.
NFS and RPC uses congestion logic to limit the number of outstanding
writes.  For incoming packets, the fragment cache and route cache
impose their own limits.
We simply need that the VM reserves a total amount of memory to meet
the sum of those local limits.

Your code embodies this model with the tree of reservations.  The root
of the tree stores the sum of all the reservations below, and this
number is given to the VM.
The value of the tree is that different components can register their
needs independently, and the whole tree (or subtrees) can be attached
or not depending on global conditions, such as whether there are any
SK_MEMALLOC sockets or not.

However I don't see how the charging that you implemented fits into
this model.
You don't do any significant charging for the route cache.  But you do
for skbs.  Why?  Don't the majority of those skbs live in the fragment
cache?  Doesn't it account their size? (Maybe it doesn't.... maybe it
should?).

I also don't see the value of tracking pages to see if they are
'reserve' pages or not.  The decision to drop an skb that is not for
an SK_MEMALLOC socket should be based on whether we are currently
short on memory.  Not whether we were short on memory when the skb was
allocated.

The second model that could fit is "total accounting". 
In this model we reserve memory at each stage including the transient
stages (packet that has arrived but isn't in fragment cache yet).
As memory moves around, we move the charging from one reserve to
another.  If the target reserve doesn't have an space, we drop the
message.
On the transmit side, that means putting the page back on a queue for
sending later.  On the receive side that means discarding the packet
and waiting for a resend.
This model makes it easy for the various limits to be very different
while under memory pressure that otherwise.  It also means they are
imposed differently which isn't so good.

So:
 - Why do you impose skb allocation limits beyond what is imposed
   by the fragment cache?
 - Why do you need to track whether each allocation is a reserve or
   not?

Thanks,
NeilBrown

--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH 00/28] Swap over NFS -v16, Neil Brown, (Mon Mar 3, 7:41 pm)
Re: [PATCH 00/28] Swap over NFS -v16, Peter Zijlstra, (Tue Mar 4, 6:28 am)