login
Header Space

 
 

Re: [PATCH 00/28] Swap over NFS -v16

Score:
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Neil Brown <neilb@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Sunday, March 2, 2008 - 7:33 pm

On Mon, 2008-03-03 at 09:18 +1100, Neil Brown wrote:

The TX path needs to be able to make progress in that it must be able to
send out at least one full request (page). The thing the TX path must
not do is tie up so much memory sending out pages that we can't receive
any incoming packets.

So, having a throttle on the amount of writes in progress, and
sufficient memory to back those, seem like a solid way here.

NFS has such a limit in its congestion logic. But I'm quite sure I'm
failing to allocate enough memory to back it, as I got confused by the
whole RPC code.


That is basically what the slub logic I added does. Except that global
flags in the vm make people very nervous, so its a little more complex.


Which is what I do in the skb_alloc() path.


Which is somewhat more complex than you make it sound, but that is
exactly what I do.


You need to be able to overflow the ip fragement assembly cache, or we
could get stuck with all memory in fragments.

Same for other memory usage before we hit the socket de-multiplex, like
the route-cache.

I just refined those points here; you need to drop more that
non-writeout packets, you need to drop all packets not meant for
SK_MEMALLOC.

You also need to allow some writeout packets, because if you hit 'oom'
and need to write-out some pages to free up memory,...

I did the reservation because I wanted some guarantee we'd be able to
over-flow the caches mentioned. The alternative is working with the
variable ratio that the current reserve has.

The accounting makes the whole system more robust. I wanted to make the
state stable enough to survive a connection drop, or server reset for a
long while, and it does. During a swapping workload and heavy network
load, I can pull the network cable, or shut down the NFS server and
leave it down for over 30 minutes. When I bring it back up again, stuff
resumes.


I'm failing horribly. Let me try again:

Create a stable state where you can receive an unlimited amount of
network packets awaiting the one packet you need to move forward.

To do so we need to distinguish needed from unneeded packets; we do this
by means of SK_MEMALLOC. So we need to be able to receive packets up to
that point.

The unlimited amount of packets means unlimited time; which means that
our state must not consume memory, merely use memory. That is, the
amount of memory used must not grow unbounded over time.

So we must guarantee that all memory allocated will be promptly freed
again, and never allocate more than available.

Because this state is not the normal state, we need a trigger to enter
this state (and consequently a trigger to leave this state). We do that
by detecting a low memory situation just like you propose. We enter this
state once normal memory allocations fail and leave this state once they
start succeeding again.

We need the accounting to ensure we never allocate more than is
available, but more importantly because we need to ensure progress for
those packets we already have allocated.

A packet is received, it can be a fragment, it will be placed in the
fragment cache for packet re-assembly.

We need to ensure we can overflow this fragment cache in order that
something will come out at the other end. If under a fragment attack,
the fragment cache limit will prune the oldest fragments, freeing up
memory to receive new ones.

Eventually we'd be able to receive either a whole packet, or enough
fragments to assemble one.

Next comes routing the packet; we need to know where to process the
packet; local or non-local. This potentially involves filling the
route-cache.

If at this point there is no memory available because we forgot to limit
the amount of memory available for skb allocation we again are stuck.

The route-cache, like the fragment assembly, is already accounted and
will prune old (unused) entries once the total memory usage exceeds a
pre-determined amount of memory.

Eventually we'll end up at socket demux, matching packets to sockets
which allows us to either toss the packet or consume it. Dropping
packets is allowed because network is assumed lossy, and we have not yet
acknowledged the receive.

Does this make sense?


Then we have TX, which like I said above needs to operate under certain
limits as well. We need to be able to send out packets when under
pressure in order to relieve said pressure.

We need to ensure doing so will not exhaust our reserves.

Writing out a page typically takes a little memory, you fudge some
packets with protocol info, mtu size etc.. send them out, and wait for
an acknowledge from the other end, and drop the stuff and go on writing
other pages.

So sending out pages does not consume memory if we're able to receive
ACKs. Being able to receive packets what what all the previous was
about.

Now of course there is some RPC concurrency, TCP windows and other
funnies going on, but I assumed - and I don't think that's a wrong
assumption - that sending out pages will consume endless amounts of
memory.

Nor will it keep on sending pages, once there is a certain amount of
packets outstanding (nfs congestion logic), it will wait, at which point
it should have no memory in use at all.

Anyway I did get lost in the RPC code, and I know I didn't fully account
everything, but under some (hopefully realistic) assumptions I think the
model is sound.

Does this make sense?

--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [PATCH 00/28] Swap over NFS -v16, Neil Brown, (Sun Mar 2, 6:18 pm)
Re: [PATCH 00/28] Swap over NFS -v16, Peter Zijlstra, (Sun Mar 2, 7:33 pm)
speck-geostationary