On Mon, 2008-03-03 at 09:18 +1100, Neil Brown wrote:The TX path needs to be able to make progress in that it must be able to send out at least one full request (page). The thing the TX path must not do is tie up so much memory sending out pages that we can't receive any incoming packets. So, having a throttle on the amount of writes in progress, and sufficient memory to back those, seem like a solid way here. NFS has such a limit in its congestion logic. But I'm quite sure I'm failing to allocate enough memory to back it, as I got confused by the whole RPC code. That is basically what the slub logic I added does. Except that global flags in the vm make people very nervous, so its a little more complex. Which is what I do in the skb_alloc() path. Which is somewhat more complex than you make it sound, but that is exactly what I do. You need to be able to overflow the ip fragement assembly cache, or we could get stuck with all memory in fragments. Same for other memory usage before we hit the socket de-multiplex, like the route-cache. I just refined those points here; you need to drop more that non-writeout packets, you need to drop all packets not meant for SK_MEMALLOC. You also need to allow some writeout packets, because if you hit 'oom' and need to write-out some pages to free up memory,... I did the reservation because I wanted some guarantee we'd be able to over-flow the caches mentioned. The alternative is working with the variable ratio that the current reserve has. The accounting makes the whole system more robust. I wanted to make the state stable enough to survive a connection drop, or server reset for a long while, and it does. During a swapping workload and heavy network load, I can pull the network cable, or shut down the NFS server and leave it down for over 30 minutes. When I bring it back up again, stuff resumes. I'm failing horribly. Let me try again: Create a stable state where you can receive an unlimited amount of network packets awaiting the one packet you need to move forward. To do so we need to distinguish needed from unneeded packets; we do this by means of SK_MEMALLOC. So we need to be able to receive packets up to that point. The unlimited amount of packets means unlimited time; which means that our state must not consume memory, merely use memory. That is, the amount of memory used must not grow unbounded over time. So we must guarantee that all memory allocated will be promptly freed again, and never allocate more than available. Because this state is not the normal state, we need a trigger to enter this state (and consequently a trigger to leave this state). We do that by detecting a low memory situation just like you propose. We enter this state once normal memory allocations fail and leave this state once they start succeeding again. We need the accounting to ensure we never allocate more than is available, but more importantly because we need to ensure progress for those packets we already have allocated. A packet is received, it can be a fragment, it will be placed in the fragment cache for packet re-assembly. We need to ensure we can overflow this fragment cache in order that something will come out at the other end. If under a fragment attack, the fragment cache limit will prune the oldest fragments, freeing up memory to receive new ones. Eventually we'd be able to receive either a whole packet, or enough fragments to assemble one. Next comes routing the packet; we need to know where to process the packet; local or non-local. This potentially involves filling the route-cache. If at this point there is no memory available because we forgot to limit the amount of memory available for skb allocation we again are stuck. The route-cache, like the fragment assembly, is already accounted and will prune old (unused) entries once the total memory usage exceeds a pre-determined amount of memory. Eventually we'll end up at socket demux, matching packets to sockets which allows us to either toss the packet or consume it. Dropping packets is allowed because network is assumed lossy, and we have not yet acknowledged the receive. Does this make sense? Then we have TX, which like I said above needs to operate under certain limits as well. We need to be able to send out packets when under pressure in order to relieve said pressure. We need to ensure doing so will not exhaust our reserves. Writing out a page typically takes a little memory, you fudge some packets with protocol info, mtu size etc.. send them out, and wait for an acknowledge from the other end, and drop the stuff and go on writing other pages. So sending out pages does not consume memory if we're able to receive ACKs. Being able to receive packets what what all the previous was about. Now of course there is some RPC concurrency, TCP windows and other funnies going on, but I assumed - and I don't think that's a wrong assumption - that sending out pages will consume endless amounts of memory. Nor will it keep on sending pages, once there is a certain amount of packets outstanding (nfs congestion logic), it will wait, at which point it should have no memory in use at all. Anyway I did get lost in the RPC code, and I know I didn't fully account everything, but under some (hopefully realistic) assumptions I think the model is sound. Does this make sense? --
| Eric Sandeen | Re: [RFC] Heads up on sys_fallocate() |
| Linus Torvalds | Linux 2.6.27 |
| Cornelia Huck | Re: 2.6.22-rc3-mm1 |
| Andi Kleen | [PATCH for review] [6/48] x86: trim memory not covered by WB MTRRs |
| Linux Kernel Mailing List | i.MX3: make SoC devices globally available |
| Linux Kernel Mailing List | MXC: Remove WD IRQ priority setting |
| Linux Kernel Mailing List | ARM: DaVinci: i2c setup |
| Linux Kernel Mailing List | [MACVLAN]: Update Kconfig to refer to iproute |
git: | |
| Sverre Rabbelier | Git vs Monotone |
| Jakub Narebski | Re: [RFC] origin link for cherry-pick and revert |
| Jan-Benedict Glaw | Re: Errors GITtifying GCC and Binutils |
| H. Peter Anvin | Re: tip tree clone fail |
| jamal | Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000 |
| KOVACS Krisztian | [net-next PATCH 01/16] Loosen source address check on IPv4 output |
| Ilpo Järvinen | Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+ |
| Andrew Bird (Sphere Systems) | Re: [RFC] Patch to option HSO driver to the kernel |
| sata/ide timeout errors on asus server-mb | 1 hour ago | Linux kernel |
| Shared swap partition | 2 hours ago | Linux general |
| usb mic not detected | 6 hours ago | Applications and Utilities |
| Problem in Inserting a module | 7 hours ago | Linux kernel |
| Treason Uncloaked | 12 hours ago | Linux kernel |
| high memory | 2 days ago | Linux kernel |
| semaphore access speed | 2 days ago | Applications and Utilities |
| the kernel how to power off the machine | 2 days ago | Linux kernel |
| Easter Eggs in windows XP | 3 days ago | Windows |
| Root password | 3 days ago | Linux general |
