On Wed, 2007-08-15 at 14:22 +0200, Nick Piggin wrote:. Mainly because we seem to go in circles :-( Honestly, I don't. They very much do not solve the problem, they just displace it. Christoph's suggestion to set min_free_kbytes to 20% is ridiculous - nor does it solve all deadlocks :-( Please do ponder the problem and its proposed solutions, because I'm going crazy here. The problem with networked swap is: TX - we need some memory to initiate writeout - writeout needs to be throttled in order to make this bounded (currently sort-of done by throttle_vm_writeout() - but evginey and daniel phillips are working on a more generic approach) RX - we basically need infinite memory to receive the network reply to complete writeout. Consider the following scenario: 3 machines, A, B, C; A: * networked swapped * networked service B: * client for networked service C: * server for networked swap C becomes unreachable/slow for a while B sends massive amounts of traffic A wards A consumes all memory with non-critical traffic from B and wedges - so we need a threshold of some sorts to start tossing non-critical network packets away. (because the consumer of these packets may be the one swapping and is therefore frozen) - we also need to ensure memory doesn't fragment too badly during the receiving -> tossing phase. Otherwise we might again wedge due to OOM and then there is an TCP specific deadlock: TCP has a global limit on the amount of skb memory that can be in socket receive queues. Once we hit this limit with non-critical data (because the consumers are waiting on swap) all further packets will be tossed and we'll never receive C's completion <> Now my solution was to have a reserve just big enough to fit: - TX - RX (large enough to overflow the IP fragment reassembly) that way, whenever we receive a packet and find we need the reserve to back this packet we must only use this for critical services. (this provides the threshold previously mentioned) we then process the packet until socket demux (where the skb gets associated with a sk - and can therefore determine whether it is critical or not) and toss all packets that are non-critical. This frees up the memory to receive the next packet, and this can continue ad infinitum - until we finally do get C's completion and get out of the tight spot. <> What Christoph is proposing is doing recursive reclaim and not initiating writeout. This will only work _IFF_ there are clean pages about. Which in the general case need not be true (memory might be packed with anonymous pages - consider an MPI cluster doing computation stuff). So this gets us a workload dependant solution - which IMHO is bad! Also his suggestion to crank up min_free_kbytes to 20% of machine memory is not workable (again imagine this MPI cluster loosing 20% of its collective memory, very much out of the question). Nor does that solve the TCP deadlock, you need some additional condition to break that. Do get well.
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Adrian Bunk | Re: LSM conversion to static interface |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
| Winkler, Tomas | RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
