"The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer," explained Peter Zijlstra. "Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable."
The first fifteen patches set up a generic framework for reserving memory. Patches 16-23 actually put the framework to use on the network stack. Peter noted, "a network write back completion [involves] receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us." He went on to explain, "the solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack." Patches 24-26 set up an infrastructure for swapping to a filesystem instead of a block device, which is then utilized by the final patches, "finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS." When the usefulness of these patches were questioned, Peter noted, "There is a large corporate demand for this, which is why I'm doing this. The typical usage scenarios are: 1) cluster/blades, where having local disks is a cost issue (maintenance of failures, heat, etc) 2) virtualisation, where dumping the storage on a networked storage unit makes for trivial migration and what not.."
From: Peter Zijlstra <a.p.zijlstra@...> Subject: [PATCH 00/33] Swap over NFS -v14 Date: Oct 30, 12:04 pm 2007Hi,
Another posting of the full swap over NFS series.
[ I tried just posting the first part last time around, but
that just gets more confusion by lack of a general picture ][ patches against 2.6.23-mm1, also to be found online at:
http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.23-mm1/ ]The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.Part 1, patches 1-12
The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.This first part provides a generic reserve framework.
Care is taken to only affect the slow paths - when we're low on memory.
Caveats: it is currently SLUB only.
1 - mm: gfp_to_alloc_flags()
2 - mm: tag reseve pages
3 - mm: slub: add knowledge of reserve pages
4 - mm: allow mempool to fall back to memalloc reserves
5 - mm: kmem_estimate_pages()
6 - mm: allow PF_MEMALLOC from softirq context
7 - mm: serialize access to min_free_kbytes
8 - mm: emergency pool
9 - mm: system wide ALLOC_NO_WATERMARK
10 - mm: __GFP_MEMALLOC
11 - mm: memory reserve management
12 - selinux: tag avc cache alloc as non-criticalPart 2, patches 13-15
Provide some generic network infrastructure needed later on.
13 - net: wrap sk->sk_backlog_rcv()
14 - net: packet split receive api
15 - net: sk_allocation() - concentrate socket related allocationsPart 3, patches 16-23
Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations.That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back completion involved receiving
packets, which when there is no memory, is rather hard. And even when there is
memory there is no guarantee that the required packet comes in in the window
that that memory buys us.The solution to this problem is found in the fact that network is to be assumed
lossy. Even now, when there is no memory to receive packets the network card
will have to discard packets. What we do is move this into the network stack.So we reserve a little pool to act as a receive buffer, this allows us to
inspect packets before tossing them. This way, we can filter out those packets
that ensure progress (writeback completion) and disregard the others (as would
have happened anyway). [ NOTE: this is a stable mode of operation with limited
memory usage, exactly the kind of thing we need ]Again, care is taken to keep much of the overhead of this to only affect the
slow path. Only packets allocated from the reserves will suffer the extra
atomic overhead needed for accounting.16 - netvm: network reserve infrastructure
17 - sysctl: propagate conv errors
18 - netvm: INET reserves.
19 - netvm: hook skb allocation to reserves
20 - netvm: filter emergency skbs.
21 - netvm: prevent a TCP specific deadlock
22 - netfilter: NF_QUEUE vs emergency skbs
23 - netvm: skb processingPart 4, patches 24-26
Generic vm infrastructure to handle swapping to a filesystem instead of a block
device. The approach here has been questioned, people would like to see a less
invasive approach.One suggestion is to create and use a_ops->swap_{in,out}().
24 - mm: prepare swap entry methods for use in page methods
25 - mm: add support for non block device backed swap files
26 - mm: methods for teaching filesystems about PG_swapcache pagesPart 5, patches 27-33
Finally, convert NFS to make use of the new network and vm infrastructure to
provide swap over NFS.27 - nfs: remove mempools
28 - nfs: teach the NFS client how to treat PG_swapcache pages
29 - nfs: disable data cache revalidation for swapfiles
30 - nfs: swap vs nfs_writepage
31 - nfs: enable swap on NFS
32 - nfs: fix various memory recursions possible with swap over NFS.
33 - nfs: do not warn on radix tree node allocation failures-
From: Nick Piggin <nickpiggin@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 30, 11:26 pm 2007On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> Hi,
>
> Another posting of the full swap over NFS series.Hi,
Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)
-
From: Peter Zijlstra <a.p.zijlstra@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 7:27 am 2007On Wed, 2007-10-31 at 14:26 +1100, Nick Piggin wrote:
> On Wednesday 31 October 2007 03:04, Peter Zijlstra wrote:
> > Hi,
> >
> > Another posting of the full swap over NFS series.
>=20
> Hi,
>=20
> Is it really worth all the added complexity of making swap
> over NFS files work, given that you could use a network block
> device instead?As it stands, we don't have a usable network block device IMHO.
NFS is by far the most used and usable network storage solution out
there, anybody with half a brain knows how to set it up and use it.> Also, have you ensured that page_file_index, page_file_mapping
> and page_offset are only ever used on anonymous pages when the
> page is locked? (otherwise PageSwapCache could change)Good point, I hope so, both ->readpage() and ->writepage() take a locked
page, I'd have to look if it remains locked throughout the NFS call
chain.Then again, it might become obsolete with the extended swap a_ops.
From: Jeff Garzik <jeff@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 8:16 am 2007Thoughts:
1) I absolutely agree that NFS is far more prominent and useful than any
network block device, at the present time.2) Nonetheless, swap over NFS is a pretty rare case. I view this work
as interesting, but I really don't see a huge need, for swapping over
NBD or swapping over NFS. I tend to think swapping to a remote resource
starts to approach "migration" rather than merely swapping. Yes, we can
do it... but given the lack of burning need one must examine the price.3) You note
> Swap over network has the problem that the network subsystem does not use fixed
> sized allocations, but heavily relies on kmalloc(). This makes mempools
> unusable.True, but IMO there are mitigating factors that should be researched and
taken into account:a) To give you some net driver background/history, most mainstream net
drivers were coded to allocate RX skbs of size 1538, under the theory
that they would all be allocating out of the same underlying slab cache.
It would not be difficult to update a great many of the [non-jumbo]
cases to create a fixed size allocation pattern.b) Spare-time experiments and anecdotal evidence points to RX and TX skb
recycling as a potentially valuable area of research. If you are able
to do something like that, then memory suddenly becomes a lot more
bounded and predictable.So my gut feeling is that taking a hard look at how net drivers function
in the field should give you a lot of good ideas that approach the
shared goal of making network memory allocations more predictable and
bounded.Jeff
-
From: Peter Zijlstra <a.p.zijlstra@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 8:56 am 2007On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
> Thoughts:
>=20
> 1) I absolutely agree that NFS is far more prominent and useful than any=20
> network block device, at the present time.
>=20
>=20
> 2) Nonetheless, swap over NFS is a pretty rare case. I view this work=20
> as interesting, but I really don't see a huge need, for swapping over=20
> NBD or swapping over NFS. I tend to think swapping to a remote resource=20
> starts to approach "migration" rather than merely swapping. Yes, we can=20
> do it... but given the lack of burning need one must examine the price.There is a large corporate demand for this, which is why I'm doing this.
The typical usage scenarios are:
- cluster/blades, where having local disks is a cost issue (maintenance
of failures, heat, etc)
- virtualisation, where dumping the storage on a networked storage unit
makes for trivial migration and what not..But please, people who want this (I'm sure some of you are reading) do
speak up. I'm just the motivated corporate drone implementing the
feature :-)> 3) You note
> > Swap over network has the problem that the network subsystem does not u=
se fixed
> > sized allocations, but heavily relies on kmalloc(). This makes mempools
> > unusable.
>=20
> True, but IMO there are mitigating factors that should be researched and=20
> taken into account:
>=20
> a) To give you some net driver background/history, most mainstream net=20
> drivers were coded to allocate RX skbs of size 1538, under the theory=20
> that they would all be allocating out of the same underlying slab cache.=20
> It would not be difficult to update a great many of the [non-jumbo]=20
> cases to create a fixed size allocation pattern.One issue that comes to mind is how to ensure we'd still overflow the
IP-reassembly buffers. Currently those are managed on the number of
bytes present, not the number of fragments.One of the goals of my approach was to not rewrite the network subsystem
to accomodate this feature (and I hope I succeeded).> b) Spare-time experiments and anecdotal evidence points to RX and TX skb=20
> recycling as a potentially valuable area of research. If you are able=20
> to do something like that, then memory suddenly becomes a lot more=20
> bounded and predictable.
>=20
>=20
> So my gut feeling is that taking a hard look at how net drivers function=20
> in the field should give you a lot of good ideas that approach the=20
> shared goal of making network memory allocations more predictable and=20
> bounded.Note that being bounded only comes from dropping most packets before
trying them to a socket. That is the crucial part of the RX path, to
receive all packets from the NIC (regardless their size) but to not pass
them on to the network stack - unless they belong to a 'special' socket
that promises undelayed processing.Thanks for these ideas, I'll look into them.
From: David Miller <davem@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 12:37 am 2007From: Nick Piggin
Date: Wed, 31 Oct 2007 14:26:32 +1100> Is it really worth all the added complexity of making swap
> over NFS files work, given that you could use a network block
> device instead?Don't be misled. Swapping over NFS is just a scarecrow for the
seemingly real impetus behind these changes which is network storage
stuff like iSCSI.
-
From: Peter Zijlstra <a.p.zijlstra@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 5:53 am 2007On Tue, 2007-10-30 at 21:37 -0700, David Miller wrote:
> From: Nick Piggin
> Date: Wed, 31 Oct 2007 14:26:32 +1100
>=20
> > Is it really worth all the added complexity of making swap
> > over NFS files work, given that you could use a network block
> > device instead?
>=20
> Don't be misled. Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.Not quite, yes, iSCSI is also on the 'want' list of quite a few people,
but swap over NFS on its own is also a feature of great demand.
From: Christoph Hellwig <hch@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 4:50 am 2007On Tue, Oct 30, 2007 at 09:37:53PM -0700, David Miller wrote:
> Don't be misled. Swapping over NFS is just a scarecrow for the
> seemingly real impetus behind these changes which is network storage
> stuff like iSCSI.So can we please do swap over network storage only first? All these
VM bits look conceptually sane to me, while the changes to the swap
code to support nfs are real crackpipe material. Then again doing
that part properly by adding address_space methods for swap I/O without
the abuse might be a really good idea, especially as the way we
do swapfiles on block-based filesystems is an horrible hack already.So please get the VM bits for swap over network blockdevices in first,
and then we can look into a complete revamp of the swapfile support
that cleans up the current mess and adds support for nfs insted of
making the mess even worse.-
From: Peter Zijlstra <a.p.zijlstra@...> Subject: Re: [PATCH 00/33] Swap over NFS -v14 Date: Oct 31, 6:56 am 2007On Wed, 2007-10-31 at 08:50 +0000, Christoph Hellwig wrote:
> On Tue, Oct 30, 2007 at 09:37:53PM -0700, David Miller wrote:
> > Don't be misled. Swapping over NFS is just a scarecrow for the
> > seemingly real impetus behind these changes which is network storage
> > stuff like iSCSI.
>=20
> So can we please do swap over network storage only first? All these
> VM bits look conceptually sane to me, while the changes to the swap
> code to support nfs are real crackpipe material.Yeah, I know how you stand on that. I just wanted to post all this
before going off into the woods reworking it all.> Then again doing
> that part properly by adding address_space methods for swap I/O without
> the abuse might be a really good idea, especially as the way we
> do swapfiles on block-based filesystems is an horrible hack already.Is planned. What do you think of the proposed a_ops extension to
accomplish this? That is,->swapfile() - is this address space willing to back swap
->swapout() - write out a page
->swapin() - read in a page> So please get the VM bits for swap over network blockdevices in first,
Trouble with that part is that we don't have any sane network block
devices atm, NBD is utter crap, and iSCSI is too complex to be called
sane.Maybe Evgeniy's Distributed storage thingy would work, will have a look
at that.> and then we can look into a complete revamp of the swapfile support
> that cleans up the current mess and adds support for nfs insted of
> making the mess even worse.Sure, concrete suggestion are always welcome. Just being told something
is utter crap only goes so far.

Pointless feature
NFS mounted swap? I'm sure the scheduler people are thrilled about that. Have you ever heard of a more pointless feature? What next, swap via email?
Linus likes it
Evidently Linus Torvalds is interested in this feature:
http://kerneltrap.org/Linux/Memory_Management_Improvements
Can you say "diskless workstation?"
This has been on one person's wish list or another for about as long as I've been using Linux, if not longer. Diskless workstations still need to swap, and without some spinning rust under the hood, the network's the only other choice.
Without swap, when you run into memory pressure, you can only push out file-backed pages. That's likely a minority of the otherwise swappable pages in memory. For an added bonus, you still incur a page-in penalty to bring those pages back in over NFS. Chances are, your overall network traffic in that situation could be higher than if you permitted anonymous pages to swap to a swapfile over NFS, because let's face it, many of those anonymous pages are quiescent and will stay in swap for quite a long time—maybe even until the next reboot. Limiting yourself to file-backed pages means you're more likely to thrash.
--
Program Intellivision and play Space Patrol!
Can you say "unusable"
I pity the chump who gets one of your diskless specials plopped onto his desk. Since the cost of 1GB of fast SATA swap costs about $0.25/GB, don't ya think you should give your users a break and not make them swap across the country to your company SAN? Diskless workstations made little sense 15 years ago when storage was more expensive. They make no sense now. Do you know what you are doing? I am amazed that there are idiots out there still flogging the idea! It is worrisome that kernel size and complexity continues to grow with ill-advised features like this.
Can you say "unusable"
Diskless on the desktop IMHO is a bad principle, it doesn't solve anything related to cost (performance loss and administrative overhead outweigh any gains you get). But, there are other uses, diskless compute nodes in clusters. In specific tasked nodes like those, there's a lot of pages that can be swapped out that will more than likely never get used again. Another use is for virtualized clusters (Xen, ESX, whatever). If you swap over a local high speed channel, then you can migrate that VM (swap and all) to another node. Of course this can be done now (ESX has a migration feature), but it requires a high priced SAN and Fibre Channel setup to get it working.
As well, just because you don't plan to use the feature, doesn't mean it can't be worked on. There probably are legitimate uses. The part I see that would be a neat project is getting rid of the "over NFS part". Perhaps having a swap server of some sort. Then again, I don't have any use for any of it, but I can _imagine_ uses outside of what I deploy.
ever heard of diskless servers, clusters?
Ever heard of diskless servers, clusters?
Ever managed infrastructure bigger than your desktop computer?
swap files
I thought Linux already did that, since using normal files for swap has been possible since time immemorial. Or Am I Missing Something (TM)?
swap files over nfs
Your computer tends to lock up randomly doing swap over nfs with swap files, as I've found out to my cost. I believe (reading the thread) the problem is because the kernel can run out of memory while doing the swap over the network (as nfs and network drivers need memory), leading to a crash.
Swap over nfs is a very useful feature if it works properly (we have a diskless cluster). I'm awaiting it with excitement :-)