login
Header Space

 
 

Re: [PATCH 00/33] Swap over NFS -v14

Previous thread: [patch] __do_IRQ does not check IRQ_DISABLED when IRQ_PER_CPU is set by Russ Anderson on Tuesday, October 30, 2007 - 12:26 pm. (4 messages)

Next thread: [PATCH 29/33] nfs: disable data cache revalidation for swapfiles by Peter Zijlstra on Tuesday, October 30, 2007 - 12:04 pm. (1 message)
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>
Date: Tuesday, October 30, 2007 - 12:04 pm

Hi,

Another posting of the full swap over NFS series. 

[ I tried just posting the first part last time around, but
  that just gets more confusion by lack of a general picture ]

[ patches against 2.6.23-mm1, also to be found online at:
  http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v2.6.23-mm1/ ]

The patch-set can be split in roughtly 5 parts, for each of which I shall give
a description.


  Part 1, patches 1-12

The problem with swap over network is the generic swap problem: needing memory
to free memory. Normally this is solved using mempools, as can be seen in the
BIO layer.

Swap over network has the problem that the network subsystem does not use fixed
sized allocations, but heavily relies on kmalloc(). This makes mempools
unusable.

This first part provides a generic reserve framework.

Care is taken to only affect the slow paths - when we're low on memory.

Caveats: it is currently SLUB only.

 1 - mm: gfp_to_alloc_flags()
 2 - mm: tag reseve pages
 3 - mm: slub: add knowledge of reserve pages
 4 - mm: allow mempool to fall back to memalloc reserves
 5 - mm: kmem_estimate_pages()
 6 - mm: allow PF_MEMALLOC from softirq context
 7 - mm: serialize access to min_free_kbytes
 8 - mm: emergency pool
 9 - mm: system wide ALLOC_NO_WATERMARK
10 - mm: __GFP_MEMALLOC
11 - mm: memory reserve management
12 - selinux: tag avc cache alloc as non-critical


  Part 2, patches 13-15

Provide some generic network infrastructure needed later on.

13 - net: wrap sk-&gt;sk_backlog_rcv()
14 - net: packet split receive api
15 - net: sk_allocation() - concentrate socket related allocations


  Part 3, patches 16-23

Now that we have a generic memory reserve system, use it on the network stack.
The thing that makes this interesting is that, contrary to BIO, both the
transmit and receive path require memory allocations. 

That is, in the BIO layer write back completion is usually just an ISR flipping
a bit and waking stuff up. A network write back com...
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Tuesday, October 30, 2007 - 11:26 pm

Hi,

Is it really worth all the added complexity of making swap
over NFS files work, given that you could use a network block
device instead?

Also, have you ensured that page_file_index, page_file_mapping
and page_offset are only ever used on anonymous pages when the
page is locked? (otherwise PageSwapCache could change)
-
To: Nick Piggin <nickpiggin@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 7:27 am

As it stands, we don't have a usable network block device IMHO.
NFS is by far the most used and usable network storage solution out

Good point, I hope so, both -&gt;readpage() and -&gt;writepage() take a locked
page, I'd have to look if it remains locked throughout the NFS call
chain.

Then again, it might become obsolete with the extended swap a_ops.
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 8:16 am

Thoughts:

1) I absolutely agree that NFS is far more prominent and useful than any 
network block device, at the present time.


2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
as interesting, but I really don't see a huge need, for swapping over 
NBD or swapping over NFS.  I tend to think swapping to a remote resource 
starts to approach "migration" rather than merely swapping.  Yes, we can 
do it...  but given the lack of burning need one must examine the price.



True, but IMO there are mitigating factors that should be researched and 
taken into account:

a) To give you some net driver background/history, most mainstream net 
drivers were coded to allocate RX skbs of size 1538, under the theory 
that they would all be allocating out of the same underlying slab cache. 
  It would not be difficult to update a great many of the [non-jumbo] 
cases to create a fixed size allocation pattern.

b) Spare-time experiments and anecdotal evidence points to RX and TX skb 
recycling as a potentially valuable area of research.  If you are able 
to do something like that, then memory suddenly becomes a lot more 
bounded and predictable.


So my gut feeling is that taking a hard look at how net drivers function 
in the field should give you a lot of good ideas that approach the 
shared goal of making network memory allocations more predictable and 
bounded.

	Jeff


-
To: Jeff Garzik <jeff@...>
Cc: Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 8:56 am

There is a large corporate demand for this, which is why I'm doing this.

The typical usage scenarios are:
 - cluster/blades, where having local disks is a cost issue (maintenance
   of failures, heat, etc)
 - virtualisation, where dumping the storage on a networked storage unit
   makes for trivial migration and what not..

But please, people who want this (I'm sure some of you are reading) do
speak up. I'm just the motivated corporate drone implementing the

One issue that comes to mind is how to ensure we'd still overflow the
IP-reassembly buffers. Currently those are managed on the number of
bytes present, not the number of fragments.

One of the goals of my approach was to not rewrite the network subsystem

Note that being bounded only comes from dropping most packets before
trying them to a socket. That is the crucial part of the RX path, to
receive all packets from the NIC (regardless their size) but to not pass
them on to the network stack - unless they belong to a 'special' socket
that promises undelayed processing.

Thanks for these ideas, I'll look into them.
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Jeff Garzik <jeff@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Sunday, November 18, 2007 - 2:09 pm

&lt;apologies for being insanely late into this thread&gt;


HPC clusters are increasingly diskless, especially at the high end.
for all the reasons you mention, but also because networks are faster

swap to iSCSI has worked well in the past with your anti-deadlock
patches, and I'd definitely like to see that continue and to be merged
into mainline!! swap-to-network is a highly desirable feature for
modern clusters.

performance and scalability of NFS is poor, so it's not a good option.

actually swap to a file on Lustre(*) would be best, but iSER and iSCSI
would be my next choices. iSER is better than iSCSI as it's ~5x faster
in practice, and InfiniBand seems to be here to stay.

hmmm - any idea what the issues are with RDMA in low memory situations?
presumably if DMA regions are mapped early then there's not actually
much of a problem? I might try it with tgtd's iSER...

cheers,
robin

(*) obviously not your responsibility. although Lustre (Sun/CFS) could


-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Jeff Garzik <jeff@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Friday, November 2, 2007 - 4:54 am

I have wyse thin client here, geode (or something) cpu, 128MB flash,
256MB RAM (IIRC). You want to swap on this one, and no, you don't want
to swap to flash.
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Jeff Garzik <jeff@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 9:44 am

FWIW, I could have used a "swap to network technology X" like system at
my last job.  We were building a large networking switch with blades,
and the IO cards didn't have anywhere near the resources that the
control modules had (no persistent storage, small ram, etc).  We were
already doing userspace coredumps over NFS to the control cards.  It
would have been nice to swap as well.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Jeff Garzik <jeff@...>, Nick Piggin <nickpiggin@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 9:18 am

Keep it up, Dave already mentioned iSCSI, there is AoE, there are RT
sockets, you name it, the networking bits we've talked about several
times, they look OK, so I'm sorry for not going over all of them in
detail, but you have my support neverthless.

- Arnaldo
-
To: <nickpiggin@...>
Cc: <a.p.zijlstra@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 12:37 am

From: Nick Piggin &lt;nickpiggin@yahoo.com.au&gt;

Don't be misled.  Swapping over NFS is just a scarecrow for the
seemingly real impetus behind these changes which is network storage
stuff like iSCSI.
-
To: David Miller <davem@...>
Cc: <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 5:53 am

Not quite, yes, iSCSI is also on the 'want' list of quite a few people,
but swap over NFS on its own is also a feature of great demand.
To: David Miller <davem@...>
Cc: <nickpiggin@...>, <a.p.zijlstra@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 4:50 am

So can we please do swap over network storage only first?  All these
VM bits look conceptually sane to me, while the changes to the swap
code to support nfs are real crackpipe material.   Then again doing
that part properly by adding address_space methods for swap I/O without
the abuse might be a really good idea, especially as the way we
do swapfiles on block-based filesystems is an horrible hack already.

So please get the VM bits for swap over network blockdevices in first,
and then we can look into a complete revamp of the swapfile support
that cleans up the current mess and adds support for nfs insted of
making the mess even worse.

-
To: Christoph Hellwig <hch@...>
Cc: David Miller <davem@...>, <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 6:56 am

Yeah, I know how you stand on that. I just wanted to post all this

Is planned. What do you think of the proposed a_ops extension to
accomplish this? That is,

-&gt;swapfile() - is this address space willing to back swap
-&gt;swapout() - write out a page

Trouble with that part is that we don't have any sane network block
devices atm, NBD is utter crap, and iSCSI is too complex to be called
sane.

Maybe Evgeniy's Distributed storage thingy would work, will have a look

Sure, concrete suggestion are always welcome. Just being told something
is utter crap only goes so far.
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Christoph Hellwig <hch@...>, David Miller <davem@...>, <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Evgeniy Polyakov <johnpol@...>
Date: Wednesday, October 31, 2007 - 10:54 am

Andrew recently asked Evgeniy if his DST was ready for merging; to
which Evgeniy basically said yes:
http://lkml.org/lkml/2007/10/27/54

It would be great if DST could be merged; whereby addressing the fact
that NBD is lacking for net-vm.  If DST were scrutinized in the
context of net-vm it should help it get the review that is needed for
merging.

Mike
-
To: Mike Snitzer <snitzer@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, David Miller <davem@...>, <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 12:31 pm

Hi.


By popular request I'm working on adding strong checksumming of the data
transferred, so I can not say that Andrew will want to merge this during
development phase. I expect to complete it quite soon (it is in testing
stage right now) though with new release scheduled this week. It will
also include some small features for userspace (hapiness).

Memory management is not changed.

-- 
	Evgeniy Polyakov
-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Christoph Hellwig <hch@...>, David Miller <davem@...>, <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 7:18 am

Hey, NBD was designed to be _simple_. And I think it works okay in
that area.. so can you elaborate on "utter crap"? [Ok, performance is
not great.]

Plus, I'd suggest you to look at ata-over-ethernet. It is in tree
today, quite simple, but should have better performance than nbd.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To: Pavel Machek <pavel@...>
Cc: Christoph Hellwig <hch@...>, David Miller <davem@...>, <nickpiggin@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Jens Axboe <jens.axboe@...>
Date: Wednesday, October 31, 2007 - 7:24 am

Yeah, sorry, perhaps I was overly strong.

It doesn't work for me, because:

  - it does connection management in user-space, which makes it
    impossible to reconnect. I'd want a full kernel based client.

  - it had some plugging issues, and after talking to Jens about it
    he suggested a rewrite using -&gt;make_request() ala AoE. [ sorry if
    I'm short on details here, it was a long time ago, and I

Ah, right, I keep forgetting about that one. The only draw-back to that
on is, is that its raw ethernet, and not some IP protocol.
To: David Miller <davem@...>
Cc: <a.p.zijlstra@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 12:04 am

Oh, I'm OK with the network reserves stuff (not the actual patch,
which I'm not really qualified to review, but at least the idea
of it...).

And also I'm not as such against the idea of swap over network.

However, specifically the change to make swapfiles work through
the filesystem layer (ATM it goes straight to the block layer,
modulo some initialisation stuff which uses block filesystem-
specific calls).

I mean, I assume that anybody trying to swap over network *today*
has to be using a network block device anyway, so the idea of
just being able to transparently improve that case seems better
than adding new complexities for seemingly not much gain.
-
To: Nick Piggin <nickpiggin@...>
Cc: David Miller <davem@...>, <a.p.zijlstra@...>, <torvalds@...>, <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>
Date: Wednesday, October 31, 2007 - 10:03 am

I have some embedded diskless devices that have 16 MB of RAM and &gt;500MB of
swap. Its root fs and swap device are both done over NBD because NFS is too
expensive in 16MB of RAM. Any memory contention (i.e needing memory to swap
memory over the network), however infrequent, causes the system to freeze when
about 50 MB of VM is used up. I would love to see some work done in this area.

  -Byron

--
Byron Stanoszek                         Ph: (330) 644-3059
Systems Programmer                      Fax: (330) 644-8110
Commercial Timesharing Inc.             Email: byron@comtime.com
-
Previous thread: [patch] __do_IRQ does not check IRQ_DISABLED when IRQ_PER_CPU is set by Russ Anderson on Tuesday, October 30, 2007 - 12:26 pm. (4 messages)

Next thread: [PATCH 29/33] nfs: disable data cache revalidation for swapfiles by Peter Zijlstra on Tuesday, October 30, 2007 - 12:04 pm. (1 message)
speck-geostationary