"My experiments show that when there is not much free physical memory, swapoff moves pages out of swap at a rate of approximately 5mb/sec," Daniel Drake noted in a recent discussion about swapoff performance. He added, "I've read into the swap code and I have some understanding that this is an expensive operation (and has to be)." Hugh Dickins acknowledged, "Yes, it can be shamefully slow. But we've done nothing about it for years, simply because very few actually suffer from its worst cases. You're the first I've heard complain about it in a long time: perhaps you'll be joined by a chorus, and we can have fun looking at it again."
As a potential optimization Daniel proposed, "iterate through all process page tables, paging all swapped pages back into physical memory and updating PTEs". Hugh replied, "feasible yes, and very much less CPU-intensive than the present method. But... it would be reading in pages from swap in pretty much a random order, whereas the present method is reading them in sequentially, to minimize disk seek time. So I doubt your way would actually work out faster". He then added:
"The speedups I've imagined making, were a need demonstrated, have been more on the lines of batching (dealing with a range of pages in one go) and hashing (using the swapmap's ushort, so often 1 or 2 or 3, to hold an indicator of where to look for its references)."
From: Daniel Drake [email blocked] Subject: speeding up swapoff Date: Wed, 29 Aug 2007 09:29:32 -0400 Hi, I've spent some time trying to understand why swapoff is such a slow operation. My experiments show that when there is not much free physical memory, swapoff moves pages out of swap at a rate of approximately 5mb/sec. When there is a lot of free physical memory, it is faster but still a slow CPU-intensive operation, purging swap at about 20mb/sec. I've read into the swap code and I have some understanding that this is an expensive operation (and has to be). This page was very helpful and also agrees: http://kernel.org/doc/gorman/html/understand/understand014.html After reading that, I have an idea for a possible optimization. If we were to create a system call to disable ALL swap partitions (or modify the existing one to accept NULL for that purpose), could this process be signficantly less complex? I'm thinking we could do something like this: 1. Prevent any more pages from being swapped out from this point 2. Iterate through all process page tables, paging all swapped pages back into physical memory and updating PTEs 3. Clear all swap tables and caches Due to only iterating through process page tables once, does this sound like it would increase performance non-trivially? Is it feasible? I'm happy to spend a few more hours looking into implementing this but would greatly appreciate any advice from those in-the-know on if my ideas are broken to start with... Thanks! -- Daniel Drake Brontes Technologies, A 3M Company http://www.brontes3d.com/opensource
From: Arjan van de Ven [email blocked] Subject: Re: speeding up swapoff Date: Wed, 29 Aug 2007 07:30:40 -0700 On Wed, 29 Aug 2007 09:29:32 -0400 Daniel Drake [email blocked] wrote: Hi, > I've spent some time trying to understand why swapoff is such a slow > operation. > > My experiments show that when there is not much free physical memory, > swapoff moves pages out of swap at a rate of approximately 5mb/sec. sounds like about disk speed (at random-seek IO pattern) > I'm happy to spend a few more hours looking into implementing this but > would greatly appreciate any advice from those in-the-know on if my > ideas are broken to start with... before you go there... is this a "real life" problem? Or just a mostly-artificial corner case? (the answer to that obviously is relevant for the 'should we really care' question) Another question, if this is during system shutdown, maybe that's a valid case for flushing most of the pagecache first (from userspace) since most of what's there won't be used again anyway. If that's enough to make this go faster... A third question, have you investigated what happens if a process gets killed that has pages in swap; as long as we don't page those in but just forget about them, that would solve the shutdown problem nicely (since we kill stuff first anyway there)
From: Daniel Drake [email blocked] Subject: Re: speeding up swapoff Date: Wed, 29 Aug 2007 10:44:43 -0400 On Wed, 2007-08-29 at 07:30 -0700, Arjan van de Ven wrote: > > My experiments show that when there is not much free physical memory, > > swapoff moves pages out of swap at a rate of approximately 5mb/sec. > > sounds like about disk speed (at random-seek IO pattern) We are only using 'standard' seagate SATA disks, but I would have thought much more performance (40+ mb/sec) would be reachable. > before you go there... is this a "real life" problem? Or just a > mostly-artificial corner case? (the answer to that obviously is > relevant for the 'should we really care' question) It's more-or-less a real life problem. We have an interactive application which, when triggered by the user, performs rendering tasks which must operate in real-time. In attempt to secure performance, we want to ensure everything is memory resident and that nothing might be swapped out during the process. So, we run swapoff at that time. If there is a decent number of pages swapped out, the user sits for a while at a 'please wait' screen, which is not desirable. To throw some numbers out there, likely more than a minute for 400mb of swapped pages. Sure, we could run the whole interactive application with swap disabled, which is pretty much what we do. However we have other non-real-time processing tasks which are very memory hungry and do require swap. So, there are 'corner cases' where the user can reach the real-time part of the interactive application when there is a lot of memory swapped out. > Another question, if this is during system shutdown, maybe that's a > valid case for flushing most of the pagecache first (from userspace) > since most of what's there won't be used again anyway. If that's enough > to make this go faster... Shutdown isn't a concern here. > A third question, have you investigated what happens if a process gets > killed that has pages in swap; as long as we don't page those in but > just forget about them, that would solve the shutdown problem nicely > (since we kill stuff first anyway there) According to top, those pages in swap disappear when the process is killed. So, I don't think there are any swap-related performance issues on the shutdown path. Thanks. -- Daniel Drake Brontes Technologies, A 3M Company http://www.brontes3d.com/opensource
From: Bill Davidsen [email blocked] Subject: Re: speeding up swapoff Date: Thu, 30 Aug 2007 11:57:02 -0400 Daniel Drake wrote: > On Wed, 2007-08-29 at 07:30 -0700, Arjan van de Ven wrote: >>> My experiments show that when there is not much free physical memory, >>> swapoff moves pages out of swap at a rate of approximately 5mb/sec. >> sounds like about disk speed (at random-seek IO pattern) > > We are only using 'standard' seagate SATA disks, but I would have > thought much more performance (40+ mb/sec) would be reachable. > >> before you go there... is this a "real life" problem? Or just a >> mostly-artificial corner case? (the answer to that obviously is >> relevant for the 'should we really care' question) > > It's more-or-less a real life problem. We have an interactive > application which, when triggered by the user, performs rendering tasks > which must operate in real-time. In attempt to secure performance, we > want to ensure everything is memory resident and that nothing might be > swapped out during the process. So, we run swapoff at that time. So the real issue isn't that your process doesn't run fast enough without doing swapoff, but that swapoff itself takes too long. > > If there is a decent number of pages swapped out, the user sits for a > while at a 'please wait' screen, which is not desirable. To throw some > numbers out there, likely more than a minute for 400mb of swapped pages. > > Sure, we could run the whole interactive application with swap disabled, > which is pretty much what we do. However we have other non-real-time > processing tasks which are very memory hungry and do require swap. So, > there are 'corner cases' where the user can reach the real-time part of > the interactive application when there is a lot of memory swapped out. How much is "a lot?" You said 400MB, you can add a few GB of RAM and eliminate the problem at that size. Run the application in a virtual machine which has enough dedicated memory? I think xen will do that. Run "swap" on a ramdisk? I don't think swapoff was designed as a fast operation, although your performance is pretty leisurely. ;-) I assume you looked at mlock() and it doesn't fit your usage, or you don't control the application behavior, or its limitations make it unsuitable in some other way. -- Bill Davidsen [email blocked] "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot
From: Andi Kleen [email blocked] Subject: Re: speeding up swapoff Date: 02 Sep 2007 00:20:22 +0200 Daniel Drake [email blocked] writes: > > It's more-or-less a real life problem. We have an interactive > application which, when triggered by the user, performs rendering tasks > which must operate in real-time. In attempt to secure performance, we > want to ensure everything is memory resident and that nothing might be > swapped out during the process. So, we run swapoff at that time. If the system gets under serious memory pressure it'll happily discard your text pages too (and later reload them from disk). The same for any file data you might need to access. swapoff will only affect anonymous memory, but not all the other memory you'll need as well. There's no way around mlock/mlockall() to really prevent this. Still even with that you could still lose dentries/inodes etc which can also cause stalls. The only way to keep them locked is to keep the files always open. -Andi
From: Hugh Dickins [email blocked] Subject: Re: speeding up swapoff Date: Wed, 29 Aug 2007 16:58:01 +0100 (BST) On Wed, 29 Aug 2007, Arjan van de Ven wrote: > On Wed, 29 Aug 2007 09:29:32 -0400 > Daniel Drake [email blocked] wrote: > > > I've spent some time trying to understand why swapoff is such a slow > > operation. > > > > My experiments show that when there is not much free physical memory, > > swapoff moves pages out of swap at a rate of approximately 5mb/sec. > > sounds like about disk speed (at random-seek IO pattern) The present method should be reading sequentially (with gaps), rather than randomly. Perhaps we need to check what's happening in practice. (I've often dithered over whether we should be doing swap readahead there or not: at present it does not, preferring to assume buffering at the hardware level, and last time I checked that worked out a little better.) > Another question, if this is during system shutdown, maybe that's a > valid case for flushing most of the pagecache first (from userspace) > since most of what's there won't be used again anyway. If that's enough > to make this go faster... (I didn't understand your point there, but Daniel has replied that it's not at shutdown anyway.) > A third question, have you investigated what happens if a process gets > killed that has pages in swap; as long as we don't page those in but > just forget about them, that would solve the shutdown problem nicely > (since we kill stuff first anyway there) We definitely don't page those in, it would be a disaster for process exit if we did: they just get discarded. As you say, shutdown is rarely a big issue, because almost all the processes which had stuff in swap have already been killed. tmpfs use of swap can be an issue there, but if the distro is wise, it'll do things in such an order that tmpfs'es are unmounted before swapoff (but may need two passes: the opposite case is a regular swapfile, where we need to swapoff before that partition can be unmounted). Hugh
From: Hugh Dickins [email blocked] Subject: Re: speeding up swapoff Date: Wed, 29 Aug 2007 16:36:37 +0100 (BST) On Wed, 29 Aug 2007, Daniel Drake wrote: > > I've spent some time trying to understand why swapoff is such a slow > operation. > > My experiments show that when there is not much free physical memory, > swapoff moves pages out of swap at a rate of approximately 5mb/sec. When > there is a lot of free physical memory, it is faster but still a slow > CPU-intensive operation, purging swap at about 20mb/sec. Yes, it can be shamefully slow. But we've done nothing about it for years, simply because very few actually suffer from its worst cases. You're the first I've heard complain about it in a long time: perhaps you'll be joined by a chorus, and we can have fun looking at it again. > > I've read into the swap code and I have some understanding that this is > an expensive operation (and has to be). This page was very helpful and > also agrees: > http://kernel.org/doc/gorman/html/understand/understand014.html > > After reading that, I have an idea for a possible optimization. If we > were to create a system call to disable ALL swap partitions (or modify > the existing one to accept NULL for that purpose), could this process be > signficantly less complex? I'd be quite strongly against an additional system call: if we're going to speed it up, let's speed up the common case, not your special additional call. But I don't think you need that anyway: the slowness doesn't come from the limited number of swap areas, but from the much greater numbers of processes and their pages. Looping over the number of swap areas (so often 1) isn't a problem. > > I'm thinking we could do something like this: > 1. Prevent any more pages from being swapped out from this point > 2. Iterate through all process page tables, paging all swapped > pages back into physical memory and updating PTEs > 3. Clear all swap tables and caches > > Due to only iterating through process page tables once, does this sound > like it would increase performance non-trivially? Is it feasible? I'll ignore your steps 1 and 3, I don't see the advantage. (We do already prevent pages from being swapped out to the area we're swapping off, and in general we need to allow for swapping out to another area while swapping off.) Step 2 is the core of your idea. Feasible yes, and very much less CPU-intensive than the present method. But... it would be reading in pages from swap in pretty much a random order, whereas the present method is reading them in sequentially, to minimize disk seek time. So I doubt your way would actually work out faster, except in those (exceptional, I'm afraid) cases where almost all the swap pages are already in core swapcache when swapoff begins. > > I'm happy to spend a few more hours looking into implementing this but > would greatly appreciate any advice from those in-the-know on if my > ideas are broken to start with... Well, do give it a try if you're interested: I've never actually timed doing it that way, and might be surprised. I doubt you could actually remove the present code, but it could become a fallback to clear up the loose ends after some faster first pass. Don't forget you'll also need to deal with tmpfs files (mm/shmem.c): Christoph Rohland long ago had a patch to work on those in the way you propose, but we never integrated it because of the random seek issue. The speedups I've imagined making, were a need demonstrated, have been more on the lines of batching (dealing with a range of pages in one go) and hashing (using the swapmap's ushort, so often 1 or 2 or 3, to hold an indicator of where to look for its references). Hugh
From: [email blocked] (Eric W. Biederman) Subject: Re: speeding up swapoff Date: Thu, 30 Aug 2007 02:27:29 -0600 Hugh Dickins [email blocked] writes: > The speedups I've imagined making, were a need demonstrated, have > been more on the lines of batching (dealing with a range of pages > in one go) and hashing (using the swapmap's ushort, so often 1 or > 2 or 3, to hold an indicator of where to look for its references). There is one other possibility. Typically the swap code is using compatibility disk I/O functions instead of the best the kernel can offer. I haven't looked recently but it might be worth just making certain that there isn't some low-level optimization or cleanup possible on that path. Although I may just be thinking of swapfiles. I know there were tremendous gains ago when I removed the functions that wrote pages synchronously to swapfiles. Eric
From: Hugh Dickins [email blocked] Subject: Re: speeding up swapoff Date: Thu, 30 Aug 2007 11:36:36 +0100 (BST) On Thu, 30 Aug 2007, Eric W. Biederman wrote: > > There is one other possibility. Typically the swap code is using > compatibility disk I/O functions instead of the best the kernel > can offer. I haven't looked recently but it might be worth just > making certain that there isn't some low-level optimization or > cleanup possible on that path. Although I may just be thinking > of swapfiles. Andrew rewrote swapfile support in 2.5, making it use FIBMAP at swapon time: so that in 2.6 swapfiles are as deadlock-free and as efficient (unless the swapfile happens to be badly fragmented) as raw disk partitions. There's certainly scope for a study of I/O patterns in swapping, it's hard to imagine that improvements couldn't be made (but also easy to imagine endless disputes over different kinds of workload). But most people would appreciate an improvement in active swapping, and not care very much about the swapoff. Regarding Daniel's use of swapoff: it's a very heavy sledgehammer for cracking that nut, I strongly agree with those who have pointed him to mlock and mlockall instead. Hugh
From: Daniel Drake [email blocked] Subject: Re: speeding up swapoff Date: Thu, 30 Aug 2007 11:05:16 -0400 On Thu, 2007-08-30 at 11:36 +0100, Hugh Dickins wrote: > Regarding Daniel's use of swapoff: it's a very heavy sledgehammer > for cracking that nut, I strongly agree with those who have pointed > him to mlock and mlockall instead. There are some issues with us using mlockall. Admittedly, most/all of them are not the kernels problem (but a fast swapoff would be a good workaround): We're using python 2.4, so mlock() itself isn't really an option (we don't realistically have access to the address regions hidden behind the language). mlockall() is a possibility, but the fact that all allocations above a particular limit will fail would potentially cause us problems given that it's hard to control python's memory usage for a long-running application. Additionally, choosing that limit is hard given that we have this real-time and non-real-time processing balance, plus an interactive python-based application that runs all the time (which is the thing we would be locking). python 2.4 never returns memory to the OS, so at whatever point the memory usage of the application peaks, all that memory remains locked permanently. In addition we have the non-real-time processing task which does benefit from having more memory available, so in that case, we would want it to swap out parts of the application. I guess we could ask the application to do munlockall() here, but things start getting scary and overcomplicated at this point... So, our arguments against mlockall() are not strong, but you can see why fast swapoff would be mighty convenient. Thanks for all the info so far. It does sound like my earlier idea wouldn't be any faster in the general case due to excess disk seeking. Oh well... -- Daniel Drake Brontes Technologies, A 3M Company http://www.brontes3d.com/opensource
PTRACE_ATTACH and /proc/<pid>/maps
(Crazy ideas ahead... probably not fit for lkml consumption. :-) )
Rather than swap in everything, why not write a helper app that goes and swaps in the entire Python app? Theoretically, it could do this with ptrace(PTRACE_ATTACH, ...), ptrace(PTRACE_PEEKDATA, ...) and information gleaned from /proc/<pid>/maps.
Granted, this does have the negative side of accessing the virtual address sequentially, without regard for the order in which things were swapped. Also, the process needs to be stopped, so the page-in is definitely synchronous. On the plus side, demand-paged text pages (e.g. library and executable pages) also get brought in for the task. The swapoff approach only brings in the non-file-backed pages, and could actually push these read-only file backed pages out of memory.
Here's a second idea: What about some variant of the swap-in patches Con's carried around for awhile? I imagine those could be tweaked to trigger a swap-in on demand, roughly in reverse order of the way things were swapped out.
In either case, you get the additional benefit of not actually losing your swap space, so that clean anonymous pages don't need to be re-written to disk later when the non-real-time task induces memory pressure.
--
Program Intellivision and play Space Patrol!