Hi,
I've spent some time trying to understand why swapoff is such a slow
operation.My experiments show that when there is not much free physical memory,
swapoff moves pages out of swap at a rate of approximately 5mb/sec. When
there is a lot of free physical memory, it is faster but still a slow
CPU-intensive operation, purging swap at about 20mb/sec.I've read into the swap code and I have some understanding that this is
an expensive operation (and has to be). This page was very helpful and
also agrees:
http://kernel.org/doc/gorman/html/understand/understand014.htmlAfter reading that, I have an idea for a possible optimization. If we
were to create a system call to disable ALL swap partitions (or modify
the existing one to accept NULL for that purpose), could this process be
signficantly less complex?I'm thinking we could do something like this:
1. Prevent any more pages from being swapped out from this point
2. Iterate through all process page tables, paging all swapped
pages back into physical memory and updating PTEs
3. Clear all swap tables and cachesDue to only iterating through process page tables once, does this sound
like it would increase performance non-trivially? Is it feasible?I'm happy to spend a few more hours looking into implementing this but
would greatly appreciate any advice from those in-the-know on if my
ideas are broken to start with...Thanks!
--
Daniel Drake
Brontes Technologies, A 3M Company
http://www.brontes3d.com/opensource-
Daniel:
in a response, Juergen Beisert asked if you'd tried mlock() [mlockall()
would probably be a better choice] to lock your application into memory.
That would require modifying the application. Don't know if you want to
do that.Back in Feb'07, I posted an RFC regarding [optionally] inheriting
mlockall() semantics across fork and exec. The original posting is
here:http://marc.info/?l=linux-mm&m=117217855508612&w=4
The patch is quite stale now [against 20-rc<something>], but shouldn't
be too much work to rebase to something more recent. The patch
description points to an ad hoc mlock "prefix command" that would allow
you to:mlock <some application>
and run the application as if it had called "mlockall(MCL_CURRENT|
MCL_FUTURE)", without having to modify the application--if that's
something you can't or don't want to do.Maybe this would help?
Lee
-
Yes, it can be shamefully slow. But we've done nothing about it for
years, simply because very few actually suffer from its worst cases.
You're the first I've heard complain about it in a long time: perhapsI'd be quite strongly against an additional system call: if we're
going to speed it up, let's speed up the common case, not your special
additional call. But I don't think you need that anyway: the slowness
doesn't come from the limited number of swap areas, but from the much
greater numbers of processes and their pages. Looping over the numberI'll ignore your steps 1 and 3, I don't see the advantage. (We
do already prevent pages from being swapped out to the area we're
swapping off, and in general we need to allow for swapping out to
another area while swapping off.) Step 2 is the core of your idea.Feasible yes, and very much less CPU-intensive than the present method.
But... it would be reading in pages from swap in pretty much a random
order, whereas the present method is reading them in sequentially, to
minimize disk seek time. So I doubt your way would actually work out
faster, except in those (exceptional, I'm afraid) cases where almostWell, do give it a try if you're interested: I've never actually
timed doing it that way, and might be surprised. I doubt you could
actually remove the present code, but it could become a fallback to
clear up the loose ends after some faster first pass.Don't forget you'll also need to deal with tmpfs files (mm/shmem.c):
Christoph Rohland long ago had a patch to work on those in the way you
propose, but we never integrated it because of the random seek issue.The speedups I've imagined making, were a need demonstrated, have
been more on the lines of batching (dealing with a range of pages
in one go) and hashing (using the swapmap's ushort, so often 1 or
2 or 3, to hold an indicator of where to look for its references).Hugh
-
There is one other possibility. Typically the swap code is using
compatibility disk I/O functions instead of the best the kernel
can offer. I haven't looked recently but it might be worth just
making certain that there isn't some low-level optimization or
cleanup possible on that path. Although I may just be thinking
of swapfiles.I know there were tremendous gains ago when I removed the functions
that wrote pages synchronously to swapfiles.Eric
-
Andrew rewrote swapfile support in 2.5, making it use FIBMAP at
swapon time: so that in 2.6 swapfiles are as deadlock-free and
as efficient (unless the swapfile happens to be badly fragmented)
as raw disk partitions.There's certainly scope for a study of I/O patterns in swapping,
it's hard to imagine that improvements couldn't be made (but also
easy to imagine endless disputes over different kinds of workload).
But most people would appreciate an improvement in active swapping,
and not care very much about the swapoff.Regarding Daniel's use of swapoff: it's a very heavy sledgehammer
for cracking that nut, I strongly agree with those who have pointed
him to mlock and mlockall instead.Hugh
-
There are some issues with us using mlockall. Admittedly, most/all of
them are not the kernels problem (but a fast swapoff would be a good
workaround):We're using python 2.4, so mlock() itself isn't really an option (we
don't realistically have access to the address regions hidden behind the
language). mlockall() is a possibility, but the fact that all
allocations above a particular limit will fail would potentially cause
us problems given that it's hard to control python's memory usage for a
long-running application.Additionally, choosing that limit is hard given that we have this
real-time and non-real-time processing balance, plus an interactive
python-based application that runs all the time (which is the thing we
would be locking). python 2.4 never returns memory to the OS, so at
whatever point the memory usage of the application peaks, all that
memory remains locked permanently.In addition we have the non-real-time processing task which does benefit
from having more memory available, so in that case, we would want it to
swap out parts of the application. I guess we could ask the application
to do munlockall() here, but things start getting scary and
overcomplicated at this point...So, our arguments against mlockall() are not strong, but you can see why
fast swapoff would be mighty convenient.Thanks for all the info so far. It does sound like my earlier idea
wouldn't be any faster in the general case due to excess disk seeking.
Oh well...--
Daniel Drake
Brontes Technologies, A 3M Company
http://www.brontes3d.com/opensource-
On Wed, 29 Aug 2007 09:29:32 -0400
Daniel Drake <ddrake@brontes3d.com> wrote:before you go there... is this a "real life" problem? Or just a
mostly-artificial corner case? (the answer to that obviously is
relevant for the 'should we really care' question)Another question, if this is during system shutdown, maybe that's a
valid case for flushing most of the pagecache first (from userspace)
since most of what's there won't be used again anyway. If that's enough
to make this go faster...A third question, have you investigated what happens if a process gets
killed that has pages in swap; as long as we don't page those in but
just forget about them, that would solve the shutdown problem nicely
(since we kill stuff first anyway there)-
The present method should be reading sequentially (with gaps),
rather than randomly. Perhaps we need to check what's happening
in practice.(I've often dithered over whether we should be doing swap readahead
there or not: at present it does not, preferring to assume buffering
at the hardware level, and last time I checked that worked out a(I didn't understand your point there, but Daniel has replied that
We definitely don't page those in, it would be a disaster for process
exit if we did: they just get discarded.As you say, shutdown is rarely a big issue, because almost all the
processes which had stuff in swap have already been killed. tmpfs
use of swap can be an issue there, but if the distro is wise, it'll
do things in such an order that tmpfs'es are unmounted before swapoff
(but may need two passes: the opposite case is a regular swapfile,
where we need to swapoff before that partition can be unmounted).Hugh
-
We are only using 'standard' seagate SATA disks, but I would have
It's more-or-less a real life problem. We have an interactive
application which, when triggered by the user, performs rendering tasks
which must operate in real-time. In attempt to secure performance, we
want to ensure everything is memory resident and that nothing might be
swapped out during the process. So, we run swapoff at that time.If there is a decent number of pages swapped out, the user sits for a
while at a 'please wait' screen, which is not desirable. To throw some
numbers out there, likely more than a minute for 400mb of swapped pages.Sure, we could run the whole interactive application with swap disabled,
which is pretty much what we do. However we have other non-real-time
processing tasks which are very memory hungry and do require swap. So,
there are 'corner cases' where the user can reach the real-time part ofAccording to top, those pages in swap disappear when the process is
killed. So, I don't think there are any swap-related performance issues
on the shutdown path.Thanks.
--
Daniel Drake
Brontes Technologies, A 3M Company
http://www.brontes3d.com/opensource-
If the system gets under serious memory pressure it'll happily discard
your text pages too (and later reload them from disk). The same
for any file data you might need to access.swapoff will only affect anonymous memory, but not all the other
memory you'll need as well.There's no way around mlock/mlockall() to really prevent this.
Still even with that you could still lose dentries/inodes etc which
can also cause stalls. The only way to keep them locked
is to keep the files always open.-Andi
-
So the real issue isn't that your process doesn't run fast enough
How much is "a lot?" You said 400MB, you can add a few GB of RAM and
eliminate the problem at that size. Run the application in a virtual
machine which has enough dedicated memory? I think xen will do that. Run
"swap" on a ramdisk? I don't think swapoff was designed as a fast
operation, although your performance is pretty leisurely. ;-)I assume you looked at mlock() and it doesn't fit your usage, or you
don't control the application behavior, or its limitations make it--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
-
Did you play with mlock()?
Juergen
-
Is there a good reason to swapoff during shutdown?
Regards
Oliver-
Three reasons, I think, only one of them compelling:
1. Tidiness.
2. So swapoff gets testing and I get to hear of any bugs in it.
3. If a regular swapfile is used instead of a disk partition, you
need to swapoff before its filesystem can be unmounted cleanly.Hugh
-
Yes. I hadn't thought of that. I am using a dedicated disk.
Regards
Oliver-
| Artem Bityutskiy | [PATCH 12/44 take 2] [UBI] allocation unit implementation |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
| Christoph Hellwig | pcmcia ioctl removal |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
