Linux: Updated Autoregulated VM Swappiness Patch

Submitted by Jeremy
on May 30, 2004 - 10:17am

Following the recent attention focused on swap [story], Con Kolivas [interview] has released a new version of his autoregulated VM swapiness patch [story]. The patch has been updated since its first announcement, fixing a few bugs and adding new functionality. Con explains:

"[The] amount of swap space consumed is also taken into account, and the size of it compared to the physical ram is taken into consideration when making its effect on the value of swappiness. With this patch, this should make any machine that has swapspace as resistant to OOM as possible. This version by default autoregulates the swappiness, but also allows you to choose a manual setting if you so desire by echo 0 > /proc/sys/vm/autoswappiness and then setting the swappiness the manual way as previously. This makes comparison with autoregulation easy."


From: Con Kolivas [email blocked]
To: Linux Kernel Mailinglist [email blocked]
Subject: [PATCH] Autoregulated VM swappiness
Date: 	Sun, 30 May 2004 23:30:41 +1000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

With all the recent attention paid to swap I thought I'd resync this patch 
with 2.6.7-rc2 in it's more recent incarnation.

What this patch does is this:
The swappiness is made proportional to the amount of ram consumed by 
application pages, and inversely proportional to the amount of the last 
(sizeof physical ram) of swap ram used up. This has the effect of hardly ever 
using swap if you have large amounts of cached data by say copying an iso 
image or manipulating video files. Conversely if you have lots of 
applications running at once it will allow the less used ones to swap out by 
increasing the swappiness, giving preference to the "foreground" one. 

Changes from the first version announced on lkml:
Amount of swap space consumed is also taken into account, and the size of it 
compared to the physical ram is taken into consideration when making it's 
effect on the value of swappiness. With this patch, this should make any 
machine that has swapspace as resistant to OOM as possible.
This version by default autoregulates the swappiness, but also allows you to 
choose a manual setting if you so desire by

echo 0 > /proc/sys/vm/autoswappiness

and then setting the swappiness the manual way as previously. This makes 
comparison with autoregulation easy.
A few bugfixes.

Con


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFAueIEZUg7+tp6mRURAnx5AJ9bkBsu+nAxT3fXJe2qLQ3PSYVL7wCbBrrC
pZOZHLOBhVsVdrd1a1ozKS4=
=XeIZ
-----END PGP SIGNATURE-----


Related Links:

Hi, Con! Nice to see you back!

on
May 30, 2004 - 10:43am

There were a lot of scheduler changes this times...

What we can expect from you regarding improving the startup time for apps?

Can it be made on the current scheduler, or did you plan to release another patch of the staircase scheduler?

Anyway, what do you think of http://ebs.aurema.com/ ?

Thanks!!!!!!

Thanks!

on
May 30, 2004 - 10:54am

Nice to be welcomed back :-)

I've already done a resync with a few bugfixes and updates to the staircase for 2.6.7-rc1 and it's on my website. In terms of interactivity and startup this one is about as good as it's going to get, but I need to address some corner cases that can occur. This may take me some time to say the least.

I think the ebs is an awesome idea but it needs a heck of a lot of testing - I can't say I've tried it myself yet.

What I do know is we'll have quite a few schedulers for 2.7 as I see pluggable schedulers not far away.

Cheers.

Thanks for the response!

on
May 30, 2004 - 11:31am

I will try your recent changes now!

another CK kernel?

Anonymous
on
May 30, 2004 - 1:19pm

Con, good to see you back!

Will there be another -ck kernel for 2.6.7 as you had for 2.6.4?

Thanks.

Not long

on
May 30, 2004 - 4:46pm

Probably within a week. I was hoping to have a complete version of the staircase ready by now and start working on -ck but I don't think that's going to be the case. It will be a better version than the one in 2.6.4-ck but it will not be the final version.

Fortunately the stuff I've been putting into -ck has actually been sneaking into 2.6 mainline meaning I'm not the only person who thinks these things are important ;-) Some things I have will never be merged, though, so it means there will always be something for me to make a -ck with.

User memory vs. file cache

Anonymous
on
May 30, 2004 - 10:48am

Does swappiness or anything else allow you to control how the physical memory is split between user memory and file cache?

No

on
May 30, 2004 - 10:56am

There is no simple control that does such a thing as far as I'm aware. However a low swappiness value will, as the memory fills up, bias towards "user memory" as you have called it. There is no such thing as "user memory", but I understand what you're asking for.

User memory vs. file cache

on
May 30, 2004 - 10:58am

How do you define user memory? A file can be memory mapped, which means the file cache is directly readable (or even writable) through user address space. Though this is normally considered to be cache, it is somehow more important than the rest of the cache. Removing too many memory mapped pages from the page cache would hurt performance as much as swapping too much. And in fact small memory and no swap would mean too litle physical memory for memory mapped files.

But not all memory mapped files are equally important. I once wrote a program that would memory map a 700MB iso file. Not because I really needed it to be memory mapped, I could as well have used read and write system calls. I just used mmap to reduce the system call overhead. Had the kernel tried to keep this file in RAM it would have failed.

Swapiness

Anonymous
on
June 1, 2004 - 3:50am

Yes, I believe swapiness (/proc/sysinfo/vm/swapiness) is what you're looking for. Or the closest you'll get to it.
You can't tell the kernel exactly how much RAM to use for each, but you can e.g. set swapiness to 0 so "user memory" is never swaped out to make room for file cache.

At Con Kolivas

Anonymous
on
May 30, 2004 - 7:45pm

Hi, glad to see you active again. Could you do a re-diff of the staircase sheduler and the autoregulated VM swappiness agains the mm kernel?

Thanks in advance.

Staircase for -mm

on
May 31, 2004 - 2:12pm

While I understand it may be more effort than its worth, I would like to second the notion of having diffs of CK patches against -mm! :D

Unofficial diff of staircase against 2.6.7-rc1-mm1 (patch applied with 4 failed hunks out of 37, fixes were trivial) -- Staircase 5.5 for 2.6.7-rc1-mm1

Time constraints

on
June 1, 2004 - 1:07am

While I greatly appreciate your enthusiasm for my work I have limited kernel development time available and prefer to diff only to one tree to allow me to spend more time on the raw development. Also -ck is traditionally a stable series and I'd like to have it keep that reputation. I have no problem with unofficial versions being ported to other trees though ;)

Uh-oh

Anonymous
on
May 31, 2004 - 5:41pm

Probably someone will declare a "fatwa" against me for posting this here, but that reminds me of a dialog:

Virtual memory size:
[_] Fixed
[_] Let Windows calculate the size for me.

Spooky, huh?

Perhaps, but they're different things

Anonymous
on
May 31, 2004 - 6:38pm

It might be vaguely reminiscent, but that dialog is something else entirely. In that particular case, you're picking the size of your swap file, or letting Windows pick its size for you.

That's rather different than this patch.

Ok.

Anonymous
on
May 31, 2004 - 7:44pm

You're right. I posted it in a hurry (and somewhat tired). My apologies.

For what it's worth...

on
June 1, 2004 - 5:58pm


For what it's worth, it might be worthwhile to have some combination of the autoswappiness patch and automatic swap sizing in a desktop distro.


Granted, when you have a swap partition (as opposed to a swap file), the swap size is picked up front as part of the install. I seem to recall most installers at least suggest an initial partition layout.


None seem to suggest anything like my favorite algorithm, which is to:


  • Put one swap partition on each hard drive
  • Put the swap partitions as early as possible on the drive
  • Mark swap partitions on all fast primary drives (masters on IDE chains) at equal priority so swap allocations "round-robin"
  • Mark all other swap partitions low-priority so they only get allocated if the others fill up. Mark at lowest priority drives on chains that will get a lot of disk usage (eg. mastering DVDs, etc.).


That effectively spreads the swap loading out, and it seems to speed up swap-in in my limited testing. Granted, the last time I tested it heavily with a swappy load was in the 2.2 days, but old habits die hard. I'm still not running 2.6 yet (though I will w/ my next computer).



One variation on the algorithm is to put the partition for /tmp ahead of the swap partition.

i don't get it

on
June 1, 2004 - 4:06pm

This seems like the perfect solution to most of the issues raised recently w.r.t. swap usage.
Also, Con seems to be one of the few people who actually understands what the complains are. ;-) I dunno, but the linux kernel mailing list seems opaque to things related to pure desktop usage.

So, what's the reality? Why the autoregulated swapinnes patch is not getting more attention from the core kernel developers? To me, the patch seems like it's filling an obvious hole in the kernel's behaviour.

Feedback

Anonymous
on
June 1, 2004 - 9:13pm

I guess that's because there have been precisely zero reports back to the mailing list saying how it performs. If it's a desktop user issue then the developers wont know if it helps if the desktop users dont report back.

Swap

Anonymous
on
June 1, 2004 - 11:37pm

What's all the talk about swap recently, is there a bug?

Story headline

Anonymous
on
June 2, 2004 - 12:39am

Read the first line of this story and check out the link.

VM and Swap

on
June 2, 2004 - 1:06am

You're close with the comment on "pure desktop usage" -- the kernel devs have many other priorities to deal with when tuning VM and swap usage, pure desktops being only one tiny part of the whole. On top of that, there've been some BIG changes to the VM code thanks to both Andrea Arcangeli and Hugh Dickens with more changes coming down the pipeline so it may be awhile before things settle down ...

Dynamic Caching?

Anonymous
on
June 2, 2004 - 2:09pm

Forgive me for asking (or assuming).. I've only been using Linux about a year or so. Is disk caching/buffering dynamic? It has appeared to go up/down as memory conditions change.. if this is allowed to push program memory out to swap, would that not be a problem with the dynamic cache size directly? This would seem a better place to patch than adjusting swappiness on the fly -- which does seem indirect. I am curious about how the disk caching does work if anyone has a link, thanks.

Linux seems to gobble up the

on
June 2, 2004 - 5:08pm

Linux seems to gobble up the entire available memory and allocate it as a disk cache. Linux systems with less than 95% memory usage are unusual, precisely because of this aggressive cache allocation. But that is ok.
What some people (me included) complain about is that the cache is so aggressive that it pushes out processes. This is very much the case with the 2.4 kernel, it is less an issue with 2.6 but it still happens a bit too much.
Con Kolivas' patches try to address this problem and make the kernel "smarter", so that "it knows" when to be aggressive and when not to.

It is dynamic, and it's more slippery than you might think

on
June 2, 2004 - 5:16pm

The disk cache is dynamic. Actually, the notion of "disk cache" is pretty slippery, since it's not quite what you may think it is. This isn't like MS-DOS's "SmartDrive" or Norton SpeedDisk.


The description below is mostly from memory, with some slight refreshes from Mel Gorman's VM Documentation for Linux 2.4. It also is oversimplified and probably inaccurate in places. Mainly it is intended to make a point.


Once upon a time, there were two caching structures--the buffer cache and the page cache. The buffer cache strictly held copies of blocks from the disk. The page cache held pages that could be mapped into process address space. The buffer cache is the closest thing Linux had to something like the disk caches of old. IIRC, starting with 2.4, the buffer and page cache are unified. Thus, what you have for a disk cache is really an all-purpose beast.


Pages are the work-horse of the VM. They can be file-backed or anonymous. They can hold data or they can hold code. They can be "clean" or "dirty." They can be "read only" or "copy-on-write." All of this is quite important.


Nearly every bit of code or data your applications want to touch is held in a page. Thus, keeping an application's working set in memory is really a process of keeping the pages it needs in memory. When you don't have enough memory for all the pages you might want to access, you have to make room somehow.


There are two main strategies for making room: Throw some pages away, and push some pages to swap. Both strategies will require you to re-read the pages in the future. How pages are handled depends on how they're categorized. One way to break it down is as follows:


  1. File-backed pages that are never written to can be thrown away and re-read. This includes pages that hold data files, and pages that hold executable code. These pages will be scattered all over the disk, and so the cost of bringing them back into memory may be large.


  2. File-backed pages that are read-write and have been written to (aka. are "dirty") can be written back to their respective files and then discarded. The cost of discarding these is higher, since they must be written back. The cost of bringing these pages back into memory is the same as for the other file backed pages. Periodically writing back dirty pages (e.g. what bdflush does) keeps this set small.


  3. Anonymous pages (such as those that back 'malloc()') and file-backed copy-on-write pages can't be written to any file. File-backed copy-on-write pages are typically program pages and library pages that have been written to by the dynamic linker. These clearly cannot be written back to their source files. Anonymous pages don't have an associated file. To move these pages out of the way, the only recourse is to move these to swap. The cost of writing the pages to swap can be low if the swap is well organized. The cost of bringing the pages back in is similar to bringing back file-backed pages, with a small chance of it being cheaper if the swap is laid out well.




Side note: Swap is interesting, because it offers an on-disk home for all the pages in that third category. When pages are written out to swap, they now have an on-disk home. When they're subsequently brought back in, the copy in swap doesn't change. As long as the page isn't written to later, it can be discarded without re-writing it to swap. In Linux, the "swap cache" handles this optimization.


But anyway, back to my point. If you eliminate swap, you only allow clean file-backed pages to be moved out of the memory map. The balance between swap and non-swap is totally a matter of discarding never-written file and program pages and discarding pages that can only live in swap. It's not as simple as "this is my disk cache, and it can only get this big."



Balancing the composition of the page cache is tricky business. Some portion of it will be file-backed pages, and some portion won't. Some portion will be swapped, some portion will be dropped and re-read. The "swappiness" control changes the bias, but there's no good answer. Ultimately, your goal is to reduce your I/O to maximize CPU utilization. If your working set is too large, you'll get excess I/O, period. VM tuning is all about making the transition region between "too large" and "not too large" as pleasant as possible.



Why are desktop loads so special, and the cause of so much controversy? Because the user's definition of working set ("the windows I can see and want to point my mouse at") has little to do with the actual working set ("the tasks that aren't sleeping"). If some job comes up in the background (such as everyone's favorite whipping boy, updatedb), the OS legitimately considers it the current working set. The user who sits down at the machine later and tries to interact with the now-swapped Mozilla window afterwards wonders why his "working set" got pushed out.

Caching

on
June 3, 2004 - 2:24pm

Norton SpeedDisk.

Wasn't SpeedDisk their defragmentation tool? I guess they also offered caching, but I don't remember the name of the program.

They can be "read only" or "copy-on-write."

You surely also have a lot of ordinary read-write pages.

Oops.

on
June 3, 2004 - 5:43pm

Wasn't SpeedDisk their defragmentation tool? I guess they also offered caching, but I don't remember the name of the program.


Durrrh... yeah. Norton NCache. At least, I remember it being "ncache.exe"--it may've had a fancier name. I ran it long enough back in my DOS days, I should remember the name. But, then, most of us try to forget pain...



    They can be "read only" or "copy-on-write."


You surely also have a lot of ordinary read-write pages.



Yes, I missed that, but for some reason when I went to edit my post, KernelTrap wouldn't let me. I got an "edit post" page that was otherwise blank. I hope at least the rest of my post made it clear you could have file-backed pages that are writable.

Thank you for the detailed in

on
June 3, 2004 - 2:34pm

Thank you for the detailed information!

Technical details aside, from a dumb user's p.o.v., the problem is quite simple: if i use a machine with lots of RAM and no swap, it "feels" great. Quite a few people who are into video editing and music/sound simply disable swap, because when the machine starts using the swap, the responsiveness goes down the drain, video frames get dropped, MIDI events get missed or delayed, etc.
For time-sensitive applications, the 2.4 kernel with swap is a nightmare. Things improved somewhat with 2.6, but no cigar.

Well, i have to admit i have

on
June 2, 2004 - 5:04pm

Well, i have to admit i have very strong biases, due to the work i do with Linux. I'm into video processing, and also into music and sound (MIDI, DAW, sequencing, effects). Neither one of these fields has a too large tolerance to non-responsive machines. Swap being used a lot is certainly the hallmark of a not too responsive system.
What's extremely annoying, from my p.o.v., with the current way the linux-2.4 kernel handles the memory is that (again, from the p.o.v. of the kind of work i do) it seems to start swapping out of blue sky! I'm like - all the processes use maybe 60% of the memory, the rest is disk cache, what's up with this sudden urge to scratch deep marks on the hard-drive in the swap partition area?
Linux-2.6 is slightly better in this regard, but it still looks like there's room for improvement.

Among the people who use Linux the same way as i do, there's almost a complete agreement that the CK patches will automatically work better than the plain kernel.

Well, note some of the caveats

on
June 3, 2004 - 5:51pm

Basically, for those applications, it sounds like Linux too aggressively prefers to move pages to swap as opposed to dropping nice, clean file-backed pages. For latency sensitive applications, you really need to make sure you've set the tasks to a real-time priority, and make sure you mlock() all the buffers in RAM so they don't get paged. For streaming apps, I'm not sure what exactly you do.


Notice one of the caveats I gave about swap in my post: Swap has potential to perform as well or better than just discarding clean pages if it is laid out well. I suspect on most systems, swap isn't laid out very well at all after awhile--you get a seek storm with every burst of swap activity.



For large video applications and what-not, I can see this being particularly frustrating, since most of what's in that disk buffer won't be referenced again any time soon. You're either streaming, or you're doing non-linear editing, but in any case, the context is so huge that it is a loss for the kernel to try to keep some of it in RAM--its LRU just won't do the trick.


I seem to recall an O_STREAM attribute being bandied about at some point, but I don't know if it ever went in. Its purpose would be to identify a file as a "huge freakin' stream you shouldn't bother caching too aggressively."

My swap is not being used.

Anonymous
on
June 2, 2004 - 3:01pm

I have 503 mb physical memory, after I start my everyday apps, 268 from 503mb of the physical memory get used and 0 from 486mb of the swap get used.

Why?

Enough RAM

on
June 3, 2004 - 2:15pm

It doesn't use any swap at that point, because so far you have enough RAM. As long as you have not been able to use all of your 503MB, there is not much point in writing pages to swap. OK actually there would be some reason to start writing pages to swap, if the disk was otherwise idle. Because by writing litle used pages now, the time to write them could be saved later when you actually need to free some memory. But AFAIK that feature has not been implemented in Linux.

Tied to scheduler 'interactivity' ?

Anonymous
on
June 2, 2004 - 5:54pm

The kernel should not swap out (or perhaps prioritize less) the pages belonging to processes that the scheduler has deemed to be 'interactive'. Maybe the swapping algorithm could be tweaked to avail of the scheduler info ?

Anybody think this is a good idea ?

Jim C

except !

Anonymous
on
June 2, 2004 - 6:28pm

... except for those processes that the I/O scheduler detects to have lots of I/O. (was thinking about process scheduler in parent post).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.