login
Header Space

 
 

Linux: Using Multiple Swap Partitions In 2.4

June 10, 2005 - 7:02pm
Submitted by kmerley on June 10, 2005 - 7:02pm.
Linux

One day I was searching the web and found either a SuSE or Red Hat site that was saying that you could set up your swap in a RAID fashion. They were talking about a large server with a lot of disk drives, and you could put a swap partition on many of them, and set all these swap partitions to the same priority. This way they would work more like they were in a RAID setup, and the speed of swap writing and reading from the disks would be improved.

Also, near the end of the replies to the 'How to use RAM as Swap' article, there was mention that someone should be reworking the swap algorithm so it didn't use such a simple and slow search method for finding and using swap slots. I believe it was a mention of Andrew Morton himself saying something like that in an lkml email. I don't know if that has been done, but the 2.6 kernel swap is a lot different than the 2.4, so maybe it was at least attempted.

In any case, in 2.4, if I break up my 512 MB swap partition into 4 different 128 MB partitions, and then give them all the same priority, it changes my swap-in wait period quite a bit. This is with all the swap partitions on the same drive, so the limiting factors are not IDE transfer speeds here. With the smaller partitions my max wait goes down from 60 seconds to 30 seconds (on the old laptop), and the average wait of 30 seconds is down to about 10. That makes it a lot more usable. It is much nicer. And on the old laptop there is no possibility of adding RAM, which would help more if it could be done.

Another interesting thing I encountered is that recently, when I installed MEPIS, on a drive that already had the smaller swap partitions, it automatically picked up the swap partitions and made them all the same priority. That was great. I usually have to manually edit fstab and maybe use mkswap on some of the partitions to get them to mount and to mount with the same priority.

Apparently this is already being used. I guess I don't know why MEPIS (3.3.1 I believe) does this automatically but I suspect it is for this RAID-like speed increase.

So, this does indicate that the swap algorithms must be somewhat inefficient. If hdparm gives a result of 10 MB/sec (slow by todays standards), then it should only take a few seconds to swap your application back in, but instead it is taking 30 seconds (with just one swap partition), so that looks like a low percentage efficiency for read in speed. Now since the processor and memory are so much faster than drive data transfer, it does indicate some kind of tie up in the swap algorithms. Also, the fact that using more swap partitions on the same drive speeds things up also shows that the swap algorithm is inefficient, but can be worked around to some extent.

This helps up to a point. When I tried 16 quite small swap partitions the speed was a little worse than with just one large partition, so in that case it must be getting tied up in handling so many partitions. So for my case in the breaking up a of a 512 MB swap partition, four partitions of 128 MB was about optimum. Your mileage may vary.

To try this yourself, if you have the swap partition as the last partition on the drive, you can (when not booted into linux on the drive you will be working on) delete the swap partition and make 4 swap partitions of one quarter the original size. If the swap partition is not the last partition, there will be some problems with making more partitions before the main Linux partitions, such as the boot loader will point to the wrong partition and you will have to repair that.

A couple ways you can do this is to boot from a rescue CD or Knoppix, and run from them.

But for a new installation, just put the swap partition(s) as the last partition(s) on the drive. Then you can slice it up later and not disturb the booting capability.

If you do have the situation where the swap partition is the last partition or you have space at the end of the drive or can make space by changing the next to last partition size, then you can put more swap partitions on the last of the drive and try this.

Your hdxx values will probably be different than those in the example below. To find your original swap partition value run:

fdisk -l

And look for the swap entries. (WARNING: If they aren't at the end of the disk, don't try this unless you know how to recover from changing the hdxx value of your booting partition from adding partitions before it.)

Use your favorite partitioning software and delete your current x MB swap partition. Then make 4 partitions of equal size (that add up to the original partition size) as swap partitions. Run mkswap if you need to. Then modify /etc/fstab so it has 4 lines about swap, and put " pri=8 " on each line in the section where it should be, for example:

(make a copy of fstab before you try this, for example:
cp /etc/fstab /etc/fstab.bak ).

(If you run into trouble and just can't get this to work, delete the small partitions, remake your original swap partition, and then restore your original fstab file from the fstab.bak. Your original bootloader file should work again too, unless you modified that and didn't make a backup copy of it.)

The swap line in the unmodified fstab file for just one swap partition may read:

/dev/hda6     swap      swap    defaults   0  0

And you will then modify it and add entries so it reads, for example:

/dev/hda6     swap      swap    sw, pri=8  0  0
/dev/hda7     swap      swap    sw, pri=8  0  0
/dev/hda8     swap      swap    sw, pri=8  0  0
/dev/hda9     swap      swap    sw, pri=8  0  0

Then, after saving this file, you can reboot and see if everything worked, or if you didn't prepare all the partitions correctly and need to run mkswap or made typos, whatever. Well, suffice it to say, if you don't feel comfortabe working at this level, maybe you should have a more experienced linux person help.

You can check it with:

free -mt

That will tell you how much space there is in swap, and:

swapon -s

will tell you what swap partitions are active and what their priorities are, plus more information.

Anyway, with the original swap partition sliced into 4 smaller ones, and if they are all mounted and running, there should be a definite decrease in time spent swapping. This is something that might really help especially if you can't increase your amount of memory and you must use swap. Of course it won't speed swap operations up anywhere near the speedup from adding some RAM and replacing your swap partition with swap in a ramdisk or set of ramdisks.

Andrew Cox?

June 11, 2005 - 5:01am
GetALife (not verified)

Alan surely?

Please not more mention of this stupid RAM as SWAP thing. So what now, create multiple RAM disks and set them all up as SWAP partitions with equal priorities?

Exactly

June 11, 2005 - 1:15pm
Janne (not verified)

Using RAM as swap is pretty stupid idea IMO. I mean, what is swap? Basically, it's HD-space that is being used as RAM. So now we will have RAM emulating swap emulating RAM. What's the point?

I have heard some people getting lots of RAM and using some of it as swap. What I would rather do is to get lots of RAM, and get rid of swap altogether.

RAM as SWAP

June 11, 2005 - 3:52pm

In the past, there have been pentium-mainboards where only some amount (not all) of RAM could be cached by the CPU. There it really WAS an improvement to e.g. use the first 64 MB of main memory as normal RAM and the other (non-cacheable) 64 MB as a ramdisk... because as soon as the uncached memory was accessed everything slowed down to a crawl.

Actually it should be Andrew Morton

June 11, 2005 - 1:18pm

Well, there would be a gain in swapram by setting the small ramdisks all to the same priority, but it might not be noticed because it is running so fast anyway.

But NO, in this article I was only talking about multiple swap partitions replacing the larger single swap partition on a hard drive. And it is amazing to me that it helps even when these partitions are on the same drive.

Andrew Morton, not Andrew Cox

June 11, 2005 - 2:51pm

I must apologize for getting Andrew Morton mixed up with Alan Cox. The email discussing the swap block allocator is from Andrew Morton.

Performance

June 11, 2005 - 6:10am
Anonymous Person (not verified)

Splitting swap does not help, the issue is related to hard disk seek time.

Splitting swap should not hel

June 11, 2005 - 11:40am
Donkey (not verified)

Splitting swap should not help, but that doesn't nessecarily preclude it from doing so. This particular post doesn't make a very strong argument that anything has actually improved. Improved await in particular, doesn't nessecarily translate into faster swap. It might be 50 percent lower, but added together it's twice as much. When doing actual science here, it's important to not only tell your audience the method and results, but also make sure that you're actually measuring something related to the hypothesis.

Performance

June 11, 2005 - 12:55pm
Anonymous Person (not verified)

The fact is swap performs random read and write on the hard disk, were by seek time makes a difference, of course along with data transfer rate of the hard disk.

The other thing that could improve peformance is DMA.

Performance

June 11, 2005 - 1:14pm
Anonymous Person (not verified)


This particular post doesn't make a very strong argument that anything has actually improved.

The poster claims:


Anyway, with the original swap partition sliced into 4 smaller ones, and if they are all mounted and running, there should be a definite decrease in time spent swapping.

How much more clear does it need to be made?


When doing actual science here, it's important to not only tell your audience the method and results, but also make sure that you're actually measuring something related to the hypothesis.

This is no hypothesis.

No Hypothesis, Just Something that Works

June 11, 2005 - 1:31pm

Not a hypothesis, just a hack that works. I guess if it were pure swap striping, where the partitions are on multiple disks, that would be acceptable, but if the partitions are all on the same disk, why does swap speed increase?

No Hypothesis, Just Something that Works

June 11, 2005 - 2:19pm
Anonymous Person (not verified)


why does swap speed increase?

The swap partitions could be positioned on different platters on the hard disk, I think this is what you are seeing.

Unlikely

June 12, 2005 - 2:02pm

If he is splitting one 512MB into four, the partitions occupies the same place as the old big one was using, so it isn't a problem of positioning of disk..

Yes, that is what I did

June 12, 2005 - 11:56pm

That is what I did on my 128 MB laptop. It is a no cost change. I had the swap partition as the last partition, so I could just use PartitionMagic to delete it and make 4 new swap partitions. Then I had to activate them and put them in the fstab file so they would be available after a reboot.

Now on another computer I put in a large swap partition and 4 smaller ones which added up to the same size as the larger one. That way I could have the computer running, set it up to a condition where it has at least as much in swap as the size of main memory (total swap was 2x main memory), and then swapon the 4 smaller ones and swapoff the large one, then check swap in times, then swapon the large one, and swapoff the 4 smaller ones, and check the swap times. (I let the computer sit overnight after the swapon-swapoff change, so it could 'stabilize' on the new condition and so that it would be sure to swap out the pictures. This cycle was repeated a few times.) Checking swap times means going to the pictures and clicking on their app and timing how long it took to bring them up to a usable state. Instances were of GIMP and eye of gnome, with 1.5 MB jpgs and 27 MB tiffs (from the Hubble telescope). With the large swap partition the times for this were about 15 seconds on that computer, and with the small partitions it was about 7 seconds. This is with swap striping on the same drive on adjacent partitions. 2.4 kernel. I don't think it works this way on the 2.6.

Hard drives already write/rea

June 12, 2005 - 10:58pm
Anonymous123123 (not verified)

Hard drives already write/read across the platters. All of the heads move together.

Could you please provide a test case for this?

June 12, 2005 - 11:44pm

Anyway, with the original swap partition sliced into 4 smaller ones, and if they are all mounted and running, there should be a definite decrease in time spent swapping. This is something that might really help especially if you can't increase your amount of memory and you must use swap. Of course it won't speed swap operations up anywhere near the speedup from adding some RAM and replacing your swap partition with swap in a ramdisk or set of ramdisks.

Before I twist my partition table into a pretzel doing this, could you please provide some hard data? Performance metrics?

I'd probably write a script that:
1. Backgrounds a C program that malloc's 1 GB or so, writes random garbage to the whole thing, then sleeps for 30 seconds, and then sequentially accesses the entire malloc'd area.
2. 15 seconds later, I'd run eatmem to flush the entire malloc'd area to swap. Kill it 10 seconds later.
3. Report the time that it took for the C program to execute.
4. Lather, Rinse, Repeat - changing one parameter at a time. Heck, you could even have it parse a list of settings for the VM /proc interface, and reconfigure the swap settings between runs...

If you can't meet that standard of proof, I'll keep to the recipies generated by people that have been administrating Unix machines since before I was born...

The Elusive Swap Speed Test

June 13, 2005 - 5:02am

Since sometimes the replies get separated from the comments to which they apply:

This is in reply to James4765, and his Comment:

"Before I twist my partition table into a pretzel doing this, could you please provide some hard data? Performance metrics?

I'd probably write a script that:
1. Backgrounds a C program that malloc's 1 GB or so, writes random garbage to the whole thing, then sleeps for 30 seconds, and then sequentially accesses the entire malloc'd area.
2. 15 seconds later, I'd run eatmem to flush the entire malloc'd area to swap. Kill it 10 seconds later.
3. Report the time that it took for the C program to execute.
4. Lather, Rinse, Repeat - changing one parameter at a time. Heck, you could even have it parse a list of settings for the VM /proc interface, and reconfigure the swap settings between runs...

If you can't meet that standard of proof, I'll keep to the recipies generated by people that have been administrating Unix machines since before I was born..."

--------------------------------------------------------------

There is a good chance that this suggested test will not be acceptable, especially if it shows an improvement by using striped swap partitions that are on the same drive. I don't want to end up playing "Bring Me A Rock." This is a game Andy Grove taked about, that he didn't want to be playing this with customers. It goes like this.

"Bring me a rock."

So a person goes and gets a rock and brings it to the requestor.

"No, not that rock. Bring me another rock."

And so on.

Now I do think that a good standardized swap metric is what we need. I have thought about it, but haven't come up with it. The system is so dynamic, modifying itself all the time, compensating for changes we make in it. It tries to be self-correcting/adjusting.

So there is large probability that this particular test won't be acceptable.

You can just ignore this and keep to your past recipes. Fine. A lot less upsetting.

If you don't want to pretzelize your partition table, perhaps using an additional drive with a large swap partition and 4 smaller ones would be acceptable for testing? Perhaps you have no time for testing? Whatever, if you did have another drive (and this would require opening the computer, which is another thing you could choose not to do until there is enough evidence meeting your standards), and you set it up that way, and put it in the computer and then could test it to your own exacting standards, you wouldn't have to pretzelize your partition table on the main drive. You can test the large swap partition against the smaller ones, on the same drive so you are only changing one thing at a time.

But I won't be surprised to be told by others, as before, "if you write it, you have to provide all the evidence in a form acceptable to us (and any form that indicates your method is good will not be acceptable to us)."

I won't be surprised if someone tries it and doesn't do it right or in some other way kind of cooks the results to support their original contention. That is very common. There was just an article about a lot of scientists admitting they do this. If some evidence doesn't support your contention, throw it out, don't mention it, make remarks about the sanity or junk science of someone presenting that contrary evidence.

I think perhaps why this works, on my computers anyway, is that the smaller partitions can be searched in a smaller time. It partially compensates for the inefficiencies of the swap algorithms. That is probably part of it anyway.

As far as I know there is no standard swap speed test. If there is, let's do it.

My normal method is to get the amount of data in swap to be more than the size of RAM. If we have 256 MB of RAM, things start slowing down from swapping when there is more than 256 MB of swap used (maybe before that). To really slow things down, if we have 256 MB of RAM, use a 1 GB swap partition, and then we keep bringing up instances of whatever until we have more than 512 MB of swap used. Just about everything that is done after that causes swapping. Perhaps a "swap saturation" condition can be defined. Perhaps if we use 4x RAM as swap, and fill to at least 2x RAM, there may not be more slow down after that, because everything requires swapping after that, except locked items. Many have noticed this effect, and that is why there are the comments about reducing the size of the swap partition. Like, "you shouldn't use such a large swap partition".

So, perhaps a good swap test could be done with oversize swap partitions filled with more than twice the RAM in swap. It would be a slow test, but it should be easy to see speed increases, like from a minute to only 45 seconds.

You know, as many have made sure to tell me, the reason swap came to be is to compensate for not having enough RAM, but being able to spare space on the hard drive. So, if 1GB is needed, but the RAM is only 256 MB, it just won't work with a 512 MB swap. It would barely work with 256 MB RAM and 768 MB of swap, but it would sure be slow, but it would (should) work. The more there is in swap verses the size of RAM, the slower things can get. If we have 256 MB of RAM and 1GB of swap, and there is over 512 MB in swap, there is a good chance that a request for a dormant service will involve swapping out something that is in memory so that there will be space for whatever needs to be swapped in. And if the kernel does this in little 8 page chunks, that will take a lot of time.

All right.

June 14, 2005 - 12:02am

Now I do think that a good standardized swap metric is what we need. I have thought about it, but haven't come up with it. The system is so dynamic, modifying itself all the time, compensating for changes we make in it. It tries to be self-correcting/adjusting.

But, they use performance metrics all the time for things as dynamic as the process scheduler and latency. No benchmark is perfect, but any benchmark is better than "seat-of-the-pants" feel. Once you have numbers, and can then figure out why you got those numbers, then you can work on fixing any real issues demonstrated by that test.

But I won't be surprised to be told by others, as before, "if you write it, you have to provide all the evidence in a form acceptable to us (and any form that indicates your method is good will not be acceptable to us)."

I won't be surprised if someone tries it and doesn't do it right or in some other way kind of cooks the results to support their original contention. That is very common. There was just an article about a lot of scientists admitting they do this. If some evidence doesn't support your contention, throw it out, don't mention it, make remarks about the sanity or junk science of someone presenting that contrary evidence.

But at least they published something. Give us numbers. Then we can debate the validity of your methodology - but with no study, and nothing published, we have nothing to go off of.

Anyone who goes against the accepted wisdom in scientific endeavour must face this gauntlet. If you truly believe in this, then prove it in a way that at least some people here will accept.

If it truly does work, then others will adopt it - but I'm not going to waste my time testing something that seems completely daft to me.

As far as I know there is no standard swap speed test. If there is, let's do it.

No, you can do it. Or, ask politely at http://www.kernelnewbies.org/ - someone there might know of a tool you can use.

Once again, I don't care enough to put effort into determining if your unsubstantiated claims are valid - but I will look at the data you generate. That is the sign of a computer scientist. Wild, unreproduceable claims are the sign of a perpetual-motion fanatic.

The more there is in swap verses the size of RAM, the slower things can get. If we have 256 MB of RAM and 1GB of swap, and there is over 512 MB in swap, there is a good chance that a request for a dormant service will involve swapping out something that is in memory so that there will be space for whatever needs to be swapped in. And if the kernel does this in little 8 page chunks, that will take a lot of time.

Um, most swap-in operations are small. The kernel will try to do as little as possible to satisfy the request - and if that means reading in a single swap page at a time, then that's what it will do. There is some read-ahead behavior in the 2.6 disk scheduler AFAICR - someone care to elaborate on that?

There is no way to make thrashing stop - short of killing the VM overcommit, hacking the memory manager, and breaking a whole bunch of other things that make a modern Unix system work reliably.

My final $0.02: Substantiate. Substantiate. Substantiate. Without a methodology, and documentation, I will stick you with the cold fusion - perpetual motion - Popular Mechanics set of "inventors" who make bad patent attourneys very wealthy.

Cold fusion

June 14, 2005 - 7:27am
Wol (not verified)

Don't be nasty to the scientists studying cold fusion. There IS something there, the question being "what?".

The trouble is, Pons and Fleischman completely buggered up the PR side of things and annoyed most of their fellow scientists (INCLUDING others studying cold fusion). Then, because they jumped the gun on publishing and published a screwed-up paper, the scientific establishment concluded that P&F were charlatans (wrong) and that their experiments were fakes (also wrong). Unfortunately, given the circumstances, such conclusions were only justifiable and to be expected.

FACT: P&F's experiments did do what they said they did.
FACT: Nobody else could copy them (because P&F's paper was crap, probably because P&F themselves didn't understand what on earth was going on).
FACT: P&F have completely ruined the field for future researchers, despite there being some very interesting phenomena going on here. And yes, there is still some research going on, but very much under wraps because of all the bad publicity brought on through P&F's stupidity.

Cheers,
Wol

Get the facts

October 2, 2007 - 9:19pm
Anonymous (not verified)

FACT: P&F's experiments did do what they said they did.

P&F said lots of things about their experiments, some true, some false. They said that they got more energy out than they put in, that the platinum electrodes sometimes melted, they said they detected neutrons with a suprising energy spectrum, with the expected spectrum and there were no neutrons.

They measured energy with calorimeters. These are notoriously difficult to get good results from, especially when used for hours. If P&F had read some old biology papers they would have found techniques to improve the accuracy of calorimetery. Slightly more modern papers would have shown biologists moved away from calorimeters because they are indirect methods that are simpler to used and give more accurate results. I am sure that some of P&F's results only showed more energy out than went in because the sources of error were larger than P&F understood.

No-one ever saw red hot platinum electrodes. What they saw was that some electrodes had massively changed shape. In P&F's experiments, large amounts of hydrogen disolve in the platinum. This will reduce the melting point of platinum.

When two deuterium atoms fuse the result is a helium-4 atom in an excited energy state. The helium-4 ejects a neutron to get to helium-3 in its ground state. The energy of the neutron is predictable. The bad news is that neutron detectors detect all sorts of things. Removing all the bogus signals is a challenge for particle physicists, so it is not suprising that a pair of electrochemists got this badly wrong on their first attempt. It is astounding that they got it right within hours of them publishing a neutron spectrum with the wrong peak energy. Eventually they said they detected no neutrons, just like everyone else.

FACT: Nobody else could copy them (because P&F's paper was crap, probably because P&F themselves didn't understand what on earth was going on).

Other people did copy them, and got the same results. Plenty of students did not correctly account for the errors in their calorimeters. Michael Faraday noted melted platinum electrodes in the 19th century. As far as I know, no-one else reported neutrons, but I am sure they got some false positives will setting up the equipment.

FACT: P&F have completely ruined the field for future researchers, despite there being some very interesting phenomena going on here. And yes, there is still some research going on, but very much under wraps because of all the bad publicity brought on through P&F's stupidity.

Near enough true. Grant requests for fusion experiments may well get examined with more suspicion than they would had F&P not quietly moved the axes on their neutron spectrum to make it look more convincing. There is some interesting electrochemistry here, but not nuclear physics. There is a massive lesson to be learned about listening to people familiar with the equipment you intend to use in your experiment. I have no idea how much cold fusion research goes on in secret, but I have read some interesting recent papers on cold fusion.

Think about that

June 11, 2005 - 1:23pm

"Splitting swap does not help, the issue is related to hard disk seek time."

Perhaps it is related to hard disk seek times. That would mean that using multiple swap partitions reduces seek time.

You can't just say "splitting swap does not help" if it does.

Argh not like this

June 12, 2005 - 5:54am

>You can't just say "splitting swap does not help" if it does.

It seems to me you have absolutely nada evidence for your claims
and your claims on a system that avoids swap on small
partitions.

In the future, please don't post this shit on the first page,
but instead discuss it on the forums or wherever. FUD like this
will lead to Gentoo syndromes :P

Partition seek times, probably bullshit. I'm not sure how the block
allocator works there, but I'm quite sure it seeks about as much
as it would were there only one continuous partition instead.

Because on your ricer scheme, everything's clumped in one partition
and by default it's clumped in the beginning of the partition,
which may fragment later on, but the seeks shouldn't vary much.

-- 
Markus Törnqvist

Multiple Swap Partitions

June 12, 2005 - 7:16am
Anonymous Person (not verified)


That would mean that using multiple swap partitions reduces seek time.

No! There more factors involved.

Very unusual that this sort o

June 11, 2005 - 6:41am
Antti S. Lankila (not verified)

Very unusual that this sort of thing would help at all.

The "swap striping" was invented to swap to multiple physical drives. Maybe the kernel's use of swap space somehow gets better with this change, but chances are changing some other setting would help, too (something like swap prefetch, if there's such a thing).

I would like to adjust this setting

June 11, 2005 - 1:42pm

Since in this forum the replies get separated from the comments to which the replies refer, here is the comment I to which I am referring:

"Very unusual that this sort of thing would help at all.

The "swap striping" was invented to swap to multiple physical drives. Maybe the kernel's use of swap space somehow gets better with this change, but chances are changing some other setting would help, too (something like swap prefetch, if there's such a thing)."

Yes, I would like to know the settings to change. I have tried a lot of the VM settings in /proc/sys/vm. I didn't hit a good one yet, but have hit ones that made things worse. This is 2.4, and I haven't tried a lot in 2.6 yet, but now that it is getting a little more stable I might. The MEPIS is a 2.6 kernel. But I haven't tested to see what difference this makes in 2.6.

At any rate, 2.4 kernel is ge

June 14, 2005 - 6:30am
Antti S. Lankila (not verified)

At any rate, 2.4 kernel is getting obsolete and any work of this kind will be of very limited value. It likely won't be incorporated into mainstream, and most users are flocking to the new kernel, as it has proven reliable, performant, and largerly swapping-free.

I invite you to redo your tests against the 2.6 kernel at the earliest possible moment, but you will need to use C programs to abuse the VM in order to get any significant swapping.

Yet another kmerely chronic lack of understanding

June 11, 2005 - 8:07am
Anon (not verified)

From the person who brought us that oh so special trip through idiocy, RAM as swap, comes another round of missing the point. It only makes sense if you stripe swap across multiple disks.

RAID0 exists to boost performance by striping data across multiple discs because an individual disk is limited in speed. RAID0 on multiple partitions of the same disk makes no sense because you are still bound by that single disk's limitations.

The same principle applies to swaps at the same priority level. It only makes sense across multiple disks as it's trying to solve the same problem. In fact putting them all on the same disk as per your example will *SLOW THINGS DOWN* as the disk ends up seeking constantly across 4 discrete swap areas.

During the last round of bullshit (RAM as swap), you derided most dissenting voices as knowing nothng about kernel internals or the VM. Instead of complaining about the inefficient swap algorithm, why don't you use your doubtlessly superior skills and code up your own replacement. I'm sure the core hackers will shower your patch with the praise and admiration it so richly deserves. At least then you look down your nose at we, the ignorant morlocks, *legitimately*.

How can kmerely, a clearly defficient human being, be allowed to post to the front page? Of course, according to Kim, my opinion is worthless because I posted anonymously.

Your Opinion Promotes Misunderstanding

June 11, 2005 - 2:05pm

In reply to:

"Yet another kmerely chronic lack of understanding"

Specifically:

"The same principle applies to swaps at the same priority level. It only makes sense across multiple disks as it's trying to solve the same problem. In fact putting them all on the same disk as per your example will *SLOW THINGS DOWN* as the disk ends up seeking constantly across 4 discrete swap areas."

OK, this is what should happen, But it ISN'T what happens. For some reason it helps to put the swap striping all on one drive. I don't know why. I can guess that it is at least partially related of what Andrew Morton was talking about in that email which is mentioned here:

http://lkml.org/lkml/2004/9/9/254

And a quote:

"Someone needs to get down and redesign the swap block allocator. I bet latency improvements would fall out of that automatically.

The main problem is that swap blocks are now physically clustered according to the page lru ordering, which doesn't have much relationship to process-virtual-address-ordering.

The swap allocator made sense when we were doing a virtual scan. It
doesn't make much sense now."

Somehow I don't think that is the whole story though.

And another interesting thing, his email is from Sept 9th 2004, and my "How to use RAM as Swap" article was from August 17th 2004, stretching well into September.

Deceiving appearances

June 12, 2005 - 5:39am

>For some reason it helps to put the swap striping all on one drive. I >don't know why.

Yes. But that's not it really. The 2.4 swapper looks at your
partitions and goes "holy fuck, no" and tries to swap less
because your partitions are too small.

Try dropping some of them away all the way and you'll see the
same behaviour.

It's faster because the vm starts to abhor swapping when there
isn't ample space.

And it's not about seek times either.

You can start proving me wrong by posting some very solid data
against this.

-- 
Markus Törnqvist

Interesting.

June 12, 2005 - 10:00am

That'd be interesting if that's really what's affecting the performance here, and it's actually somewhat rational, too.

The question is, though, doesn't Linux look at the total size of swap? If I have 1 1GB partition or 4 256MB partitions, shouldn't Linux's heuristics for dialing back swappiness work the same?

That also argues that there'd be a negative performance benefit to adding more swap on a given system, beyond the 2-bytes/page cost of tracking all the pages in swap.

There's another possible ironic explanation of what's going on: What if this did make swap slower (so that swapping in or out any particular page took longer), but because it forced a task to stay in state 'D' longer, other things were able to progress further before they got paged out? In other words, the system got faster overall because it got "more unfair" and serialized the processes more? If that's the case, the speedup should only happen if you have multiple tasks active at once. If you have one monolithic pig thrashing swap and everything else is "sleeping," the striped swap should be slower.

I agree that "swap to RAM" and "striping swap on a single disk" are ridiculous notions, but if there are real measured speedups, it's worth trying to understand why. So far I have seen no data related to this post or an explanation of the workload that exhibits the speedup under these schemes.

Oh please! "It seems" "som

June 13, 2005 - 6:01pm
Non Ame (not verified)

Oh please!

"It seems" "somehow it helps" "I cant explain but" is no good to make a point.
Find some way of measuring the performance and do tests. This should'nt be to hard.
Noting is more convincing than hard numbers. Especially if the person in question did some totaly bogus articles(RAM as swap anyone?)in the past.

If you adhere to this, you can't use Linux

June 15, 2005 - 1:40pm

This idea that you can't use something until you know how it works is what is truly bogus. If you apply this criteria to what you do during a day, you will do just about nothing. Do you know everything about how your body works, your car, your television, medicines, stock market, tax returns?

Ram as Swap does speed up swap at least 100 times. If you are having swap slowdowns, with swap in a ramdisk, they will be reduced to portions of a second. Of course you need enough RAM to do this.

You better stop using Linux because its page reclaimation algorithms are determined by trying things and using what works best, for each release. Not very explainable, but usable. We might not like this, but it is the only practical thing to do. It isn't _clean_ but it has to be done. See how it is described in "Understanding the Linux Kernel". I believe I quoted parts of that subchapter in "How to use RAM as Swap".

"This idea that you can't use

June 16, 2005 - 3:34am
Non Ame (not verified)

"This idea that you can't use something until you know how it works is what is truly bogus"

I did not say that.

For clarification:

There is a distinct difference between fiddling around on your personal system until it works better for some unknown reason and systematically going to the root of a problem and correct it.
If you dont have the knowledge or resources to do this yourself you should post your findings to the appropriate mailing list.

Maybe that would help

June 16, 2005 - 7:15pm

In reply to "Non Ame"'s comment:

"There is a distinct difference between fiddling around on your personal system until it works better for some unknown reason and systematically going to the root of a problem and correct it.
If you dont have the knowledge or resources to do this yourself you should post your findings to the appropriate mailing list."

Probably a reason that this and the other swap articles bring so many comments is that swap has been a problem for quite a while. As memory and drive space get larger and larger, while the hard drive platter rotation speed stays about the same (it isn't going up by factors of 10 anyway without flying apart), swap got to be a more noticeable problem. These suggestions are going to be helpful to those that use them.

Anyway, I thought this _was_ a discussion group. There is sure a lot of discussion and major players read it.

To go to the source of the problem and correct it just might take a major kernel change. All this data is handled in swap pages, and on x86 this means 4096 bytes at a time. That wasn't such an issue when we only had 16MB of memory and 500 MB drives. A good swap partition then would have been 32 MB. So if we now have 512 MB of memory and 1 GB swap partitions, handling the swap pages using the same parameters will make it slow. That is the small pipe swap has to pass through I believe. What should be done? Make swap pages 40960 bytes? I bet that would speed things up a little. I bet it would be a major change. Remember how hard drives went through all those limits in capacity until they changed from CHS to LBA. There is still mystery about that. But they had to do it.

_I_ can't make this change, that is for sure. That would be a large coordinated effort. And why stop at 40960? Why not make it so it can be easily even larger, because drive sizes and other improvements will require it in the future. Wait till we all have dual core processors, TB hard drives, and 32GB main memory, and Oracle takes at least 32GB of RAM and 64GB swap. Then doing data transfers at 4096 bytes/page will be painful. .

Maybe it would help a lot to increase the readahead on our more modern systems. I dont' think I have tried that yet.

But if it was an easy matter, or one you could solve by writing to discussion groups, it would have been done already. A lot of long-time kernel developers working together hadn't solved it by the 2.4 kernel.

By the way, 2.4 kernel will be in use a long long time. It is still perceived as more stable than 2.6. It will still be used on older hardware. So far I have seen situations where upgrades to 2.6 cause crashes or data loss. That was 2.6.9. If we kept just running the old 2.4, no problems. So I believe there is interest in 2.4, and I am not upgrading if there is no good reason. Sure I will try the newer kernels where it doesn't matter if the computer freezes, but I am a little hesitant to take the chance again with someone else's server computer. It is unfortunate, but that is my personal experience so far. Look at all the topics on Kerneltrap about having trouble upgrading to 2.6. They oftern say "I decided to upgrade to the 2.6 kernel, so I did, and now this and that which worked before don't work now." Sure, a lot of those problems have been solved, but not all, and new ones are introduced now and then too.

informed comment

June 16, 2005 - 8:55pm

You managed to describe accurately the function of /proc/sys/vm/page-cluster, congratulations.

What about using video RAM as

June 11, 2005 - 1:34pm

What about using video RAM as swap?

What about using video RAM as

June 11, 2005 - 1:47pm
Ano (not verified)

LMAO

I have thought about this bef

June 11, 2005 - 2:30pm
Anonymous Coward (not verified)

I have thought about this before and it might not be a bad idea. The new cards with 256-512MB RAM are mostly useless unless you're doing 3D. Of course, this isn't really important; if you have one of these, you probably have 1-2GB+ of main memory. I would rather have the kernel developers working on more important things.

Developers wouldn't work on this

June 11, 2005 - 3:14pm

Hello there Mr. (I assume anyway) Anonymous Coward. Glad to hear from you again.

Probably the developers wouldn't work on this idea of using video memory.

At least the interface to the memory is fast. It might be possible to just use the video memory as regular RAM, if the video card is cooperative.

So, if you could just add the video memory, like maybe most of a 1 GB video memory, to the main memory address space, then you could put in the ramdisk, and it would go somewhere in the memory and the video memory would be available as RAM, whether used for ramdisk or caches and buffers, or whatever.

It wouldn't help me much with some of my computers because they have integrated graphics, using main memory for video memory anyway. There is an instance of video and main RAM being shared. The BIOS supports that.

how about this?

June 11, 2005 - 5:14pm

http://hedera.linuxnews.pl/_news/2002/09/03/_long/1445.html

By the way, slram/mtdblock is also the way to use (e.g. uncached (on old cheap boards)) normal RAM as Swap. One problem remains: when this 'fast swap' partition is full, its least recently used entries are not swapped to slower swap space (as a disk), so once it is filled up with some memory hogging daemons, all the other programs swap to disk again. You can work around this problem by swapoff/force everything into swap by a large enough allocation and touching the pages/swapon on the device to get the old contents moved to disk, on regular intervals. Pages that really are needed, are swapped in and into the now empty memory swap, when accessed.

It IS (or has been) possible

June 11, 2005 - 7:06pm
Anon (not verified)

It IS (or has been) possible to use your videoram. Do some googling.

http://hedera.linuxnews.pl/_news/2002/09/03/_long/1445.html

YMMV

I do like the last of that Article

June 11, 2005 - 9:53pm

I did get a kick out of that last part:

--------------------------------------------------------
Well, many things. When I was thinking about it, I have found two ways to use it. One of them is making any filesystem on that:

meehow:~# mkfs.ext2 /dev/mtdblock0

and mounting it somewhere, the other is more sophisticated:

meehow:~# mkswap /dev/mtdblock0
Setting up swapspace version 1, size = 12582912 bytes

meehow:~# swapon /dev/mtdblock0"

-----------------------------------------------------------

I can just see the future video card ad copy:

"In lower resolution modes, spare video memory can be used as swap in Linux systems."

Could Add Another Disk

June 11, 2005 - 3:23pm

To avoid the possibility of having problems with repartitioning and bootloader misdirections, you could add another drive with the multiple swap partitions on that. An older drive might work quite well for this, and a new drive will work well (better) too. You can put the swap partitions on the added drive, and use the rest of the space for another data partition. If you were going to add a drive anyway, this would be a great time to set up multiple swap partitions.

Of course, you wouldn't want to put a partition that was going to have a lot of traffic to it on the same drive as the swap partitions because that would slow things down due to the heavy use of the IDE bus.

That's interesting, in the analogy of data being transported here and there, it is shipped by bus.

And the next question is: wha

June 11, 2005 - 7:37pm

And the next question is: what about using audiocard RAM as swap?

P.S. I tried google before

should work

June 11, 2005 - 10:09pm

This should be easy, if you wrap the access functions (somewhere in the card driver) with an mtd driver (see my post above)...

But the real chance lies in parallelity, even with a cheap stereo card you can read from or write to a datassette tape in parallel (or a speaker/delay tube/microphone combination: dynamic air storage), with modern 7.1 cards in every PCI slot or even USB (up to 127 devices) you can read/write tens or even hundreds of swap pages simultaneously! Of course as in the multiple disk scenario you need enough concurrent processes/threads to fully utilize these possibilities, but i am pretty sure some 'works for me' case can be constructed out of this.

Btw, the flash RAMs for BIOS and other firmware are another untapped resource, just save them to (RAM-)disk and restore their contents before rebooting.

Dang, even I can't summon that much sarcasm...

June 12, 2005 - 11:54pm

Using firmware chips as swap space. The mind boggles.

Not a bad deal for the hardware manufacturers, as people brick motherboards, network adapters, and SCSI controllers by the truckload...

I could only stand in awe of the person who actually implemented it. And pray I never have to work with (or after) them...

If you really want to squeeze

June 13, 2005 - 8:58am
Zombywuf (not verified)

If you really want to squeeze in some extra storage space why not use the air molocules in your room for a shortlived temporary store. Should be fairly easy with a 4.1 surround sound card with satalites dotted around the room with microphones set up to pick up the echos, and retransmit them if the memory isn't needed at that point in time. Ue separate phases/frequencies for each page, job's a good 'un.

Not all of the application is paged out to the swap partition

June 12, 2005 - 2:32am
Anonymosu (not verified)

Remember, that a good part of the process' memory space consists of the executable file and various libraries. These are never paged out into the swap partition, becuase they are backed out by the corresponding file on the filesystem. When they are needed again, the pages are simply read in from the original file. Which of course causes random IO.

Not that this actually explains the weird behaviour you are seeing, but worth keeping in mind, nevertheless.

kswapd swap_cluster

June 13, 2005 - 4:09am

The kswapd parameters are usually set to:

512 32 8

Which we are told means (http://www.tldp.org/LDP/intro-linux/html/x9025.html):

------------------------------------------------------------------

kswapd
------

Kswapd is the kernel swap out daemon. That is, kswapd is that piece of the kernel that frees memory when it gets fragmented or full. Since every system is different, you'll probably want some control over this piece of the system.

The file contains three numbers:

tries_base
----------

The maximum number of pages kswapd tries to free in one round is calculated from this number. Usually this number will be divided by 4 or 8 (see mm/vmscan.c), so it isn't as big as it looks.

When you need to increase the bandwidth to/from swap, you'll want to increase this number.

tries_min
---------

This is the minimum number of times kswapd tries to free a page each time it is called. Basically it's just there to make sure that kswapd frees some pages even when it's being called with minimum priority.

swap_cluster
------------

This is probably the greatest influence on system performance.

swap_cluster is the number of pages kswapd writes in one turn. You'll want this value to be large so that kswapd does its I/O in large chunks and the disk doesn't have to seek as often, but you don't want it to be too large since that would flood the request queue.

-----------------------------------------------------------------

Now, look at that last one. It has the most influence in system performance (hopefully from increasing swap speed). We want swap_cluster to be large so that kswapd does its I/O in large chunks. BBUUUUUUUTTTTTT, you don't want it to be too large since that would flood the request queue.

So it gets set to 8. For x86 that should be 32kbytes at a time. That doesn't sound very large to me. Well, that would make swap take a long time. That breaks a MB into 32 operations. Is this right, or am I missing something? This is because we don't want to flood the request queue. 32 or more seeks. Each one is time expensive. Can't the request queue be made larger? Can't it handle larger chunks?

The larger files get, the more this would be a problem. 10 MB would take 320 operations, and the big problem is that the hard drives only spin so fast, and after the buffer in the hard drive is exhausted, we are not at the high burst data rates, but at the platter-spin-rate-limited value again. What is the chance that what you want from swap is going to be in the hard drive's buffer anyway?

Why does the swap_cluster size possibly flood the request queue anyway. Does the request queue have to treat each page as a separate request? Is that the limiting factor? If we choose 8 for this number then there are 8 requests, each one to swap out one 4096 byte page? If that is how it is, I can see why it is slow. It would be slow enough just from writing to the hard drive, but if this happens that will slow it down much more.

If we could make this number 64 safely that might speed things up. Or is swap_cluster not exactly what is stated?

breaking up a of a 512 MB swa

June 13, 2005 - 7:24am

breaking up a of a 512 MB swap partition, four partitions of 128 MB

Herein lies the key to his performance gain and it has nothing to do with the number of swap partitions. The swap algorithm will try harder to not swap if you get over half of the swap space used. If you decrease the size of your swapspace to something that will often be more than half full your vm will be less swappy from that point on. Of course that means you're more likely to go out-of-memory sooner if you really stress it. What he should do is use just one partition of 128MB since he found that optimal, and that will be much faster than 4 disk thrashing striped partitions on the same disk.

Maybe, but

June 13, 2005 - 12:44pm
Anonymous S. Whole (not verified)

"The swap algorithm will try harder to not swap if you get over half of the swap space used. "

That is assuming that the change in speed only occurs when the multiple swap partitions are half or more full, which may or may not be the case.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary