Linux: Unswappable Kernel Memory

Submitted by Jeremy
on May 14, 2007 - 4:03pm

The question was asked on the lkml whether or not memory allocated by kmalloc and vmalloc is swappable. Rik van Reil offered a clear explanation as to why it is not, "unswappable kernel memory is simpler and faster," adding, "there really is no good reason for swapping kernel memory nowadays." He went on to explain:

"Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000.

"The data structures that grow with memory (mostly the mem_map[] array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64."


From: Rodrigo Amestica [email blocked]
To:  linux-kernel
Subject: is linux still a none swappable kernel?
Date:	Sun, 13 May 2007 13:12:12 -0400

In some older posts I have read that memory allocations via kmalloc
and vmalloc are not swappable, that is, these memory chunks are not
paged out to swap area. Is this still the case with linux kernel 2.6?

thanks,


From: Rik van Riel [email blocked] Subject: Re: is linux still a none swappable kernel? Date: Sun, 13 May 2007 15:00:25 -0400 Rodrigo Amestica wrote: > In some older posts I have read that memory allocations via kmalloc > and vmalloc are not swappable, that is, these memory chunks are not > paged out to swap area. Is this still the case with linux kernel 2.6? Yes. Unswappable kernel memory is simpler and faster. Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000. The data structures that grow with memory (mostly the mem_map[] array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64. There really is no good reason for swapping kernel memory nowadays. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.

Related Links:

Except maybe one thing -

trasz (not verified)
on
May 14, 2007 - 5:04pm

Except maybe one thing - process' stacks. Here you have a choice: either you try very hard to make them small - as Linux does - or allow to swap them - as e.g. FreeBSD does.

Hmmm...

on
May 14, 2007 - 6:14pm

Are you referring to the kernel side of the process' stack? The user-space side is swappable under Linux, I'm almost certain.

Of course it is. The

Anonymous (not verified)
on
May 15, 2007 - 2:57am

Of course it is.
The non-swappable stack is 8k or 4k per process. That makes 8MB resp. 4MB for a system with 1000 processes, which is acceptable IMHO.

1024

Anonymous (not verified)
on
May 15, 2007 - 8:24am

That would be 1024 processes. Even better!

Not processes - threads.

trasz (not verified)
on
May 16, 2007 - 8:37am

Not processes - threads. Assuming you have 1:1 threading model, you need one kernel stack per thread, not per process.

In linux processes and

Anonymous (not verified)
on
May 17, 2007 - 4:22pm

In linux processes and threads are the same thing.

Really?

Anonymous (not verified)
on
May 15, 2007 - 3:07pm

Point me to where FreeBSD swaps kernel side process stacks. I don't believe it.

You're lying. FreeBSD

Anonymous (not verified)
on
May 16, 2007 - 1:25am

You're lying.

FreeBSD doesn't have swappable kernel stacks either.

http://www.awprofessional.com/articles/article.asp?p=366888&seqNum=2&rl=1:
"Every thread that might potentially run must have its stack resident in memory because one task of its stack is to handle page faults. If it were not resident, it would page fault when the thread tried to run, and there would be no kernel stack available to service the page fault. Since a system may have many thousands of threads, the kernel stacks must be kept small to avoid wasting too much physical memory. In FreeBSD 5.2 on the PC, the kernel stack is limited to two pages of memory. Implementors must be careful when writing code that executes in the kernel to avoid using large local variables and deeply nested subroutine calls, to avoid overflowing the run-time stack."

The fragment above does not

trasz (not verified)
on
May 16, 2007 - 8:36am

The fragment above does not state that stacks are not swappable - it only says that thread cannot run without it's stack kernel stack in physical memory.

Read this thread: http://www.daemonnews.org/mailinglists/FreeBSD/cvs-src/msg45498.html. It describes a particular problem caused by interaction of bad msleep(9) use and stack swapping.

I'm not a kernel hacker, but

Anonymous (not verified)
on
May 16, 2007 - 11:29am

I'm not a kernel hacker, but swapping out stacks of not sleeping processes seems like a really stupid idea, IMHO.

Why?

Because the stack of a sleeper is practically empty. What do you have there, the contents of maybe 32 4 byte registers, or 128 bytes. Why would you swap out a whole page or two in order to save 128 bytes?

A better approach would be to save the 128 bytes somewhere else in RAM. On a 10000-processes machine, this would amount to around 1MB of RAM. And the latency for starting processes whose stack is gone would be much better. For 100000 processes we are talking about 10MB of RAM. If you cannot afford that, why are you running 100000 processes at the same time?

So what am I missing?

Argh... "of sleeping

Anonymous (not verified)
on
May 16, 2007 - 11:32am

Argh... "of sleeping processes", of course.

Kernel stacks are fixed in

on
May 16, 2007 - 3:12pm

Kernel stacks are fixed in size, so stack of a 'sleeper' takes exactly the same amount of space as usual.

There is even better solution, called "mach continuations" - this way, sleeping thread does not even need space for all the CPU context, it only takes one pointer to function and maybe some parameter that gets passed to it. MacOS X is said to use mach continuations.

Yes, that was my point: The

Anonymous (not verified)
on
May 16, 2007 - 3:51pm

Yes, that was my point: The size of the stack is 8k (or 4k), nor matter what, but only some 128 bytes are used when the process is sleeping. It seems silly to swap in/out a whole 4k page, which isn't used at all. Instead copy the 128 bytes somewhere else and just free the stack. If the process is woken up the stack can easily be reconstructed without hitting the disk. Doesn't that make sense?

Huh?

on
May 17, 2007 - 2:20pm

Couldn't you be fairly deep in a call chain if whatever system call you made decides to sleep?

Grandparent is confused. He

RareCactus (not verified)
on
May 24, 2007 - 2:04am

Grandparent is confused. He thinks that processes only sleep when they invoke the sleep system call. But in fact, processes can sleep when waiting for I/O, on a page fault, and in many other cases.

So yes, you're right.

One more thing that I forgot

trasz (not verified)
on
May 16, 2007 - 9:03am

One more thing that I forgot to add - it's not just FreeBSD, actually I think that _most_ operating systems do swap some kernel data. Take look at Solaris, for example, http://research.sun.com/techrep/1998/smli_tr-98-55.pdf:

"The process' basic state, including kernel thread stacks, is removed from memory by the swapping process. Only the swap scheduler is allowed to push kernel stack pages to disk."

Another example: an old, a.d. 1993, SunOS 4.1.4, BSD based. Why there are both 'pagedaemon' and 'swapper' kernel processes? (Ok, I'm not exactly sure about this one. I _seem_ to remember that all Unix systems swapped stacks, but I don't have any Valhalla or McKusick handy right now.)

Some swap not only kernel stacks, but even kernel code - Windows NT, for example.

If you consider that all the

on
May 17, 2007 - 5:57am

If you consider that all the systems you are talking about are technically inferior to Linux, you may get the answer to why Linux does not swap the kernel-side process stack.

Actually, at least Solaris

on
May 17, 2007 - 8:58am

Actually, at least Solaris is much more advanced. But let's not continue this topic, ok? ;-)

I won't start here a

on
May 17, 2007 - 1:02pm

I won't start here a discussion about why I disagree with you about this,I'll just say that Solaris is not "much" more advanced than Linux,and that's for sure.I don't even think that it's "a bit" more advanced!Anyway,going back to the things that matter, I think that swapping the kernel-side process stack is more a problem than a feature.

Swapping vs. paging

on
May 17, 2007 - 2:28pm

I think part of the confusion here is that some OSes (but not Linux) distinguish between swapping (wherein an entire task or data structure is moved to/from to disk with explicit intention) and paging (which occurs at page granularity and happens implicitly via page faults).

If you're swapping at task granularity, then it makes sense that you'd be able to move all of its context off to disk. Swap-in and swap-out are explicit acts. If you're merely demand paging via page faults, things become trickier. Faults cause implicit activity and are effectively "random," insofar as when they happen is not specifically written in the code.

Under heavy load, swapping can become attractive if the machine is vastly oversubscribed. It enforces a more serial ordering on the tasks, allowing them to make progress more quickly rather than thrashing each other. Of course, latency can be pretty bad, especially if you're the first swapped and last run.

Think about the tradeoffs, too

on
May 19, 2007 - 5:53am

When considering this topic, think very hard about the expected tradeoffs. Swapping anything out (kernel code, stacks, userspace code) brings with it two problems:

  1. The kernel has to track whether a particular area is in physical memory or not, and bring it into physical memory if needed. This can't easily be done transparently for kernel memory; it has to be an active decision.
  2. Swapping is inherently expensive; it's only a net win if the swapped out data stays swapped out for a long enough period of time.

Given the added complexity of swapping, and the relative speeds of disk and memory that the Linux kernel is designed for, Linux has aimed to not swap kernel data, but to shrink it instead. Everyone gains if the amount of memory used by a kernel shrinks; only users with tasks that idle for long periods of time gain if the kernel is swappable.

Note also that the relative size of kernel data has shrunk massively over time; if kernel data is 1MB plus 1kB per MB of RAM, a 4MB machine loses over 1/4 of its RAM to kernel data, while a 1GB machine loses 1/512 of its RAM to kernel data. Combine this shift with the fact that RAM speeds are relatively higher than disk speeds (RAM is now roughly 100 times faster than disk at its best, whereas it used to be typically ten times faster than disk at its best), and it becomes even harder to justify the complexity of swapping out kernel data.

__exit

Anonymous (not verified)
on
August 31, 2007 - 2:57pm

I think it would be nice to patch kernel/module.c so that it not store __exit and similar sections in RAM, at least not permanently, but load them when needed.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.