The question was asked on the lkml whether or not memory allocated by kmalloc and vmalloc is swappable. Rik van Reil offered a clear explanation as to why it is not, "unswappable kernel memory is simpler and faster," adding, "there really is no good reason for swapping kernel memory nowadays." He went on to explain:
"Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000.
"The data structures that grow with memory (mostly the mem_map[] array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64."
From: Rodrigo Amestica [email blocked] To: linux-kernel Subject: is linux still a none swappable kernel? Date: Sun, 13 May 2007 13:12:12 -0400 In some older posts I have read that memory allocations via kmalloc and vmalloc are not swappable, that is, these memory chunks are not paged out to swap area. Is this still the case with linux kernel 2.6? thanks,
From: Rik van Riel [email blocked] Subject: Re: is linux still a none swappable kernel? Date: Sun, 13 May 2007 15:00:25 -0400 Rodrigo Amestica wrote: > In some older posts I have read that memory allocations via kmalloc > and vmalloc are not swappable, that is, these memory chunks are not > paged out to swap area. Is this still the case with linux kernel 2.6? Yes. Unswappable kernel memory is simpler and faster. Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000. The data structures that grow with memory (mostly the mem_map[] array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64. There really is no good reason for swapping kernel memory nowadays. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.
Except maybe one thing -
Except maybe one thing - process' stacks. Here you have a choice: either you try very hard to make them small - as Linux does - or allow to swap them - as e.g. FreeBSD does.
Hmmm...
Are you referring to the kernel side of the process' stack? The user-space side is swappable under Linux, I'm almost certain.
Of course it is. The
Of course it is.
The non-swappable stack is 8k or 4k per process. That makes 8MB resp. 4MB for a system with 1000 processes, which is acceptable IMHO.
1024
That would be 1024 processes. Even better!
Not processes - threads.
Not processes - threads. Assuming you have 1:1 threading model, you need one kernel stack per thread, not per process.
In linux processes and
In linux processes and threads are the same thing.
Really?
Point me to where FreeBSD swaps kernel side process stacks. I don't believe it.
You're lying. FreeBSD
You're lying.
FreeBSD doesn't have swappable kernel stacks either.
http://www.awprofessional.com/articles/article.asp?p=366888&seqNum=2&rl=...
"Every thread that might potentially run must have its stack resident in memory because one task of its stack is to handle page faults. If it were not resident, it would page fault when the thread tried to run, and there would be no kernel stack available to service the page fault. Since a system may have many thousands of threads, the kernel stacks must be kept small to avoid wasting too much physical memory. In FreeBSD 5.2 on the PC, the kernel stack is limited to two pages of memory. Implementors must be careful when writing code that executes in the kernel to avoid using large local variables and deeply nested subroutine calls, to avoid overflowing the run-time stack."
The fragment above does not
The fragment above does not state that stacks are not swappable - it only says that thread cannot run without it's stack kernel stack in physical memory.
Read this thread: http://www.daemonnews.org/mailinglists/FreeBSD/cvs-src/msg45498.html. It describes a particular problem caused by interaction of bad msleep(9) use and stack swapping.
I'm not a kernel hacker, but
I'm not a kernel hacker, but swapping out stacks of not sleeping processes seems like a really stupid idea, IMHO.
Why?
Because the stack of a sleeper is practically empty. What do you have there, the contents of maybe 32 4 byte registers, or 128 bytes. Why would you swap out a whole page or two in order to save 128 bytes?
A better approach would be to save the 128 bytes somewhere else in RAM. On a 10000-processes machine, this would amount to around 1MB of RAM. And the latency for starting processes whose stack is gone would be much better. For 100000 processes we are talking about 10MB of RAM. If you cannot afford that, why are you running 100000 processes at the same time?
So what am I missing?
Argh... "of sleeping
Argh... "of sleeping processes", of course.
Kernel stacks are fixed in
Kernel stacks are fixed in size, so stack of a 'sleeper' takes exactly the same amount of space as usual.
There is even better solution, called "mach continuations" - this way, sleeping thread does not even need space for all the CPU context, it only takes one pointer to function and maybe some parameter that gets passed to it. MacOS X is said to use mach continuations.
Yes, that was my point: The
Yes, that was my point: The size of the stack is 8k (or 4k), nor matter what, but only some 128 bytes are used when the process is sleeping. It seems silly to swap in/out a whole 4k page, which isn't used at all. Instead copy the 128 bytes somewhere else and just free the stack. If the process is woken up the stack can easily be reconstructed without hitting the disk. Doesn't that make sense?
Huh?
Couldn't you be fairly deep in a call chain if whatever system call you made decides to sleep?
Grandparent is confused. He
Grandparent is confused. He thinks that processes only sleep when they invoke the sleep system call. But in fact, processes can sleep when waiting for I/O, on a page fault, and in many other cases.
So yes, you're right.
One more thing that I forgot
One more thing that I forgot to add - it's not just FreeBSD, actually I think that _most_ operating systems do swap some kernel data. Take look at Solaris, for example, http://research.sun.com/techrep/1998/smli_tr-98-55.pdf:
"The process' basic state, including kernel thread stacks, is removed from memory by the swapping process. Only the swap scheduler is allowed to push kernel stack pages to disk."
Another example: an old, a.d. 1993, SunOS 4.1.4, BSD based. Why there are both 'pagedaemon' and 'swapper' kernel processes? (Ok, I'm not exactly sure about this one. I _seem_ to remember that all Unix systems swapped stacks, but I don't have any Valhalla or McKusick handy right now.)
Some swap not only kernel stacks, but even kernel code - Windows NT, for example.
If you consider that all the
If you consider that all the systems you are talking about are technically inferior to Linux, you may get the answer to why Linux does not swap the kernel-side process stack.
Actually, at least Solaris
Actually, at least Solaris is much more advanced. But let's not continue this topic, ok? ;-)
I won't start here a
I won't start here a discussion about why I disagree with you about this,I'll just say that Solaris is not "much" more advanced than Linux,and that's for sure.I don't even think that it's "a bit" more advanced!Anyway,going back to the things that matter, I think that swapping the kernel-side process stack is more a problem than a feature.
Swapping vs. paging
I think part of the confusion here is that some OSes (but not Linux) distinguish between swapping (wherein an entire task or data structure is moved to/from to disk with explicit intention) and paging (which occurs at page granularity and happens implicitly via page faults).
If you're swapping at task granularity, then it makes sense that you'd be able to move all of its context off to disk. Swap-in and swap-out are explicit acts. If you're merely demand paging via page faults, things become trickier. Faults cause implicit activity and are effectively "random," insofar as when they happen is not specifically written in the code.
Under heavy load, swapping can become attractive if the machine is vastly oversubscribed. It enforces a more serial ordering on the tasks, allowing them to make progress more quickly rather than thrashing each other. Of course, latency can be pretty bad, especially if you're the first swapped and last run.
Think about the tradeoffs, too
When considering this topic, think very hard about the expected tradeoffs. Swapping anything out (kernel code, stacks, userspace code) brings with it two problems:
Given the added complexity of swapping, and the relative speeds of disk and memory that the Linux kernel is designed for, Linux has aimed to not swap kernel data, but to shrink it instead. Everyone gains if the amount of memory used by a kernel shrinks; only users with tasks that idle for long periods of time gain if the kernel is swappable.
Note also that the relative size of kernel data has shrunk massively over time; if kernel data is 1MB plus 1kB per MB of RAM, a 4MB machine loses over 1/4 of its RAM to kernel data, while a 1GB machine loses 1/512 of its RAM to kernel data. Combine this shift with the fact that RAM speeds are relatively higher than disk speeds (RAM is now roughly 100 times faster than disk at its best, whereas it used to be typically ten times faster than disk at its best), and it becomes even harder to justify the complexity of swapping out kernel data.
__exit
I think it would be nice to patch kernel/module.c so that it not store __exit and similar sections in RAM, at least not permanently, but load them when needed.