Linux: Page Replacement Design

Submitted by Jeremy
on January 23, 2007 - 12:46pm

A university student studying operating systems asked about why the Linux kernel uses two chained lists in its LRU (least recently used) page replacement algorithm. Andrea Arcangeli [interview], whose virtual memory subsystem was merged into the 2.4.10 kernel, explained, "back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set." He went on to add, "a page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too."

Rik van Riel [interview], author of the reverse mapping virtual memory code that was merged into the 2.5 kernel [story] noted, "since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time." In response to a proposal to split the LRU into two parts, one for the page cache and the other for mapped pages, Nick Piggin [interview] replied, "I actually had patches to do 'split active lists' a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is." Rik offered some ideas on to how to further tune it, "for each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced. That way we can determine which list has the largest fraction of 'idle' pages sitting around and consequently which list should be scanned more aggressively."


From: Niki Hammler [email blocked]
To:  linux-kernel
Subject: Why active list and inactive list?
Date:	Tue, 23 Jan 2007 01:10:46 +0100

Dear Linux Developers/Enthusiasts,

For a course at my university I'm implementing parts of an operating
system where I get most ideas from the Linux Kernel (which I like very
much). One book I gain information from is [1].

Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained
lists - one active list and one inactive list.
I implemented my PRA this way too.

Now the big question is: WHY do I need *two* lists? Isn't it just
overhead/more work? Are there any reasons for that?

In my opinion, it would be better to have just one just; pop frames to
be swapped out from the end of the list and push new frames in front of
it. Then just evaluate the frames and shift them around in the list.

Is there any explanation why Linux uses two lists?

Thanks you very much in advance!

Best regards,
Niki



[1] Linux Kernelarchitektur. Konzepte, Strukturen und Algorithmen von
Kernel 2.6. Wolfang Mauerer, Hanser, ISBN 3-446-22566-8


From: Andrea Arcangeli [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 01:39:39 +0100 On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: > Dear Linux Developers/Enthusiasts, > > For a course at my university I'm implementing parts of an operating > system where I get most ideas from the Linux Kernel (which I like very > much). One book I gain information from is [1]. > > Linux uses for its Page Replacing Algorithm (based on LRU) *two* chained > lists - one active list and one active list. > I implemented my PRA this way too. > > No the big question is: WHY do I need *two* lists? Isn't it just > overhead/more work? Are there any reasons for that? > > In my opinion, it would be better to have just one just; pop frames to > be swapped out from the end of the list and push new frames in front of > it. Then just evaluate the frames and shift them around in the list. > > Is there any explanation why Linux uses two lists? Back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set. The idea is that the pollution will enter and exit the inactive list without ever being elected to the active list because by definition it will never generate a cache hit. The working set will instead trigger cache hits during page faults or repeated reads, and it will be preserved better by electing it to enter the active list. A page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too. If your server "fits in ram" you'll find lots of cache to be active and so the I/O activity not part of the working set will be collected without affecting the working set much.
From: Balbir Singh [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 07:01:33 +0530 Andrea Arcangeli wrote: > On Tue, Jan 23, 2007 at 01:10:46AM +0100, Niki Hammler wrote: >> >> Is there any explanation why Linux uses two lists? > > Back then I designed it with two lru lists because by splitting the > active from the inactive cache allows to detect the cache pollution > before it starts discarding the working set. The idea is that the > pollution will enter and exit the inactive list without ever being > elected to the active list because by definition it will never > generate a cache hit. The working set will instead trigger cache hits > during page faults or repeated reads, and it will be preserved better > by electing it to enter the active list. > > A page in the inactive list will be collected much more quickly than a > page in the active list, so the pollution will be collected more > quickly than the working set. Then the VM while freeing cache tries to > keep a balance between the size of the two lists to avoid being too > unfair, obviously at some point the active list have to be > de-activated too. If your server "fits in ram" you'll find lots of > cache to be active and so the I/O activity not part of the working set > will be collected without affecting the working set much. This makes me wonder if it makes sense to split up the LRU into page cache LRU and mapped pages LRU. I see two benefits 1. Currently based on swappiness, we might walk an entire list searching for page cache pages or mapped pages. With these lists separated, it should get easier and faster to implement this scheme 2. There is another parallel thread on implementing page cache limits. If the lists split out, we need not scan the entire list to find page cache pages to evict them. Of course I might be missing something (some piece of history) -- Balbir Singh Linux Technology Center IBM, ISTL
From: Andrea Arcangeli [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 03:13:27 +0100 On Tue, Jan 23, 2007 at 07:01:33AM +0530, Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme When I tried that long time ago, I recall I had troubles, but there wasn't the reclaim_mapped based on static values so it was even harder. However it would be still a problem today to decide when to switch from the unmapped to the mapped lru. When reclaim_mapped is set, you'll still have to shrink some unmapped page, and by splitting you literally lose age information to save some cpu. Eventually you risk spending time in trying to free unfreeable pinned pages that sits in the unmapped list before finally jumping to the mapped list. So you've to add yet another list to get rid of the pinned stuff in the unmapped list and I stopped when I had to refile pages from the "pinned" list to the unmapped list in irq I/O completion context, now it's all spin_lock_irq so it would be more natural at least... > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. BTW I'm unsure about the cache limit thread, the overhead of the vm collection shouldn't be an issue, and those tends to hide vm inefficiencies. For example Neil has a patch to reduce the writeback cache to 10M-50M (much lower than the current 1% minimum) to hide huge unfariness in the writeback cache. I think they should mount the fs with -o sync instead of using that patch until the unfariness is fixed or tunable. The patch itself is fine though but for that problem it only looks a workaround. So I at least try to be always quite skeptical when I hear about cache "fixed size limiting" patches that improve responsiveness or performance ;) > Of course I might be missing something (some piece of history) Partly ;) Code was very different back then, today it would be easier thanks to reclaim_mapped but the partial loss of age information and potential loss of cpu in a pinned&unmapped walk would probably remain.
From: Rik van Riel [email blocked] Subject: Re: Why active list and inactive list? Date: Mon, 22 Jan 2007 20:42:59 -0500 Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits Unlikely. I have seen several workloads fall over because they did not throw out mapped pages soon enough. If the kernel does not keep the most frequently accessed pages resident, hit rates will suffer. Sometimes (well, usually) those are the mapped pages, but this is not true in all workloads. Some workloads are very page cache heavy and it pays to keep the more frequently accessed page cache pages resident while discarding the less frequently accessed ones. Since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time. > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme How do you classify a mapped page cache page? Another issue is that you'll want to make sure that the page cache pages that are referenced more frequently than the least referenced mapped (I assume you mean anonymous?) pages in memory, while swapping out those least used anonymous pages. One way to do this could be to compare the scan rates, list sizes and referenced percentage of both lists, to find out which of the two caches is hotter. > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. If the lists split out, there is no reason to limit the page cache size because you can easily reclaim them. Right? > Of course I might be missing something (some piece of history) http://linux-mm.org/AdvancedPageReplacement -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.
From: Nick Piggin [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 15:17:32 +1100 Balbir Singh wrote: > This makes me wonder if it makes sense to split up the LRU into page > cache LRU and mapped pages LRU. I see two benefits > > 1. Currently based on swappiness, we might walk an entire list > searching for page cache pages or mapped pages. With these > lists separated, it should get easier and faster to implement > this scheme > 2. There is another parallel thread on implementing page cache > limits. If the lists split out, we need not scan the entire > list to find page cache pages to evict them. > > Of course I might be missing something (some piece of history) I actually had patches to do "split active lists" a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is. They actually got a noticable speedup of the swapping kbuild workload, but at this stage there were some more basic improvements needed, so the difference could be smaller today. The other nice thing about it was that it didn't have a hard cutoff that the current reclaim_mapped toggle does -- you could opt to scan the mapped list at a lower ratio than the unmapped one. Of course, it also has some downsides too, and would require retuning... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com
From: Rik van Riel [email blocked] Subject: Re: Why active list and inactive list? Date: Mon, 22 Jan 2007 23:34:27 -0500 Nick Piggin wrote: > The other nice thing about it was that it didn't have a hard > cutoff that the current reclaim_mapped toggle does -- you could > opt to scan the mapped list at a lower ratio than the unmapped > one. Of course, it also has some downsides too, and would > require retuning... Here's a simple idea for tuning. For each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced That way we can determine which list has the largest fraction of "idle" pages sitting around and consequently which list should be scanned more aggressively. For each list we can calculate how frequently the pages in the list are being used: pressure = referenced percentage * scan rate / list size The VM can equalize the pressure by scanning the list with lower usage less than the other list. This way the VM can give the right amount of memory to each type. Of course, each list needs to be divided into inactive and active like the current VM, in order to make sure that the pages which are used once cannot push the real working set of that list out of memory. There is a more subtle problem when the list's working set is larger than the amount of memory the list has. In that situation the VM will be faulting pages back in just after they got evicted. Something like my /proc/refaults code can detect that and adjust the size of the undersized list accordingly. Of course, once we properly distinguish between the more frequently and less frequently accessed pages within each of the page sets (mapped/anonymous vs. unmapped) and have the pressure between the lists equalized, why do we need to keep them separate again? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic.
From: Balbir Singh [email blocked] Subject: Re: Why active list and inactive list? Date: Tue, 23 Jan 2007 11:21:38 +0530 Rik van Riel wrote: > Nick Piggin wrote: > >> The other nice thing about it was that it didn't have a hard >> cutoff that the current reclaim_mapped toggle does -- you could >> opt to scan the mapped list at a lower ratio than the unmapped >> one. Of course, it also has some downsides too, and would >> require retuning... > > Here's a simple idea for tuning. > > For each list we keep track of: > 1) the size of the list > 2) the rate at which we scan the list > 3) the fraction of (non new) pages that get > referenced > > That way we can determine which list has the largest > fraction of "idle" pages sitting around and consequently > which list should be scanned more aggressively. > > For each list we can calculate how frequently the pages > in the list are being used: > > pressure = referenced percentage * scan rate / list size > > The VM can equalize the pressure by scanning the list with > lower usage less than the other list. This way the VM can > give the right amount of memory to each type. > This sounds like a good thing to start with. I think we can then use swappiness to decide what to evict. > Of course, each list needs to be divided into inactive and > active like the current VM, in order to make sure that the > pages which are used once cannot push the real working set > of that list out of memory. > Yes, that makes sense. > There is a more subtle problem when the list's working set > is larger than the amount of memory the list has. In that > situation the VM will be faulting pages back in just after > they got evicted. Something like my /proc/refaults code > can detect that and adjust the size of the undersized list > accordingly. > > Of course, once we properly distinguish between the more > frequently and less frequently accessed pages within each > of the page sets (mapped/anonymous vs. unmapped) and have > the pressure between the lists equalized, why do we need > to keep them separate again? > :-) -- Balbir Singh Linux Technology Center IBM, ISTL

Related Links:

How would it go to have

Anonymous (not verified)
on
January 24, 2007 - 3:31am

How would it go to have priorities for paging, the same way there are priorities for CPU usage (e.g. nice,...)?

This could be extended to other resources (e.g. interrupts,...).

I think it's the applications' developers the best to know what the application (e.g. to be in physical memory, more CPU time, to be very responsive to interrupts).

There is still need of memory management, but the priorities can help in its decisions.

Maybe I'm talking dumb, I'm really naive in terms of how the kernel works.

Not a good idea

ChrisLZ (not verified)
on
January 25, 2007 - 6:17am

First off, there are priorities for interrupts - that's a fundamental feature of microprocessors. As to priorities that are tunable by the application programmer, these are generally a bad idea. In systems that do not implement pre-emptive multitasking an application can (and often did in the DOS world) cause all other applications and even the OS itself to grind to a halt.

Ok, thanks for the reply. I

Anonymous (not verified)
on
January 25, 2007 - 7:48am

Ok, thanks for the reply.
I just thought it could be a better approach than the current memlock() and others. In the sense that an application can set itself as having higher priority than others (and/or lower than others) to stay in cached memory. For example some apps that should be very responsive even if the memory is highly used, should be swapped much less. Basically having a caching priority could would be adding another parameter to deal with for the Memory Management system, would this be very bad in terms of performance?

while there's no real way to

Anonymous (not verified)
on
January 25, 2007 - 4:53pm

while there's no real way to prioritize the first half of an interrupt handler, it might be possible to prioritize the second half, though it might be hard/impossible/waste-of-resources to do so. The first half just looks up the interrupt table and goes through the list of interrupt callbacks that kernel level drivers have addded (I think it checks it in order of first device to register with an interrupt number, but either way, it goes through the whole list and calls each driver call back before finishing even if the correct device has already been checked). The second half would be very hard to prioritize because lets say you have a hard drive that's responding to a dma transfer from a request from the block level driver. The block level driver doesn't associate this request with a process because it gets it from the filesystem driver (actually there's the block cache and page caches and all that other crap so I'm simplifying it a lot). So in my opinion, it would be almost impossible to priortize the second half of the interrupt handlers even though they are ran by the schedular (where the first half is called by the hardware itself).
It could be possible to priortize the drivers so the for example, the second half of a interrupt handler for a sound card driver is called before the second half of an interrupt handler for a disk driver. Though hopefully your interrupt handlers don't stack up that high that you would have to schedule them because then it would get a very laggy experenice for the user.

Note: I've only written a few small drivers so I could be wrong.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.