Linux: Debating Swap-Prefetch

Submitted by Jeremy
on May 4, 2007 - 4:51am

Ingo Molnar [interview] reviewed Con Kolivas [interview]'s swap-prefetching patches [story] suggesting that they were ready for inclusion in the mainline kernel, "I've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up." He went on to describe swap prefetch, "to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.)"

Nick Piggin [interview] expressed some concern that the impact of the code still wasn't understood well enough, "I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyze them. It is really simple, but I haven't got any since first asking." Ingo noted that the patch has generated a lot of positive feedback from users and it would be best to merge it into the kernel, going on to suggest that it would be good to get more people actively involved, "really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable."


From: Ingo Molnar [email blocked]
To: Andrew Morton [email blocked]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans
Date:	Thu, 3 May 2007 17:54:07 +0200


* Andrew Morton wrote:

> - If replying, please be sure to cc the appropriate individuals.  
>   Please also consider rewriting the Subject: to something 
>   appropriate.

i'm wondering about swap-prefetch:

  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch
  add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c            |  581 ++++++++++++++++++++++++++++++++

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
i've seen so far was positive. Time to have this upstream and time for a 
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable 
and it is as ready for integration as it ever is going to be. So it's a 
win/win scenario.

Acked-by: Ingo Molnar [email blocked]

	Ingo


From: Con Kolivas [email blocked] Subject: Re: swap-prefetch: 2.6.22 -mm merge plans Date: Fri, 4 May 2007 08:14:26 +1000 On Friday 04 May 2007 01:54, Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > - If replying, please be sure to cc the appropriate individuals. > > Please also consider rewriting the Subject: to something > > appropriate. > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. > > Acked-by: Ingo Molnar [email blocked] Thank you very much for code review, ack and support! -- -ck
From: Nick Piggin [email blocked] Subject: Re: swap-prefetch: 2.6.22 -mm merge plans Date: Fri, 04 May 2007 17:34:03 +1000 Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > >>- If replying, please be sure to cc the appropriate individuals. >> Please also consider rewriting the Subject: to something >> appropriate. > > > i'm wondering about swap-prefetch: Well I had some issues with it that I don't think were fully discussed, and Andrew prompted me to say something, but it went off list for a couple of posts (my fault, sorry). Posting it below with Andrew's permission... > mm-implement-swap-prefetching.patch > swap-prefetch-avoid-repeating-entry.patch > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch > > The swap-prefetch feature is relatively compact: > > 10 files changed, 745 insertions(+), 1 deletion(-) > > it is contained mostly to itself: > > mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++ > > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Here were some of my concerns, and where our discussion got up to. Andrew Morton wrote: > On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin [email blocked] wrote: > > >>Andrew Morton wrote: >> >>>istr you had issues with swap-prefetch? >>> >>>If so, now's a good time to reiterate them ;) >> >>1) It is said to help with the updatedb overnight problem, however it >> actually _doesn't_ prefetch swap when there are low free pages, which >> is how updatedb will leave the system. So this really puzzles me how >> it would work. However if updatedb is causing excessive swapout, I >> think we should try improving use-once algorithms first, for example. > > > Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with > normal system activity we get enough free pages to kick the thing off and > running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. > Would be useful to see this claim substantiated with a real testcase, > description of results and an explanation of how and why it worked. Yes... and then try to first improve regular page reclaim and use-once handling. >>2) It is a _highly_ speculative operation, and in workloads where periods >> of low and high page usage with genuinely unused anonymous / tmpfs >> pages, it could waste power, memory bandwidth, bus bandwidth, disk >> bandwidth... > > > Yes. I suspect that's a matter of waiting for the corner-case reporters to > complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. >>3) I haven't seen a single set of numbers out of it. Feedback seems to >> have mostly come from people who > > > Yup. But can we come up with a testcase? It's hard. I guess it is hard firstly because swapping is quite random to start with. But I haven't even seen basic things like "make -jhuge swapstorm has no regressions". >>4) If this is helpful, wouldn't it be equally important for things like >> mapped file pages? Seems like half a solution. > > > True. > > Without thinking about it, I almost wonder if one could do a userspace > implementation with something which pokes around in /proc/pid/pagemap and > /proc/pid/kpagemap, perhaps with some additional interfaces added to > do a swapcache read. (Give userspace the ability to get at swapcache > via a regular fd?) > > (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) >>5) New one: it is possibly going to interact badly with MADV_FREE lazy >> freeing. The more complex we make page reclaim, the worse IMO. > > > That's a bit vague. What sort of problems do you envisage? Well MADV_FREE pages aren't technically free, are they? So it might be possible for a significant number of them to build up and prevent swap prefetch from working. Maybe. >>...) I had a few issues with implementation, like interaction with >> cpusets. Don't know if these are all fixed or not. I sort of gave >> up looking at it. > > > Ah yes, I remember some mention of cpusets. I forget what it was though. I could be wrong, but IIRC there is no good way to know which cpuset to bring the page back into, (and I guess similarly it would be hard to know what container to account it to, if doing account-on-allocate). We could hope that users of these features would be mostly disjoint sets, but that's an evil road to start heading down, where we have various core bits of mm that don't play nice together. -- SUSE Labs, Novell Inc.
From: Ingo Molnar [email blocked] Subject: Re: swap-prefetch: 2.6.22 -mm merge plans Date: Fri, 4 May 2007 10:52:01 +0200 * Nick Piggin [email blocked] wrote: > > i'm wondering about swap-prefetch: > Being able to config all these core heuristics changes is really not > that much of a positive. The fact that we might _need_ to config > something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). > Here were some of my concerns, and where our discussion got up to. > > Yes. Perhaps it just doesn't help with the updatedb thing. Or > > maybe with normal system activity we get enough free pages to kick > > the thing off and running. Perhaps updatedb itself has a lot of > > rss, for example. > > Could be, but I don't know. I'd think it unlikely to allow _much_ > swapin, if huge amounts of the desktop have been swapped out. But > maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a "no fundamental objection"? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) And let me be really blunt about this, there is no option #3 to say: "I have no real better idea, I have no code, I have no time, but hey, lets not merge this because it 'could in theory' be possible to do it better" =B-) really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable. The maintainance drag outside of swap_prefetch.c is essentially _zero_. If the feature doesnt work it ends up on Con's desk. If it turns out to not work at all (despite years of testing and happy users) it still only ends up on Con's desk. A clear win/win scenario for you i think :-) > > Would be useful to see this claim substantiated with a real > > testcase, description of results and an explanation of how and why > > it worked. > > Yes... and then try to first improve regular page reclaim and use-once > handling. agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? > >>2) It is a _highly_ speculative operation, and in workloads where periods > >> of low and high page usage with genuinely unused anonymous / tmpfs > >> pages, it could waste power, memory bandwidth, bus bandwidth, disk > >> bandwidth... > > > > Yes. I suspect that's a matter of waiting for the corner-case > > reporters to complain, then add more heuristics. > > Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch > is happy to do a _lot_ of work for these things which we have already > decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I do not even see any risk of "splitup of mindshare" - swap-prefetch is so clearly speculative that it's not really a different view about how to do swapping (which would split the tester base, etc.), it's simply a "do you want your system to speculate about the future or not" add-on decision. Every system has a pretty clear idea about that: desktops generally want to do it, clusters generally dont want to do it. > >>3) I haven't seen a single set of numbers out of it. Feedback seems to > >> have mostly come from people who > > > > Yup. But can we come up with a testcase? It's hard. i think Con has a testcase. > >>4) If this is helpful, wouldn't it be equally important for things like > >> mapped file pages? Seems like half a solution. [...] > > (otoh the akpm usersapce implementation is swapoff -a;swapon -a) > > Perhaps. You may need a few indicators to see whether the system is > idle... but OTOH, we've already got a lot of indicators for memory, > disk usage, etc. So, maybe :) The time has passed for this. Let others play too. Please :-) > I could be wrong, but IIRC there is no good way to know which cpuset > to bring the page back into, (and I guess similarly it would be hard > to know what container to account it to, if doing > account-on-allocate). (i think cpusets are totally uninteresting in this context: nobody in their right mind is going to use swap-prefetch on a big NUMA box. Nor can i see any fundamental impediment to making this more cpuset-aware, just like other subsystems were made cpuset-aware, once the requests from actual users came in and people started getting interested in it.) I think the "lack of testcase and numbers" is the only valid technical objection i've seen so far. Con might be able to help us with that? Ingo
From: Nick Piggin [email blocked] Subject: Re: swap-prefetch: 2.6.22 -mm merge plans Date: Fri, 04 May 2007 19:09:14 +1000 Ingo Molnar wrote: > * Nick Piggin [email blocked] wrote: >>Here were some of my concerns, and where our discussion got up to. > > >>>Yes. Perhaps it just doesn't help with the updatedb thing. Or >>>maybe with normal system activity we get enough free pages to kick >>>the thing off and running. Perhaps updatedb itself has a lot of >>>rss, for example. >> >>Could be, but I don't know. I'd think it unlikely to allow _much_ >>swapin, if huge amounts of the desktop have been swapped out. But >>maybe... as I said, nobody seems to have a recipe for these things. > > > can i take this one as a "no fundamental objection"? There are really > only 2 maintainance options left: > > 1) either you can do it better or at least have a _very_ clearly > described idea outlined about how to do it differently > > 2) or you should let others try it > > #1 you've not done for 2-3 years since swap-prefetch was waiting for > integration so it's not an option at this stage anymore. Then you are > pretty much obliged to do #2. ;-) The burden is not on me to get someone else's feature merged. If it can be shown to work well and people's concerns addressed, then anything will get merged. The reason Linux is so good is because of what we don't merge, figuratively speaking. I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyse them. It is really simple, but I haven't got any since first asking. And note that I don't think I ever explicitly "nacked" anything, just voiced my concerns. If my concerns had been addressed, then I couldn't have stopped anybody from merging anything. >>>>2) It is a _highly_ speculative operation, and in workloads where periods >>>> of low and high page usage with genuinely unused anonymous / tmpfs >>>> pages, it could waste power, memory bandwidth, bus bandwidth, disk >>>> bandwidth... >>> >>>Yes. I suspect that's a matter of waiting for the corner-case >>>reporters to complain, then add more heuristics. >> >>Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch >>is happy to do a _lot_ of work for these things which we have already >>decided are least likely to be used again. > > > i see no real problem here. We've had heuristics for a _long_ time in > various areas of the code. Sometimes they work, sometimes they suck. So that's one of my issues with the code. If all you have to support a merge is anecodal evidence, then I find it interesting that you would easily discount something like this. >>>>4) If this is helpful, wouldn't it be equally important for things like >>>> mapped file pages? Seems like half a solution. > > [...] > >>>(otoh the akpm usersapce implementation is swapoff -a;swapon -a) >> >>Perhaps. You may need a few indicators to see whether the system is >>idle... but OTOH, we've already got a lot of indicators for memory, >>disk usage, etc. So, maybe :) > > > The time has passed for this. Let others play too. Please :-) Play with what? Prefetching mmaped file pages as well? Sure. >>I could be wrong, but IIRC there is no good way to know which cpuset >>to bring the page back into, (and I guess similarly it would be hard >>to know what container to account it to, if doing >>account-on-allocate). > > > (i think cpusets are totally uninteresting in this context: nobody in > their right mind is going to use swap-prefetch on a big NUMA box. Nor > can i see any fundamental impediment to making this more cpuset-aware, > just like other subsystems were made cpuset-aware, once the requests > from actual users came in and people started getting interested in it.) OK, so make it more cpuset aware. This isn't a new issue, I raised it a long time ago. And trust me, it is a nightmare to just assume that nobody will use cpusets on a small box for example (AFAIK the resource control guys are looking at doing just that). All core VM features should play nicely with each other without *really* good reason. > I think the "lack of testcase and numbers" is the only valid technical > objection i've seen so far. Well you're entitled to your opinion too. -- SUSE Labs, Novell Inc.

Related Links:

Bad page-replacement

Anonymous (not verified)
on
May 10, 2007 - 8:04pm

The problem is the ingenious page-replacement algorithm in Linux which when pressure is high, requires such a high usage rate for a page to be not paged out that is unattainable. This cannot be solved by swap readahead (which is a good idea BTW).

Future

Anonymous (not verified)
on
July 11, 2007 - 1:06pm

In the future, swap and swap-prefetching and caching wont be needed.

It sucks that hard disk drives (HDD) are so slow. In the future when everybody has solid-state drives, everything will be much faster.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.