I'm running into a VERY frustrating bug with page caching and swap. I'm going to write a long post on this and forward it to the list. In the mean time is it possible to disable the page cache or limit it to say 10% of system memory? I haven't found any way to do this yet. You could do this with RHEL 4.x but I don't recall this ever making it into 2.6 :-/ Feedback appreciated. -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
Hi, On Mon, 5 May 2008 00:12:56 -0700 You might want to look at /proc/sys/vm/ for tunables. /proc/sys/vm/vfs_cache_pressure and /proc/sys/vm/swappiness are probably what you are looking for. See this article for an explanation of swappiness : AFAIK RHEL 4.x runs 2.6.9. Cheers FDC --
I haven't played with vfs_cache_pressure so I'll take a look at this. OK... maybe it was 3.x .. I might be getting confused with their kernel release numbers. Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
On Mon, 5 May 2008 11:34:24 -0700 Could you explain the original problem ? That way we might help you better. Best, Francois --
Probably it is something like this: http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/ kloczek -- ----------------------------------------------------------- *Ludzie nie maj
Yeah... I wanted to explain it in detail in a longer post. If necessary I'll do that as well. I have a database process that's using 3G of memory. Then I have a secondary process that's using about 500MB of memory. This leaves about 500MB free for anything else. The box has 4GB of memory total... No swap is needed. ... swappiness is set to zero. I've verified that overcommit or other issues aren't getting in the way as I can run the box without swap. This isn't possible when in production though. What's happening is that the kernel is deciding to swap out some memory to disk in the (false) belief that it can free some up for cache. The problem is my database is already using its own cache so these are conflicting philosophies here which make Linux perform in a pathological way.. We're using O_DIRECT for our database which helps reduce the problem a bit...... O_DIRECT bypasses the page cache and does all IO directly. This means there isn't much pressure on the page cache and pages aren't swapped out to disk. However, MySQL still needs to do reads on misc files which aren't O_DIRECT so I end up swapping but at a much slower rate. What I want to do is either disable the page cache entirely or just tell the OS to cache at max 10% of the available memory. Thoughts? Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
On Mon, 5 May 2008 12:42:31 -0700 One of the problems is that the process pages (anonymous memory) and page cache pages live on the same LRU, so the kernel cannot always easily find the page cache pages when it is trying to evict something. Once that is fixed, and replacement is biased towards evicting page cache pages, the system may do the right thing by itself. I realize that this is no quick fix for your issue, but I am working on a split LRU patch series to make sure Linux does the right thing in the future. You can find the latest patch at http://people.redhat.com/riel/splitvm/ I have no such tunable in my code (yet), because I would like the kernel to do the right thing automatically. I will post a 2.6.25 based kernel RPM for Fedora 9 soon with the split LRU patch series applied. If you feel like testing/breaking it, I would be interested to see if it does indeed do the right thing for your workload or if it needs more tuning. -- All Rights Reversed --
Given that the kernel doesn't currently do the right thing, what about providing a tuneable as a temporary fix for people that are currently having problems? Or are you concerned that they will then not be as willing to test the automatic behaviour? Chris --
I'd like to be able to disable the cache altogether..... It seems like an easy tunable to add and might help benefit future testing. -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
On Mon, 5 May 2008 14:32:41 -0700 Executables, libraries, shared memory segments, directories and all kinds of other things live in the page cache. You cannot disable it without breaking the most basic functionality. -- All Rights Reversed --
With the same patch you linked to above? No new code.. We're on Debian so I need to dive in and see how easy it would be for us to compile our own kernel. I've done it in the past but I've avoided running anything non-stock for about 2 years now... Well we'd be deploying it in product :).. what's the ETA into making it into the official kernel? Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
On Mon, 5 May 2008 14:00:51 -0700 I have created a 2.6.25 based split LRU kernel RPM for Fedora 9: http://people.redhat.com/riel/splitvm/ The srpm and x86-64 RPMs are there. The i686 kernel RPM is still compiling, I will upload it tomorrow morning. -- All rights reversed. --
On Mon, 5 May 2008 14:00:51 -0700 The code seems to work right. What is left now is convincing Andrew Morton and Linus Torvalds to merge it. Getting good test results with reasonable workloads where the vanilla kernel does the wrong thing (eg. your workload) is probably the most important factor for Andrew and Linus in deciding whether or not to merge the code. -- All Rights Reversed --
Note that this is part of a larger thread: http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/ ... and my older post on the subject: http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity... mlockall DOES help solve this problem but then MySQL is forced to stay in memory. If you have too many connections to the server and it mallocs more memory and the system can't page anything else it will OOM kill your mysqld process. It's MUCH better to have MySQL dip 100MB into swap first.... Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
FYI page cache includes the memory your program uses. No page cache would mean no user space, 10% would mean user space uses only 10%. One way that would fulfil your request literally is to boot with mem=<10% of your ram>, but I guess you don't really want that. If you don't want your application to be swapped at all you should probably investigate mlock(2)/mlockall(2) -Andi --
Maybe this is similiar to the "well known" mysql dirty buffer problem? It looks strange to me that sysadmins are forced to swap to ramdisks: http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/ Gruss Bernd --
We're actually running mlock which is native to MySQL.... of course I didn't think about whether our secondary process is paging not our MySQL process. I'm going to have to look into that and figure out if it's possible to figure out what apps are being paged. Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
Though mlockall isn't going to fix this problem because the OOM killer will still kill it if it runs out of memory. I want to preserve the scenario where we go 10 bytes over memory and the kernel just swaps out those 10 bytes. Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
... and here's another point I don't fully grok: http://www.westnet.com/~gsmith/content/linux-pdflush.htm "By default, Linux will aggressively swap processes out of physical memory onto disk in order to keep the disk cache as large as possible. This means that pages that haven't been used recently will be pushed into swap long before the system even comes close to running out of memory, which is an unexpected behavior compared to some operating systems. The /proc/sys/vm/swappiness parameter controls how aggressive Linux is in this area. " ... "A value of 0 will avoid ever swapping out just for caching space. Using 100 will always favor making the disk cache bigger. Most distributions set this value to be 60, tuned toward moderately aggressive swapping to increase disk cache. " ..... but we're running on zero and this machine still has 500MB of free memory. Why is it swapping? Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
I would suggest you try to understand what I wrote in my first answer. From your replies you didn't seem to. -Andi --
Grok it now.. It seems counterintuitive for both to be included. Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
OK....... I tried again and no such luck unfortunately. Set vfs_cache_pressure to 0 and 200 (default is zero) and no luck either way. A value of 0 us used for swappiness .... Any other ideas... The box has enough memory to do this as I can run it without swap.... Kevin -- Founder/CEO Tailrank.com Location: San Francisco, CA AIM/YIM: sfburtonator Skype: burtonator Work: http://spinn3r.com and http://tailrank.com Blog: http://feedblog.org Cell: 415-637-8078 Fax: 1-415-358-419 PIN: 0092 --
| Yu Zhao | [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum' |
| Greg Kroah-Hartman | [PATCH 011/196] sysfs: Fix a copy-n-paste typo in comment |
| Laurent Riffard | Re: 2.6.23-mm1: BUG in reiserfs_delete_xattrs |
| Ben Crowhurst | Kernel Development & Objective-C |
git: | |
| Kyle Rose | [OT] Re: C++ *for Git* |
| cte | linking libgit.a in C++ projects |
| Linus Torvalds | Re: CRLF problems with Git on Win32 |
| Pierre Habouzit | Re: [PATCH] bundle, fast-import: detect write failure |
| Thor Lancelot Simon | Re: sysctl knob to let sugid processes dump core (pr 15994) |
| YAMAMOTO Takashi | Re: Patches for EST and SMP |
| Bill Studenmund | Re: @booted_kernel magic symlink? |
| Adam Hamsik | Re: Thread benchmarks, round 2 |
| Chris | OpenBSD 4.4 installation error: write failed; file system full |
| Samuel Moñux | Cyrus IMAP performance problems [Long] |
| Steve B | Intel Atom and D945GCLF2 |
| James Hartley | scp batch mode? |
