login
Header Space

 
 

Re: Ability to limit or disable page caching?

Previous thread: linux-next: Tree for May 5 by Stephen Rothwell on Monday, May 5, 2008 - 1:15 am. (3 messages)

Next thread: Re: WARNING in 2.6.25-07422-gb66e1f1 by Neil Brown on Monday, May 5, 2008 - 3:24 am. (6 messages)
To: <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:12 am

I'm running into a VERY frustrating bug with page caching and swap.

I'm going to write a long post on this and forward it to the list.

In the mean time is it possible to disable the page cache or limit it
to say 10% of system memory?

I haven't found any way to do this yet.

You could do this with RHEL 4.x but I don't recall this ever making it
into 2.6 :-/

Feedback appreciated.

-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: <linux-kernel@...>
Date: Monday, May 5, 2008 - 7:16 am

Hi,

On Mon, 5 May 2008 00:12:56 -0700

You might want to look at /proc/sys/vm/ for tunables.
/proc/sys/vm/vfs_cache_pressure and /proc/sys/vm/swappiness are probably
what you are looking for.
See this article for an explanation of swappiness :

AFAIK RHEL 4.x runs 2.6.9.

Cheers

FDC
--
To: FD Cami <francois.cami@...>
Cc: <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:34 pm

I haven't played with vfs_cache_pressure so I'll take a look at this.


OK... maybe it was 3.x .. I might be getting confused with their
kernel release numbers.

Kevin

-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burtonator@...>
Cc: <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:28 pm

On Mon, 5 May 2008 11:34:24 -0700

Could you explain the original problem ? That way we might help you better.

Best,

Francois
--
To: FD Cami <francois.cami@...>
Cc: Kevin Burton <burtonator@...>, <linux-kernel@...>
Date: Tuesday, May 6, 2008 - 7:02 am

Probably it is something like this:

http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

kloczek
-- 
-----------------------------------------------------------
*Ludzie nie maj
To: FD Cami <francois.cami@...>
Cc: <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:42 pm

Yeah... I wanted to explain it in detail in a longer post.  If
necessary I'll do that as well.

I have a database process that's using 3G of memory. Then I have a
secondary process that's using about 500MB of memory.  This leaves
about 500MB free for anything else. The box has 4GB of memory total...
No swap is needed.

... swappiness is set to zero.

I've verified that overcommit or other issues aren't getting in the
way as I can run the box without swap.  This isn't possible when in
production though.

What's happening is that the kernel is deciding to swap out some
memory to disk in the (false) belief that it can free some up for
cache.

The problem is my database is already using its own cache so these are
conflicting philosophies here which make Linux perform in a
pathological way..

We're using O_DIRECT for our database which helps reduce the problem a
bit...... O_DIRECT bypasses the page cache and does all IO directly.
This means there isn't much pressure on the page cache and pages
aren't swapped out to disk.  However, MySQL still needs to do reads on
misc files which aren't O_DIRECT so I end up swapping but at a much
slower rate.

What I want to do is either disable the page cache entirely or just
tell the OS to cache at max 10% of the available memory.

Thoughts?

Kevin




-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:44 pm

On Mon, 5 May 2008 12:42:31 -0700

One of the problems is that the process pages (anonymous memory) and
page cache pages live on the same LRU, so the kernel cannot always
easily find the page cache pages when it is trying to evict something.

Once that is fixed, and replacement is biased towards evicting page
cache pages, the system may do the right thing by itself.

I realize that this is no quick fix for your issue, but I am working
on a split LRU patch series to make sure Linux does the right thing in
the future.

You can find the latest patch at http://people.redhat.com/riel/splitvm/

I have no such tunable in my code (yet), because I would like the
kernel to do the right thing automatically.  I will post a 2.6.25 based
kernel RPM for Fedora 9 soon with the split LRU patch series applied.
If you feel like testing/breaking it, I would be interested to see if
it does indeed do the right thing for your workload or if it needs more
tuning.

-- 
All Rights Reversed
--
To: Rik van Riel <riel@...>
Cc: Kevin Burton <burton@...>, FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 5:03 pm

Given that the kernel doesn't currently do the right thing, what about 
providing a tuneable as a temporary fix for people that are currently 
having problems?

Or are you concerned that they will then not be as willing to test the 
automatic behaviour?

Chris
--
To: Chris Friesen <cfriesen@...>
Cc: Rik van Riel <riel@...>, FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 5:32 pm

I'd like to be able to disable the cache altogether.....  It seems
like an easy tunable to add and might help benefit future testing.


-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: Chris Friesen <cfriesen@...>, FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 6:06 pm

On Mon, 5 May 2008 14:32:41 -0700

Executables, libraries, shared memory segments, directories and all kinds
of other things live in the page cache.  You cannot disable it without
breaking the most basic functionality.

-- 
All Rights Reversed
--
To: Rik van Riel <riel@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 5:00 pm

With the same patch you linked to above?  No new code..

We're on Debian so I need to dive in and see how easy it would be for
us to compile our own kernel.

I've done it in the past but I've avoided running anything non-stock
for about 2 years now...


Well we'd be deploying it in product :).. what's the ETA into making
it into the official kernel?

Kevin


-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Wednesday, May 7, 2008 - 10:08 pm

On Mon, 5 May 2008 14:00:51 -0700

I have created a 2.6.25 based split LRU kernel RPM for Fedora 9:

http://people.redhat.com/riel/splitvm/

The srpm and x86-64 RPMs are there.  The i686 kernel RPM is
still compiling, I will upload it tomorrow morning.

-- 
All rights reversed.
--
To: Kevin Burton <burton@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 6:08 pm

On Mon, 5 May 2008 14:00:51 -0700


The code seems to work right. What is left now is convincing Andrew Morton
and Linus Torvalds to merge it.

Getting good test results with reasonable workloads where the vanilla
kernel does the wrong thing (eg. your workload) is probably the most
important factor for Andrew and Linus in deciding whether or not to
merge the code. 

-- 
All Rights Reversed
--
To: Rik van Riel <riel@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 6:15 pm

Note that this is part of a larger thread:

http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

...

and my older post on the subject:

http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity...

mlockall DOES help solve this problem but then MySQL is forced to stay
in memory.

If you have too many connections to the server and it mallocs more
memory and the system can't page anything else it will OOM kill your
mysqld process.

It's MUCH better to have MySQL dip 100MB into swap first....

Kevin




-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:10 pm

FYI page cache includes the memory your program uses. No page 
cache would mean no user space, 10% would mean user space uses only
10%. One way that would fulfil your request literally is to boot with 
mem=&lt;10% of your ram&gt;, but I guess you don't really want that.

If you don't want your application to be swapped at all you
should probably investigate mlock(2)/mlockall(2)

-Andi
--
To: <linux-kernel@...>
Date: Monday, May 5, 2008 - 5:24 pm

Maybe this is similiar to the "well known" mysql dirty buffer problem?

It looks strange to me that sysadmins are forced to swap to ramdisks:

http://blogs.smugmug.com/don/2008/05/01/mysql-and-the-linux-swap-problem/

Gruss
Bernd
--
To: Andi Kleen <andi@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:28 pm

We're actually running mlock which is native to MySQL.... of course I
didn't think about whether our secondary process is paging not our
MySQL process.

I'm going to have to look into that and figure out if it's possible to
figure out what apps are being paged.

Kevin




-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Andi Kleen <andi@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:31 pm

Though mlockall isn't going to fix this problem because the OOM killer
will still kill it if it runs out of memory.

I want to preserve the scenario where we go 10 bytes over memory and
the kernel just swaps out those 10 bytes.

Kevin



-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Andi Kleen <andi@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:40 pm

... and here's another point I don't fully grok:

http://www.westnet.com/~gsmith/content/linux-pdflush.htm

"By default, Linux will aggressively swap processes out of physical
memory onto disk in order to keep the disk cache as large as possible.
This means that pages that haven't been used recently will be pushed
into swap long before the system even comes close to running out of
memory, which is an unexpected behavior compared to some operating
systems. The /proc/sys/vm/swappiness parameter controls how aggressive
Linux is in this area. "

...

"A value of 0 will avoid ever swapping out just for caching space.
Using 100 will always favor making the disk cache bigger. Most
distributions set this value to be 60, tuned toward moderately
aggressive swapping to increase disk cache. "

..... but we're running on zero and this machine still has 500MB of
free memory.  Why is it swapping?

Kevin




-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: Kevin Burton <burton@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 8:23 pm

I would suggest you try to understand what I wrote in my first answer.
From your replies you didn't seem to.

-Andi

--
To: Andi Kleen <andi@...>
Cc: FD Cami <francois.cami@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 8:30 pm

Grok it now.. It seems counterintuitive for both to be included.

Kevin




-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
To: FD Cami <francois.cami@...>
Cc: <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:33 pm

OK....... I tried again and no such luck unfortunately.

Set vfs_cache_pressure to 0 and 200 (default is zero) and no luck either way.

A value of 0 us used for swappiness ....

Any other ideas...

The box has enough memory to do this as I can run it without swap....

Kevin

-- 
Founder/CEO Tailrank.com
Location: San Francisco, CA
AIM/YIM: sfburtonator
Skype: burtonator
Work: http://spinn3r.com and http://tailrank.com
Blog: http://feedblog.org
Cell: 415-637-8078
Fax: 1-415-358-419 PIN: 0092
--
Previous thread: linux-next: Tree for May 5 by Stephen Rothwell on Monday, May 5, 2008 - 1:15 am. (3 messages)

Next thread: Re: WARNING in 2.6.25-07422-gb66e1f1 by Neil Brown on Monday, May 5, 2008 - 3:24 am. (6 messages)
speck-geostationary