Re: Linux machines dieing in swap storms

Previous thread: [PATCH]fs: Fix to correct the mbcache entries counter by Ram Gupta on Thursday, October 25, 2007 - 8:03 am. (1 message)

Next thread: [PATCH] Wipe out traditional opt from x86_64 Makefile by Glauber de Oliveira Costa on Thursday, October 25, 2007 - 5:47 am. (8 messages)
From: Richard Purdie
Date: Thursday, October 25, 2007 - 8:20 am

I've got a problem I keep running into. My computers have buggy software
which can sometimes run out of control. Two specific examples:

Evolution: Sometimes its memory usage decides to suddenly grow out of
control. It usually idles at around 300MB, you can watch it in top,
doubling, trebling and ending up going past the 1200MB mark. My system
has 1.5GB ram and you notice it swapping heavily past say 800MB.

Spamassassin: If my mail server log files hit the 2GB file size limit of
the filesystem something strange happens and for whatever reason spamd
suddenly starts growing in memory usage until it uses up all available
system memory.

Arguably both pieces of software are buggy, I accept that, fine. 

In both machines in totally different circumstances what happens next is
bad. The systems swap more and more heavily trying to cope with these
out of control processes. Network interactivity stops. The swap storm
gets so bad you can't log onto the console any more. I've left machines
in this state for 1-2 hours and they don't come back. Watching the
console, the OOM killer does kick in but it never kills the problem
process (both spamd and evolution are long running processes that have
suddenly gone out of control). In then end, you have to hit the reset
switch :(. This happened to my desktop once again about 10 minutes ago
and its *extremely* frustrating. Sometimes I can catch and kill the
offending process but I shouldn't have to.

This isn't a new problem. My mail server used to be running an ancient
2.6.12 kernel and I upgraded it to 2.6.22.X in an effort to solve this
problem which no change. My desktop shows exactly the same kind of OOM
swap storm behaviour (2.6.20 based).

I realise that tuning the OOM killer is a really tricky problem but
something needs improving as the current user experience is broken.

I'm seriously tempted to add a "kill the process using the most memory"
key combination into SysRq which might let me save the desktop but won't
help with my remote ...
From: Alan Cox
Date: Thursday, October 25, 2007 - 9:13 am

For specific applications you can set resource limits, you can also set
OOM priorities in current kernels to pick who dies.

Finally you can disable overcommit and go for a rigid "no overcommit"
policy where the system will fail any memory allocation which might lead
to out of memory situations later.

Alan
-

From: Richard Purdie
Date: Thursday, October 25, 2007 - 11:28 am

I couldn't seem to find much documentation on this. For the archive and
to confirm we're talking about the same thing, you mean:

echo 10 > /proc/PID/oom_adj

(and ulimit/setrlimit for the resource limits) ?

This assumes I know in advance which processes are likely to go mad

Its certainly another option but other processes then suffer because
certain applications have bugs in them?

Thanks,

Richard

-

From: Rik van Riel
Date: Thursday, October 25, 2007 - 11:34 am

On Thu, 25 Oct 2007 16:20:41 +0100

I can't see any easy hacks or workarounds to fix the issue in the
current MM, except maybe activate the OOM killer if the amount of
page cache and buffer cache is really low and swap is full...

In the longer run, I'm working on:

http://linux-mm.org/PageReplacementDesign

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

From: Simon Arlott
Date: Thursday, October 25, 2007 - 12:55 pm

I have no swap. If I accidentally start The GIMP and load a very large 
image, everything just freezes and I have to reboot - the OOM killer 
doesn't appear to care.

-- 
Simon Arlott
-

From: David Newall
Date: Thursday, October 25, 2007 - 7:08 pm

Ulimit them.
-

From: Lenar Lõhmus
Date: Friday, October 26, 2007 - 8:14 am

Hi,

It seems very similar to my case, where it is very easy to
completely trash the computer in 30 seconds.

Conf: core2 cpu, 2gb memory, 2gb swap, 64-bit os.
Software: latest stable xorg, firefox2, latest gnash (all from gutsy)

Go to site http://www.epl.ee/ and almost immeditately you loose
control of your computer - mouse gets very jerky, clicks aren't registered,
switching consoles doesn't work. If I'm quick enough, I can go and log
on over ssh remotely and kill all gnashes. But some moments later even
that is not possible. It just trashes hard drive.

Yes, one could set up different limits and what-so-ever ... but my point 
is -
it is tooo easy to kill your linux computer (or your friends server). 
This should
be changed.

With best,
Lenar


-

Previous thread: [PATCH]fs: Fix to correct the mbcache entries counter by Ram Gupta on Thursday, October 25, 2007 - 8:03 am. (1 message)

Next thread: [PATCH] Wipe out traditional opt from x86_64 Makefile by Glauber de Oliveira Costa on Thursday, October 25, 2007 - 5:47 am. (8 messages)