OOM notifications

Previous thread: Re: [RFC PATCH 0/5] Shadow directories by Jan Engelhardt on Thursday, October 18, 2007 - 4:09 pm. (1 message)

Next thread: OOM notifications by Marcelo Tosatti on Thursday, October 18, 2007 - 4:25 pm. (10 messages)
To: <linux-kernel@...>
Cc: <drepper@...>, <riel@...>
Date: Thursday, October 18, 2007 - 4:15 pm

Hi,

AIX contains the SIGDANGER signal to notify applications to free up some
unused cached memory:

http://www.ussg.iu.edu/hypermail/linux/kernel/0007.0/0901.html

There have been a few discussions on implementing such an idea on Linux,
but nothing concrete has been achieved.

On the kernel side Rik suggested two notification points: "about to
swap" (for desktop scenarios) and "about to OOM" (for embedded-like
scenarios).

With that assumption in mind it would be necessary to either have two
special devices for notification, or somehow indicate both events
through the same file descriptor.

Comments are more than welcome.

-

To: Marcelo Tosatti <marcelo@...>
Cc: <linux-kernel@...>, <drepper@...>, <riel@...>
Date: Tuesday, October 30, 2007 - 10:57 am

Actually, wouldn't a generic netlink interface be more elegant? Then
we could connect it with DBUS and it would be much easier for
applications (Desktop) to handle such events.
I agree that near-to-oom conditions are quite volatile and maybe we
want a technically simple (and thus more reliable) mechanism for the
notification but I anyway wanted to point to this possibility.

Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

To: Jan Kara <jack@...>
Cc: Marcelo Tosatti <marcelo@...>, <linux-kernel@...>, <drepper@...>
Date: Tuesday, October 30, 2007 - 11:23 am

On Tue, 30 Oct 2007 15:57:20 +0100

There's nothing wrong with being able to get this info via DBUS,
but we cannot expect every database and JVM out there (big targets
for the "reduce your memory footprint" thing on servers) to grow
a DBUS interface.

--
All Rights Reversed
-

To: Rik van Riel <riel@...>
Cc: Marcelo Tosatti <marcelo@...>, <linux-kernel@...>, <drepper@...>
Date: Tuesday, October 30, 2007 - 11:55 am

Hmm, that's right, but still the kernel->userspace interface could be via
netlink (which is much more flexible than signals etc.) and then in userspace
we could implement also some simple interface (UNIX socket?) for server like
apps...

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

To: Jan Kara <jack@...>
Cc: Marcelo Tosatti <marcelo@...>, <linux-kernel@...>, <drepper@...>
Date: Tuesday, October 30, 2007 - 1:31 pm

On Tue, 30 Oct 2007 16:55:25 +0100

I think we all agree that it should not be a Unix signal, if only
because glibc cannot manipulate memory pools from signal handlers :)

The low memory message (for lack of a better word) needs to get to
userspace over a file descriptor, which the process can select() or
poll() on from its main loop.

Whether that is a device node, a sysfs file, a netlink socket or
something else ... I don't particularly care :)

--
All Rights Reversed
-

To: Marcelo Tosatti <marcelo@...>
Cc: <linux-kernel@...>, <drepper@...>, <riel@...>, Martin Bligh <mbligh@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 5:02 pm

On Thu, 18 Oct 2007 16:15:31 -0400

Martin was talking about some mad scheme wherin you'd create a bunch of
pseudo files (say, /proc/foo/0, /proc/foo/1, ..., /proc/foo/9) and each one
would become "ready" when the MM scanning priority reaches 10%, 20%, ...
100%.

Obviously there would need to be a lot of abstraction to unhook a permanent
userspace feature from a transient kernel implementation, but the basic
idea is that a process which wants to know when the VM is getting into the
orange zone would select() on the file "7" and a process which wants to
know when the VM is getting into the red zone would select on file "9".

It get more complicated with NUMA memory nodes and cgroup memory
controllers.

-

To: Andrew Morton <akpm@...>
Cc: Marcelo Tosatti <marcelo@...>, <linux-kernel@...>, <drepper@...>, <riel@...>, Martin Bligh <mbligh@...>, <linux-mm@...>
Date: Sunday, October 28, 2007 - 5:16 pm

At OLS this year, users wanted user space notification of OOM
for cgroup memory controller. When a group is about to OOM,
a notification can help an external application re-adjust
memory limits across the system.

Keeping some memory reserved for handling OOM, this scheme could
be extended to handle global OOM conditions as well.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-

To: Andrew Morton <akpm@...>
Cc: Marcelo Tosatti <marcelo@...>, <linux-kernel@...>, <drepper@...>, <riel@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 5:05 pm

We ended up not doing that, but making a scanner that saw what
percentage of the LRU was touched in the last n seconds, and
printing that to userspace to deal with.

Turns out priority is a horrible metric to use for this - it
stays at default for ages, then falls off a cliff far too
quickly to react to.

-

To: Martin Bligh <mbligh@...>
Cc: <marcelo@...>, <linux-kernel@...>, <drepper@...>, <riel@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 5:11 pm

On Fri, 26 Oct 2007 14:05:47 -0700

Sure, but in terms of high-level userspace interface, being able to
select() on a group of priority buckets (spread across different nodes,
zones and cgroups) seems a lot more flexible than any signal-based
approach we could come up with.
-

To: Andrew Morton <akpm@...>
Cc: Martin Bligh <mbligh@...>, <marcelo@...>, <linux-kernel@...>, <drepper@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 5:35 pm

On Fri, 26 Oct 2007 14:11:12 -0700

Absolutely, the process needs to be able to just poll or
select on a file descriptor from the process main loop.

I am not convinced that the magic of NUMA memory distribution
and NUMA memory pressure should be visible to userspace. Due
to the thundering herd problem we cannot wake up all of the
processes that select on the filedescriptor at the same time
anyway, so we can (later on) add NUMA magic to the process
selection logic in the kernel to only wake up processes on
the right NUMA nodes.

The initial patch probably does not need that.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

To: Rik van Riel <riel@...>
Cc: Andrew Morton <akpm@...>, <marcelo@...>, <linux-kernel@...>, <drepper@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 5:59 pm

Depends if you're using cpusets or not, I think?

-

To: Martin Bligh <mbligh@...>
Cc: Andrew Morton <akpm@...>, <marcelo@...>, <linux-kernel@...>, <drepper@...>, <linux-mm@...>
Date: Friday, October 26, 2007 - 6:30 pm

On Fri, 26 Oct 2007 14:59:01 -0700

The kernel knows on which cpuset a process can run.

The process itself may have been relocated to a different
cpuset at runtime, without it even knowing.

Because of that I think the magic of which process(es) to wake
up when there is memory pressure in some NUMA node should
live in the kernel.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-

Previous thread: Re: [RFC PATCH 0/5] Shadow directories by Jan Engelhardt on Thursday, October 18, 2007 - 4:09 pm. (1 message)

Next thread: OOM notifications by Marcelo Tosatti on Thursday, October 18, 2007 - 4:25 pm. (10 messages)