Re: [PATCH 0/8][for -mm] mem_notify v6

Previous thread: [PATCH] efs: move headers out of include/linux/ by Christoph Hellwig on Saturday, February 9, 2008 - 2:20 am. (1 message)

Next thread: [PATCH 1/8][for -mm] mem_notify v6: introduce poll_wait_exclusive() by KOSAKI Motohiro on Saturday, February 9, 2008 - 8:21 am. (1 message)
From: KOSAKI Motohiro
Date: Saturday, February 9, 2008 - 8:19 am

Hi

The /dev/mem_notify is low memory notification device.
it can avoid swappness and oom by cooperationg with the user process.

the Linux Today article is very nice description. (great works by Jake Edge)
http://www.linuxworld.com/news/2008/020508-kernel.html

<quoted>
When memory gets tight, it is quite possible that applications have memory
allocated—often caches for better performance—that they could free.
After all, it is generally better to lose some performance than to face the
consequences of being chosen by the OOM killer.
But, currently, there is no way for a process to know that the kernel is
feeling memory pressure.
The patch provides a way for interested programs to monitor the /dev/mem_notify
 file to be notified if memory starts to run low.
</quoted>


You need not be annoyed by OOM any longer :)
please any comments!

patch list
       [1/8] introduce poll_wait_exclusive() new API
       [2/8] introduce wake_up_locked_nr() new API
       [3/8] introduce /dev/mem_notify new device (the core of this
patch series)
       [4/8] memory_pressure_notify() caller
       [5/8] add new mem_notify field to /proc/zoneinfo
       [6/8] (optional) fixed incorrect shrink_zone
       [7/8] ignore very small zone for prevent incorrect low mem notify.
       [8/8] support fasync feature


related discussion:
--------------------------------------------------------------
 LKML OOM notifications requirement discussion
    http://www.gossamer-threads.com/lists/linux/kernel/832802?nohighlight=1#832802
 OOM notifications patch [Marcelo Tosatti]
    http://marc.info/?l=linux-kernel&m=119273914027743&w=2
 mem notifications v3 [Marcelo Tosatti]
    http://marc.info/?l=linux-mm&m=119852828327044&w=2
 Thrashing notification patch  [Daniel Spang]
    http://marc.info/?l=linux-mm&m=119427416315676&w=2
 mem notification v4
    http://marc.info/?l=linux-mm&m=120035840523718&w=2
 mem notification v5
    ...
From: Jonathan Corbet
Date: Monday, February 11, 2008 - 8:36 am

Just for future reference...the above-mentioned article is from LWN,
syndicated onto LinuxWorld.  It has, so far as I know, never been near
Linux Today.

Glad you liked it, though :)

Thanks,

jon
-

From: KOSAKI Motohiro
Date: Monday, February 11, 2008 - 8:46 am

Oops, sorry.
I had serious misunderstand ;-)

sorry, again.
and thank you for your helpful message.
-

From: Jon Masters
Date: Saturday, February 9, 2008 - 9:02 am

Yo,

Interesting patch series (I am being yuppie and reading this thread  
from my iPhone on a treadmill at the gym - so further comments later).  
I think that this is broadly along the lines that I was thinking, but  
this should be an RFC only patch series for now.

Some initial questions:

Where is the netlink interface? Polling an FD is so last century :)

What testing have you done?

Still, it is good to start with some code - eventually we might just  
have a full reservation API created. Rik and I and others have bounced  
ideas around for a while and I hope we can pitch in. I will play with  
these patches later.

Jon.



On Feb 9, 2008, at 10:19, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com 
-

From: KOSAKI Motohiro
Date: Saturday, February 9, 2008 - 9:33 am

Thank you.

to be honest, I don't know anyone use netlink and why hope receive
low memory notify by netlink.

poll() is old way, but it works good enough.

and, netlink have a bit weak point.
end up, netlink philosophy is read/write model.

I afraid to many low-mem message queued in netlink buffer
at under heavy pressure.

Great.
Welcome to any idea and any discussion.
-

From: Rik van Riel
Date: Saturday, February 9, 2008 - 9:43 am

On Sun, 10 Feb 2008 01:33:49 +0900

More importantly, all gtk+ programs, as well as most databases and other
system daemons have a poll() loop as their main loop.

A file descriptor fits that main loop perfectly.

-- 
All rights reversed.
-

From: KOSAKI Motohiro
Date: Saturday, February 9, 2008 - 9:49 am

not only gtk+, may be all modern GUI program :)
-

From: Paul Jackson
Date: Sunday, February 17, 2008 - 7:49 am

I just noticed this patchset, kosaki-san.  It looks quite interesting;
my apologies for not commenting earlier.

I see mention somewhere that mem_notify is of particular interest to
embedded systems.

I have what seems, intuitively, a similar problem at the opposite
end of the world, on big-honkin NUMA boxes (hundreds or thousands of
CPUs, terabytes of main memory.)  The problem there is often best
resolved if we can kill the offending task, rather than shrink its
memory footprint.  The situation is that several compute intensive
multi-threaded jobs are running, each in their own dedicated cpuset.

If one of these jobs tries to use more memory than is available in
its cpuset, then

  (1) we quickly loose any hope of that job continuing at the excellent
      performance needed of it, and

  (2) we rapidly get increased risk of that job starting to swap and
      unintentionally impact shared resources (kernel locks, disk
      channels, disk heads).

So we like to identify such jobs as soon as they begin to swap,
and kill them very very quickly (before the direct reclaim code
in mm/vmscan.c can push more than a few pages to the swap device.)

For a much earlier, unsuccessful, attempt to accomplish this, see:

	[Patch] cpusets policy kill no swap
	http://lkml.org/lkml/2005/3/19/148

Now, it may well be that we are too far apart to share any part of
a solution; one seldom uses the same technology to build a Tour de
France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
cargo transport.

One clear difference is the policy of what action we desire to take
when under memory pressure: do we invite user space to free memory so
as to avoid the wrath of the oom killer, or do we go to the opposite
extreme, seeking a nearly instantant killing, faster than the oom
killer can even begin its search for a victim.

Another clear difference is the use of cpusets, which are a major and
vital part of administering the big NUMA boxes, and I presume are not
even compiled into ...
From: KOSAKI Motohiro
Date: Tuesday, February 19, 2008 - 12:36 am

Hi Paul,

Thank you for wonderful interestings comment.
your comment is really nice.

I was HPC guy with large NUMA box at past. 
I promise i don't ignroe hpc user.
but unfortunately I didn't have experience of use CPUSET
because at that point, it was under development yet.

I hope discuss you that CPUSET usage case and mem_notify requirement.


you think kill the process just after swap, right?
but unfortunately, almost user hope receive notification before swap ;-)
because avoid swap.


Hmm, sorry
I understand your patch yet, because I don't know CPUSET so much.


Yes, some embedded distribution(i.e. monta vista) distribute as source.
but embedded people strongly dislike bloat code size.
I think they never turn on CPUSET.


I think you talk about user space oom manager.
it and many user process are obviously different.

I doubt memory manager daemon model doesn't works on desktop and
typical server.
thus, current implementaion optimize to no manager environment.

of course, it doesn't mean i refuse add to code for oom manager.
it is very interesting idea.


Excellent!
that is really good idea.


Hmmm, I don't think so.
I think timing of memmory_pressure_notify(1) is already best.

the page move active list to inactive list indicate swap I/O happen
a bit after.

but memmory_pressure_notify(0) is a bit messy.

Disagreed.
that is too late.


that makes sense.
I will learn cpuset and think integrate mem_notify and cpuset.


and,

Please don't think I reject your idea.
your proposal is large different of past our discussion and
i don't know cpuset.

I think we can't drop all current design and accept your idea all, may be.
but we may be able to accept partial until hpc guys content enough.

I will learn to CPUSET more in a few days.
after it, we can discussion more.

please wait for a while.

Thanks!



-

From: Paul Jackson
Date: Tuesday, February 19, 2008 - 8:00 am

There is not much my customers HPC jobs can do with notification before
swap.  Their jobs either have the main memory they need to perform the
requested calculations with the desired performance, or their job is
useless and should be killed.  Unlike the applications you describe,
my customers jobs have no way, once running, to adapt to less memory.
They can only adapt to less memory by being restarted with a different
set of resource requests to the job scheduler (the application that
manages job requests, assigns them CPU, memory and other resources,
and monitors, starts, stops and pauses jobs.)

The primary difficulty my HPC customers have is killing such jobs fast
enough, before a bad job (one that attempts to use more memory than it
signed up for) can harm the performance of other users and the rest of
the system.

I don't mind if a pages are slowly or occassionally written to swap;
but as soon as the task wants to reclaim big chunks of memory by
writing thousands of pages at once to swap, it must die, and die

Yes - understood and agreed - as I guessed, cpusets are not configured

Yes - I agree that my ideas were quite different.  Please don't
hesitate to reject every one of them, like a Samurai slicing through

For your work, yes that hook is too late.  Agreed.

Depending on what we're trying to do:
 1) warn applications of swap coming soon (your case),
 2) show how close we are to swapping,
 3) show how much swap has happened already,
 4) kill instantly if try to swap (my hpc case),
 5) measure file i/o caused by memory pressure, or
 6) perhaps other goals,
we will need to hook different places in the kernel.

It may well be that your hooks for embedded are simply in different
places than my hooks for HPC.  If so, that's fine.

I look forward to your further thoughts.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
-

From: Rik van Riel
Date: Tuesday, February 19, 2008 - 12:02 pm

On Tue, 19 Feb 2008 09:00:08 -0600

Don't forget the "hooks for desktop" :)

Basically in all situations, the kernel needs to warn at the same point
in time: when the system is about to run out of RAM for anonymous pages.

In the desktop case, that leads to swapping (and programs can free memory).

In the embedded case, it leads to OOM (and a management program can kill or
restart something else, or a program can restart itself).

In the HPC case, it leads to swapping (and a management program can kill or
restart something else).

I do not see the kernel side being different between these situations, only
userspace reacts differently in the different scenarios.

Am I overlooking something?

-- 
All Rights Reversed
-

From: Paul Jackson
Date: Tuesday, February 19, 2008 - 1:18 pm

Thanks for stopping by ...

Perhaps with the cgroup based memory controller in progress, or with
other work I'm overlooking, this is or will no longer be a problem,
but on 2.6.16 kernels (the latest ones I have in major production HPC
use) this is not sufficient.

As of at least that point, we don't (didn't ?) have sufficiently
accurate numbers of when we were "about to run out".  We can only
detect when "we just did run out", as evidenced by entering the direct
reclaim code, or as by slightly later events such as starting to push
Anon pages to the swap device from direct reclaim.

Actually, even the point that we enter direct reclaim, near the bottom
of __alloc_pages(), isn't adequate either, as we could be there because
some thread in that cpuset is trying to write out a results file that
is larger than that cpusets memory.   In that case, we really don't want
to kill the job ... it just needs to be (and routinely is) throttled
back to disk speeds as it completes the write out of dirty file system
pages.

So the first clear spot that we -know- serious swapping is commencing
is where the direct reclaim code calls a writepage op with an Anon
page. At that point, having a management program intervene is entirely
too late.  Even having the task at that instant, inline, tag itself
with a SIGKILL, as it queues that first Anon page to a swap device, is
too late.  The direct reclaim code can loop, pushing hundreds or
thousand of pages, on big memory systems, to the swapper, in the
current reclaim loop, before it pops the stack far enough back to even
notice that it has a SIGKILL pending on it.  The suppression of pushing
pages to the swapper has to happen right there, inline in some
mm/vmscan.c code, as part of the direct reclaim loops.

(Hopefully I said something stupid in that last paragraph, and you will
be able to correct it ... it sure would be useful ;).

A year or two ago, I added the 'memory_pressure' per-cpuset meter to
Linux, in an effort to realize just what you ...
From: Paul Jackson
Date: Tuesday, February 19, 2008 - 1:43 pm

I'm forgetting an important detail here.  Kosaki-san has clearly stated
that this hook, at vmscan's writepage, is too late for his embedded needs,
and that they need the feedback a bit earlier, when the page moves from
the active list to the inactive list.

However, except for the placement of such hooks in three or four
places, rather than just one, it may well be (if cpusets could be
factored out) that one mechanism would meet all needs ... except for
that pesky HPC need for throttling to more or less zero the swapping
from select cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
-

From: Pavel Machek
Date: Tuesday, February 19, 2008 - 3:28 pm

Sounds like a job for memory limits (ulimit?), not for OOM
notification, right?
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Paul Jackson
Date: Tuesday, February 19, 2008 - 6:54 pm

Er eh -- which one?

The only one I see that might help keep a multi-threaded job
using various kinds of memory on multiple nodes confined could
be the resident set size (RLIMIT_RSS; ulimit -m).  So far as
I can tell, that one is a pure no-op in Linux.

Here's the bash list of all available ulimit (setrlimit) options:

              -a     All current limits are reported
              -c     The maximum size of core files created
              -d     The maximum size of a process's data segment
              -e     The maximum scheduling priority ("nice")
              -f     The maximum size of files written by the shell and its children
              -i     The maximum number of pending signals
              -l     The maximum size that may be locked into memory
              -m     The maximum resident set size
              -n     The maximum number of open file descriptors (most systems do not allow this value to be set)
              -p     The pipe size in 512-byte blocks (this may not be set)
              -q     The maximum number of bytes in POSIX message queues
              -r     The maximum real-time scheduling priority
              -s     The maximum stack size
              -t     The maximum amount of cpu time in seconds
              -u     The maximum number of processes available to a single user
              -v     The maximum amount of virtual memory available to the shell
              -x     The maximum number of file locks

Did I miss seeing one that would be useful?

Actually, given the chronic problem we've had over the years accounting
for how much memory in total (including text, data, stack, mapped
files, locked pages, kernel memory structures that an application is
using many of, ...  I'd be suprised if any such ulimit existed that
actually worked for this purpose (confining an HPC jobs to using almost
exactly all the memory available to it, but no more.)

-- 
                  I won't rest till it's the best ...
                  ...
From: Rik van Riel
Date: Tuesday, February 19, 2008 - 7:07 pm

On Tue, 19 Feb 2008 23:28:28 +0100

I suspect one problem could be that an HPC job scheduling program
does not know exactly how much memory each job can take, so it can
sometimes end up making a mistake and overcommitting the memory on
one HPC node.

In that case the user is better off having that job killed and
restarted elsewhere, than having all of the jobs on that node
crawl to a halt due to swapping.

Paul, is this guess correct? :)

-- 
All rights reversed.
-

From: KOSAKI Motohiro
Date: Tuesday, February 19, 2008 - 7:48 pm

Yes.
Fujitsu HPC middleware watching sum of memory consumption of the job
and, if over-consumption happened, kill process and remove job schedule.

I think that is common hpc requirement.
but we watching to user defined memory limit, not swap.

Thanks.


-

From: Paul Jackson
Date: Tuesday, February 19, 2008 - 9:57 pm

Did those jobs share nodes -- sometimes two or more jobs using the same
nodes?  I am sure SGI has such users too, though such job mixes make
the runtimes of specific jobs less obvious, so customers are more
tolerant of variations and some inefficiencies, as they get hidden in
the mix.

In other words, Rik, both yes and no ;).  Both sorts of HPC loads
exist, sharing nodes and a dedicated set of nodes for each job.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
-

From: KOSAKI Motohiro
Date: Tuesday, February 19, 2008 - 10:21 pm

Hm
our dedicated ndoe user set memory limit to machine physical memory
size (minus a bit).

I think don't have so much share/dedicate and watch user-defined/swap.
am i misundestand?


-

From: Paul Jackson
Date: Tuesday, February 19, 2008 - 9:36 pm

Not for the loads I focus on.  Each job gets exclusive use of its own
dedicated set of nodes, for the duration of the job.  With that comes a
quite specific upper limit on how much memory, in total, including node
local kernel data, that job is allowed to use.

One problem with swapping is that nodes aren't entirely isolated.
They share buses, i/o channels, disk arms, kernel data cache lines and
kernel locks with other nodes, running other jobs.   A job thrashing
its swap is a drag on the rest of the system.

Another problem with swapping is that it's a waste of resources.  Once
a pure compute bound job goes into swapping when it shouldn't, that job
has near zero hope of continuing with the intended performance, as it
has just slowed from main memory speeds to disk speeds, which are
thousands of times slower.  Best to get it out of there, immediately.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
-

From: Tom May
Date: Tuesday, April 1, 2008 - 4:35 pm

On Sat, Feb 9, 2008 at 8:19 AM, KOSAKI Motohiro

Thanks for this patch set!  I ported it to 2.6.23.9 and tried it, on a
system with no swap since I'm evaluating this for an embedded system.
In practice, the criterion it uses for notifications wasn't sufficient to avoid
memory problems, including OOM, in a cyclic allocate/notify/free
sequence which is probably typical.

I tried it with a real-world program that, among other things, mmaps
anonymous pages and touches them at a reasonable speed until it gets
notified via /dev/mem_notify, releases most of them with
madvise(MADV_DONTNEED), then loops to start the cycle again.

What tends to happen is that I do indeed get notifications via
/dev/mem_notify when the kernel would like to be swapping, at which
point I free memory.  But the notifications come at a time when the
kernel needs memory, and it gets the memory by discarding some Cached
or Mapped memory (I can see these decreasing in /proc/meminfo with
each notification).  With each mmap/notify/madvise cycle the Cached
and Mapped memory gets smaller, until eventually while I'm touching
pages the kernel can't find enough memory and will either invoke the
OOM killer or return ENOMEM from syscalls.  This is precisely the
situation I'm trying to avoid by using /dev/mem_notify.

The criterion of "notify when the kernel would like to swap" feels
correct, but in addition I seem to need something like "notify when
cached+mapped+free memory is getting low".

I'll need to be looking into doing this, so any comments or ideas are
welcome.

Thanks,
.tom
--

Previous thread: [PATCH] efs: move headers out of include/linux/ by Christoph Hellwig on Saturday, February 9, 2008 - 2:20 am. (1 message)

Next thread: [PATCH 1/8][for -mm] mem_notify v6: introduce poll_wait_exclusive() by KOSAKI Motohiro on Saturday, February 9, 2008 - 8:21 am. (1 message)