Re: [PATCH 0/8][for -mm] mem_notify v6

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <kosaki.motohiro@...>, <marcelo@...>, <daniel.spang@...>, <riel@...>, <akpm@...>, <alan@...>, <linux-fsdevel@...>, <pavel@...>, <a1426z@...>, <jonathan@...>, <zlynx@...>
Date: Sunday, February 17, 2008 - 10:49 am

I just noticed this patchset, kosaki-san.  It looks quite interesting;
my apologies for not commenting earlier.

I see mention somewhere that mem_notify is of particular interest to
embedded systems.

I have what seems, intuitively, a similar problem at the opposite
end of the world, on big-honkin NUMA boxes (hundreds or thousands of
CPUs, terabytes of main memory.)  The problem there is often best
resolved if we can kill the offending task, rather than shrink its
memory footprint.  The situation is that several compute intensive
multi-threaded jobs are running, each in their own dedicated cpuset.

If one of these jobs tries to use more memory than is available in
its cpuset, then

  (1) we quickly loose any hope of that job continuing at the excellent
      performance needed of it, and

  (2) we rapidly get increased risk of that job starting to swap and
      unintentionally impact shared resources (kernel locks, disk
      channels, disk heads).

So we like to identify such jobs as soon as they begin to swap,
and kill them very very quickly (before the direct reclaim code
in mm/vmscan.c can push more than a few pages to the swap device.)

For a much earlier, unsuccessful, attempt to accomplish this, see:

	[Patch] cpusets policy kill no swap
	http://lkml.org/lkml/2005/3/19/148

Now, it may well be that we are too far apart to share any part of
a solution; one seldom uses the same technology to build a Tour de
France bicycle as one uses to build a Lockheed C-5A Galaxy heavy
cargo transport.

One clear difference is the policy of what action we desire to take
when under memory pressure: do we invite user space to free memory so
as to avoid the wrath of the oom killer, or do we go to the opposite
extreme, seeking a nearly instantant killing, faster than the oom
killer can even begin its search for a victim.

Another clear difference is the use of cpusets, which are a major and
vital part of administering the big NUMA boxes, and I presume are not
even compiled into embedded kernels (correct?).  This difference maybe
unbridgeable ... these big NUMA systems require per-cpuset mechanisms,
whereas embedded may require builds without cpusets.

However ... there might be some useful cross pollination of ideas.

I see in the latest posts to your mem_notify patchset v6, responding
to comments by Andrew and Andi on Feb 12 and 13, that you decided to
think more about the design of this, so perhaps this is a good time
for some random ideas from myself, even though I'm clearly coming from
a quite different problem space in some ways.

1) You have a little bit of code in the kernel to throttle the
   thundering herd problem.  Perhaps this could be moved to user space
   ... one user daemon that is always notified of such memory pressure
   alarms, and in turn notifies interested applications.  This might
   avoid the need to add poll_wait_exclusive() to the kernel.  And it
   moves any fussy details of how to tame the thundering herd out of
   the kernel.

2) Another possible mechanism for communicating events from
   the kernel to user space is inotify.  For example, I added
   the line:

   	fsnotify_modify(dentry);   # dentry is current tasks cpuset

   at an interesting spot in vmscan.c, and using inotify-tools
   <inotify-tools.sourceforge.net/> could easily watch all cpusets
   for these events from one user space daemon.

   At this point, I have no idea whether this odd use of inotify
   is better or worse than what your patchset has.  However using
   inotify did require less new kernel code, and with such user space
   mechanisms as inotify-tools already well developed, it made the
   problem I had, of watching an entire hierarcy of special files
   (beneath /dev/cpuset) very easy to implement.  At least inotify
   also presents events on a file descriptor that can be consumed
   using a poll() loop.

3) Perhaps, instead of sending simple events, one could update
   a meter of the rate of recent such events, such as the per-cpuset
   'memory_pressure' mechanism does.  This might lead to addressing
   Andrew Morton's comment:

	If this feature is useful then I'd expect that some
	applications would want notification at different times, or at
	different levels of VM distress.  So this semi-randomly-chosen
	notification point just won't be strong enough in real-world
	use.

4) A place that I found well suited for my purposes (watching for
   swapping from direct reclaim) was just before the lines in the
   pageout() routine in mm/vmscan.c:

   	if (clear_page_dirty_for_io(page)) {
		...
		res = mapping->a_ops->writepage(page, &wbc);

   It seemed that testing "PageAnon(page)" here allowed me to easily
   distinguish between dirty pages going back to the file system, and
   pages going to swap (this detail is from work on a 2.6.16 kernel;
   things might have changed.)

   One possible advantage of the above hook in the direct reclaim
   code path in vmscan.c is that pressure in one cpuset did not cause
   any false alarms in other cpusets.  However even this hook does
   not take into account the constraints of mm/mempolicy (the NUMA
   memory policy that Andi mentioned) nor of cgroup memory controllers.

5) I'd be keen to find an agreeable way that you could have the
   system-wide, no cpuset, mechanism you need, while at the same
   time, I have a cpuset interface that is similar and depends on the
   same set of hooks.  This might involve a single set of hooks in
   the key places in the memory and swapping code, that (1) updated
   the system wide state you need, and (2) if cpusets were present,
   updated similar state for the tasks current cpuset.  The user
   visible API would present both the system-wide connector you need
   (the special file or whatever) and if cpusets are present, similar
   per-cpuset connectors.

Anyhow ... just some thoughts.  Perhaps one of them will be useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Sat Feb 9, 11:19 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Tue Apr 1, 7:35 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Wed Apr 2, 3:31 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Mon Apr 14, 8:16 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Thu Apr 17, 5:30 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Thu Apr 17, 3:23 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Daniel Spång, (Wed Apr 23, 4:27 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Wed Apr 30, 10:07 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Thu May 1, 11:06 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Fri May 2, 6:21 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Sat May 3, 8:26 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Tue May 6, 1:22 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Fri Apr 18, 6:07 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Mon Apr 21, 4:32 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Tue Apr 15, 10:30 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Tom May, (Wed Apr 2, 1:45 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Sun Feb 17, 10:49 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Tue Feb 19, 3:36 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Tue Feb 19, 11:00 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Pavel Machek, (Tue Feb 19, 6:28 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Rik van Riel, (Tue Feb 19, 10:07 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Wed Feb 20, 12:36 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Tue Feb 19, 10:48 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Wed Feb 20, 12:57 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Wed Feb 20, 1:21 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Tue Feb 19, 9:54 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Rik van Riel, (Tue Feb 19, 3:02 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Tue Feb 19, 4:18 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Paul Jackson, (Tue Feb 19, 4:43 pm)
[PATCH 0/8][for -mm] mem_notify v6, Jonathan Corbet, (Mon Feb 11, 11:36 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Mon Feb 11, 11:46 am)
Re: [PATCH 0/8][for -mm] mem_notify v6, Jon Masters, (Sat Feb 9, 12:02 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Sat Feb 9, 12:33 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, Rik van Riel, (Sat Feb 9, 12:43 pm)
Re: [PATCH 0/8][for -mm] mem_notify v6, KOSAKI Motohiro, (Sat Feb 9, 12:49 pm)