Re: [discuss] memrlimit - potential applications that can use

Previous thread: [PATCH] nubus: fix mis-indented statement by Ilpo Järvinen on Tuesday, August 19, 2008 - 12:05 am. (1 message)

Next thread: [PATCH] mpu401: reindent misindented spinlock by Ilpo Järvinen on Tuesday, August 19, 2008 - 12:56 am. (1 message)
From: Balbir Singh
Date: Tuesday, August 19, 2008 - 12:18 am

After having discussed memrlimit at the container mini-summit, I've been
investigating potential users of memrlimits. Here are the use cases that I have
so far.

1. To provide a soft landing mechanism for applications that exceed their memory
limit. Currently in the memory resource controller, we swap and on failure OOM.
2. To provide a mechanism similar to memory overcommit for control groups.
Overcommit has finer accounting, we just account for virtual address space usage.
3. Vserver will directly be able to port over on top of memrlimit (their address
space limitation feature)

The case against 1 has been that applications, do not tolerate malloc failure,
does not imply that applications should not have the capability or will never be
allowed the flexibility of doing so

Other users of memory limits I found are

1. php - through php.ini allows setting of maximum memory limit
2. Apache - supports setting of memory limits for child processes (RLimitMEM
Directive)
3. Java/KVM all take hints about the maximum memory to be used by the application
4. google.com/codesearch for RLIMIT_AS will show up a big list of applications
that use memory limits.

With this background, I propose that we need a mechanism of providing a memory
overcommit feature for cgroups, the options are

1. We keep memrlimit and use it. It's very flexible, but on the down side it
does simple total_vm based accounting and provides functionality similar to
RLIMIT_AS for control groups.
2. We port the overcommit feature (Andrea did post patches for this), it's
harder to implement, but provides functionality similar to what exists for
overcommit.


Comments?

-- 
	Warm Regards,
	Balbir
--

From: Dave Hansen
Date: Tuesday, August 19, 2008 - 8:58 am

Balbir,

This all seems like a little bit too much hand waving to me.  I don't
really see a single concrete user in the "potential applications" here.
I really don't understand why you're pushing this so hard if you don't
have anyone to actually use it.

I just don't see anyone that *needs* it.  There's a lot of "it would be
nice", but no "needs".

-- Dave

--

From: Balbir Singh
Date: Tuesday, August 19, 2008 - 9:45 am

Dave, there is no hand waving, just an honest discussion. Although, you may not
see it in the background, we still need overcommit protection and we have it
enabled by default for the system. There are applications that can deal with the
constraints setup by the administrator and constraints of the environment,

If you see the original email, I've sent - I've mentioned that we need
overcommit support (either via memrlimit or by porting over the overcommit
feature) and the exploiters you are looking for is the same as the ones who need
overcommit and RLIMIT_AS support.

On the memory overcommit front, please see PostgreSQL Server Administrator's
Guide at
http://www.network-theory.co.uk/docs/postgresql/vol3/LinuxMemoryOvercommit.html

The guide discusses turning off memory overcommit so that the database is never
OOM killed, how do we provide these guarantees for a particular control group?
We can do it system wide, but ideally we want the control point to be per
control group.

As far as other users are concerned, I've listed users of the memory limit
feature, in the original email I sent out. To try and understand your viewpoint
better, could you please tell me if

1. You are opposed to overcommit and RLIMIT_AS as features

OR

2. Expanding them to control groups

-- 
	Balbir
--

From: Dave Hansen
Date: Tuesday, August 19, 2008 - 10:41 am

OK, let's get back to describing the basic problem here.  What is the
basic problem being solved?  Applications basically want to get a
failure back from malloc() when the machine is (nearly?) out of memory
so they can stop consuming?

Is this the only way to do autonomic computing with memory?  Or, are
there other or better approaches?

Surely an autonomic computing app could keep track of its own memory

Heh.  That suggestion is, at best, working around a kernel bug.  The DB
guys are just saying to do that because they're the biggest memory users
and always seem to get OOM killed first.

The base problem here is the OOM killer, not an application that truly

I think that too many of the users of (1) probably fall into the
PostgreSQL category.  They found that turning it on "fixed" their bugs,
but it really just swept them under the rug.

So, before we expand the use of those features to control groups by
adding a bunch of new code, let's make sure that there will be users for
it and that those users have no better way of doing it.

-- Dave

--

From: Balbir Singh
Date: Wednesday, August 20, 2008 - 1:26 am

Yes, an application does know it's memory footprint, but does it know how it is
supposed to consume resources in the system. Consider a linear algebra package
trying to do a multiplication of 1 million x 1 million rows. Depending on how
much resources it is allowed to consume, it could do so in one shot or if there
was a restriction, it could multiply smaller matrices and then collate results.
The application wants to stretch itself (memory footprint) for performance, but
at the same time does not want to get killed because

1. Other applications came in and caused an OOM

No it is not a kernel BUG, agreed that the database is using a lot of memory,
but how can it predict what else will run on the system. Why is it bad for a
database for the sake of data integrity to ensure that it does not get OOM
killed and thus make sure memory is never overcommitted. Yes, you need
performance, so the application expands it's footprint, but at the same time,
the stretching should not cause it to be killed. How would you propose to solve


I am all ears to better ways of doing it. Are you suggesting that overcommit was
added even though we don't actually need it?

-- 
	Balbir
--

From: Dave Hansen
Date: Wednesday, August 20, 2008 - 9:29 am

So, in (2) it deserves to be oom'd.

If other applications came in and caused the oom, then we do
have /proc/$pid/oom_adj to help out.  That's a much better tunable than

I think that we're tying OOM'ing and overcommit a little too close
together here.  It's not like you can't have OOMs when strict overcommit
is being observed.

There are lots of other ways to lock memory down, and any one of those
can also cause an oom.

Yes, userspace mapped memory is usually the largest single consumer, but
the problem space is well beyond overcommit control.  Agreed?  Just look
at why beancounters were implemented and track things far beyond

It serves a purpose, certainly.  We have have better ways of doing it
now, though.  "So, before we expand the use of those features to
control groups by adding a bunch of new code, let's make sure that there
will be users for it and that those users have no better way of doing
it."

The one concrete user that's been offered so far is postgres.  I've
suggested something that I hope will be more effective than enforcing
overcommit.  

-- Dave

--

From: Balbir Singh
Date: Wednesday, August 20, 2008 - 8:25 pm

Not really, how does an application know how to trade-off between maximum

And oom_adj is not a hack? What if several memory hungry applications striving

The other ways of locking memory down is mlock(), which by default is limited on

I've looked at http://wiki.openvz.org/User_pages_accounting and it states

"Account a part of memory on mmap/brk and reject there, and account the rest of
the memory in page fault handlers without any rejects."
    This type of accounting is used in UBC.

I looked through the code in mm/mmap.c for beancounters, ub_memory_charge() is
called from almost the same places that the memrlimit controller does
accounting. Please see their git tree at git.openvz.org. My understanding of the
code is the private vm and locked vm pages are only charged in that
implementation. Agreed, they have additional finer accounting of kernel data


Is your suggestion beancounters?

-- 
	Balbir

--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 21, 2008 - 12:43 am

On Thu, 21 Aug 2008 08:55:52 +0530

I'm sorry I miss the point. My concern on memrlimit (for overcommiting) is that
it's not fair because an application which get -ENOMEM at mmap() is just someone
unlucky. I think it's better to trigger some notifier to application or daemon
rather than return -ENOMEM at mmap(). Notification like "Oh, it seems the VSZ
of total application exceeds the limit you set. Although you can continue your
operation, it's recommended that you should fix up the  situation".
will be good.

Thanks,
-Kame

--

From: Balbir Singh
Date: Thursday, August 21, 2008 - 3:26 am

It can happen today with overcommit turned on. Why is it unlucky?


So you are suggesting that when we are running out of memory (as defined by our
current resource constraints), we don't return -ENOMEM, but instead we now
handle a new event that states that we are running out of memory?

NOTE: I am not opposed to the event, it can be useful for container
administrators to know how to size their containers, not to application
developers who want to auto-tune their applications (see my comment on autonomic
computing in an earlier thread) or to applications that want to make sure they
don't OOM without the system administrator having to do oom_adj for every


-- 
	Balbir

--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 21, 2008 - 3:59 am

On Thu, 21 Aug 2008 15:56:41 +0530
Today's overcommit is also unlucky ;) 

For example) process A and B is under a memrlimit.
 process A no memory leak, it often calls malloc() and free().
 process B does memory leak, 100MB per night.

process A cannot do anything when it notices malloc() returns NULL.
It controls his memory usage perfectly. He is unlucky and will die.
process B can use up VSZ which is freed by process A.

(OOM-killer, is disliked by everyone, have some kind of fairness.
Not "running out of memory" Just "VSZ is over the limit you set/expected".

My point is an application witch can handle NULL returned by malloc() is
not very popular, I think.

Sorry for noise.

Thanks,


--

From: Balbir Singh
Date: Thursday, August 21, 2008 - 4:13 am

Yes, true that will happen. Why will A die because it sees NULL? Yes, many
applications do die, but that is not how malloc == NULL is expected to be
handled. If that is a concern, do not use any memrlimits for A and B, if you do
you will find the bug early.

Now consider the other scenario, if there really is a memory leak and process B
is using all that memory, two things to consider

1. Without swap controller, B will start swapping out A's memory and cause
excessive swapping and performance loss
2. With swap controller enabled, at some point we will hit the swap limit, what

Yes and that's why we have the flexibility, if the application can't deal with
it don't set memrlimits for those applications :)

-- 
	Balbir

--

From: righi.andrea
Date: Thursday, August 21, 2008 - 8:18 am

-ENOMEM should be considered by applications like "try again" (maybe
-EAGAIN would be more appropriate). When the notification of the
out-of-virtual-memory event occurs the dedicated userspace daemon can
do ehm... something... to resolve the situation. Just like the OOM
handling in userspace. Similar issues, but a common solution could
resolve both problems.

-Andrea
--

From: righi.andrea
Date: Wednesday, August 20, 2008 - 6:25 am

Hi Dave,

IMHO there're two different problems, and both should be considered by
the kernel system wide as well as for each cgroup:

1) how to prevent OOM conditions
2) how to handle OOM conditions

The perfect solution for 2) doesn't exist IMHO, because there's no
clean way from the applications point of view to handle such critical
condition post-facto.

Containing the OOM within a cgroup is surely a great improvement, but
there's always the risk to kill the wrong applications (within the
cgroup). Another good improvement would be to handle the OOM condition
in userspace, Balbir is working/discussing/plannig something about
this, if I remember well.

An interesting solution, proposed in the past, was to send a special
signal to userspace apps to free up caches/buffers/unused mem when the
whole memory in the system goes under a critical threshold. But this
would require an active support by all the userspace applications,
that should implement the signal handler in a proper way. Maybe this
could be even considered a special case of the userspace OOM handling.

Memory overcommit protection, instead, is a way to *prevent* OOM
conditions (problem 1). This approach is safer for critical
applications that have a chance to cleanly handle the OOM at the time
they're requesting memory to the kernel, instead of receiving a
SIGKILL (or whatever signal) asynchronously during the execution path.
Unfortunately, this kind of prevention is not always acceptable,
because, in this case, userspace apps must request virtual memory
carefully, otherwise it would be quite easy to create memory DoS for
other applications (and probably the per-application/per-cgroup
RLIMIT_AS could help here).

As an example, an ideal solution I'd like to implement for a generic
enterprise environment is to create all the critical apps inside a
cgroup with never-overcommit memory policy and move all the other
userspace apps in another cgroup with oom-killer enabled. But for this
we need both 1) and 2) ...
From: Dave Hansen
Date: Wednesday, August 20, 2008 - 9:38 am

I completely disagree. :)

Think of all the work Eric Biederman did on pid namespaces.  One of his
motivations was to keep /proc from being able to pin task structs.  That
is one great example of a way a process can pin lots of memory without
mapping it, and overcommit has no effect on this!

Eric had a couple of other good examples, but I think task structs were
the biggest.

As I said to Balbir, there probably are some large-scale solutions to
this: things like beancounters.  

-- Dave

--

Previous thread: [PATCH] nubus: fix mis-indented statement by Ilpo Järvinen on Tuesday, August 19, 2008 - 12:05 am. (1 message)

Next thread: [PATCH] mpu401: reindent misindented spinlock by Ilpo Järvinen on Tuesday, August 19, 2008 - 12:56 am. (1 message)