After having discussed memrlimit at the container mini-summit, I've been investigating potential users of memrlimits. Here are the use cases that I have so far. 1. To provide a soft landing mechanism for applications that exceed their memory limit. Currently in the memory resource controller, we swap and on failure OOM. 2. To provide a mechanism similar to memory overcommit for control groups. Overcommit has finer accounting, we just account for virtual address space usage. 3. Vserver will directly be able to port over on top of memrlimit (their address space limitation feature) The case against 1 has been that applications, do not tolerate malloc failure, does not imply that applications should not have the capability or will never be allowed the flexibility of doing so Other users of memory limits I found are 1. php - through php.ini allows setting of maximum memory limit 2. Apache - supports setting of memory limits for child processes (RLimitMEM Directive) 3. Java/KVM all take hints about the maximum memory to be used by the application 4. google.com/codesearch for RLIMIT_AS will show up a big list of applications that use memory limits. With this background, I propose that we need a mechanism of providing a memory overcommit feature for cgroups, the options are 1. We keep memrlimit and use it. It's very flexible, but on the down side it does simple total_vm based accounting and provides functionality similar to RLIMIT_AS for control groups. 2. We port the overcommit feature (Andrea did post patches for this), it's harder to implement, but provides functionality similar to what exists for overcommit. Comments? -- Warm Regards, Balbir --
Balbir, This all seems like a little bit too much hand waving to me. I don't really see a single concrete user in the "potential applications" here. I really don't understand why you're pushing this so hard if you don't have anyone to actually use it. I just don't see anyone that *needs* it. There's a lot of "it would be nice", but no "needs". -- Dave --
Dave, there is no hand waving, just an honest discussion. Although, you may not see it in the background, we still need overcommit protection and we have it enabled by default for the system. There are applications that can deal with the constraints setup by the administrator and constraints of the environment, If you see the original email, I've sent - I've mentioned that we need overcommit support (either via memrlimit or by porting over the overcommit feature) and the exploiters you are looking for is the same as the ones who need overcommit and RLIMIT_AS support. On the memory overcommit front, please see PostgreSQL Server Administrator's Guide at http://www.network-theory.co.uk/docs/postgresql/vol3/LinuxMemoryOvercommit.html The guide discusses turning off memory overcommit so that the database is never OOM killed, how do we provide these guarantees for a particular control group? We can do it system wide, but ideally we want the control point to be per control group. As far as other users are concerned, I've listed users of the memory limit feature, in the original email I sent out. To try and understand your viewpoint better, could you please tell me if 1. You are opposed to overcommit and RLIMIT_AS as features OR 2. Expanding them to control groups -- Balbir --
OK, let's get back to describing the basic problem here. What is the basic problem being solved? Applications basically want to get a failure back from malloc() when the machine is (nearly?) out of memory so they can stop consuming? Is this the only way to do autonomic computing with memory? Or, are there other or better approaches? Surely an autonomic computing app could keep track of its own memory Heh. That suggestion is, at best, working around a kernel bug. The DB guys are just saying to do that because they're the biggest memory users and always seem to get OOM killed first. The base problem here is the OOM killer, not an application that truly I think that too many of the users of (1) probably fall into the PostgreSQL category. They found that turning it on "fixed" their bugs, but it really just swept them under the rug. So, before we expand the use of those features to control groups by adding a bunch of new code, let's make sure that there will be users for it and that those users have no better way of doing it. -- Dave --
Yes, an application does know it's memory footprint, but does it know how it is supposed to consume resources in the system. Consider a linear algebra package trying to do a multiplication of 1 million x 1 million rows. Depending on how much resources it is allowed to consume, it could do so in one shot or if there was a restriction, it could multiply smaller matrices and then collate results. The application wants to stretch itself (memory footprint) for performance, but at the same time does not want to get killed because 1. Other applications came in and caused an OOM No it is not a kernel BUG, agreed that the database is using a lot of memory, but how can it predict what else will run on the system. Why is it bad for a database for the sake of data integrity to ensure that it does not get OOM killed and thus make sure memory is never overcommitted. Yes, you need performance, so the application expands it's footprint, but at the same time, the stretching should not cause it to be killed. How would you propose to solve I am all ears to better ways of doing it. Are you suggesting that overcommit was added even though we don't actually need it? -- Balbir --
So, in (2) it deserves to be oom'd. If other applications came in and caused the oom, then we do have /proc/$pid/oom_adj to help out. That's a much better tunable than I think that we're tying OOM'ing and overcommit a little too close together here. It's not like you can't have OOMs when strict overcommit is being observed. There are lots of other ways to lock memory down, and any one of those can also cause an oom. Yes, userspace mapped memory is usually the largest single consumer, but the problem space is well beyond overcommit control. Agreed? Just look at why beancounters were implemented and track things far beyond It serves a purpose, certainly. We have have better ways of doing it now, though. "So, before we expand the use of those features to control groups by adding a bunch of new code, let's make sure that there will be users for it and that those users have no better way of doing it." The one concrete user that's been offered so far is postgres. I've suggested something that I hope will be more effective than enforcing overcommit. -- Dave --
Not really, how does an application know how to trade-off between maximum And oom_adj is not a hack? What if several memory hungry applications striving The other ways of locking memory down is mlock(), which by default is limited on I've looked at http://wiki.openvz.org/User_pages_accounting and it states "Account a part of memory on mmap/brk and reject there, and account the rest of the memory in page fault handlers without any rejects." This type of accounting is used in UBC. I looked through the code in mm/mmap.c for beancounters, ub_memory_charge() is called from almost the same places that the memrlimit controller does accounting. Please see their git tree at git.openvz.org. My understanding of the code is the private vm and locked vm pages are only charged in that implementation. Agreed, they have additional finer accounting of kernel data Is your suggestion beancounters? -- Balbir --
On Thu, 21 Aug 2008 08:55:52 +0530 I'm sorry I miss the point. My concern on memrlimit (for overcommiting) is that it's not fair because an application which get -ENOMEM at mmap() is just someone unlucky. I think it's better to trigger some notifier to application or daemon rather than return -ENOMEM at mmap(). Notification like "Oh, it seems the VSZ of total application exceeds the limit you set. Although you can continue your operation, it's recommended that you should fix up the situation". will be good. Thanks, -Kame --
It can happen today with overcommit turned on. Why is it unlucky? So you are suggesting that when we are running out of memory (as defined by our current resource constraints), we don't return -ENOMEM, but instead we now handle a new event that states that we are running out of memory? NOTE: I am not opposed to the event, it can be useful for container administrators to know how to size their containers, not to application developers who want to auto-tune their applications (see my comment on autonomic computing in an earlier thread) or to applications that want to make sure they don't OOM without the system administrator having to do oom_adj for every -- Balbir --
On Thu, 21 Aug 2008 15:56:41 +0530 Today's overcommit is also unlucky ;) For example) process A and B is under a memrlimit. process A no memory leak, it often calls malloc() and free(). process B does memory leak, 100MB per night. process A cannot do anything when it notices malloc() returns NULL. It controls his memory usage perfectly. He is unlucky and will die. process B can use up VSZ which is freed by process A. (OOM-killer, is disliked by everyone, have some kind of fairness. Not "running out of memory" Just "VSZ is over the limit you set/expected". My point is an application witch can handle NULL returned by malloc() is not very popular, I think. Sorry for noise. Thanks, --
Yes, true that will happen. Why will A die because it sees NULL? Yes, many applications do die, but that is not how malloc == NULL is expected to be handled. If that is a concern, do not use any memrlimits for A and B, if you do you will find the bug early. Now consider the other scenario, if there really is a memory leak and process B is using all that memory, two things to consider 1. Without swap controller, B will start swapping out A's memory and cause excessive swapping and performance loss 2. With swap controller enabled, at some point we will hit the swap limit, what Yes and that's why we have the flexibility, if the application can't deal with it don't set memrlimits for those applications :) -- Balbir --
-ENOMEM should be considered by applications like "try again" (maybe -EAGAIN would be more appropriate). When the notification of the out-of-virtual-memory event occurs the dedicated userspace daemon can do ehm... something... to resolve the situation. Just like the OOM handling in userspace. Similar issues, but a common solution could resolve both problems. -Andrea --
Hi Dave, IMHO there're two different problems, and both should be considered by the kernel system wide as well as for each cgroup: 1) how to prevent OOM conditions 2) how to handle OOM conditions The perfect solution for 2) doesn't exist IMHO, because there's no clean way from the applications point of view to handle such critical condition post-facto. Containing the OOM within a cgroup is surely a great improvement, but there's always the risk to kill the wrong applications (within the cgroup). Another good improvement would be to handle the OOM condition in userspace, Balbir is working/discussing/plannig something about this, if I remember well. An interesting solution, proposed in the past, was to send a special signal to userspace apps to free up caches/buffers/unused mem when the whole memory in the system goes under a critical threshold. But this would require an active support by all the userspace applications, that should implement the signal handler in a proper way. Maybe this could be even considered a special case of the userspace OOM handling. Memory overcommit protection, instead, is a way to *prevent* OOM conditions (problem 1). This approach is safer for critical applications that have a chance to cleanly handle the OOM at the time they're requesting memory to the kernel, instead of receiving a SIGKILL (or whatever signal) asynchronously during the execution path. Unfortunately, this kind of prevention is not always acceptable, because, in this case, userspace apps must request virtual memory carefully, otherwise it would be quite easy to create memory DoS for other applications (and probably the per-application/per-cgroup RLIMIT_AS could help here). As an example, an ideal solution I'd like to implement for a generic enterprise environment is to create all the critical apps inside a cgroup with never-overcommit memory policy and move all the other userspace apps in another cgroup with oom-killer enabled. But for this we need both 1) and 2) ...
I completely disagree. :) Think of all the work Eric Biederman did on pid namespaces. One of his motivations was to keep /proc from being able to pin task structs. That is one great example of a way a process can pin lots of memory without mapping it, and overcommit has no effect on this! Eric had a couple of other good examples, but I think task structs were the biggest. As I said to Balbir, there probably are some large-scale solutions to this: things like beancounters. -- Dave --
