This patchset adds RSS, accounting and control and limiting the number of tasks and files within container. Based on top of Paul Menage's container subsystem v7 RSS controller includes per-container RSS accounter, reclamation and OOM killer. It behaves like standalone machine - when container runs out of resources it tries to reclaim some pages and if it doesn't succeed in it kills some task which mm_struct belongs to container in question. Num tasks and files containers are very simple and self-descriptive from code. As discussed before when a task moves from one container to another no resources follow it - they keep holding the container they were allocated in. The difficulties met during using of Pauls' containers were: 1. Container fork hook is placed before new task changes. This makes impossible of handling fork properly. I.e. new mm_struct should have pointer to RSS container, but we don't have one at that early time. 2. Extended containers may register themselves too late. Kernel threads/helpers start forking, opening files and touching pages much earlier. This patchset workarounds this in not-so-cute manner and I'm waiting for Paul's comments on this issue. -
Introduce generic structures and routines for resource accounting. Each resource accounting container is supposed to aggregate it, container_subsystem_state and its resource-specific members within.
Is there any way to indicate that there are no limits on this container. LONG_MAX is quite huge, but still when the administrator wants to configure a container to *un-limited usage*, it becomes hard for These bits look a little out of sync, with no users for these routines in this patch. Won't you get a compiler warning, compiling this bit alone? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Yes - LONG_MAX is essentially a "no limit" value as no I'm afraid no. We have to atomically check for limit and alter one of usage or failcnt depending on the checking result. Making this with atomic_xxx ops will require at least two ops. If we'll remove failcnt this would look like while (atomic_cmpxchg(...)) which is also not that good. Moreover - in RSS accounting patches I perform page list Nope - when you have a non-static function without users in a file no compiler warning produced. -
-1 or ~0 is a viable choice for userspace to Linux-VServer does the accounting with atomic counters, so that works quite fine, just do the checks at the beginning of whatever resource allocation and the it still hasn't been shown that this kind of RSS limit doesn't add big time overhead to normal operations (inside and outside of such a resource container) note that the 'usual' memory accounting is much more lightweight and serves similar purposes ... best, -
account it kernel may preempt and let another process It OOM-kills current int case of limit hit instead of -
Atomic operations versus locks is only a granularity thing. You still need the cache line which is the cost on SMP. Are you using atomic_add_return or atomic_add_unless or are you performing you actions in two separate steps which is racy? What I have seen indicates you are using a racy two separate Perhaps.... Eric -
yes, this is the current implementation which is more than sufficient, but I'm aware of the potential issues here, and I have an experimental patch sitting here which removes this race with the following change: - doesn't store the accounted value but limit - accounted (i.e. the free resource) - uses atomic_add_return() - when negative, an error is returned and the resource amount is added back changes to the limit have to adjust the 'current' value too, but that is again simple and atomic best, Herbert PS: atomic_add_unless() didn't exist back then (at least I think so) but that might be an option -
I think as far as having this discussion if you can remove that race people will be more willing to talk about what vserver does. That said anything that uses locks or atomic operations (finer grained locks) because of the cache line ping pong is going to have scaling issues on large boxes. So in that sense anything short of per cpu variables sucks at scale. That said I would much rather get a simple correct version without the complexity of per cpu counters, before we optimize the counters that much. Eric -
fully agree with it. We need to get a working version first. FYI, in OVZ we recently added such optimizations: reserves like in TCP/IP, e.g. for kmemsize, numfile these reserves are done on task-basis for fast charges/uncharges w/o involving lock operations. On task exit reserves are returned back to the beancounter. As it demonstrated atomic counters can be replaced with task-reserves on the next step. Thanks, Kirill -
BTW atomic_add_unless() is essentially a loop!!! Just like spin_lock() is, so why is one better that another? spin_lock() can go to schedule() on preemptive kernels -
well, shouldn't be a big deal to brush that patch up right, but atomic ops have much less impact on most actually I thought about per cpu counters quite a lot, and we (Llinux-VServer) use them for accounting, but please tell me how you use per cpu structures for implementing limits TIA, -
Right. But atomic_add_unless() is slower as it is Did you ever look at how get_empty_filp() works? I agree, that this is not a "strict" limit, but it limits the usage wit some "precision". /* off-the-topic */ Herbert, you've lost Balbir again: In this sub-thread some letters up Eric wrote a letter with Balbir in Cc:. The next reply from you doesn't include him. -
If I am not mistaken, you shouldn't loop in normal cases, which means it boils down to a atomic_read() + atomic_cmpxch() -- Regards, vatsa -
So does the lock - in a normal case (when it's not heavily contented) it will boil down to atomic_dec_and_test(). Nevertheless, making charge like in this patchset requires two atomic ops with atomic_xxx and only one with spin_lock(). -
To be very clear. If you care about optimization cache lines and lock hold times (to keep contention down) are the important things. With spin locks you have to be a little more careful to put them on the same cache line as your data and to keep should hold times short. With atomic ops you get that automatically. There is really no significant advantage in either approach. The number of atomic ops doesn't matter. You bring in the cache line and manipulate it. The expensive part is acquiring the cache line exclusively. This is expensive even if things are never contended but there are many users. Sorry for the rant, but I just wanted to set the record straight. spin_locks vs atomic ops is a largely meaningless debate. Eric -
fine, nobody actually uses atomic_add_unless(), or am I missing something? using two locks will be slower than using a single lock, adding a loop which counts from 0 to 100 will I can happily add him to every email I reply to, but he definitely isn't removed by my mailer (as I already stated, it might be the mailing list which does this), fact is, the email arrives here without him in the cc, so a reply does not contain it either ... best, Herbert -
This includes setup of RSS container within generic process containers, all the declarations used in RSS accounting, and core code responsible for accounting.
On Tue, 06 Mar 2007 17:55:29 +0300 ah. This looks good. I'll find a hunk of time to go through this work and through Paul's patches. It'd be good to get both patchsets lined up in -mm within a couple of weeks. But.. We need to decide whether we want to do per-container memory limitation via these data structures, or whether we do it via a physical scan of some software zone, possibly based on Mel's patches. -
doesn't look so good for me, mainly becaus of the additional per page data and per page processing on 4GB memory, with 100 guests, 50% shared for each guest, this basically means ~1mio pages, 500k shared and 1500k x sizeof(page_container) entries, which roughly boils down to ~25MB of wasted memory ... increase the amount of shared pages and it starts why not do simple page accounting (as done currently in Linux) and use that for the limits, without keeping the reference from container to page? best, -
You are. Each page has only one page_container associated with it despite the number of containers it is shared As I've already answered in my previous letter simple limiting w/o per-container reclamation and per-container oom killer isn't a good memory management. It doesn't allow to handle resource shortage gracefully. This patchset provides more grace way to handle this, but full memory management includes accounting of VMA-length as well (returning ENOMEM from system call) but we've decided -
per container OOM killer does not require any container page reference, you know _what_ tasks belong to the container, and you know their _badness_ from the normal OOM calculations, so doing them for a container is really straight forward without having any page 'tagging' for the reclamation part, please elaborate how that will differ in a (shared memory) guest from what the kernel currently does ... TIA, -
That's true. If you look at the patches you'll This is all described in the code and in the -
so what do we keep the context -> page reference must have missed some of them, please can you point me to the relevant threads ... TIA, -
We need this for 1. keeping page's owner to uncharge to IT when page goes away. Or do you propose to uncharge it to current (i.e. ANY) container like you do all across Vserver accounting which screws up accounting with pages sharing? 2. managing LRU lists for good reclamation. See Balbir's patches for details. 3. possible future uses - correct sharing accounting, -
Herbert, You lost me in the cc list and I almost missed this part of the thread. Could you please not modify the "cc" list. Thanks, Balbir -
hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the
thread in my mailbox shows that akpm dropped you ...
--------------------------------------------------------------------
Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov <xemul@sw.ru>
To: Andrew Morton <akpm@osdl.org>, Paul Menage <menage@google.com>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Balbir Singh <balbir@in.ibm.com>
Cc: containers@lists.osdl.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Date: Tue, 06 Mar 2007 17:55:29 +0300
--------------------------------------------------------------------
Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton <akpm@linux-foundation.org>
To: Pavel Emelianov <xemul@sw.ru>
Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>
Date: Tue, 6 Mar 2007 14:00:36 -0800
--------------------------------------------------------------------
I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)
best,
-
Thats good to know, but my mailer shows Andrew Morton <akpm@linux-foundation.org> to Pavel Emelianov <xemul@sw.ru> cc Paul Menage <menage@google.com>, Srivatsa Vaddagiri <vatsa@in.ibm.com>, Balbir Singh <balbir@in.ibm.com> (see I am <<HERE>>), devel@openvz.org, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, containers@lists.osdl.org, Kirill Korotaev <dev@sw.ru> date Mar 7, 2007 3:30 AM subject Re: [RFC][PATCH 2/7] RSS controller core mailed-by vger.kernel.org On Tue, 06 Mar 2007 17:55:29 +0300 and your reply as Andrew Morton <akpm@linux-foundation.org>, Pavel Emelianov <xemul@sw.ru>, Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org, Paul Menage <menage@google.com>, List <linux-kernel@vger.kernel.org> to Andrew Morton <akpm@linux-foundation.org> cc Pavel Emelianov <xemul@sw.ru>, Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org, Paul Menage <menage@google.com>, List <linux-kernel@vger.kernel.org> date Mar 9, 2007 10:18 PM subject Re: [RFC][PATCH 2/7] RSS controller core mailed-by vger.kernel.org I am not sure what went wrong. Could you please check your mail client, cause it seemed to even change email address to smtp.osdl.org Cheers, Balbir -
I have a problem doing a group-reply in mutt to Herbert's mails. His email id gets dropped from the To or Cc list. Is that his email setting? Don't know. -- Regards, vatsa -
my mail client is not involved in receiving the emails, so the email I replied to did already miss you in the cc (i.e. I doubt that mutt would hide you from the cc, if it would be present in the mailbox :) maybe one of the mailing lists is removing receipients according to some strange scheme? here are the full headers for the email I replied to: -8<------------------------------------------------------------------------
i.e. a separate memzone for each container? imho memzone approach is inconvinient for pages sharing and shares accounting. it also makes memory management more strict, forbids overcommiting per-container etc. Maybe you have some ideas how we can decide on this? Thanks, Kirill -
Yep. Straightforward machine partitioning. An attractive thing is that it We need to work out what the requirements are before we can settle on an implementation. Sigh. Who is running this show? Anyone? You can actually do a form of overcommittment by allowing multiple containers to share one or more of the zones. Whether that is sufficient or suitable I don't know. That depends on the requirements, and we haven't even discussed those, let alone agreed to them. -
well, I guess all existing OS-Level virtualizations (Linux-VServer, OpenVZ, and FreeVPS) have stated more than one time that _sharing_ of resources is a central element, and one especially important resource to share is memory (RAM) ... if your aim is full partitioning, we do not need to bother with OS-Level isolation, we can simply use Linux-VServer (and probably OpenVZ): - shared mappings of 'shared' files (binaries and libraries) to allow for reduced memory footprint when N identical guests are running - virtual 'physical' limit should not cause swap out when there are still pages left on the host system (but pages of over limit guests can be preferred for swapping) - accounting and limits have to be consistent and should roughly represent the actual used memory/swap (modulo optimizations, I can go into detail here, if necessary) - OOM handling on a per guest basis, i.e. some out of memory condition in guest A must not affect guest B HTC, -
How about we drill down on these a bit more. So, it sounds like this can be phrased as a requirement like: "Guests must be able to share pages." Can you give us an idea why this is so? On a typical vserver system, how much memory would be lost if guests were not permitted to share Is this a really hard requirement? It seems a bit fluffy to me. An added bonus if we can do it, but certainly not the most important requirement in the bunch. What are the consequences if this isn't done? Doesn't a loaded system eventually have all of its pages used anyway, so won't this always be a temporary situation? This also seems potentially harmful if we aren't able to get pages *back* that we've given to a guest. Tasks can pin pages in lots of So, consistency is important, but is precision? If we, for instance, used one of the hashing schemes, we could have some imprecise decisions made but the system would stay consistent overall. This requirement also doesn't seem to push us in the direction of having distinct page owners, or some sharing mechanism, because both would be I'll agree that this one is important and well stated as-is. Any disagreement on this one? -- Dave -
sure, one reason for this is that guests tend to be similar (or almost identical) which results in quite a lot of 'shared' libraries and executables which would otherwise get cached for each guest and let me give a real world example here: - typical guest with 600MB disk space - about 100MB guest specific data (not shared) - assumed that 80% of the libs/tools are used gives 400MB of shared read only data assumed you are running 100 guests on a host, that makes ~39GB of virtual memory which will get paged in and out over and over again ... well, let's look at the overall memory resource function with the above assumptions: with sharing: f(N) = N*80M + 400M without sharing: g(N) = N*480M so the decrease N->inf: g/f -> 6 (factor) which is quite realistic, if you consider that there are only so many distributions, OTOH, the factor might become less important when the no, not hard, but a reasonable optimization ... let me note once again, that for full isolation you better go with Xen or some other Hypervisor because if you make it work like Xen, it will become as slow and resource hungry as any other most optimizations might look strange at first glance, but when you check what the limitting factors for OS-Level virtualizations are, you will find that it looks like this: (in order of decreasing relevance) - I/O subsystem - available memory - network performance - CPU performance note: this is for 'typical' guests, not for number crunching or special database, or pure nope, not the _most_ important one, but it let's consider a quite limited guest (or several of them) which have a 'RAM' limit of 64MB and additional 64MB of 'virtual swap' assigned ... if they use roughly 96MB (memory footprint) then having this 'fluffy' optimization will keep them running without any effect on the host side, but without, they will continously swap in and out which will affect not only the host, but also the no, the idea is not to ...
I get the general idea here, but I just don't think those numbers are
very accurate. My laptop has a bunch of gunk open (xterm, evolution,
firefox, xchat, etc...). I ran this command:
lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du -Dcs
and got:
113840 total
On a web/database server that I have (ps aux | wc -l == 128), I just ran
the same:
39168 total
That's assuming that all of the libraries are fully read in and
populated, just by their on-disk sizes. Is that not a reasonable measure
of the kinds of things that we can expect to be shared in a vserver? If
so, it's a long way from 400MB.
Could you try a similar measurement on some of your machines? Perhaps
I don't doubt this, but doing this two-level page-out thing for
containers/vservers over their limits is surely something that we should
consider farther down the road, right?
It's important to you, but you're obviously not doing any of the
All workloads that use $limit+1 pages of memory will always pay the
price, right? :)
-- Dave
-
Think shell scripts and the like. From what I have seen I would agree that is typical for application code not to dominate application memory usage. However on the flip side it is non uncommon for application code to dominate disk usage. Some of us have giant music, video or code databases that consume a lot of disk space but in many instances servers don't have enormous chunks of private files, and even when they do they share the files from the distribution. The result of this is that there are a lot of unmapped pages cached in the page cache for rarely run executables, that are cached just in case we need them. So while Herbert's numbers may be a little off the general principle of the entire system doing better if you can share the page cache is very real. That the page cache isn't accounted for here isn't terribly important we still It is what the current VM of linux does. There is removing a page from processes and then there is writing it out to disk. I think the normal term is second chance replacement. The idea is that once you remove a page from being mapped you let it age a little before it is paged back in. This allows pages in high demand to avoid being written Tread carefully here. Herbert may not be doing a lot of mainline coding or extremely careful review of potential patches but he does seem to have a decent grasp of the basic issues. In addition to a reasonable amount of experience so it is worth listening to what he says. In addition Herbert does seem to be doing some testing of the mainline Ugh. You really want swap > RAM here. Because there are real cases when you are swapping when all of your pages in RAM can be cached in the page cache. 96MB with 64MB RSS and 64MB swap is They should. When you remove an anonymous page from the pages tables it needs to be allocated and placed in the swap cache. Once you do that it can sit in the page cache like any file backed page. So the container that hits $limit+1 should get the paging ...
nooooooo. What you're saying there amounts to text replication. There is no proposal here to create duplicated copies of pagecache pages: the VM just doesn't support that (Nick has soe protopatches which do this as a possible NUMA optimisation). So these mmapped pages will contiue to be shared across all guests. The problem boils down to "which guest(s) get charged for each shared page". A simple and obvious and easy-to-implement answer is "the guest which paged it in". I think we should firstly explain why that is insufficient. -
I guess by "paged it in" you essentially mean "mapped the page into address space for the *first* time"? i.e. how many times the same page mapped into 2 address spaces in the same container should be accounted for? We believe ONE. It is better due to: - it allows better estimate how much RAM container uses. - if one container mapped a single page 10,000 times, it doesn't mean it is worse than a container which mapped only 200 pages and that it should be killed in case of OOM. Thanks, Kirill -
Not really - I mean "first allocated the page". ie: major fault(), read(), I'm not sure that we need to account for pages at all, nor care about rss. If we use a physical zone-based containment scheme: fake-numa, variable-sized zones, etc then it all becomes moot. You set up a container which has 1.5GB of physial memory then toss processes into it. As that process set increases in size it will toss out stray pages which shouldn't be there, then it will start reclaiming and swapping out its own pages and eventually it'll get an oom-killing. No RSS acounting or page acounting in sight, because we already *have* that stuff, at the physical level, in the zone. Overcommitment can be performed by allowing different containers to share the same zone set, or by dynamically increasing or decreasing the size of a physical container. This all works today with fake-numa and cpusets, no kernel changes needed. It could be made to work fairly simply with a multi-zone approach, or with resizeable zones. I'd be interested in knowing what you think the shortcomings of this are likely to be,. -
sounds good to me, just not sure it provides what we
okay, let me ask a few naive questions about this scheme:
how does this work for a _file_ which is shared between
two guests (e.g. an executable like bash, hardlinked
between guests) when both guests are in a different
zone-based container?
+ assumed that the file is read in the first time,
will it be accounted to the first guest doing so?
+ assumed it is accessed in the second guest, will
it cause any additional cache/mapping besides the
dentry stuff?
+ will container A be able to 'toss out' pages
'shared' with container B (assumed sharing is
possible :)
+ when the container A tosses out the pages for this
executable, will guest B still be able to use them?
+ when the pages are tossed out, will they require
the system to read them in again, or will they
here the question is, can a guest have several of
those 'virtual zones' assigned, so that there is a
will do so once I have a better understanding how this
approach will work ...
TIA,
Herbert
-
I was just reading through the (comprehensive) thread about this from last week, so forgive me if I missed some of it. The idea is really tempting, precisely because I don't think anyone really wants to have to screw with the reclaim logic. I'm just brain-dumping here, hoping that somebody has already thought through some of this stuff. It's not a bitch-fest, I promise. :) How do we determine what is shared, and goes into the shared zones? Once we've allocated a page, it's too late because we already picked. Do we just assume all page cache is shared? Base it on filesystem, mount, ...? Mount seems the most logical to me, that a sysadmin would have to set up a container's fs, anyway, and will likely be doing special things to shared data, anyway (r/o bind mounts :). There's a conflict between the resize granularity of the zones, and the storage space their lookup consumes. We'd want a container to have a limited ability to fill up memory with stuff like the dcache, so we'd appear to need to put the dentries inside the software zone. But, that gets us to our inability to evict arbitrary dentries. After a while, would containers tend to pin an otherwise empty zone into place? We could resize it, but what is the cost of keeping zones that can be resized down to a small enough size that we don't mind keeping it there? We could merge those "orphaned" zones back into the shared zone. Were there any requirements about physical contiguity? What about minimum zone sizes? If we really do bind a set of processes strongly to a set of memory on a set of nodes, then those really do become its home NUMA nodes. If the CPUs there get overloaded, running it elsewhere will continue to grab pages from the home. Would this basically keep us from ever being able to move tasks around a NUMA system? -- Dave -
Assuming we had a means of creating a zone that was assigned to a container, a second zone for shared data between a set of containers. For shared data, the time the pages are being allocated is at page fault time. At that point, the faulting VMA is known and you also know if it's MAP_SHARED or not. The caller allocating the page would select (or create) a zonelist that is appropriate for the container. For shared mappings, it would be one zone - the shared zone for the set. For private mappings, it would be one zone - the shared zone for the set. For overcommit, the allowable zones for overcommit could be included. Allowing overcommit opens the possibility for containers to interfere with each other but I'm guessing that if overcommit is enabled, the administrator is willing to live with that interference. This has the awkward possibility of having two "shared" zones for two container sets and one file that needs sharing. Similarly, there is a possibility for having a container that has no shared zone and faulted in shared data. In that case, the page ends up in the first faulting container set and it's too bad it got "charged" for the page use on behalf of other containers. I'm not sure there is a sane way of accounting this situation fairly. I think that it's important to note that once data is shared between containers at all that they have the potential to interfere with each other (by reclaiming We'd choose the appropriate zonelist before faulting. Once allocated, I have no strong feelings here. To me, it's "who do I assign this fake zone to?" I guess you would have at least one zone per container mount Stuff like shrinking dentry caches is already pretty course-grained. Last I looked, we couldn't even shrink within a specific node, let alone Merging "orphaned" zones back into the "main" zone would seem a sensible For the lookup to software zone to be efficient, it would be easiest to have them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to ...
Well, but MAP_SHARED does not necessarily mean shared outside of the container, right? Somebody wishing to get around resource limits could just MAP_SHARED any data they wished to use, and get it into the shared area before their initial use, right? I shouldn't have used dentries as an example. I'm just saying that if we end up (or can end up with) with a whole ton of these software zones, we might have troubles storing them. I would imagine the issue would OK, but merging wouldn't be possible if they're not physically contiguous. I guess this could be worked around by just calling it a I was mostly wondering about zones spanning other zones. We _do_ I know we _try_ to avoid this these days, but I'm not sure how taking it away as an option will affect anything. -- Dave -
Well, the data could also be shared outside of the container. I would see They would only be able to impact other containers in a limited sense. Specifically, if 5 containers have one shared area, then any process in those 5 containers could exceed their container limits at the expense of A normal read/write if it's the first reader of a file would get charged to the container, not to the shared area. It is less likely that a file that is read() That is an immediate problem. There needs to be a way of mapping an arbitrary page to a software zone. page_zone() as it is could only resolve the "main" zone. If additional bits were used in page->flags, there would be very hard limits on the number of containers that can exist. If zones were physically contiguous to MAX_ORDER, pageblock flags from the anti-fragmentation could be used to record that a block of pages was in a container and what the ID is. If non-contiguous software zones were required, page->zone could be reintroduced for software zones to be used when a page belongs to a container. It's not ideal the proper way of mapping pages to software zones might be more obvious then when we'd see where page->zone was used. With either approach, the important thing that occured to me is be to be sure that pages only came from the same hardware zone. For example, do not mix HIGHMEM pages with DMA pages because it'll fail miserably. For RSS accounting, this is not much of a restriction but it does have an impact on In practice, overlapping zones never happen today so a few new bugs based on assumptions about MAX_ORDER_NR_PAGES being aligned in a zone -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
I played with an approach where you can bind a dentry to a set of memory zones, and any children of that dentry would inherit the mempolicy; I was envisaging that most data wouldn't be shared between different containers/jobs, and that userspace would set up "shared" zones for big shared regions such as /lib, /usr, /bin, and for move_pages() will let you shuffle tasks from one node to another without too much intrusion. Paul -
Here is a wacky one. Suppose there is some NFS server that exports something that most machines want to mount like company home directories. Suppose multiple containers mount that NFS server based on local policy. (If we can allow non-root users to mount filesystems a slightly more trusted guest admin certainly will be able to). The NFS code as current written (unless I am confused) will do everything in it's power to share the filesystem cache between the different mounts (including the dentry tree). How do we handle bit shared areas like that. Dynamic programming solutions where we discovery the areas of sharing at runtime seem a lot more general then a priori solutions where you have to predict what will come next. If a priori planning and knowledge about sharing is the best we can do it is the best we can do and we will have to live with the limits that imposes. Given the inflexibility in use and setup I'm not yet ready to concede that this is the best we can do. Eric -
My first worry was that this approach is unfair to the poor bastard that happened to get started up first. If we have a bunch of containerized web servers, the poor guy who starts Apache first will pay the price for keeping it in memory for everybody else. That said, I think this is naturally worked around. The guy charged unfairly will get reclaim started on himself sooner. This will tend to page out those pages that he was being unfairly charged for. Hopefully, they will eventually get pretty randomly (eventually evenly) spread among all users. We just might want to make sure that we don't allow ptes (or other new references) to be re-established to pages like this when we're trying to reclaim them. Either that, or force the next toucher to take ownership of the thing. But, that kind of arbitrary ownership transfer can't happen if we have rigidly defined boundaries for the containers. The other concern is that the memory load on the system doesn't come from the first user ("the guy who paged it in"). The long-term load comes from "the guy who keeps using it." The best way to exemplify this is somebody who read()s a page in, followed by another guy mmap()ing the same page. The guy who did the read will get charged, and the mmap()er will get a free ride. We could probably get an idea when this kind of stuff is happening by comparing page->count and page->_mapcount, but it certainly wouldn't be conclusive. But, does this kind of nonsense even happen in practice? -- Dave -
"Is it useful for me as a bad guy to make it happen ?" Alan -
A very fine question. ;) To exploit this, you'd need to: 1. need to access common data with another user 2. be patient enough to wait 3. determine when one of those users had actually pulled a page in from disk, which sys_mincore() can do, right? I guess that might be a decent reason to not charge the guy who brings the page in for the page's entire lifetime. So, unless we can change page ownership after it has been allocated, anyone accessing shared data can get around resource limits if they are patient. -- Dave -
To create a DOS attack. - Allocate some memory you know your victim will want in the future, (shared libraries and the like). - Wait until your victim is using the memory you allocated. - Terminate your memory resource group. - Victim is pushed over memory limits by your exiting. - Victim can no longer allocate memory - Victim dies It's not quite that easy unless your victim calls mlockall(MCL_FUTURE), but the potential is clearly there. Am I missing something? Or is this fundamental to any first touch scenario? I just know I have problems with first touch because it is darn hard to reason about. Eric -
I think it's fundamental to any case where two containers share the use of the page, but either one _can_ be charged but does not receive a _full_ charge for it. I don't think it's uniquely associated with first-touch schemes. The software zones approach where there would be a set of "shared" zones would not have this problem, because any sharing would have to occur on data on which neither one was being charged. http://linux-mm.org/SoftwareZones -- Dave -
True. The "shared" zones approach would simply have the problem that it would make sharing hard and thus reduce the effectiveness of the page cache. The "shared" zone approach also would seem to interact in very weird ways with real NUMA and memory hotplug or process migration. The fact that we actually have to care about the real memory size on the machine makes me look at it strange. Zones should definitely be penalized in some category for the reduction in efficiency of the page cache. It took us decades to learn that the most efficient page cache was one that could resize and reallocate memory on demand based on the current usage. Zones and possibly anything else with the concept of page ownership seems to be trying to be ignoring Looking at your page, and I'm too lazy to figure out how to update it I have a couple of comments. - Why do limits have to apply to the unmapped page cache? - Could you mention proper multi process RSS limits. (I.e. we count the number of pages each group of processes have mapped and limit that). It is the same basic idea as partial page ownership, but instead of page ownership you just count how many pages each group is using and strictly limit that. There is no page owner ship or partial charges. The overhead is just walking the rmap list at map and unmap time to see if this is the first users in the container. No additional kernel data structures are needed. Eric -
You just need to create an account by clicking the Login button. It lets you edit things after that. But, I'd be happy to put anything in To me, it is just because it consumes memory. Unmapped cache is, of couse, much more easily reclaimed than mapped files, but it still fundamentally causes pressure on the VM. To me, a process sitting there doing constant reads of 10 pages has the same overhead to the VM as a process sitting there with a 10 page file I've tried to capture this. Let me know what else you think it needs. http://linux-mm.org/SoftwareZones -- Dave -
I can see temporarily accounting for pages in use for such a read/write and possibly during things such as read ahead. However I doubt it is enough memory to be significant, and as such is probably a waste of time accounting for it. A memory limit is not about accounting for memory pressure, so I think the reasoning for wanting to account for unmapped pages as a hard requirement is still suspect. A memory limit is to prevent one container from hogging all of the memory in the system, and denying it to other containers. The page cache by definition is a global resource that facilitates global kernel optimizations. If we kill those optimizations we are on the wrong track. By requiring limits there I think we are very likely to kill our very important global optimizations, and bring Requirements: - The current kernel global optimizations are preserved and useful. This does mean one container can affect another when the optimizations go awry but on average it means much better performance. For many the global optimizations are what make the in-kernel approach attractive over paravirtualization. Very nice to have: - Limits should be on things user space have control of. Saying you can only have X bytes of kernel memory for file descriptors and the like is very hard to work with. Saying you can have only N file descriptors open is much easier to deal with. - SMP Scalability. The final implementation should have per cpu counters or per task reservations so in most instances we don't need to bounce a global cache line around to perform the accounting. Nice to have: - Perfect precision. Having every last byte always accounted for is nice but a little bit of bounded fuzziness in the accounting is acceptable if it that make the accounting problem more tractable. We need several more limits in this discussion to get a full picture, otherwise we may to try and build the all singing all dancing limit. - A limit on the number of ...
exactly! nevertheless, you might want to extend that to swapping that is my major concern for most of the 'straight forward' agreed, we want to optimize for small systems as well as for large ones, and SMP/NUMA is quite as long as the accounting is consistant, i.e. you do not lose resources by repetitive operations inside the guest (or through guest-guest interaction) with shared files, otherwise an lvm partition does I/O and CPU limits are special, as they have the temporal component, i.e. you are not interested in 10s CPU time, instead you want 0.5s/s CPU (same for I/O) note: this is probably also true for page in/out - sockets - locks - dentries HTH, -
Let's say you have an mmap'd file. It has zero pages brought in right now. You do a write to it. It is well within the kernel's rights to let you write one word to an mmap'd file, then unmap it, write it to disk, and free the page. To me, mmap() is an interface, not a directive to tell the kernel to keep things in memory. The fact that two reads of a bytes from an mmap()'d file tends to not go to disk or even cause a fault for the second read is because the page is in the page cache. The fact that two consecutive read()s of the same disk page tend to not cause two trips to the disk is because the page is in the page cache. Anybody who wants to get data in and out of a file can choose to use either of these interfaces. A page being brought into the system for either a read or touch of an mmap()'d area causes the same kind of memory pressure. So, I think we have a difference of opinion. I think it's _all_ about memory pressure, and you think it is _not_ about accounting for memory pressure. :) Perhaps we mean different things, but we appear to disagree greatly on the surface. Can we agree that there must be _some_ way to control the amounts of unmapped page cache? Whether that's related somehow to the same way we control RSS or done somehow at the I/O level, there must be some way to ... I've tried to capture this: Definitely. I think we've all agreed that memory is the hard one, though. If we can make progress on this one, we're set! :) -- Dave -
I think it is about preventing a badly behaved container from having a significant effect on the rest of the system, and in particular other containers on the system. See below. I think to reach agreement we should start by discussing the algorithm that we see being used to keep the system function well and the theory behind that algorithm. Simply limiting memory is not At lot depends on what we measure and what we try and control. Currently what we have been measuring are amounts of RAM, and thus what we are trying to control is the amount of RAM. If we want to control memory pressure we need a definition and a way to measure it. I think there may be potential if we did that but we would still need a memory limit to keep things like mlock in check. So starting with a some definitions and theory. RSS is short for resident set size. The resident set being how many of pages are current in memory and not on disk and used by the application. This includes the memory in page tables, but can reasonably be extended to include any memory a process can be shown to be using. In theory there is some minimal RSS that you can give an application at which it will get productive work done. Below the minimal RSS the application will spend the majority of real time waiting for pages to come in from disk, so it can execute the next instruction. The ultimate worst case here is a read instruction appearing on one page and it's datum on another. You have to have both pages in memory at the same time for the read to complete. If you set the RSS hard limit to one page the problem will be continually restarting either because the page it is on is not in memory or the page it is reading from is not in memory. What we want to accomplish is to have a system that runs multiple containers without problems. As a general memory management policy we can accomplish this by ensuring each container has at least it's minimal RSS quota of pages. By watching the paging activity of a ...
that is exactly what we (Linux-VServer) want ... (sounds good to me, please keep up the good work in this direction) there is nothing wrong with hard limits if somebody really wants them, even if they hurt the sysstem as whole, but those limits shouldn't be the default .. best, -
That's Dave's point, I believe. Limiting mapped memory may be mostly OK for well behaved applications, but it doesn't do anything to stop bad ones from effectively DoSing the system or ruining any guarantees you might proclaim (not that hard guarantees are always possible without using virtualisation anyway). This is why I'm surprised at efforts that go to such great lengths to get accounting "just right" (but only for mmaped memory). You may as well not even bother, IMO. Give me an RSS limit big enough to run a couple of system calls and a loop... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Would any of them work on a system on which every filesystem was on ramfs, and there was no swap? If not then they are not memory attacks but I/O attacks. I completely concede that you can DOS the system with I/O if that is not limited as well. My point is that is not a memory problem but a disk I/O problem which is much easier to and cheaper to solve. Disk I/O is fundamentally a slow path which makes it hard to modify it in a way that negatively affects system performance. I don't think with a memory RSS limit you can DOS the system in a way that is purely about memory. You have to pick a different kind of DOS attack. As for virtualization that is what a kernel is about virtualizing it's resources so you can have multiple users accessing them at the same time. You don't need some hypervisor or virtual machine to give you that. That is where we start. However it was found long ago that global optimizations give better system through put then the rigid systems you can get with hypervisors. Although things are not quite as deterministic when you optimize globally. They should be sufficiently deterministic you can avoid the worst of the DOS attacks. The real practical problem with the current system is that nearly all of our limits are per process and applications now span more than one process so the limits provided by linux are generally useless to limit real world applications. This isn't generally a problem until we start trying to run multiple applications on the same system because the hardware is so powerful. Which the namespace work which will allow you to run several different instances of user space simultaneously is likely to allow. At the moment I very much in a position of doing review not implementing this part of it. I'm trying to get the people doing the implementation to make certain they have actually been paying attention to how their proposed limits will interact with the rest of the system. So far generally the conversation has ...
It can be done trivially without performing any IO or swap, yes. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Please give me a rough sketch of how to do so. Or is this about DOS'ing the system by getting the kernel to allocate a large number of data structures (struct file, struct inode, or the like)? Eric -
Reading sparse files is just one I had in mind. But I'm not very That works too. And I don't believe hand-accounting and limiting all these things individually as a means to limit RAM usage is sane, when you have a much more comprehensive and relatively unintrusive page level scheme. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
I truly understand your point here. But, I don't think this thought exercise is really helpful here. In a pure sense, nothing is keeping an unmapped page cache file in memory, other than the user's prayers. But, please don't discount their prayers, it's what they want! I seem to remember a quote attributed to Alan Cox around OLS time last year, something about any memory controller being able to be fair, fast, and accurate. Please pick any two, but only two. Alan, did I get close? To me, one of the keys of Linux's "global optimizations" is being able to use any memory globally for its most effective purpose, globally (please ignore highmem :). Let's say I have a 1GB container on a machine that is at least 100% committed. I mmap() a 1GB file and touch the entire thing (I never touch it again). I then go open another 1GB file and r/w to it until the end of time. I'm at or below my RSS limit, but that 1GB of RAM could surely be better used for the second file. How do we do this if we only account for a user's RSS? Does this fit into Alan's unfair bucket? ;) Also, in a practical sense, it is also a *LOT* easier to describe to a customer that they're getting 1GB of RAM than >=20GB/hr of bandwidth from the disk. -- Dave P.S. Do we have an quotas on ramfs? If we have an ramfs filesystems, what keeps the containerized users from just filling up RAM? -
what's the difference to a normal Linux system here? when low on memory, the system will reclaim pages, and if you want something which is easy to describe for the 'customer', then a VM is what you are looking for, it has a perfectly well defined amount of resources which will tmpfs has hard limits, you simply specify it on mount none /tmp tmpfs size=16m,mode=1777 0 0 best, -
But would it not bias application writers towards using read()/write() calls over mmap()? They know that their calls are likely to be faster when the application is run in a container. Without page cache control we'll end up creating an asymmetrical container, where certain usage is charged and some usage is not. Also, please note that when a page is unmapped and moved to swap cache; the swap cache uses the page cache. Without page cache control, we could end up with too many pages moving over to the swap cache and still occupying memory, while the original intension was to avoid this scenario. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
I think it would be very difficult in practice to exploit a situation where an evil guy forces another container to hold shared pages that the container Exactly. That said, the "poor bastard" will have to be pretty determined to page out because the pages will appear active but it should happen I don't think anything like that currently exists. It's almost the opposite of what the current reclaim algorithm would be trying to do because it has no notion of containers. Currently, the idea of paging out something in active use is a mad plan. Maybe what would be needed is something where the shared page is unmapped from page tables and the next faulter must copy the page instead of reestablishing the PTE. The data copy is less than ideal but it'd be cheaper than reclaim and help the accounting. However, it would require a counter to track "how Right, charging the next toucher would not work in the zones case. The next toucher would establish a PTE to the page which is still in the zone of the I think this problem would happen with other accounting mechanisms as well. However, it's more pronounced with zones because there are harder limits on memory usage. If the counter existed to track "how many processes in this container have mapped the page", the problem of free-riders could be investigated by comparing _mapcount to the container count. That would determine if additional steps are required or not to force another container to assume the accounting cost. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
So what to do when virtual physical limit is hit? This is true for current implementation for booth - this patchset ang OpenVZ beancounters. If you sum up the physpages values for all containers This is done in current patches. Herbert, did you look at the patches before sending this mail or do you just want to 'take part' in conversation w/o understanding -
nice, but the question was about _requirements_ when the RSS limit is hit, but there _are_ enough pages left on the physical system, there is no good reason to swap out the page at all - there is no benefit in doing so (performance wise, that is) - it actually hurts performance, and could become a separate source for DoS what should happen instead (in an ideal world :) is that the page is considered swapped out for the guest (add guest penality for swapout), and when the page would be swapped in again, the guest takes a penalty (for the 'virtual' page in) and the page is returned to the guest, possibly kicking again, the question was about requirements, not your patches, and yes, I had a look at them _and_ the OpenVZ implementations ... best, Herbert -
Is the page stays mapped for the container or not? If yes then what's the use of limits? Container mapped pages more than the limit is but all the pages are -
sounds weird, but makes sense if you look at the full picture just because the guest is over its page limit doesn't mean that you actually want the system to swap stuff out, what you really want to happen is the following: - somehow mark those pages as 'gone' for the guest - penalize the guest (and only the guest) for the 'virtual' swap/page operation - penalize the guest again for paging in the page - drop/swap/page out those pages when the host system you tell me? or is that an option in OpenVZ? best, -
Yeah! And slow down the container which caused global limit hit (w/o hitting it's own limit!) by swapping In OpenVZ we account resources in host system as well. -
great. I agree with that. Just curious why current vserver code kills arbitrary depends on whether you will include beanocunter 0 usages or not :) Kirill -
because it obviously lacks the finess of OpenVZ code :) seriously, handling the OOM kills inside a container has never been a real world issue, as once you are really out of memory (and OOM starts killing) you usually have lost the game anyways (i.e. a guest restart or similar is required to get your services up and running again) and OOM killer decisions are not perfect in mainline either, but, you've probably seen the FIXME and TODO entries in the code showing that this so that is an option then? best, -
I'm talking not about the finess of the code, but rather about the lack of isolation, i.e. one VE can affect others. Kirill -
We discussed zones for resource control and some of the disadvantages at
http://lkml.org/lkml/2006/10/30/222
I need to look at Mel's patches to determine if they are suitable for
control. But in a thread of discussion on those patches, it was agreed
We discussed some of the requirements in the RFC: Memory Controller
requirements thread
All the stake holders involved in the RFC discussion :-) We've been
talking and building on top of each others patches. I hope that was a
There are other things like resizing a zone, finding the right size,
etc. I'll look
at Mel's patches to see what is supported.
Warm Regards,
Balbir Singh
-
And misses every resource sharing opportunity in sight. Except for filtering the which pages are eligible for reclaim an RSS limit should not need to change the existing reclaim logic, and with things like the memory zones we have had that kind of restriction in the reclaim logic If you are talking about RSS limits the term is well defined. The number of pages you can have mapped into your set of address space at any given time. Unless I'm totally blind that isn't what the patchset implements. A true RSS limit over multiple processes has a lot of potential to be generally useful, is very understandable, doesn't affect kernel cache decisions so largely performance should not be affected. There is a little more overhead in the fault logic but that is a moderately Another really nasty issue is the container term as the resource guys are using the term in a subtlety different way then it has been used with namespaces leading to several threads where the participants talked past each other. We need a different term to designate the group of tasks a resource controller is dealing with. The whole filesystem interface also is over general and makes it too easy to express the hard things (like move an existing task from one group of tasks to another) leading to code complications. On the up side I think the code the focus is likely in the right place to start delivering usable code. Eric -
exactly this is implemented in the current patches from Pavel. the only difference is that filtering is not done in general LRU list, which is not effective, but via per-container LRU list. So the pointer on the page structure does 2 things: - fast reclamation - correct uncharging of page from where it was charged (e.g. shared pages can be mapped first in one container, but the last unmap Ouch, what makes you think so? The fact that a page mapped into 2 different processes is charged only once? Imho it is much more correct then sum of process' RSS within container, due to: 1. it is clear how much container uses physical pages, not abstract items 2. shared pages are charged only once, so the sum of containers RSS is still Thanks, Kirill -
No the fact that a page mapped into 2 separate mm_structs in two separate accounting domains is counted only once. This is very likely to happen with things like glibc if you have a read-only shared copy of your distro. There appears to be no technical reason for such a restriction. A page should not be owned. Going further unless the limits are draconian I don't expect users to hit the rss limits often or frequently. So in 99% of all cases page reclaim should continue to be global. Which makes me question messing with the general page reclaim lists. Now if the normal limits turn out to be draconian it may make sense to split the first level of page lists by some reasonable approximation Maybe. The extra locking complexity gives me fits. But in the grand scheme of things it is minor as long as it is not user perceptible we can fix it later. I'm still wrapping my head around the weird fs concepts. Eric -
I would be happy to propose OVZ approach then, where a page is tracked with page_beancounter data structure, which ties together a page with beancounters which use it like this: page -> page_beancounter -> list of beanocunters which has the page mapped This gives a number of advantages: - the page is accounted to all the VEs which actually use it. - allows almost accurate tracking of page fractions used by VEs depending on how many VEs mapped the page. - allows to track dirty pages, i.e. which VE dirtied the page and implement correct disk I/O accounting and CFQ write scheduling It is not that rare when containers hit their limits, believe me :/ In trusted environments - probably you are right, in hosting - no. Thanks, Kirill -
The wording looks very familiar :-). It would be useful to add "The reclaim logic is now container aware, when the container goes overlimit the page reclaimer reclaims pages belonging to this container. If we are unable to reclaim enough pages to satisfy the request, the process is Yes, this is what I was planning to get to -- a per container LRU list. But you have just one list, don't you need active and inactive lists? When the global LRU is manipulated, shouldn't this list be updated as The return codes of the functions is a bit confusing, ideally container_try_to_free_pages() should return 0 on success. Also res_counter_charge() has a WARN_ON(1) if the limit is exceeded. The system administrator can figure out the details from failcnt, I suspect when the container is running close to it's limit, dmesg will have too many WARNING messages. How much memory do you try to reclaim in container_try_to_free_pages()? With my patches, I was planning to export this knob to userspace with a default value. This will help the administrator decide how much of the working set/container LRU should be freed on reaching the limit. I cannot find the definition of container_try_to_free_pages() in This is not good, it won't give us LRU behaviour which is Which part of the working set are we pushing out, this looks like we are using FIFO to determine which pages to reclaim. This needs This would lead to LRU churning, I would recommend using list_splice_tail() instead. Since this code has a lot in common with isolate_lru_pages, it would be nice to reuse the code in vmscan.c NOTE: Code duplication is a back door for subtle bugs and solving the same I see that the charges are not migrated. Is that good? If a user could find a way of migrating his/her task from one container to another, it could create an issue with the user's task taking up a big chunk of the RSS limit. Can we migrate any task or just the thread group leader. In my patches, I allowed migration of just the ...
Nope - res_counter_uncharge() has - this is an absolutely At least one page. This is enough to make one page charge. That's the difference from general try_to_free_pages() that This is in patch #5. Why not - recently used pages are in the head of the list. Active/incative state of the page is determined from it's flags. The idea of this list is to decrease the number of pages scanned This algo works exactly like general try_to_free_pages() does. Anyway - page migration may be done later with a -
Adds needed pointers to mm_struct and page struct, places hooks to core code for mm_struct initialization and hooks in container_init_early() to preinitialize RSS accounting subsystem.
An extra pointer in struct page is unlikely to fly. Both because it increases the size of a size critical structure, and because conceptually it is ridiculous. If you are limiting the RSS size you are counting the number of pages in the page tables. You don't care about the page itself. With the rmap code it is relatively straight forward to see if this is the first time a page has been added to a page table in your rss group, or if this is the last reference to a particular page in your rss group. The counters should only increment the first time a particular page is added to your rss group. The counters should only decrement when it is the last reference in your rss subsystem. This allow important little cases like glibc to be properly accounted for. One of the key features of a rss limit is that the kernel can still keep pages that you need in-core, that are accessible with just a minor fault. Directly owning pages works directly against that Eric -
as it was discussed multiple times (and according OLS):
- it is not critical nowdays to expand struct page a bit in case
accounting is on.
- it can be done w/o extending, e.g. via mapping page <-> container
using hash or some other data structure.
You are fundamentally wrong if shared pages are concerned.
Imagine a glibc page shared between 2 containers - VE1 and VE2.
VE1 was the first who mapped it, so it is accounted to VE1
(rmap count was increased by it).
now VE2 maps the same page. You can't determine whether this page is mapped
to this container or another one w/o page->container pointer.
All the choices you have are:
a) do not account this page, since it is allready accounted to some other VE.
b) account this page again to current container.
(a) is bad, since VE1 can unmap this page first, and the last user will be VE2.
Which means VE1 will be charged for it, while VE2 uncharged. Accounting screws up.
b) is bad, since:
- the same page is accounted multiple times, which makes impossible
to understand how much real memory pages container needs/consumes
- and because on container enter the process and it's pages
are essentially moved to another context, while accounting
Sorry, can't understand what you mean. It doesn't work against.
Each container has it's own LRU. So if glibc has the most
often used pages - it won't be thrashed out.
Thanks,
Kirill
-
Hi Kirill, I thought we can always get from the page to the VMA. rmap provides this to us via page->mapping and the 'struct address_space' or anon_vma. Do we agree on that? We can also get from the vma to the mm very easily, via vma->vm_mm, right? We can also get from a task to the container quite easily. So, the only question becomes whether there is a 1:1 relationship between mm_structs and containers. Does each mm_struct belong to one and only one container? Basically, can a threaded process have different threads in different containers? It seems that we could bridge the gap pretty easily by either assigning each mm_struct to a container directly, or putting some kind of task-to-mm lookup. Perhaps just a list like mm->tasks_using_this_mm_list. Not rocket science, right? -- Dave -
Not completely. When page is unmapped from the *very last* user its *first* toucher may already be dead. So we'll never No. The question is "how to get a container that touched the page first" which is the same as "how to find mm_struct which touched the page first". Obviously there's no answer on this question unless we hold some direct page->container reference. This may be a hash, a direct on-page pointer, or mirrored This could work for reclamation: we scan through all the mm_struct-s within the container and shrink its' pages, but -
OK, but this is assuming that we didn't *un*account for the page when Or, you keep track of when the last user from the container goes away, and you effectively account it to another one. Are there problems with shifting ownership around like this? -- Dave -
That's exactly what we agreed on during our discussions: When page is get touched it is charged to this container. When page is get touched again by new container it is NOT charged to new container, but keeps holding the old one till it (the page) is completely freed. Nobody worried the fact that a single page can hold container for good. OpenVZ beancounters work the other way (and we proposed this solution when we first sent the patches). We keep track of We can migrate page to another user but we decided -
These patches are very similar to what I posted at
http://lwn.net/Articles/223829/
In my patches, the thread group leader owns the mm_struct and all
threads belong to the same container. I did not have a per container
LRU, walking the global list for reclaim was a bit slow, but otherwise
my patches did not add anything to struct page
I used rmap information to get to the VMA and then the mm_struct.
Kirill, it is possible to determine all the containers that map the
page. Please see the page_in_container() function of
http://lkml.org/lkml/2007/2/26/7.
I was also thinking of using the page table(s) to identify all pages
belonging to a container, by obtaining all the mm_structs of tasks
belonging to a container. But this approach would not work well for
the page cache controller, when we add that to our memory controller.
Balbir
-
Pages are charged to their first touchers which are determined using pages' mapcount manipulations in rmap calls.
NAK pages should be charged to every rss group whose mm_struct they are mapped into. Eric -
For these you essentially need per-container page->_mapcount counter, otherwise you can't detect whether rss group still has the page in question being mapped in its processes' address spaces or not. 1. This was discussed before and considered to be ok by all the resource management involved people. 2. this can be done with a-la page beancounters which are used in OVZ for shared fractions accounting. It's a next step forward. If you know how to get "pages should be charged to every rss group whose mm_struct they are mapped into" w/o additional pointer in struct page, please throw me an idea. Thanks, Kirill -
What do you mean by this? You can always tell whether a process has a particular page mapped. Could you explain the issue a bit more. I'm not sure I get it. -- Dave -
OpenVZ wants to account _shared_ pages in a guest different than separate pages, so that the RSS accounted values reflect the actual used RAM instead of the sum of all processes RSS' pages, which for sure is more relevant to the administrator, but IMHO not so terribly important to justify memory consuming structures and sacrifice performance to get it right YMMV, but maybe we can find a smart solution to the issue too :) best, -
I will tell you what I want. I want a shared page cache that has nothing to do with RSS limits. I want an RSS limit that once I know I can run a deterministic application with a fixed set of inputs in I want to know it will always run. First touch page ownership does not guarantee give me anything useful for knowing if I can run my application or not. Because of page sharing my application might run inside the rss limit only because I got lucky and happened to share a lot of pages with another running application. If the next I run and it isn't running my application will fail. That is ridiculous. I don't want sharing between vservers/VE/containers to affect how many pages I can have mapped into my processes at once. Now sharing is sufficiently rare that I'm pretty certain that problems come up rarely. So maybe these problems have not shown up in testing yet. But until I see the proof that actually doing the accounting for sharing properly has intolerable overhead. I want proper accounting not this hand waving that is only accurate on the third Tuesday of the month. Ideally all of this will be followed by smarter rss based swapping. There are some very cool things that can be done to eliminate machine overload once you have the ability to track real rss values. Eric -
Let's be practical here, what you're asking is basically impossible. Unless by deterministic you mean that it never enters the a non trivial syscall, in which case, you just want to know about maximum It is basically handwaving anyway. The only approach I've seen with a sane (not perfect, but good) way of accounting memory use is this one. If you care to define "proper", then we could discuss that. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Not per process I want this on a group of processes, and yes that is all I want just. I just want accounting of the maximum RSS of No. I don't want the meaning of my rss limit to be affected by what other processes are doing. We have constraints of how many resources the box actually has. But I don't want accounting so sloppy that processes outside my group of processes can artificially I will agree that this patchset is probably in the right general ballpark. But the fact that pages are assigned exactly one owner is pure non-sense. We can do better. That is all I am asking for someone to at least attempt to actually account for the rss of a group of processes and get the numbers right when we have shared pages, between different groups of processes. We have the data structures to support this with rmap. Let me describe the situation where I think the accounting in the patchset goes totally wonky. Gcc as I recall maps the pages it is compiling with mmap. If in a single kernel tree I do: make -jN O=../compile1 & make -jN O=../compile2 & But set it up so that the two compiles are in different rss groups. If I run the concurrently they will use the same files at the same time and most likely because of the first touch rss limit rule even if I have a draconian rss limit the compiles will both be able to complete and finish. However if I run either of them alone if I use the most draconian rss limit I can that allows both compiles to finish I won't be able to compile a single kernel tree. The reason for the failure with a single tree (in my thought experiment) is that the rss limit was set below the what is actually needed for the code to work. When we were compiling two kernels and they were mapping the same pages at the same time we could put the rss limit below the minimum rss needed for the compile to execute and still have it complete because of with first touch only one group accounted for the pages and the other just leached of the first, as long as ...
Well don't you just sum up the maximum for each process? Or do you want to only count shared pages inside a container once, So what are you going to do about all the shared caches and slabs Yeah it is not perfect. Fortunately, there is no perfect solution, so we don't have to be too upset about that. And strangely, this example does not go outside the parameters of what you asked for AFAIKS. In the worst case of one container getting _all_ the shared pages, they will still remain inside their maximum rss limit. So they might get penalised a bit on reclaim, but maximum rss limits will work fine, and you can (almost) guarantee X amount of memory for a given container, and it will _work_. But I also take back my comments about this being the only design I have seen that gets everything, because the node-per-container idea is a really good one on the surface. And it could mean even less impact I think it is simplistic. Sure you could probably use some of the rmap stuff to account shared mapped _user_ pages once for each container that touches them. And this patchset isn't preventing that. But how do you account kernel allocations? How do you account unmapped pagecache? What's the big deal so many accounting people have with just RSS? I'm not a container person, this is an honest question. Because from my POV if you conveniently ignore everything else... you may as well just not do any accounting at all. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
When that does happen and if a container hits it limit, with a LRU per-container, if the container is not actually using those pages, they'll get thrown out of that container and get mapped into the With the proposed node-per-container, we will need to make massive core VM changes to reorganize zones and nodes. We would want to allow 1. For sharing of nodes 2. Resizing nodes 3. May be more With the node-per-container idea, it will hard to control page cache limits, independent of RSS limits or mlock limits. We decided to implement accounting and control in phases 1. RSS control 2. unmapped page cache control 3. mlock control 4. Kernel accounting and limits This has several advantages 1. The limits can be individually set and controlled. 2. The code is broken down into simpler chunks for review and merging. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Exactly. Statistically, first touch will work OK. It may mean some reclaim inefficiencies in corner cases, but things will tend to But a lot of that is happening anyway for other reasons (eg. memory plug/unplug). And I don't consider node/zone setup to be part of the "core VM" as such... it is _good_ if we can move extra work into setup rather than have it in the mm. I don't know that it would be particularly harder than any other first-touch scheme. If one container ends up being charged with too much pagecache, eventually they'll reclaim a bit of it and the pages But this patch gives the groundwork to handle 1-4, and it is in a small chunk, and one would be able to apply different limits to different types of pages with it. Just using rmap to handle 1 does not really seem like a viable alternative because it fundamentally isn't going to handle 2 or 4. I'm not saying that you couldn't _later_ add something that uses rmap or our current RSS accounting to tweak container-RSS semantics. But isn't it sensible to lay the groundwork first? Get a clear path to something that is good (not perfect), but *works*? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Thanks, thats one of our goals, to keep it simple, understandable and Yes, true, but what if a user does not want to control the page cache usage in a particular container or wants to turn off For (2), we have the basic setup in the form of a per-container LRU list and a pointer from struct page to the container that first brought in I agree with your development model suggestion. One of things we are going to do in the near future is to build (2) and then add (3) and (4). So far, we've not encountered any difficulties on building on top of (1). Vaidy, any comments? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
Accounting becomes easy if we have a container pointer in struct page. This can form base ground for building controllers since any memory related controller would be interested in tracking pages. However we still want to evaluate if we can build them without bloating the struct page. Pagecache controller (2) we can implement with container pointer in struct page or container pointer in struct address space. Building on this patchset is much simple and and we hope the bloat in struct page will be compensated by the benefits in memory controllers in terms of performance and simplicity. Adding too many controllers and accounting parameters to start with will make the patch too big and complex. As Balbir mentioned, we have a plan and we shall add new control parameters in stages. --Vaidy -
The thing is, you have to worry about actually getting anything in the kernel rather than trying to do fancy stuff. The approaches I have seen that don't have a struct page pointer, do intrusive things like try to put hooks everywhere throughout the kernel where a userspace task can cause an allocation (and of course end up missing many, so they aren't secure anyway)... and basically just nasty stuff that will never get merged. Struct page overhead really isn't bad. Sure, nobody who doesn't use containers will want to turn it on, but unless you're using a big PAE system you're actually unlikely to notice. But again, I'll say the node-container approach of course does avoid this nicely (because we already can get the node from the page). So definitely that approach needs to be discredited before going with this Everyone seems to have a plan ;) I don't read the containers list... does everyone still have *different* plans, or is any sort of consensus being reached? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
Consensus? I believe at this point we have a sort of consensus on the base container infrastructure and the need for memory controller to control RSS, pagecache, mlock, kernel memory etc. However the implementation and approach taken is still being discussed :) --Vaidy -
User beancounters patch has got through all these... The approach where each charged object has a pointer to the owner container, who has charged it - is the most easy/clean way to handle all the problems with dynamic context change, races, etc. big PAE doesn't make any difference IMHO But it lacks some other features: 1. page can't be shared easily with another container 2. shared page can't be accounted honestly to containers as fraction=PAGE_SIZE/containers-using-it 3. It doesn't help accounting of kernel memory structures. e.g. in OpenVZ we use exactly the same pointer on the page to track which container owns it, e.g. pages used for page tables are accounted this way. 4. I guess container destroy requires destroy of memory zone, which means write out of dirty data. Which doesn't sound good for me as well. 5. memory reclamation in case of global memory shortage becomes a tricky/unfair task. 6. You cannot overcommit. AFAIU, the memory should be granted to node exclusive usage and cannot be used by by another containers, hope we'll have it soon :) Thanks, Kirill -
The pointer in struct page approach is a decent one, which I have liked since this whole container effort came up. IIRC Linus and Alan also thought that was a reasonable way to go. I haven't reviewed the rest of the beancounters patch since looking at it quite a few months ago... I probably don't have time for a The issue is just that struct pages use low memory, which is a really scarce commodity on PAE. One more pointer in the struct page means 64MB less lowmem. But PAE is crap anyway. We've already made enough concessions in the kernel to support it. I agree: struct page overhead is not really I think they could be shared. You allocate _new_ pages from your own node, but you can definitely use existing pages allocated to other Yes there would be some accounting differences. I think it is hard to say exactly what containers are "using" what page anyway, though. ? I haven't looked at any implementation, but I think it is fine for I don't understand why? You can much more easily target a specific container for reclaim with this approach than with others (because I'm not sure about that. If you have a larger number of nodes, then you could assign more free nodes to a container on demand. But I think there would definitely be less flexibility with nodes... I don't know... and seeing as I don't really know where the google Good luck ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -
This patch is not really beancounters. 1. It uses the containers framework 2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8) Yes, but we break the global LRU. With these RSS patches, reclaim not triggered by containers still uses the global LRU, by using nodes, I think we have made some forward progress on the consensus. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
If we used Beancounters as Pavel and Kirill mentioned, that would
keep track of each container that has referenced a page, not just the
first container. It sounds like beancounters can return a usage count
where each page is divided by the number of referencing containers (e.g.
1/3rd if 3 containers share a page). Presumably it could also return a
full count of 1 to each container.
If we look at data in the latter form, i.e. each container must pay
fully for each page used, then Eric could use that to determine real
usage needs of the container. However we could also use the fractional
count in order to do things such as charging the container for its
actual usage. i.e. full count for setting guarantees, fractional for
actual usage.
-- Ethan
-
When we do charge/uncharge we have to answer on another question: "whether *any* task from the *container* has this page mapped", not the "whether *this* task has this page mapped". Thanks, Kirill -
That's a bit more clear. ;) OK, just so I make sure I'm getting your argument here. It would be too expensive to go looking through all of the rmap data for _any_ other task that might be sharing the charge (in the same container) with the current task that is doing the unmapping. The requirements you're presenting so far appear to be: 1. The first user of a page in a container must be charged 2. The second user of a page in a container must not be charged 3. A container using a page must take a diminished charge when another container is already using the page. 4. Additional fields in data structures (including 'struct page') are permitted What have I missed? What are your requirements for performance? I'm not quite sure how the page->container stuff fits in here, though. page->container would appear to be strictly assigning one page to one container, but I know that beancounters can do partial page charges. Care to fill me in? -- Dave -
Which is a questionable assumption. Worse case we are talking a list several thousand entries long, and generally if you are used by the same container you will hit one of your processes long before you traverse the whole list. So at least the average case performance should be good. It is only in the case when you a page is shared between multiple containers when this matters. Eric -
you missed out an include in mm/migrate.c cheers, C. Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> --- mm/migrate.c | 1 + 1 file changed, 1 insertion(+) Index: 2.6.20/mm/migrate.c =================================================================== --- 2.6.20.orig/mm/migrate.c +++ 2.6.20/mm/migrate.c @@ -28,6 +28,7 @@ #include <linux/mempolicy.h> #include <linux/vmalloc.h> #include <linux/security.h> +#include <linux/rss_container.h> #include "internal.h" -
* container_try_to_free_pages() walks containers page list and tries to shrink pages. This is based on try_to_free_pages() and Co code. Called from core code when no resource left at the moment of page touching. * container_out_of_memory() selects a process to be killed which mm_struct belongs to container in question. Called from core code when no resources left and no pages were reclaimed.
Hi, Pavel, Please find my patch to add LRU behaviour to your latest RSS controller. Balbir Singh Linux Technology Center IBM, ISTL
Thanks for participation and additional testing :) -
Small and simple - each fork()/clone() is accounted and rejected when limit is hit.
Hi Pavel, Why do you need a pointer added to task_struct? One of the main points of the generic containers is to avoid every different subsystem and There's no need to hold a reference here - by definition, the task's container can't go away while the task is in it. Also, shouldn't you have an attach() method to move the count from one container to another when a task moves? Paul -
The idea is: Task may be "the entity that allocates the resources" and "the entity that is a resource allocated". When task is the first entity it may move across containers (that is implemented in your patches). When task is a resource it shouldn't move across containers like files or pages do. More generally - allocated resources hold reference to original container till they die. No resource migration is performed. -
Yes, but I disagree with the premise. The title of your patch is "Account for the number of tasks within container", but that's not what the subsystem does, it accounts for the number of forks within the container that aren't directly accompanied by an exit. Ideally, resources like files and pages would be able to follow tasks as well. The reason that files and pages aren't easily migrated from one container to another is that there could be sharing involved; figuring out the sharing can be expensive, and it's not clear what to do if two users are in different containers. But in the case of a task count, there are no such issues with sharing, so it seems to me to be more sensible (and more efficient) to just limit the number of tasks in a container. i.e. when moving a task into a container or forking a task within a container, increment the count; when moving a task out of a container or when it exits, decrement the count. With your approach, if you were to set the task limit of an empty container A to 1, and then move a process P from B into A, P would be able to fork a new child, since the "task count" would be 0 (as P was being charged to B still). Surely the fact that there's 1 process in A should prevent P from forking? Paul -
Sounds reasonable. I'll take this into account when I make the next iteration. -
Simple again - increment usage counter at file open and decrement at file close. Reject opening if limit is hit.
I have one problem with the patchset, I cannot compile the patches individually and some of the code is hard to read as it depends on functions from future patches. Patch 2, 3 and 4 fail to compile without patch 5 applied. Patch 1 failed to apply with a reject in kernel/Makefile I applied it on top of 2.6.20 with all of Paul Menage's patches (all 7). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -
This sounds weird for me :( I've taken a stock 2.6.20 and applied Paul's patches. This is what this patchset is applicable for. -
maybe Paul's patch should be taken w/o subsystems examples (CKRM, UBC), i.e. first 3 patches only? Kirill -
Can we not make sure that each subsystem registers itself before any of its resources become usable? So the file counting subsystem should register at some point before filp_open() becomes usable, and the process counting subsystem should register before it's possible to fork, etc. Paul -
Actually all the subsystems I've sent became usable very early. Much earlier that initcalls started. I didn't found where exactly -
