Re: [RFC][PATCH 6/7] Account for the number of tasks within container

Previous thread: [ALSA PATCH] alsa-git merge request by Jaroslav Kysela on Tuesday, March 6, 2007 - 7:20 am. (1 message)

Next thread: [PATCH -rt] airo: threaded IRQ handler sleeps forever by Michal Schmidt on Tuesday, March 6, 2007 - 7:40 am. (1 message)
From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 7:42 am

This patchset adds RSS, accounting and control and
limiting the number of tasks and files within container.

Based on top of Paul Menage's container subsystem v7

RSS controller includes per-container RSS accounter,
reclamation and OOM killer. It behaves like standalone
machine - when container runs out of resources it tries
to reclaim some pages and if it doesn't succeed in it
kills some task which mm_struct belongs to container in
question.

Num tasks and files containers are very simple and
self-descriptive from code.

As discussed before when a task moves from one container
to another no resources follow it - they keep holding the
container they were allocated in.

The difficulties met during using of Pauls' containers were:

1. Container fork hook is placed before new task
   changes. This makes impossible of handling fork
   properly. I.e. new mm_struct should have pointer
   to RSS container, but we don't have one at that
   early time.

2. Extended containers may register themselves too late.
   Kernel threads/helpers start forking, opening files
   and touching pages much earlier. This patchset
   workarounds this in not-so-cute manner and I'm waiting
   for Paul's comments on this issue.
-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 7:49 am

Introduce generic structures and routines for
resource accounting.

Each resource accounting container is supposed to
aggregate it, container_subsystem_state and its
resource-specific members within.
From: Balbir Singh
Date: Tuesday, March 6, 2007 - 9:03 pm

Is there any way to indicate that there are no limits on this container.
LONG_MAX is quite huge, but still when the administrator wants to
configure a container to *un-limited usage*, it becomes hard for



These bits look a little out of sync, with no users for these routines in
this patch. Won't you get a compiler warning, compiling this bit alone?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Pavel Emelianov
Date: Wednesday, March 7, 2007 - 12:19 am

Yes - LONG_MAX is essentially a "no limit" value as no

I'm afraid no. We have to atomically check for limit and alter
one of usage or failcnt depending on the checking result. Making
this with atomic_xxx ops will require at least two ops.

If we'll remove failcnt this would look like
   while (atomic_cmpxchg(...))
which is also not that good.

Moreover - in RSS accounting patches I perform page list

Nope - when you have a non-static function without users in a
file no compiler warning produced.
-

From: Herbert Poetzl
Date: Friday, March 9, 2007 - 9:37 am

-1 or ~0 is a viable choice for userspace to

Linux-VServer does the accounting with atomic counters,
so that works quite fine, just do the checks at the
beginning of whatever resource allocation and the

it still hasn't been shown that this kind of RSS limit
doesn't add big time overhead to normal operations
(inside and outside of such a resource container)

note that the 'usual' memory accounting is much more
lightweight and serves similar purposes ...

best,
-

From: Pavel Emelianov
Date: Sunday, March 11, 2007 - 2:01 am

account it kernel may preempt and let another process

It OOM-kills current int case of limit hit instead of

-

From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 12:00 pm

Atomic operations versus locks is only a granularity thing.
You still need the cache line which is the cost on SMP.

Are you using atomic_add_return or atomic_add_unless or
are you performing you actions in two separate steps which
is racy?  What I have seen indicates you are using a racy two separate

Perhaps....

Eric
-

From: Herbert Poetzl
Date: Sunday, March 11, 2007 - 6:16 pm

yes, this is the current implementation which
is more than sufficient, but I'm aware of the
potential issues here, and I have an experimental
patch sitting here which removes this race with
the following change:

 - doesn't store the accounted value but
   limit - accounted (i.e. the free resource)
 - uses atomic_add_return() 
 - when negative, an error is returned and
   the resource amount is added back

changes to the limit have to adjust the 'current'
value too, but that is again simple and atomic

best,
Herbert

PS: atomic_add_unless() didn't exist back then
(at least I think so) but that might be an option
-

From: Eric W. Biederman
Date: Tuesday, March 13, 2007 - 2:09 am

I think as far as having this discussion if you can remove that race
people will be more willing to talk about what vserver does.

That said anything that uses locks or atomic operations (finer grained locks)
because of the cache line ping pong is going to have scaling issues on large
boxes.

So in that sense anything short of per cpu variables sucks at scale.  That said
I would much rather get a simple correct version without the complexity of
per cpu counters, before we optimize the counters that much.

Eric
-

From: Kirill Korotaev
Date: Tuesday, March 13, 2007 - 2:49 am

fully agree with it. We need to get a working version first.

FYI, in OVZ we recently added such optimizations: reserves like in TCP/IP,
e.g. for kmemsize, numfile these reserves are done on task-basis for
fast charges/uncharges w/o involving lock operations.
On task exit reserves are returned back to the beancounter.

As it demonstrated atomic counters can be replaced with
task-reserves on the next step.

Thanks,
Kirill
-

From: Pavel Emelianov
Date: Tuesday, March 13, 2007 - 2:27 am

BTW atomic_add_unless() is essentially a loop!!! Just
like spin_lock() is, so why is one better that another?

spin_lock() can go to schedule() on preemptive kernels

-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 8:21 am

well, shouldn't be a big deal to brush that patch up

right, but atomic ops have much less impact on most

actually I thought about per cpu counters quite a lot, and
we (Llinux-VServer) use them for accounting, but please
tell me how you use per cpu structures for implementing 
limits

TIA,
-

From: Pavel Emelianov
Date: Tuesday, March 13, 2007 - 8:41 am

Right. But atomic_add_unless() is slower as it is

Did you ever look at how get_empty_filp() works?
I agree, that this is not a "strict" limit, but it
limits the usage wit some "precision".

/* off-the-topic */ Herbert, you've lost Balbir again:
In this sub-thread some letters up Eric wrote a letter with
Balbir in Cc:. The next reply from you doesn't include him.
-

From: Srivatsa Vaddagiri
Date: Tuesday, March 13, 2007 - 9:07 am

If I am not mistaken, you shouldn't loop in normal cases, which means
it boils down to a atomic_read() + atomic_cmpxch()


-- 
Regards,
vatsa
-

From: Pavel Emelianov
Date: Wednesday, March 14, 2007 - 12:12 am

So does the lock - in a normal case (when it's not
heavily contented) it will boil down to atomic_dec_and_test().

Nevertheless, making charge like in this patchset
requires two atomic ops with atomic_xxx and only
one with spin_lock().
-

From: Eric W. Biederman
Date: Thursday, March 15, 2007 - 9:51 am

To be very clear.  If you care about optimization cache lines
and lock hold times (to keep contention down) are the important
things.

With spin locks you have to be a little more careful to put them
on the same cache line as your data and to keep should hold times
short.  With atomic ops you get that automatically.

There is really no significant advantage in either approach.
The number of atomic ops doesn't matter.  You bring in
the cache line and manipulate it.  The expensive part is
acquiring the cache line exclusively.  This is expensive even if
things are never contended but there are many users.

Sorry for the rant, but I just wanted to set the record straight.
spin_locks vs atomic ops is a largely meaningless debate.

Eric
-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 9:32 am

fine, nobody actually uses atomic_add_unless(), or am I
missing something?

using two locks will be slower than using a single
lock, adding a loop which counts from 0 to 100 will

I can happily add him to every email I reply to, but he
definitely isn't removed by my mailer (as I already stated,
it might be the mailing list which does this), fact is, the
email arrives here without him in the cc, so a reply does
not contain it either ...

best,
Herbert

-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 7:55 am

This includes setup of RSS container within generic
process containers, all the declarations used in RSS
accounting, and core code responsible for accounting.
From: Andrew Morton
Date: Tuesday, March 6, 2007 - 3:00 pm

On Tue, 06 Mar 2007 17:55:29 +0300

ah.  This looks good.  I'll find a hunk of time to go through this work
and through Paul's patches.  It'd be good to get both patchsets lined
up in -mm within a couple of weeks.  But..

We need to decide whether we want to do per-container memory limitation via
these data structures, or whether we do it via a physical scan of some
software zone, possibly based on Mel's patches.

-

From: Herbert Poetzl
Date: Friday, March 9, 2007 - 9:48 am

doesn't look so good for me, mainly becaus of the 
additional per page data and per page processing

on 4GB memory, with 100 guests, 50% shared for each
guest, this basically means ~1mio pages, 500k shared
and 1500k x sizeof(page_container) entries, which
roughly boils down to ~25MB of wasted memory ...

increase the amount of shared pages and it starts

why not do simple page accounting (as done currently
in Linux) and use that for the limits, without
keeping the reference from container to page?

best,
-

From: Pavel Emelianov
Date: Sunday, March 11, 2007 - 2:08 am

You are. Each page has only one page_container associated
with it despite the number of containers it is shared

As I've already answered in my previous letter simple
limiting w/o per-container reclamation and per-container
oom killer isn't a good memory management. It doesn't allow
to handle resource shortage gracefully.

This patchset provides more grace way to handle this, but
full memory management includes accounting of VMA-length
as well (returning ENOMEM from system call) but we've decided

-

From: Herbert Poetzl
Date: Sunday, March 11, 2007 - 7:32 am

per container OOM killer does not require any container
page reference, you know _what_ tasks belong to the 
container, and you know their _badness_ from the normal
OOM calculations, so doing them for a container is really
straight forward without having any page 'tagging'

for the reclamation part, please elaborate how that will
differ in a (shared memory) guest from what the kernel
currently does ...

TIA,
-

From: Pavel Emelianov
Date: Sunday, March 11, 2007 - 8:04 am

That's true. If you look at the patches you'll

This is all described in the code and in the

-

From: Herbert Poetzl
Date: Sunday, March 11, 2007 - 5:41 pm

so what do we keep the context -> page reference

must have missed some of them, please can you
point me to the relevant threads ...

TIA,
-

From: Pavel Emelianov
Date: Monday, March 12, 2007 - 1:31 am

We need this for
1. keeping page's owner to uncharge to IT when page
   goes away. Or do you propose to uncharge it to
   current (i.e. ANY) container like you do all across
   Vserver accounting which screws up accounting with
   pages sharing?
2. managing LRU lists for good reclamation. See Balbir's
   patches for details.
3. possible future uses - correct sharing accounting,

-

From: Balbir Singh
Date: Monday, March 12, 2007 - 2:55 am

Herbert,

You lost me in the cc list and I almost missed this part of the
thread. Could you please not modify the "cc" list.

Thanks,
Balbir
-

From: Herbert Poetzl
Date: Monday, March 12, 2007 - 4:43 pm

hmm, it is very unlikely that this would happen,
for several reasons ... and indeed, checking the 
thread in my mailbox shows that akpm dropped you ...

--------------------------------------------------------------------
Subject: [RFC][PATCH 2/7] RSS controller core
From: Pavel Emelianov <xemul@sw.ru>
To: Andrew Morton <akpm@osdl.org>, Paul Menage <menage@google.com>,
       	Srivatsa Vaddagiri <vatsa@in.ibm.com>,
       	Balbir Singh <balbir@in.ibm.com>
Cc: containers@lists.osdl.org,
       	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Date: Tue, 06 Mar 2007 17:55:29 +0300
--------------------------------------------------------------------
Subject: Re: [RFC][PATCH 2/7] RSS controller core
From: Andrew Morton <akpm@linux-foundation.org>
To: Pavel Emelianov <xemul@sw.ru>
Cc: Kirill@smtp.osdl.org, Linux@smtp.osdl.org, containers@lists.osdl.org,
       	Paul Menage <menage@google.com>,
       	List <linux-kernel@vger.kernel.org>
Date: Tue, 6 Mar 2007 14:00:36 -0800
--------------------------------------------------------------------

I never modify the cc unless explicitely asked
to do so. I wish others would have it that way
too :)

best,
-

From: Balbir Singh
Date: Monday, March 12, 2007 - 6:57 pm

Thats good to know, but my mailer shows


Andrew Morton <akpm@linux-foundation.org>
	to		Pavel Emelianov <xemul@sw.ru>	
	cc	
	Paul Menage <menage@google.com>,
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
Balbir Singh <balbir@in.ibm.com> (see I am <<HERE>>),
devel@openvz.org,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
containers@lists.osdl.org,
Kirill Korotaev <dev@sw.ru>	
	date		Mar 7, 2007 3:30 AM	
	subject		Re: [RFC][PATCH 2/7] RSS controller core	
	mailed-by		vger.kernel.org	
On Tue, 06 Mar 2007 17:55:29 +0300

and your reply as

Andrew Morton <akpm@linux-foundation.org>,
Pavel Emelianov <xemul@sw.ru>,
Kirill@smtp.osdl.org,
Linux@smtp.osdl.org,
containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>	
	to		Andrew Morton <akpm@linux-foundation.org>	
	cc	
	Pavel Emelianov <xemul@sw.ru>,
Kirill@smtp.osdl.org,
Linux@smtp.osdl.org,
containers@lists.osdl.org,
Paul Menage <menage@google.com>,
List <linux-kernel@vger.kernel.org>	
	date		Mar 9, 2007 10:18 PM	
	subject		Re: [RFC][PATCH 2/7] RSS controller core	
	mailed-by		vger.kernel.org

I am not sure what went wrong. Could you please check your mail
client, cause it seemed to even change email address to smtp.osdl.org

Cheers,
Balbir
-

From: Srivatsa Vaddagiri
Date: Monday, March 12, 2007 - 7:24 pm

I have a problem doing a group-reply in mutt to Herbert's mails. His
email id gets dropped from the To or Cc list. Is that his email setting?
Don't know.

-- 
Regards,
vatsa
-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 9:06 am

my mail client is not involved in receiving the emails,
so the email I replied to did already miss you in the cc
(i.e. I doubt that mutt would hide you from the cc, if
it would be present in the mailbox :)

maybe one of the mailing lists is removing receipients
according to some strange scheme?

here are the full headers for the email I replied to:

-8<------------------------------------------------------------------------
From: Kirill Korotaev
Date: Sunday, March 11, 2007 - 5:26 am

i.e. a separate memzone for each container?
imho memzone approach is inconvinient for pages sharing and shares accounting.
it also makes memory management more strict, forbids overcommiting
per-container etc.
Maybe you have some ideas how we can decide on this?

Thanks,
Kirill

-

From: Andrew Morton
Date: Sunday, March 11, 2007 - 5:51 am

Yep.  Straightforward machine partitioning.  An attractive thing is that it


We need to work out what the requirements are before we can settle on an
implementation.

Sigh.  Who is running this show?   Anyone?

You can actually do a form of overcommittment by allowing multiple
containers to share one or more of the zones.  Whether that is sufficient
or suitable I don't know.  That depends on the requirements, and we haven't
even discussed those, let alone agreed to them.  

-

From: Herbert Poetzl
Date: Sunday, March 11, 2007 - 6:00 pm

well, I guess all existing OS-Level virtualizations
(Linux-VServer, OpenVZ, and FreeVPS) have stated more
than one time that _sharing_ of resources is a central
element, and one especially important resource to share
is memory (RAM) ...

if your aim is full partitioning, we do not need to
bother with OS-Level isolation, we can simply use

Linux-VServer (and probably OpenVZ):

 - shared mappings of 'shared' files (binaries 
   and libraries) to allow for reduced memory
   footprint when N identical guests are running

 - virtual 'physical' limit should not cause
   swap out when there are still pages left on
   the host system (but pages of over limit guests
   can be preferred for swapping)

 - accounting and limits have to be consistent
   and should roughly represent the actual used
   memory/swap (modulo optimizations, I can go
   into detail here, if necessary)

 - OOM handling on a per guest basis, i.e. some
   out of memory condition in guest A must not
   affect guest B

HTC,
-

From: Dave Hansen
Date: Monday, March 12, 2007 - 11:42 am

How about we drill down on these a bit more.


So, it sounds like this can be phrased as a requirement like:

	"Guests must be able to share pages."

Can you give us an idea why this is so?  On a typical vserver system,
how much memory would be lost if guests were not permitted to share

Is this a really hard requirement?  It seems a bit fluffy to me.  An
added bonus if we can do it, but certainly not the most important
requirement in the bunch.

What are the consequences if this isn't done?  Doesn't a loaded system
eventually have all of its pages used anyway, so won't this always be a
temporary situation?

This also seems potentially harmful if we aren't able to get pages
*back* that we've given to a guest.  Tasks can pin pages in lots of

So, consistency is important, but is precision?  If we, for instance,
used one of the hashing schemes, we could have some imprecise decisions
made but the system would stay consistent overall.

This requirement also doesn't seem to push us in the direction of having
distinct page owners, or some sharing mechanism, because both would be

I'll agree that this one is important and well stated as-is.  Any
disagreement on this one?

-- Dave

-

From: Herbert Poetzl
Date: Monday, March 12, 2007 - 3:41 pm

sure, one reason for this is that guests tend to
be similar (or almost identical) which results
in quite a lot of 'shared' libraries and executables
which would otherwise get cached for each guest and


let me give a real world example here:

 - typical guest with 600MB disk space
 - about 100MB guest specific data (not shared)
 - assumed that 80% of the libs/tools are used

gives 400MB of shared read only data

assumed you are running 100 guests on a host,
that makes ~39GB of virtual memory which will
get paged in and out over and over again ...


well, let's look at the overall memory resource
function with the above assumptions:

 with sharing:		f(N) = N*80M + 400M
 without sharing: 	g(N) = N*480M

so the decrease N->inf:	g/f -> 6 (factor)

which is quite realistic, if you consider that
there are only so many distributions, OTOH, the
factor might become less important when the 

no, not hard, but a reasonable optimization ...

let me note once again, that for full isolation
you better go with Xen or some other Hypervisor
because if you make it work like Xen, it will
become as slow and resource hungry as any other

most optimizations might look strange at first
glance, but when you check what the limitting
factors for OS-Level virtualizations are, you
will find that it looks like this:

(in order of decreasing relevance)

 - I/O subsystem
 - available memory 
 - network performance
 - CPU performance

note: this is for 'typical' guests, not for
number crunching or special database, or pure

nope, not the _most_ important one, but it

let's consider a quite limited guest (or several
of them) which have a 'RAM' limit of 64MB and 
additional 64MB of 'virtual swap' assigned ...

if they use roughly 96MB (memory footprint) then
having this 'fluffy' optimization will keep them
running without any effect on the host side, but
without, they will continously swap in and out
which will affect not only the host, but also the

no, the idea is not to ...
From: Dave Hansen
Date: Monday, March 12, 2007 - 4:02 pm

I get the general idea here, but I just don't think those numbers are
very accurate.  My laptop has a bunch of gunk open (xterm, evolution,
firefox, xchat, etc...).  I ran this command:

lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du -Dcs

and got:

113840  total

On a web/database server that I have (ps aux | wc -l == 128), I just ran
the same:

39168   total

That's assuming that all of the libraries are fully read in and
populated, just by their on-disk sizes. Is that not a reasonable measure
of the kinds of things that we can expect to be shared in a vserver?  If
so, it's a long way from 400MB.

Could you try a similar measurement on some of your machines?  Perhaps


I don't doubt this, but doing this two-level page-out thing for
containers/vservers over their limits is surely something that we should
consider farther down the road, right?

It's important to you, but you're obviously not doing any of the

All workloads that use $limit+1 pages of memory will always pay the
price, right?  :)

-- Dave

-

From: Eric W. Biederman
Date: Sunday, March 18, 2007 - 9:58 am

Think shell scripts and the like.  From what I have seen I would agree
that is typical for application code not to dominate application memory usage.
However on the flip side it is non uncommon for application code to dominate
disk usage.  Some of us have giant music, video or code databases that consume
a lot of disk space but in many instances servers don't have enormous chunks
of private files, and even when they do they share the files from the distribution.

The result of this is that there are a lot of unmapped pages cached in the page
cache for rarely run executables, that are cached just in case we need them.

So while Herbert's numbers may be a little off the general principle of the entire
system doing better if you can share the page cache is very real.

That the page cache isn't accounted for here isn't terribly important we still

It is what the current VM of linux does.  There is removing a page from
processes and then there is writing it out to disk.  I think the normal
term is second chance replacement.  The idea is that once you remove
a page from being mapped you let it age a little before it is paged
back in.  This allows pages in high demand to avoid being written

Tread carefully here.  Herbert may not be doing a lot of mainline coding
or extremely careful review of potential patches but he does seem to have
a decent grasp of the basic issues.   In addition to a reasonable amount
of experience so it is worth listening to what he says.

In addition Herbert does seem to be doing some testing of the mainline

Ugh.  You really want swap > RAM here.  Because there are real
cases when you are swapping when all of your pages in RAM can
be cached in the page cache.  96MB with 64MB RSS and 64MB swap is

They should.  When you remove an anonymous page from the pages tables it
needs to be allocated and placed in the swap cache.  Once you do that
it can sit in the page cache like any file backed page.  So the
container that hits $limit+1 should get the paging ...
From: Andrew Morton
Date: Monday, March 12, 2007 - 11:04 pm

nooooooo.  What you're saying there amounts to text replication.  There is
no proposal here to create duplicated copies of pagecache pages: the VM
just doesn't support that (Nick has soe protopatches which do this as a
possible NUMA optimisation).

So these mmapped pages will contiue to be shared across all guests.  The
problem boils down to "which guest(s) get charged for each shared page".

A simple and obvious and easy-to-implement answer is "the guest which paged
it in".  I think we should firstly explain why that is insufficient.

-

From: Kirill Korotaev
Date: Tuesday, March 13, 2007 - 3:19 am

I guess by "paged it in" you essentially mean
"mapped the page into address space for the *first* time"?

i.e. how many times the same page mapped into 2 address spaces
in the same container should be accounted for?

We believe ONE. It is better due to:
- it allows better estimate how much RAM container uses.
- if one container mapped a single page 10,000 times,
  it doesn't mean it is worse than a container which mapped only 200 pages
  and that it should be killed in case of OOM.

Thanks,
Kirill
-

From: Andrew Morton
Date: Tuesday, March 13, 2007 - 4:48 am

Not really - I mean "first allocated the page".  ie: major fault(), read(),

I'm not sure that we need to account for pages at all, nor care about rss.

If we use a physical zone-based containment scheme: fake-numa,
variable-sized zones, etc then it all becomes moot.  You set up a container
which has 1.5GB of physial memory then toss processes into it.  As that
process set increases in size it will toss out stray pages which shouldn't
be there, then it will start reclaiming and swapping out its own pages and
eventually it'll get an oom-killing.

No RSS acounting or page acounting in sight, because we already *have* that
stuff, at the physical level, in the zone.

Overcommitment can be performed by allowing different containers to share
the same zone set, or by dynamically increasing or decreasing the size of
a physical container.

This all works today with fake-numa and cpusets, no kernel changes needed. 

It could be made to work fairly simply with a multi-zone approach, or with
resizeable zones.

I'd be interested in knowing what you think the shortcomings of this are
likely to be,.

-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 7:59 am

sounds good to me, just not sure it provides what we 

okay, let me ask a few naive questions about this scheme:

how does this work for a _file_ which is shared between 
two guests (e.g. an executable like bash, hardlinked 
between guests) when both guests are in a different 
zone-based container?

   + assumed that the file is read in the first time,
     will it be accounted to the first guest doing so?

   + assumed it is accessed in the second guest, will
     it cause any additional cache/mapping besides the
     dentry stuff?

   + will container A be able to 'toss out' pages
     'shared' with container B (assumed sharing is
     possible :)

   + when the container A tosses out the pages for this 
     executable, will guest B still be able to use them?

   + when the pages are tossed out, will they require
     the system to read them in again, or will they


here the question is, can a guest have several of
those 'virtual zones' assigned, so that there is a


will do so once I have a better understanding how this
approach will work ...

TIA,
Herbert

-

From: Dave Hansen
Date: Tuesday, March 13, 2007 - 10:05 am

I was just reading through the (comprehensive) thread about this from
last week, so forgive me if I missed some of it.  The idea is really
tempting, precisely because I don't think anyone really wants to have to
screw with the reclaim logic.  

I'm just brain-dumping here, hoping that somebody has already thought
through some of this stuff.  It's not a bitch-fest, I promise. :)

How do we determine what is shared, and goes into the shared zones?
Once we've allocated a page, it's too late because we already picked.
Do we just assume all page cache is shared?  Base it on filesystem,
mount, ...?  Mount seems the most logical to me, that a sysadmin would
have to set up a container's fs, anyway, and will likely be doing
special things to shared data, anyway (r/o bind mounts :).

There's a conflict between the resize granularity of the zones, and the
storage space their lookup consumes.  We'd want a container to have a
limited ability to fill up memory with stuff like the dcache, so we'd
appear to need to put the dentries inside the software zone.  But, that
gets us to our inability to evict arbitrary dentries.  After a while,
would containers tend to pin an otherwise empty zone into place?  We
could resize it, but what is the cost of keeping zones that can be
resized down to a small enough size that we don't mind keeping it there?
We could merge those "orphaned" zones back into the shared zone. Were
there any requirements about physical contiguity?  What about minimum
zone sizes?

If we really do bind a set of processes strongly to a set of memory on a
set of nodes, then those really do become its home NUMA nodes.  If the
CPUs there get overloaded, running it elsewhere will continue to grab
pages from the home.  Would this basically keep us from ever being able
to move tasks around a NUMA system?

-- Dave

-

From: Mel Gorman
Date: Wednesday, March 14, 2007 - 8:38 am

Assuming we had a means of creating a zone that was assigned to a container,
a second zone for shared data between a set of containers.  For shared data,
the time the pages are being allocated is at page fault time. At that point,
the faulting VMA is known and you also know if it's MAP_SHARED or not.

The caller allocating the page would select (or create) a zonelist that
is appropriate for the container. For shared mappings, it would be one
zone - the shared zone for the set. For private mappings, it would be
one zone - the shared zone for the set.

For overcommit, the allowable zones for overcommit could be included.
Allowing overcommit opens the possibility for containers to interfere with
each other but I'm guessing that if overcommit is enabled, the administrator
is willing to live with that interference.

This has the awkward possibility of having two "shared" zones for two container
sets and one file that needs sharing. Similarly, there is a possibility for
having a container that has no shared zone and faulted in shared data. In
that case, the page ends up in the first faulting container set and it's
too bad it got "charged" for the page use on behalf of other containers. I'm
not sure there is a sane way of accounting this situation fairly.

I think that it's important to note that once data is shared between containers
at all that they have the potential to interfere with each other (by reclaiming

We'd choose the appropriate zonelist before faulting. Once allocated,

I have no strong feelings here. To me, it's "who do I assign this fake
zone to?" I guess you would have at least one zone per container mount

Stuff like shrinking dentry caches is already pretty course-grained.
Last I looked, we couldn't even shrink within a specific node, let alone

Merging "orphaned" zones back into the "main" zone would seem a sensible

For the lookup to software zone to be efficient, it would be easiest to have
them as MAX_ORDER_NR_PAGES contiguous. This would avoid having to ...
From: Dave Hansen
Date: Wednesday, March 14, 2007 - 1:42 pm

Well, but MAP_SHARED does not necessarily mean shared outside of the
container, right?  Somebody wishing to get around resource limits could
just MAP_SHARED any data they wished to use, and get it into the shared
area before their initial use, right?


I shouldn't have used dentries as an example.  I'm just saying that if
we end up (or can end up with) with a whole ton of these software zones,
we might have troubles storing them.  I would imagine the issue would

OK, but merging wouldn't be possible if they're not physically
contiguous.  I guess this could be worked around by just calling it a

I was mostly wondering about zones spanning other zones.  We _do_

I know we _try_ to avoid this these days, but I'm not sure how taking it
away as an option will affect anything.

-- Dave

-

From: Mel Gorman
Date: Tuesday, March 20, 2007 - 11:57 am

Well, the data could also be shared outside of the container. I would see

They would only be able to impact other containers in a limited sense.
Specifically, if 5 containers have one shared area, then any process in
those 5 containers could exceed their container limits at the expense of

A normal read/write if it's the first reader of a file would get charged to the
container, not to the shared area. It is less likely that a file that is read()

That is an immediate problem. There needs to be a way of mapping an arbitrary
page to a software zone. page_zone() as it is could only resolve the "main"
zone. If additional bits were used in page->flags, there would be very hard
limits on the number of containers that can exist.

If zones were physically contiguous to MAX_ORDER, pageblock flags from the
anti-fragmentation could be used to record that a block of pages was in a
container and what the ID is.  If non-contiguous software zones were required,
page->zone could be reintroduced for software zones to be used when a page
belongs to a container. It's not ideal the proper way of mapping pages to
software zones might be more obvious then when we'd see where page->zone
was used.

With either approach, the important thing that occured to me is be to be
sure that pages only came from the same hardware zone. For example, do
not mix HIGHMEM pages with DMA pages because it'll fail miserably. For RSS
accounting, this is not much of a restriction but it does have an impact on


In practice, overlapping zones never happen today so a few new bugs
based on assumptions about MAX_ORDER_NR_PAGES being aligned in a zone

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Paul Menage
Date: Sunday, March 18, 2007 - 3:44 pm

I played with an approach where you can bind a dentry to a set of
memory zones, and any children of that dentry would inherit the
mempolicy; I was envisaging that most data wouldn't be shared between
different containers/jobs, and that userspace would set up "shared"
zones for big shared regions such as /lib, /usr, /bin, and for

move_pages() will let you shuffle tasks from one node to another
without too much intrusion.

Paul
-

From: Eric W. Biederman
Date: Monday, March 19, 2007 - 10:41 am

Here is a wacky one.

Suppose there is some NFS server that exports something that most machines
want to mount like company home directories.

Suppose multiple containers mount that NFS server based on local policy.
(If we can allow non-root users to mount filesystems a slightly more trusted
 guest admin certainly will be able to).

The NFS code as current written (unless I am confused) will do
everything in it's power to share the filesystem cache between the
different mounts (including the dentry tree).

How do we handle bit shared areas like that.

Dynamic programming solutions where we discovery the areas of sharing
at runtime seem a lot more general then a priori solutions where you
have to predict what will come next.

If a priori planning and knowledge about sharing is the best we can do
it is the best we can do and we will have to live with the limits that
imposes.  Given the inflexibility in use and setup I'm not yet ready
to concede that this is the best we can do.

Eric
-

From: Dave Hansen
Date: Tuesday, March 13, 2007 - 10:26 am

My first worry was that this approach is unfair to the poor bastard that
happened to get started up first.  If we have a bunch of containerized
web servers, the poor guy who starts Apache first will pay the price for
keeping it in memory for everybody else.

That said, I think this is naturally worked around.  The guy charged
unfairly will get reclaim started on himself sooner.  This will tend to
page out those pages that he was being unfairly charged for.  Hopefully,
they will eventually get pretty randomly (eventually evenly) spread
among all users.  We just might want to make sure that we don't allow
ptes (or other new references) to be re-established to pages like this
when we're trying to reclaim them.  Either that, or force the next
toucher to take ownership of the thing.  But, that kind of arbitrary
ownership transfer can't happen if we have rigidly defined boundaries
for the containers.

The other concern is that the memory load on the system doesn't come
from the first user ("the guy who paged it in").  The long-term load
comes from "the guy who keeps using it."  The best way to exemplify this
is somebody who read()s a page in, followed by another guy mmap()ing the
same page.  The guy who did the read will get charged, and the mmap()er
will get a free ride.  We could probably get an idea when this kind of
stuff is happening by comparing page->count and page->_mapcount, but it
certainly wouldn't be conclusive.  But, does this kind of nonsense even
happen in practice?  

-- Dave

-

From: Alan Cox
Date: Tuesday, March 13, 2007 - 12:09 pm

"Is it useful for me as a bad guy to make it happen ?"

Alan
-

From: Dave Hansen
Date: Tuesday, March 13, 2007 - 1:28 pm

A very fine question. ;)

To exploit this, you'd need to:
1. need to access common data with another user
2. be patient enough to wait
3. determine when one of those users had actually pulled
   a page in from disk, which sys_mincore() can do, right?

I guess that might be a decent reason to not charge the guy who brings
the page in for the page's entire lifetime.  

So, unless we can change page ownership after it has been allocated,
anyone accessing shared data can get around resource limits if they are
patient.  

-- Dave

-

From: Eric W. Biederman
Date: Thursday, March 15, 2007 - 5:55 pm

To create a DOS attack.

- Allocate some memory you know your victim will want in the future,
  (shared libraries and the like).
- Wait until your victim is using the memory you allocated.
- Terminate your memory resource group.
- Victim is pushed over memory limits by your exiting.
- Victim can no longer allocate memory
- Victim dies

It's not quite that easy unless your victim calls mlockall(MCL_FUTURE),
but the potential is clearly there.

Am I missing something?  Or is this fundamental to any first touch scenario?

I just know I have problems with first touch because it is darn hard to
reason about.

Eric
-

From: Dave Hansen
Date: Friday, March 16, 2007 - 9:31 am

I think it's fundamental to any case where two containers share the use
of the page, but either one _can_ be charged but does not receive a
_full_ charge for it.

I don't think it's uniquely associated with first-touch schemes.

The software zones approach where there would be a set of "shared" zones
would not have this problem, because any sharing would have to occur on
data on which neither one was being charged.

http://linux-mm.org/SoftwareZones

-- Dave

-

From: Eric W. Biederman
Date: Friday, March 16, 2007 - 11:54 am

True.   The "shared" zones approach would simply have the problem that it
would make sharing hard and thus reduce the effectiveness of the page cache.

The "shared" zone approach also would seem to interact in very weird ways
with real NUMA and memory hotplug or process migration.  The fact that we
actually have to care about the real memory size on the machine makes me
look at it strange.

Zones should definitely be penalized in some category for the reduction
in efficiency of the page cache.  It took us decades to learn that the
most efficient page cache was one that could resize and reallocate memory
on demand based on the current usage.  Zones and possibly anything else
with the concept of page ownership seems to be trying to be ignoring


Looking at your page, and I'm too lazy to figure out how to update it
I have a couple of comments.

- Why do limits have to apply to the unmapped page cache?

- Could you mention proper multi process RSS limits.
  (I.e.  we count the number of pages each group of processes have mapped
   and limit that).
  It is the same basic idea as partial page ownership, but instead of
  page ownership you just count how many pages each group is using and
  strictly limit that.  There is no page owner ship or partial charges.
  The overhead is just walking the rmap list at map and unmap time to
  see if this is the first users in the container.  No additional kernel
  data structures are needed.

Eric
-

From: Dave Hansen
Date: Friday, March 16, 2007 - 12:46 pm

You just need to create an account by clicking the Login button.  It
lets you edit things after that.  But, I'd be happy to put anything in

To me, it is just because it consumes memory.  Unmapped cache is, of
couse, much more easily reclaimed than mapped files, but it still
fundamentally causes pressure on the VM.  

To me, a process sitting there doing constant reads of 10 pages has the
same overhead to the VM as a process sitting there with a 10 page file

I've tried to capture this.  Let me know what else you think it needs.

http://linux-mm.org/SoftwareZones

-- Dave

-

From: Eric W. Biederman
Date: Sunday, March 18, 2007 - 10:42 am

I can see temporarily accounting for pages in use for such a
read/write and possibly during things such as read ahead.

However I doubt it is enough memory to be significant, and as
such is probably a waste of time accounting for it.

A memory limit is not about accounting for memory pressure, so I think
the reasoning for wanting to account for unmapped pages as a hard
requirement is still suspect.  A memory limit is to prevent one container
from hogging all of the memory in the system, and denying it to other
containers.

The page cache by definition is a global resource that facilitates
global kernel optimizations.  If we kill those optimizations we
are on the wrong track.  By requiring limits there I think we are
very likely to kill our very important global optimizations, and bring

Requirements:
- The current kernel global optimizations are preserved and useful.

  This does mean one container can affect another when the
  optimizations go awry but on average it means much better
  performance.  For many the global optimizations are what make
  the in-kernel approach attractive over paravirtualization.

Very nice to have:
- Limits should be on things user space have control of.
  
  Saying you can only have X bytes of kernel memory for file
  descriptors and the like is very hard to work with.  Saying you
  can have only N file descriptors open is much easier to deal with.

- SMP Scalability.

  The final implementation should have per cpu counters or per task
  reservations so in most instances we don't need to bounce a global
  cache line around to perform the accounting.

Nice to have:

- Perfect precision.

  Having every last byte always accounted for is nice but a
  little bit of bounded fuzziness in the accounting is acceptable
  if it that make the accounting problem more tractable.

We need several more limits in this discussion to get a full picture,
otherwise we may to try and build the all singing all dancing limit.
- A limit on the number of ...
From: Herbert Poetzl
Date: Monday, March 19, 2007 - 8:48 am

exactly!

nevertheless, you might want to extend that to swapping

that is my major concern for most of the 'straight forward'



agreed, we want to optimize for small systems
as well as for large ones, and SMP/NUMA is quite

as long as the accounting is consistant, i.e.
you do not lose resources by repetitive operations
inside the guest (or through guest-guest interaction)

with shared files, otherwise an lvm partition does

I/O and CPU limits are special, as they have the temporal
component, i.e. you are not interested in 10s CPU time,
instead you want 0.5s/s CPU (same for I/O)

note: this is probably also true for page in/out

- sockets 
- locks
- dentries

HTH,
-

From: Dave Hansen
Date: Tuesday, March 20, 2007 - 9:15 am

Let's say you have an mmap'd file.  It has zero pages brought in right
now.  You do a write to it.  It is well within the kernel's rights to
let you write one word to an mmap'd file, then unmap it, write it to
disk, and free the page.

To me, mmap() is an interface, not a directive to tell the kernel to
keep things in memory.  The fact that two reads of a bytes from an
mmap()'d file tends to not go to disk or even cause a fault for the
second read is because the page is in the page cache.  The fact that two
consecutive read()s of the same disk page tend to not cause two trips to
the disk is because the page is in the page cache.

Anybody who wants to get data in and out of a file can choose to use
either of these interfaces.  A page being brought into the system for
either a read or touch of an mmap()'d area causes the same kind of
memory pressure.

So, I think we have a difference of opinion.  I think it's _all_ about
memory pressure, and you think it is _not_ about accounting for memory
pressure. :)  Perhaps we mean different things, but we appear to
disagree greatly on the surface.

Can we agree that there must be _some_ way to control the amounts of
unmapped page cache?  Whether that's related somehow to the same way we
control RSS or done somehow at the I/O level, there must be some way to
...

I've tried to capture this:


Definitely.  I think we've all agreed that memory is the hard one,
though.  If we can make progress on this one, we're set! :)

-- Dave

-

From: Eric W. Biederman
Date: Tuesday, March 20, 2007 - 2:19 pm

I think it is about preventing a badly behaved container from having a
significant effect on the rest of the system, and in particular other
containers on the system.

See below.  I think to reach agreement we should start by discussing
the algorithm that we see being used to keep the system function well
and the theory behind that algorithm.  Simply limiting memory is not

At lot depends on what we measure and what we try and control.
Currently what we have been measuring are amounts of RAM, and thus
what we are trying to control is the amount of RAM.  If we want to
control memory pressure we need a definition and a way to measure it.
I think there may be potential if we did that but we would still need
a memory limit to keep things like mlock in check.


So starting with a some definitions and theory.
RSS is short for resident set size.  The resident set being how many
of pages are current in memory and not on disk and used by the
application.  This includes the memory in page tables, but can
reasonably be extended to include any memory a process can be shown to
be using.

In theory there is some minimal RSS that you can give an application
at which it will get productive work done.  Below the minimal RSS
the application will spend the majority of real time waiting for
pages to come in from disk, so it can execute the next instruction.
The ultimate worst case here is a read instruction appearing on one
page and it's datum on another.  You have to have both pages in memory
at the same time for the read to complete.  If you set the RSS hard
limit to one page the problem will be continually restarting either
because the page it is on is not in memory or the page it is reading
from is not in memory.

What we want to accomplish is to have a system that runs multiple
containers without problems.  As a general memory management policy
we can accomplish this by ensuring each container has at least
it's minimal RSS quota of pages.  By watching the paging activity
of a ...
From: Herbert Poetzl
Date: Thursday, March 22, 2007 - 5:51 pm

that is exactly what we (Linux-VServer) want ...
(sounds good to me, please keep up the good work in
this direction)

there is nothing wrong with hard limits if somebody
really wants them, even if they hurt the sysstem as
whole, but those limits shouldn't be the default ..

best,
-

From: Nick Piggin
Date: Thursday, March 22, 2007 - 10:57 pm

That's Dave's point, I believe. Limiting mapped memory may be
mostly OK for well behaved applications, but it doesn't do anything
to stop bad ones from effectively DoSing the system or ruining any
guarantees you might proclaim (not that hard guarantees are always
possible without using virtualisation anyway).

This is why I'm surprised at efforts that go to such great lengths
to get accounting "just right" (but only for mmaped memory). You
may as well not even bother, IMO.

Give me an RSS limit big enough to run a couple of system calls and
a loop...

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Eric W. Biederman
Date: Friday, March 23, 2007 - 3:12 am

Would any of them work on a system on which every filesystem was on
ramfs, and there was no swap?  If not then they are not memory attacks
but I/O attacks.

I completely concede that you can DOS the system with I/O if that is
not limited as well.

My point is that is not a memory problem but a disk I/O problem which is
much easier to and cheaper to solve.  Disk I/O is fundamentally a slow
path which makes it hard to modify it in a way that negatively affects
system performance.

I don't think with a memory RSS limit you can DOS the system in a way
that is purely about memory.  You have to pick a different kind of DOS
attack.

As for virtualization that is what a kernel is about virtualizing it's
resources so you can have multiple users accessing them at the same
time.  You don't need some hypervisor or virtual machine to give you
that.  That is where we start.  However it was found long ago that
global optimizations give better system through put then the rigid
systems you can get with hypervisors.  Although things are not
quite as deterministic when you optimize globally.  They should be
sufficiently deterministic you can avoid the worst of the DOS
attacks.

The real practical problem with the current system is that nearly
all of our limits are per process and applications now span more than
one process so the limits provided by linux are generally useless
to limit real world applications.  This isn't generally a problem
until we start trying to run multiple applications on the same system
because the hardware is so powerful.  Which the namespace work which
will allow you to run several different instances of user space
simultaneously is likely to allow.

At the moment I very much in a position of doing review not
implementing this part of it.  I'm trying to get the people doing the
implementation to make certain they have actually been paying
attention to how their proposed limits will interact with the rest of
the system.  So far generally the conversation has ...
From: Nick Piggin
Date: Friday, March 23, 2007 - 3:47 am

It can be done trivially without performing any IO or swap, yes.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Eric W. Biederman
Date: Friday, March 23, 2007 - 5:21 am

Please give me a rough sketch of how to do so.

Or is this about DOS'ing the system by getting the kernel to allocate
a large number of data structures (struct file, struct inode, or the like)?

Eric
-

From: Nick Piggin
Date: Wednesday, March 28, 2007 - 12:33 am

Reading sparse files is just one I had in mind. But I'm not very

That works too. And I don't believe hand-accounting and limiting
all these things individually as a means to limit RAM usage is sane,
when you have a much more comprehensive and relatively unintrusive
page level scheme.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Dave Hansen
Date: Friday, March 23, 2007 - 9:41 am

I truly understand your point here.  But, I don't think this thought
exercise is really helpful here.  In a pure sense, nothing is keeping an
unmapped page cache file in memory, other than the user's prayers.  But,
please don't discount their prayers, it's what they want!

I seem to remember a quote attributed to Alan Cox around OLS time last
year, something about any memory controller being able to be fair, fast,
and accurate.  Please pick any two, but only two.  Alan, did I get
close?

To me, one of the keys of Linux's "global optimizations" is being able
to use any memory globally for its most effective purpose, globally
(please ignore highmem :).  Let's say I have a 1GB container on a
machine that is at least 100% committed.  I mmap() a 1GB file and touch
the entire thing (I never touch it again).  I then go open another 1GB
file and r/w to it until the end of time.  I'm at or below my RSS limit,
but that 1GB of RAM could surely be better used for the second file.
How do we do this if we only account for a user's RSS?  Does this fit
into Alan's unfair bucket? ;)

Also, in a practical sense, it is also a *LOT* easier to describe to a
customer that they're getting 1GB of RAM than >=20GB/hr of bandwidth
from the disk.  

-- Dave

P.S. Do we have an quotas on ramfs?  If we have an ramfs filesystems,
what keeps the containerized users from just filling up RAM?

-

From: Herbert Poetzl
Date: Friday, March 23, 2007 - 11:16 am

what's the difference to a normal Linux system here?
when low on memory, the system will reclaim pages, and

if you want something which is easy to describe for the
'customer', then a VM is what you are looking for, it has
a perfectly well defined amount of resources which will

tmpfs has hard limits, you simply specify it on mount

 none	/tmp		tmpfs	size=16m,mode=1777	0 0

best,
-

From: Balbir Singh
Date: Wednesday, March 28, 2007 - 2:18 am

But would it not bias application writers towards using read()/write()
calls over mmap()? They know that their calls are likely to be faster
when the application is run in a container. Without page cache control
we'll end up creating an asymmetrical container, where certain usage is 
charged and some usage is not.

Also, please note that when a page is unmapped and moved to swap cache;
the swap cache uses the page cache. Without page cache control, we could
end up with too many pages moving over to the swap cache and still
occupying memory, while the original intension was to avoid this
scenario.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Mel Gorman
Date: Wednesday, March 14, 2007 - 9:47 am

I think it would be very difficult in practice to exploit a situation where
an evil guy forces another container to hold shared pages that the container

Exactly. That said, the "poor bastard" will have to be pretty determined
to page out because the pages will appear active but it should happen

I don't think anything like that currently exists. It's almost the opposite
of what the current reclaim algorithm would be trying to do because it has no
notion of containers. Currently, the idea of paging out something in active
use is a mad plan.

Maybe what would be needed is something where the shared page is unmapped from
page tables and the next faulter must copy the page instead of reestablishing
the PTE. The data copy is less than ideal but it'd be cheaper than reclaim
and help the accounting. However, it would require a counter to track "how

Right, charging the next toucher would not work in the zones case. The next
toucher would establish a PTE to the page which is still in the zone of the

I think this problem would happen with other accounting mechanisms as
well. However, it's more pronounced with zones because there are harder
limits on memory usage.

If the counter existed to track "how many processes in this container have
mapped the page", the problem of free-riders could be investigated by comparing
_mapcount to the container count. That would determine if additional steps
are required or not to force another container to assume the accounting cost.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Pavel Emelianov
Date: Monday, March 12, 2007 - 2:02 am

So what to do when virtual physical limit is hit?

This is true for current implementation for
booth - this patchset ang OpenVZ beancounters.

If you sum up the physpages values for all containers

This is done in current patches.

Herbert, did you look at the patches before
sending this mail or do you just want to
'take part' in conversation w/o understanding

-

From: Herbert Poetzl
Date: Monday, March 12, 2007 - 2:11 pm

nice, but the question was about _requirements_

when the RSS limit is hit, but there _are_ enough
pages left on the physical system, there is no
good reason to swap out the page at all

 - there is no benefit in doing so (performance
   wise, that is)

 - it actually hurts performance, and could
   become a separate source for DoS

what should happen instead (in an ideal world :)
is that the page is considered swapped out for
the guest (add guest penality for swapout), and 
when the page would be swapped in again, the guest
takes a penalty (for the 'virtual' page in) and
the page is returned to the guest, possibly kicking


again, the question was about requirements, not
your patches, and yes, I had a look at them _and_
the OpenVZ implementations ...

best,
Herbert

-

From: Pavel Emelianov
Date: Tuesday, March 13, 2007 - 12:17 am

Is the page stays mapped for the container or not?
If yes then what's the use of limits? Container mapped
pages more than the limit is but all the pages are


-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 8:05 am

sounds weird, but makes sense if you look at the full picture

just because the guest is over its page limit doesn't 
mean that you actually want the system to swap stuff
out, what you really want to happen is the following:

 - somehow mark those pages as 'gone' for the guest
 - penalize the guest (and only the guest) for the
   'virtual' swap/page operation
 - penalize the guest again for paging in the page
 - drop/swap/page out those pages when the host system

you tell me? or is that an option in OpenVZ?

best,
-

From: Pavel Emelianov
Date: Tuesday, March 13, 2007 - 8:32 am

Yeah! And slow down the container which caused global
limit hit (w/o hitting it's own limit!) by swapping

In OpenVZ we account resources in host system as well.

-

From: Kirill Korotaev
Date: Tuesday, March 13, 2007 - 8:10 am

great. I agree with that.
Just curious why current vserver code kills arbitrary

depends on whether you will include beanocunter 0 usages or not :)

Kirill
-

From: Herbert Poetzl
Date: Tuesday, March 13, 2007 - 8:11 am

because it obviously lacks the finess of OpenVZ code :)

seriously, handling the OOM kills inside a container
has never been a real world issue, as once you are
really out of memory (and OOM starts killing) you 
usually have lost the game anyways (i.e. a guest restart
or similar is required to get your services up and
running again) and OOM killer decisions are not perfect
in mainline either, but, you've probably seen the 
FIXME and TODO entries in the code showing that this

so that is an option then?

best,
-

From: Kirill Korotaev
Date: Tuesday, March 13, 2007 - 8:54 am

I'm talking not about the finess of the code,
but rather about the lack of isolation,
i.e. one VE can affect others.

Kirill
-

From: Balbir Singh
Date: Sunday, March 11, 2007 - 8:51 am

We discussed zones for resource control and some of the disadvantages at
              http://lkml.org/lkml/2006/10/30/222

I need to look at Mel's patches to determine if they are suitable for
control. But in a thread of discussion on those patches, it was agreed

We discussed some of the requirements in the RFC: Memory Controller
requirements thread

All the stake holders involved in the RFC discussion :-) We've been
talking and building on top of each others patches. I hope that was a

There are other things like resizing a zone, finding the right size,
etc. I'll look
at Mel's patches to see what is supported.

Warm Regards,
Balbir Singh
-

From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 12:34 pm

And misses every resource sharing opportunity in sight.  Except for
filtering the which pages are eligible for reclaim an RSS limit should
not need to change the existing reclaim logic, and with things like the
memory zones we have had that kind of restriction in the reclaim logic

If you are talking about RSS limits the term is well defined.  The
number of pages you can have mapped into your set of address space at
any given time.

Unless I'm totally blind that isn't what the patchset implements.  A
true RSS limit over multiple processes has a lot of potential to be
generally useful, is very understandable, doesn't affect kernel cache
decisions so largely performance should not be affected.  There is a
little more overhead in the fault logic but that is a moderately


Another really nasty issue is the container term as the resource guys
are using the term in a subtlety different way then it has been used
with namespaces leading to several threads where the participants talked
past each other.  We need a different term to designate the group of
tasks a resource controller is dealing with.

The whole filesystem interface also is over general and makes it too
easy to express the hard things (like move an existing task from one
group of tasks to another) leading to code complications.

On the up side I think the code the focus is likely in the right place
to start delivering usable code.

Eric
-

From: Kirill Korotaev
Date: Monday, March 12, 2007 - 2:23 am

exactly this is implemented in the current patches from Pavel.
the only difference is that filtering is not done in general LRU list,
which is not effective, but via per-container LRU list.
So the pointer on the page structure does 2 things:
- fast reclamation
- correct uncharging of page from where it was charged
  (e.g. shared pages can be mapped first in one container, but the last unmap

Ouch, what makes you think so?
The fact that a page mapped into 2 different processes is charged only once?
Imho it is much more correct then sum of process' RSS within container, due to:
1. it is clear how much container uses physical pages, not abstract items
2. shared pages are charged only once, so the sum of containers RSS is still


Thanks,
Kirill
-

From: Eric W. Biederman
Date: Tuesday, March 13, 2007 - 2:26 am

No the fact that a page mapped into 2 separate mm_structs in two
separate accounting domains is counted only once.  This is very likely
to happen with things like glibc if you have a read-only shared copy
of your distro.  There appears to be no technical reason for such a
restriction.

A page should not be owned.  

Going further unless the limits are draconian I don't expect users to
hit the rss limits often or frequently.  So in 99% of all cases page
reclaim should continue to be global.  Which makes me question messing
with the general page reclaim lists.

Now if the normal limits turn out to be draconian it may make sense to
split the first level of page lists by some reasonable approximation

Maybe.  The extra locking complexity gives me fits.  But in the grand
scheme of things it is minor as long as it is not user perceptible we
can fix it later.  I'm still wrapping my head around the weird fs concepts.

Eric
-

From: Kirill Korotaev
Date: Tuesday, March 13, 2007 - 8:43 am

I would be happy to propose OVZ approach then, where a page is tracked
with page_beancounter data structure, which ties together
a page with beancounters which use it like this:

page -> page_beancounter -> list of beanocunters which has the page mapped

This gives a number of advantages:
- the page is accounted to all the VEs which actually use it.
- allows almost accurate tracking of page fractions used by VEs
  depending on how many VEs mapped the page.
- allows to track dirty pages, i.e. which VE dirtied the page
  and implement correct disk I/O accounting and CFQ write scheduling

It is not that rare when containers hit their limits, believe me :/
In trusted environments - probably you are right, in hosting - no.

Thanks,
Kirill

-

From: Balbir Singh
Date: Tuesday, March 6, 2007 - 10:37 pm

The wording looks very familiar :-). It would be useful to add
"The reclaim logic is now container aware, when the container goes overlimit
the page reclaimer reclaims pages belonging to this container. If we are
unable to reclaim enough pages to satisfy the request, the process is

Yes, this is what I was planning to get to -- a per container LRU list.
But you have just one list, don't you need active and inactive lists?
When the global LRU is manipulated, shouldn't this list be updated as

The return codes of the functions is a bit confusing, ideally
container_try_to_free_pages() should return 0 on success. Also
res_counter_charge() has a WARN_ON(1) if the limit is exceeded.
The system administrator can figure out the details from failcnt,
I suspect when the container is running close to it's limit,
dmesg will have too many WARNING messages.

How much memory do you try to reclaim in container_try_to_free_pages()?
With my patches, I was planning to export this knob to userspace with
a default value. This will help the administrator decide how much
of the working set/container LRU should be freed on reaching the limit.
I cannot find the definition of container_try_to_free_pages() in

This is not good, it won't give us LRU behaviour which is

Which part of the working set are we pushing out, this looks like
we are using FIFO to determine which pages to reclaim. This needs

This would lead to LRU churning, I would recommend using list_splice_tail()
instead. Since this code has a lot in common with isolate_lru_pages, it
would be nice to reuse the code in vmscan.c

NOTE: Code duplication is a back door for subtle bugs and solving the same

I see that the charges are not migrated. Is that good?
If a user could find a way of migrating his/her task from
one container to another, it could create an issue with
the user's task taking up a big chunk of the RSS limit.

Can we migrate any task or just the thread group leader.
In my patches, I allowed migration of just the ...
From: Pavel Emelianov
Date: Wednesday, March 7, 2007 - 12:27 am

Nope - res_counter_uncharge() has - this is an absolutely

At least one page. This is enough to make one page charge.
That's the difference from general try_to_free_pages() that

This is in patch #5.

Why not - recently used pages are in the head of the list.
Active/incative state of the page is determined from it's flags.

The idea of this list is to decrease the number of pages scanned

This algo works exactly like general try_to_free_pages() does.



Anyway - page migration may be done later with a

-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 7:58 am

Adds needed pointers to mm_struct and page struct,
places hooks to core code for mm_struct initialization
and hooks in container_init_early() to preinitialize
RSS accounting subsystem.
From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 12:13 pm

An extra pointer in struct page is unlikely to fly.
Both because it increases the size of a size critical structure,
and because conceptually it is ridiculous.

If you are limiting the RSS size you are counting the number of pages in
the page tables.  You don't care about the page itself.

With the rmap code it is relatively straight forward to see if this is
the first time a page has been added to a page table in your rss
group, or if this is the last reference to a particular page in your
rss group.  The counters should only increment the first time a
particular page is added to your rss group.  The counters should only
decrement when it is the last reference in your rss subsystem.

This allow important little cases like glibc to be properly accounted
for. One of the key features of a rss limit is that the kernel can
still keep pages that you need in-core, that are accessible with just
a minor fault.  Directly owning pages works directly against that

Eric
-

From: Kirill Korotaev
Date: Monday, March 12, 2007 - 9:16 am

as it was discussed multiple times (and according OLS):
- it is not critical nowdays to expand struct page a bit in case
  accounting is on.
- it can be done w/o extending, e.g. via mapping page <-> container
  using hash or some other data structure.
You are fundamentally wrong if shared pages are concerned.
Imagine a glibc page shared between 2 containers - VE1 and VE2.
VE1 was the first who mapped it, so it is accounted to VE1
(rmap count was increased by it).
now VE2 maps the same page. You can't determine whether this page is mapped
to this container or another one w/o page->container pointer.
All the choices you have are:
a) do not account this page, since it is allready accounted to some other VE.
b) account this page again to current container.

(a) is bad, since VE1 can unmap this page first, and the last user will be VE2.
Which means VE1 will be charged for it, while VE2 uncharged. Accounting screws up.

b) is bad, since:
  - the same page is accounted multiple times, which makes impossible
    to understand how much real memory pages container needs/consumes
  - and because on container enter the process and it's pages
    are essentially moved to another context, while accounting
Sorry, can't understand what you mean. It doesn't work against.
Each container has it's own LRU. So if glibc has the most
often used pages - it won't be thrashed out.

Thanks,
Kirill
-

From: Dave Hansen
Date: Monday, March 12, 2007 - 9:48 am

Hi Kirill,

I thought we can always get from the page to the VMA.  rmap provides
this to us via page->mapping and the 'struct address_space' or anon_vma.
Do we agree on that?

We can also get from the vma to the mm very easily, via vma->vm_mm,
right?

We can also get from a task to the container quite easily.  

So, the only question becomes whether there is a 1:1 relationship
between mm_structs and containers.  Does each mm_struct belong to one
and only one container?  Basically, can a threaded process have
different threads in different containers?

It seems that we could bridge the gap pretty easily by either assigning
each mm_struct to a container directly, or putting some kind of
task-to-mm lookup.  Perhaps just a list like
mm->tasks_using_this_mm_list.

Not rocket science, right?

-- Dave

-

From: Pavel Emelianov
Date: Monday, March 12, 2007 - 10:19 am

Not completely. When page is unmapped from the *very last*
user its *first* toucher may already be dead. So we'll never

No. The question is "how to get a container that touched the
page first" which is the same as "how to find mm_struct which
touched the page first". Obviously there's no answer on this
question unless we hold some direct page->container reference.
This may be a hash, a direct on-page pointer, or mirrored

This could work for reclamation: we scan through all the
mm_struct-s within the container and shrink its' pages, but

-

From: Dave Hansen
Date: Monday, March 12, 2007 - 10:27 am

OK, but  this is assuming that we didn't *un*account for the page when

Or, you keep track of when the last user from the container goes away,
and you effectively account it to another one.

Are there problems with shifting ownership around like this?

-- Dave

-

From: Pavel Emelianov
Date: Tuesday, March 13, 2007 - 12:10 am

That's exactly what we agreed on during our discussions:
When page is get touched it is charged to this container.
When page is get touched again by new container it is NOT
charged to new container, but keeps holding the old one
till it (the page) is completely freed. Nobody worried the
fact that a single page can hold container for good.

OpenVZ beancounters work the other way (and we proposed this
solution when we first sent the patches). We keep track of

We can migrate page to another user but we decided

-

From: Balbir Singh
Date: Monday, March 12, 2007 - 10:21 am

These patches are very similar to what I posted at
                    http://lwn.net/Articles/223829/
In my patches, the thread group leader owns the mm_struct and all
threads belong to the same container. I did not have a per container
LRU, walking the global list for reclaim was a bit slow, but otherwise
my patches did not add anything to struct page

I used rmap information to get to the VMA and then the mm_struct.
Kirill, it is possible to determine all the containers that map the
page. Please see the page_in_container() function of
http://lkml.org/lkml/2007/2/26/7.

I was also thinking of using the page table(s) to identify all pages
belonging to a container, by obtaining all the mm_structs of tasks
belonging to a container. But this approach would not work well for
the page cache controller, when we add that to our memory controller.

Balbir
-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 8:00 am

Pages are charged to their first touchers which are
determined using pages' mapcount manipulations in
rmap calls.
From: Eric W. Biederman
Date: Sunday, March 11, 2007 - 12:14 pm

NAK pages should be charged to every rss group whose mm_struct they
are mapped into.

Eric
-

From: Kirill Korotaev
Date: Monday, March 12, 2007 - 9:23 am

For these you essentially need per-container page->_mapcount counter,
otherwise you can't detect whether rss group still has the page in question being mapped
in its processes' address spaces or not.

1. This was discussed before and considered to be ok by all the resource management
   involved people.
2. this can be done with a-la page beancounters which are used in OVZ for shared
   fractions accounting. It's a next step forward.

If you know how to get "pages should be charged to every rss group whose mm_struct they
are mapped into" w/o additional pointer in struct page, please throw me an idea.

Thanks,
Kirill
-

From: Dave Hansen
Date: Monday, March 12, 2007 - 9:50 am

What do you mean by this?  You can always tell whether a process has a
particular page mapped.  Could you explain the issue a bit more.  I'm
not sure I get it.

-- Dave

-

From: Herbert Poetzl
Date: Monday, March 12, 2007 - 4:54 pm

OpenVZ wants to account _shared_ pages in a guest
different than separate pages, so that the RSS
accounted values reflect the actual used RAM instead
of the sum of all processes RSS' pages, which for
sure is more relevant to the administrator, but IMHO
not so terribly important to justify memory consuming
structures and sacrifice performance to get it right

YMMV, but maybe we can find a smart solution to the
issue too :)

best,
-

From: Eric W. Biederman
Date: Tuesday, March 13, 2007 - 2:58 am

I will tell you what I want.

I want a shared page cache that has nothing to do with RSS limits.

I want an RSS limit that once I know I can run a deterministic
application with a fixed set of inputs in I want to know it will
always run.

First touch page ownership does not guarantee give me anything useful
for knowing if I can run my application or not.  Because of page
sharing my application might run inside the rss limit only because
I got lucky and happened to share a lot of pages with another running
application.  If the next I run and it isn't running my application
will fail.  That is ridiculous.

I don't want sharing between vservers/VE/containers to affect how many
pages I can have mapped into my processes at once.

Now sharing is sufficiently rare that I'm pretty certain that problems
come up rarely.  So maybe these problems have not shown up in testing
yet.  But until I see the proof that actually doing the accounting for
sharing properly has intolerable overhead.  I want proper accounting
not this hand waving that is only accurate on the third Tuesday of the
month.

Ideally all of this will be followed by smarter rss based swapping.
There are some very cool things that can be done to eliminate machine
overload once you have the ability to track real rss values.  

Eric
-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 3:25 am

Let's be practical here, what you're asking is basically impossible.

Unless by deterministic you mean that it never enters the a non
trivial syscall, in which case, you just want to know about maximum


It is basically handwaving anyway. The only approach I've seen with
a sane (not perfect, but good) way of accounting memory use is this
one. If you care to define "proper", then we could discuss that.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Eric W. Biederman
Date: Tuesday, March 13, 2007 - 9:01 am

Not per process I want this on a group of processes, and yes that
is all I want just.  I just want accounting of the maximum RSS of

No.  I don't want the meaning of my rss limit to be affected by what
other processes are doing.  We have constraints of how many resources
the box actually has.  But I don't want accounting so sloppy that
processes outside my group of processes can artificially

I will agree that this patchset is probably in the right general ballpark.
But the fact that pages are assigned exactly one owner is pure non-sense.
We can do better.  That is all I am asking for someone to at least attempt
to actually account for the rss of a group of processes and get the numbers
right when we have shared pages, between different groups of
processes.  We have the data structures to support this with rmap.

Let me describe the situation where I think the accounting in the
patchset goes totally wonky. 


Gcc as I recall maps the pages it is compiling with mmap.
If in a single kernel tree I do:
make -jN O=../compile1 &
make -jN O=../compile2 &

But set it up so that the two compiles are in different rss groups.
If I run the concurrently they will use the same files at the same
time and most likely because of the first touch rss limit rule even
if I have a draconian rss limit the compiles will both be able to
complete and finish.   However if I run either of them alone if I
use the most draconian rss limit I can that allows both compiles to
finish I won't be able to compile a single kernel tree.

The reason for the failure with a single tree (in my thought
experiment) is that the rss limit was set below the what is actually
needed for the code to work.  When we were compiling two kernels and
they were mapping the same pages at the same time we could put the rss
limit below the minimum rss needed for the compile to execute and
still have it complete because of with first touch only one group
accounted for the pages and the other just leached of the first, as
long as ...
From: Nick Piggin
Date: Tuesday, March 13, 2007 - 8:51 pm

Well don't you just sum up the maximum for each process?

Or do you want to only count shared pages inside a container once,

So what are you going to do about all the shared caches and slabs


Yeah it is not perfect. Fortunately, there is no perfect solution,
so we don't have to be too upset about that.

And strangely, this example does not go outside the parameters of
what you asked for AFAIKS. In the worst case of one container getting
_all_ the shared pages, they will still remain inside their maximum
rss limit.

So they might get penalised a bit on reclaim, but maximum rss limits
will work fine, and you can (almost) guarantee X amount of memory for
a given container, and it will _work_.

But I also take back my comments about this being the only design I
have seen that gets everything, because the node-per-container idea
is a really good one on the surface. And it could mean even less impact

I think it is simplistic.

Sure you could probably use some of the rmap stuff to account shared
mapped _user_ pages once for each container that touches them. And
this patchset isn't preventing that.

But how do you account kernel allocations? How do you account unmapped
pagecache?

What's the big deal so many accounting people have with just RSS? I'm
not a container person, this is an honest question. Because from my
POV if you conveniently ignore everything else... you may as well just
not do any accounting at all.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Balbir Singh
Date: Tuesday, March 13, 2007 - 11:42 pm

When that does happen and if a container hits it limit, with a LRU
per-container, if the container is not actually using those pages,
they'll get thrown out of that container and get mapped into the

With the proposed node-per-container, we will need to make massive core
VM changes to reorganize zones and nodes. We would want to allow

1. For sharing of nodes
2. Resizing nodes
3. May be more

With the node-per-container idea, it will hard to control page cache
limits, independent of RSS limits or mlock limits.


We decided to implement accounting and control in phases

1. RSS control
2. unmapped page cache control
3. mlock control
4. Kernel accounting and limits

This has several advantages

1. The limits can be individually set and controlled.
2. The code is broken down into simpler chunks for review and merging.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Nick Piggin
Date: Tuesday, March 13, 2007 - 11:57 pm

Exactly. Statistically, first touch will work OK. It may mean some
reclaim inefficiencies in corner cases, but things will tend to

But a lot of that is happening anyway for other reasons (eg. memory
plug/unplug). And I don't consider node/zone setup to be part of the
"core VM" as such... it is _good_ if we can move extra work into setup
rather than have it in the mm.


I don't know that it would be particularly harder than any other
first-touch scheme. If one container ends up being charged with too
much pagecache, eventually they'll reclaim a bit of it and the pages

But this patch gives the groundwork to handle 1-4, and it is in a small
chunk, and one would be able to apply different limits to different types
of pages with it. Just using rmap to handle 1 does not really seem like a
viable alternative because it fundamentally isn't going to handle 2 or 4.

I'm not saying that you couldn't _later_ add something that uses rmap or
our current RSS accounting to tweak container-RSS semantics. But isn't it
sensible to lay the groundwork first? Get a clear path to something that
is good (not perfect), but *works*?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Balbir Singh
Date: Wednesday, March 14, 2007 - 12:48 am

Thanks, thats one of our goals, to keep it simple, understandable and

Yes, true, but what if a user does not want to control the page
cache usage in a particular container or wants to turn off

For (2), we have the basic setup in the form of a per-container LRU list
and a pointer from struct page to the container that first brought in

I agree with your development model suggestion. One of things we are going 
to do in the near future is to build (2) and then add (3) and (4). So far,
we've not encountered any difficulties on building on top of (1).

Vaidy, any comments?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Vaidyanathan Srinivasan
Date: Wednesday, March 14, 2007 - 6:25 am

Accounting becomes easy if we have a container pointer in struct page.
 This can form base ground for building controllers since any memory
related controller would be interested in tracking pages.  However we
still want to evaluate if we can build them without bloating the
struct page.  Pagecache controller (2) we can implement with container
pointer in struct page or container pointer in struct address space.

Building on this patchset is much simple and and we hope the bloat in
struct page will be compensated by the benefits in memory controllers
in terms of performance and simplicity.

Adding too many controllers and accounting parameters to start with
will make the patch too big and complex.  As Balbir mentioned, we have
a plan and we shall add new control parameters in stages.

--Vaidy
-

From: Nick Piggin
Date: Wednesday, March 14, 2007 - 6:49 am

The thing is, you have to worry about actually getting anything in the
kernel rather than trying to do fancy stuff.

The approaches I have seen that don't have a struct page pointer, do
intrusive things like try to put hooks everywhere throughout the kernel
where a userspace task can cause an allocation (and of course end up
missing many, so they aren't secure anyway)... and basically just
nasty stuff that will never get merged.

Struct page overhead really isn't bad. Sure, nobody who doesn't use
containers will want to turn it on, but unless you're using a big PAE
system you're actually unlikely to notice.

But again, I'll say the node-container approach of course does avoid
this nicely (because we already can get the node from the page). So
definitely that approach needs to be discredited before going with this

Everyone seems to have a plan ;) I don't read the containers list...
does everyone still have *different* plans, or is any sort of consensus
being reached?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Vaidyanathan Srinivasan
Date: Wednesday, March 14, 2007 - 7:43 am

Consensus?  I believe at this point we have a sort of consensus on the
base container infrastructure and the need for memory controller to
control RSS, pagecache, mlock, kernel memory etc.  However the
implementation and approach taken is still being discussed :)

--Vaidy

-

From: Kirill Korotaev
Date: Wednesday, March 14, 2007 - 9:16 am

User beancounters patch has got through all these...
The approach where each charged object has a pointer to the owner container,
who has charged it - is the most easy/clean way to handle
all the problems with dynamic context change, races, etc.

big PAE doesn't make any difference IMHO

But it lacks some other features:
1. page can't be shared easily with another container
2. shared page can't be accounted honestly to containers
   as fraction=PAGE_SIZE/containers-using-it
3. It doesn't help accounting of kernel memory structures.
   e.g. in OpenVZ we use exactly the same pointer on the page
   to track which container owns it, e.g. pages used for page
   tables are accounted this way.
4. I guess container destroy requires destroy of memory zone,
   which means write out of dirty data. Which doesn't sound
   good for me as well.
5. memory reclamation in case of global memory shortage
   becomes a tricky/unfair task.
6. You cannot overcommit. AFAIU, the memory should be granted
   to node exclusive usage and cannot be used by by another containers,

hope we'll have it soon :)

Thanks,
Kirill

-

From: Nick Piggin
Date: Wednesday, March 14, 2007 - 10:01 pm

The pointer in struct page approach is a decent one, which I have
liked since this whole container effort came up. IIRC Linus and Alan
also thought that was a reasonable way to go.

I haven't reviewed the rest of the beancounters patch since looking
at it quite a few months ago... I probably don't have time for a

The issue is just that struct pages use low memory, which is a really
scarce commodity on PAE. One more pointer in the struct page means
64MB less lowmem.

But PAE is crap anyway. We've already made enough concessions in the
kernel to support it. I agree: struct page overhead is not really

I think they could be shared. You allocate _new_ pages from your own
node, but you can definitely use existing pages allocated to other

Yes there would be some accounting differences. I think it is hard
to say exactly what containers are "using" what page anyway, though.

?

I haven't looked at any implementation, but I think it is fine for

I don't understand why? You can much more easily target a specific
container for reclaim with this approach than with others (because

I'm not sure about that. If you have a larger number of nodes, then
you could assign more free nodes to a container on demand. But I
think there would definitely be less flexibility with nodes...

I don't know... and seeing as I don't really know where the google

Good luck ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 
-

From: Balbir Singh
Date: Wednesday, March 14, 2007 - 10:44 pm

This patch is not really beancounters.

1. It uses the containers framework
2. It is similar to my RSS controller (http://lkml.org/lkml/2007/2/26/8)


Yes, but we break the global LRU. With these RSS patches, reclaim not
triggered by containers still uses the global LRU, by using nodes,

I think we have made some forward progress on the consensus.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Ethan Solomita
Date: Wednesday, March 28, 2007 - 1:15 pm

If we used Beancounters as Pavel and Kirill mentioned, that would 
keep track of each container that has referenced a page, not just the 
first container. It sounds like beancounters can return a usage count 
where each page is divided by the number of referencing containers (e.g. 
1/3rd if 3 containers share a page). Presumably it could also return a 
full count of 1 to each container.

    If we look at data in the latter form, i.e. each container must pay 
fully for each page used, then Eric could use that to determine real 
usage needs of the container. However we could also use the fractional 
count in order to do things such as charging the container for its 
actual usage. i.e. full count for setting guarantees, fractional for 
actual usage.
    -- Ethan

-

From: Kirill Korotaev
Date: Monday, March 12, 2007 - 10:07 am

When we do charge/uncharge we have to answer on another question:
"whether *any* task from the *container* has this page mapped", not the
"whether *this* task has this page mapped".

Thanks,
Kirill
-

From: Dave Hansen
Date: Monday, March 12, 2007 - 10:33 am

That's a bit more clear. ;)

OK, just so I make sure I'm getting your argument here.  It would be too
expensive to go looking through all of the rmap data for _any_ other
task that might be sharing the charge (in the same container) with the
current task that is doing the unmapping.  

The requirements you're presenting so far appear to be:

1. The first user of a page in a container must be charged
2. The second user of a page in a container must not be charged
3. A container using a page must take a diminished charge when 
   another container is already using the page.
4. Additional fields in data structures (including 'struct page') are
   permitted

What have I missed?  What are your requirements for performance?

I'm not quite sure how the page->container stuff fits in here, though.
page->container would appear to be strictly assigning one page to one
container, but I know that beancounters can do partial page charges.
Care to fill me in?

-- Dave

-

From: Eric W. Biederman
Date: Tuesday, March 13, 2007 - 2:43 am

Which is a questionable assumption.  Worse case we are talking a list
several thousand entries long, and generally if you are used by the same
container you will hit one of your processes long before you traverse
the whole list.

So at least the average case performance should be good.

It is only in the case when you a page is shared between multiple
containers when this matters.

Eric
-

From: Cedric Le Goater
Date: Wednesday, March 14, 2007 - 8:37 am

you missed out an include in mm/migrate.c

cheers,

C.
Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
---
 mm/migrate.c |    1 +
 1 file changed, 1 insertion(+)

Index: 2.6.20/mm/migrate.c
===================================================================
--- 2.6.20.orig/mm/migrate.c
+++ 2.6.20/mm/migrate.c
@@ -28,6 +28,7 @@
 #include <linux/mempolicy.h>
 #include <linux/vmalloc.h>
 #include <linux/security.h>
+#include <linux/rss_container.h>
 
 #include "internal.h"

-

From: Pavel Emelianov
Date: Wednesday, March 14, 2007 - 8:45 am

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 8:03 am

* container_try_to_free_pages() walks containers
  page list and tries to shrink pages. This is based
  on try_to_free_pages() and Co code.
  Called from core code when no resource left at the
  moment of page touching.

* container_out_of_memory() selects a process to be
  killed which mm_struct belongs to container in question.
  Called from core code when no resources left and no
  pages were reclaimed.
From: Balbir Singh
Date: Friday, March 9, 2007 - 2:21 pm

Hi, Pavel,

Please find my patch to add LRU behaviour to your latest RSS controller.

Balbir Singh
Linux Technology Center
IBM, ISTL
From: Pavel Emelianov
Date: Sunday, March 11, 2007 - 1:41 am

Thanks for participation and additional testing :)

-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 8:04 am

Small and simple - each fork()/clone() is accounted
and rejected when limit is hit.
From: Paul Menage
Date: Tuesday, March 6, 2007 - 7:00 pm

Hi Pavel,


Why do you need a pointer added to task_struct? One of the main points
of the generic containers is to avoid every different subsystem and

There's no need to hold a reference here - by definition, the task's
container can't go away while the task is in it.

Also, shouldn't you have an attach() method to move the count from one
container to another when a task moves?

Paul
-

From: Pavel Emelianov
Date: Wednesday, March 7, 2007 - 12:13 am

The idea is:

Task may be "the entity that allocates the resources" and "the
entity that is a resource allocated".

When task is the first entity it may move across containers
(that is implemented in your patches). When task is a resource
it shouldn't move across containers like files or pages do.

More generally - allocated resources hold reference to original
container till they die. No resource migration is performed.


-

From: Paul Menage
Date: Thursday, March 8, 2007 - 6:49 am

Yes, but I disagree with the premise. The title of your patch is
"Account for the number of tasks within container", but that's not
what the subsystem does, it accounts for the number of forks within
the container that aren't directly accompanied by an exit.

Ideally, resources like files and pages would be able to follow tasks
as well. The reason that files and pages aren't easily migrated from
one container to another is that there could be sharing involved;
figuring out the sharing can be expensive, and it's not clear what to
do if two users are in different containers.

But in the case of a task count, there are no such issues with
sharing, so it seems to me to be more sensible (and more efficient) to
just limit the number of tasks in a container.

i.e. when moving a task into a container or forking a task within a
container, increment the count; when moving a task out of a container
or when it exits, decrement the count.

With your approach, if you were to set the task limit of an empty
container A to 1, and then move a process P from B into A, P would be
able to fork a new child, since the "task count" would be 0 (as P was
being charged to B still). Surely the fact that there's 1 process in A
should prevent P from forking?

Paul
-

From: Pavel Emelianov
Date: Sunday, March 11, 2007 - 1:36 am

Sounds reasonable.
I'll take this into account when I make the next iteration.

-

From: Pavel Emelianov
Date: Tuesday, March 6, 2007 - 8:07 am

Simple again - increment usage counter at file open and
decrement at file close. Reject opening if limit is hit.
From: Balbir Singh
Date: Tuesday, March 6, 2007 - 11:52 pm

I have one problem with the patchset, I cannot compile
the patches individually and some of the code is hard
to read as it depends on functions from future patches.
Patch 2, 3 and 4 fail to compile without patch 5 applied.

Patch 1 failed to apply with a reject in kernel/Makefile
I applied it on top of 2.6.20 with all of Paul Menage's
patches (all 7).



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
-

From: Pavel Emelianov
Date: Wednesday, March 7, 2007 - 12:32 am

This sounds weird for me :( I've taken a stock 2.6.20
and applied Paul's patches. This is what this patchset
is applicable for.
-

From: Kirill Korotaev
Date: Wednesday, March 7, 2007 - 2:43 am

maybe Paul's patch should be taken w/o subsystems examples
(CKRM, UBC), i.e. first 3 patches only?

Kirill
-

From: Paul Menage
Date: Tuesday, March 6, 2007 - 7:02 pm

Can we not make sure that each subsystem registers itself before any
of its resources become usable? So the file counting subsystem should
register at some point before filp_open() becomes usable, and the
process counting subsystem should register before it's possible to
fork, etc.

Paul
-

From: Pavel Emelianov
Date: Wednesday, March 7, 2007 - 12:30 am

Actually all the subsystems I've sent became usable very early.
Much earlier that initcalls started. I didn't found where exactly

-

Previous thread: [ALSA PATCH] alsa-git merge request by Jaroslav Kysela on Tuesday, March 6, 2007 - 7:20 am. (1 message)

Next thread: [PATCH -rt] airo: threaded IRQ handler sleeps forever by Michal Schmidt on Tuesday, March 6, 2007 - 7:40 am. (1 message)