Serge Hallyn provided a patch to introduce a per-process utsname namespace explaining, "[it] can be used by openvz, vserver, and application migration to virtualize and isolate utsname info (i.e. hostname)." Andrew Morton replied, offering some of his own views of OS virtualization, "generally, I think that the whole approach of virtualising the OS so it can run multiple independent instances of userspace is a good one. It's an extension and a strengthening of things which Linux is already doing and it pushes further along a path we've been taking for many years. If done right, it's even possible that each of these featurettes could improve the kernel in its own right - better layering, separation, etc."
Andrew went on to discuss four possible methods for getting complete OS virtualization support merged into the mainline Linux kernel. His first suggestion involved generating a list of necessary features, agreeing up-front which of the various pieces are necessary and useful for all solutions, then beginning to merge things before they are actually used by all. His second suggestion was to only merge features as they prove useful and usable by "a sufficiently broad group of existing Linux users". His third idea was the possibility of of queuing all the code in his own -mm tree, then merging it all at once into the mainline kernel when ready. And finally, he noted the possibility of someone maintaining a git tree which could be restributed within -mm for wider testing, until ready for merging into the mainline kernel.
The thread continued, discussing the next logical peice to focus on. Andrey Savochkin [interview] suggested focusing on the virtualization of network containers, "virtualization of networking presents a lot of challenges and decision-making points with respect to user-visible interfaces: proc, sysctl, netlink events (and netlink sockets themselves), and so on. This code will also become immediately useful as an improvement over chroot. I am sure that when we come to a mutually acceptable solution with respect to networking, virtualization of all other subsystems can be implemented and merged without many questions." Andrew agreed with the suggestion, "it sounds like that feature might be the most-likely-to-cause-maintainer-revolt one, in which case yes, it is absolutely definitely the one to start with."
From: Serge E. Hallyn [email blocked]
To: linux-kernel
Subject: [PATCH 0/9] namespaces: Introduction
Date: Thu, 18 May 2006 10:47:00 -0500
This patchset introduces a per-process utsname namespace. These can
be used by openvz, vserver, and application migration to virtualize and
isolate utsname info (i.e. hostname). More resources will follow, until
hopefully most or all vserver and openvz functionality can be implemented
by controlling resource namespaces from userspace.
Previous utsname submissions placed a pointer to the utsname namespace
straight in the task_struct. This patchset (and the last one) moves
it and the filesystem namespace pointer into struct nsproxy, which is
shared by processes sharing all namespaces. The intent is to keep
the taskstruct smaller as the number of namespaces grows.
Changes:
- the reference count on fs namespace and uts namespace now
refers to the number of nsproxies pointing to it
- some consolidation of namespace cloning and exit code to
clean up kernel/{fork,exit}.c
- passed ltp and ltpstress on smp power, x86, and x86-64
boxes.
From: Andrew Morton [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Thu, 18 May 2006 10:34:30 -0700
"Serge E. Hallyn" [email blocked] wrote:
>
> This patchset introduces a per-process utsname namespace. These can
> be used by openvz, vserver, and application migration to virtualize and
> isolate utsname info (i.e. hostname). More resources will follow, until
> hopefully most or all vserver and openvz functionality can be implemented
> by controlling resource namespaces from userspace.
>
Generally, I think that the whole approach of virtualising the OS so it can
run multiple independent instances of userspace is a good one. It's an
extension and a strengthening of things which Linux is already doing and
it pushes further along a path we've been taking for many years. If done
right, it's even possible that each of these featurettes could improve the
kernel in its own right - better layering, separation, etc.
The approach which you appear to be taking is to separate the bits of
functionality apart and to present them as separate works each of which is
reviewed-by, acceptable-to and will-be-used-by all of the interested
projects. That's ideal, and is very much appreciated.
All of which begs the question "now what?".
What we do _not_ want to do is to merge up a pile of infrastructural stuff
which never gets used. On the other hand, we don't want to be in a
position where nothing is merged into mainline until the entirety of
vserver &&/|| openvs is ready to be merged.
I see two ways of justifying a mainline merge of things such as this
a) We make an up-front decision that Linux _will_ have OS-virtualisation
capability in the future and just start putting in place the pieces for
that, even if some of them are not immediately useful.
I suspect that'd be acceptable, although I worry that we'd get
partway through and some issues would come up which are irreconcilable
amongst the various groups.
It would help set minds at ease if someone could produce a
bullet-point list of what features the kernel will need to get it to the
stage where "most or all vserver and openvz functionality can be
implemented by controlling resource namespaces from userspace." Then we
can discuss that list, make sure that everyone's pretty much in
agreement.
It would be good if that list were to identify which features are
useful to Linux in their own right, and which ones only make sense within
a whole virtualise-the-OS setup.
b) Only merge into mainline those feature which make sense in a
standalone fashion. eg, we don't merge this patchset unless the
"per-process utsname namespace" feature is useful to and usable by a
sufficiently broad group of existing Linux users.
I suspect this will be a difficult approach.
The third way would be to buffer it all up in -mm until everything is
sufficiently in place and then slam it all in. That might not be feasible
for various reasons - please advise..
A fourth way would be for someone over there to run a git tree - you all
happily work away, I redistribute it in -mm for testing and one day it's
all ready to merge. I don't really like this approach. It ends up meaning
that nobody else reviews the new code, nobody else understands what it's
doing, etc. It's generally subversive of the way we do things.
Eric, Kirill, Herbert: let us know your thoughts, please.
From: Herbert Poetzl [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 14:42:35 +0200
On Thu, May 18, 2006 at 10:34:30AM -0700, Andrew Morton wrote:
> "Serge E. Hallyn" [email blocked] wrote:
> >
> > This patchset introduces a per-process utsname namespace. These can
> > be used by openvz, vserver, and application migration to virtualize and
> > isolate utsname info (i.e. hostname). More resources will follow, until
> > hopefully most or all vserver and openvz functionality can be implemented
> > by controlling resource namespaces from userspace.
>
> Generally, I think that the whole approach of virtualising the OS
> so it can run multiple independent instances of userspace is a good
> one. It's an extension and a strengthening of things which Linux is
> already doing and it pushes further along a path we've been taking
> for many years.
yes, I too think that Linux has been moving in that
direction for a long time now, maybe even too long, so
that other OSes (like BSD or Solaris) already managed
to get a virtualization layer up and running (although
maybe at a much simpler level than we plan to do)
> If done right, it's even possible that each of these featurettes could
> improve the kernel in its own right - better layering, separation,
> etc.
agreed, most 'features' will require a cleanup of some
otherwise completely untouched areas, which (hopefully)
will improve those areas ...
> The approach which you appear to be taking is to separate the bits
> of functionality apart and to present them as separate works each of
> which is reviewed-by, acceptable-to and will-be-used-by all of the
> interested projects. That's ideal, and is very much appreciated.
IMHO many things will make perfect sense on their own
even without the 'other' virtualizations or isolations.
With Linux-VServer it's an every day occurance, that
folks just 'cherry pick' the isolation features and build
their own level of virtual/isolated environment.
at this point, many thanks to Sam, Eric and Serge
who do a really good job in massaging patches :)
> All of which begs the question "now what?".
>
> What we do _not_ want to do is to merge up a pile of infrastructural
> stuff which never gets used. On the other hand, we don't want to be in
> a position where nothing is merged into mainline until the entirety of
> vserver &&/|| openvs is ready to be merged.
yes, I agree here, and I'm pretty sure that we are
still missing many 'stakeholders' here just because
we do not see all possible areas of use ... let me
give a simple example here:
"pid virtualization"
- Linux-VServer doesn't really need that right now.
we are perfectly fine with "pid isolation" here, we
only "virtualize" the init pid to make pstree happy
- Snapshot/Restart and Migration will require "full"
pid virtualization (that's where Eric and OpenVZ
are heading towards)
- OpenSSI and *Mosix require system wide pid spaces
which probably could be implemented with virtual
pid spaces as well
- many security addons provide something called pid
randomization, and I think they could probably
benefit from a virtual pid space, too
now does that mean that e.g. Linux-VServer is against
"pid virtualization"? well, we are mainly against all
_unnecessary_ overhead and strictly against losing the
ability to keep it simple for the user, i.e. somebody
who does not require all that stuff should be able to
pick the features (or spaces) she really needs ...
> I see two ways of justifying a mainline merge of things such as this
>
> a) We make an up-front decision that Linux _will_ have
> OS-virtualisation capability in the future and just start putting
> in place the pieces for that, even if some of them are not
> immediately useful.
as long as this doesn't automatically mean bloat, I'm
more than happy with such a decision ...
> I suspect that'd be acceptable, although I worry that we'd get
> partway through and some issues would come up which are
> irreconcilable amongst the various groups.
I'm pretty sure that we _will_ hit some issues, but
I'm also sure that we will be able to work out those
issues, after all the 'end user' has to decide what
should be in mainline and what not ...
> It would help set minds at ease if someone could produce a
> bullet-point list of what features the kernel will need to get
> it to the stage where "most or all vserver and openvz functionality
> can be implemented by controlling resource namespaces from
> userspace." Then we can discuss that list, make sure that
> everyone's pretty much in agreement.
excellent idea, will start preparing such a list
from our PoV, so that we can merge that with the
other lists ...
> It would be good if that list were to identify which features are
> useful to Linux in their own right, and which ones only make
> sense within a whole virtualise-the-OS setup.
that's probably the hardest part ...
> b) Only merge into mainline those feature which make sense in a
> standalone fashion. eg, we don't merge this patchset unless the
> "per-process utsname namespace" feature is useful to and usable
> by a sufficiently broad group of existing Linux users.
the question here is, who are the users and _how_
will they get the feature? I see the following cases
here, which might overlap ...
- the feature makes perfectly sense in mainline as
standalone feature (maybe even adds no overhead
and/or simplifies/generalizes design) but has no
direct 'user' per se (think private namespaces
or linux capabilities)
- the feature is used by a number of projects in
very different ways to improve or even realize
certain 'other' features (think ext2/3 xattrs)
- the feature (although it adds pretty much overhead
and/or complicates the design) is really useful
for everyday use, and most folks who discover it
do not understand how they could live without it
(think various attributes on vfs mounts :)
> I suspect this will be a difficult approach.
yes, but should 'we' decide to take this approach, we
can at least guarantee that we (Linux-VServer) will
try to make use of those new features as soon as they
appear in mainline (as we did until now)
> The third way would be to buffer it all up in -mm until everything
> is sufficiently in place and then slam it all in.
for me, that sounds like a pretty bad idea, at least
for the first steps -- though we might consider this
approach for the last 10 or 20 percent, when we just
have to put the pieces together ...
> That might not be feasible for various reasons - please advise..
>
> A fourth way would be for someone over there to run a git tree - you
> all happily work away, I redistribute it in -mm for testing and one
> day it's all ready to merge. I don't really like this approach. It
> ends up meaning that nobody else reviews the new code, nobody else
> understands what it's doing, etc. It's generally subversive of the way
> we do things.
let me say that I'm strictly against such an approach
as it would be very similar to merging any of the
existing projects without further mainline consideration
> Eric, Kirill, Herbert: let us know your thoughts, please.
thanks for your work and time, we appreciate it
best,
Herbert
From: Andrew Morton [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 08:13:34 -0700
Herbert Poetzl [email blocked] wrote:
>
> let me
> give a simple example here:
Examples are useful.
> "pid virtualization"
>
> - Linux-VServer doesn't really need that right now.
> we are perfectly fine with "pid isolation" here, we
> only "virtualize" the init pid to make pstree happy
>
> - Snapshot/Restart and Migration will require "full"
> pid virtualization (that's where Eric and OpenVZ
> are heading towards)
snapshot/restart/migration worry me. If they require complete
serialisation of complex kernel data structures then we have a problem,
because it means that any time anyone changes such a structure they need to
update (and test) the serialisation.
This may be a show-stopper, in which case maybe we only need to virtualise
pid #1.
> - OpenSSI and *Mosix require system wide pid spaces
> which probably could be implemented with virtual
> pid spaces as well
>
> - many security addons provide something called pid
> randomization, and I think they could probably
> benefit from a virtual pid space, too
ok.
Anyway. Thanks, guys. It sound like most of this work will be nicely
separable so we can think about each bit as it comes along.
From: [email blocked] (Eric W. Biederman)
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 10:27:32 -0600
Andrew Morton [email blocked] writes:
> Herbert Poetzl [email blocked] wrote:
>>
>> let me
>> give a simple example here:
>
> Examples are useful.
>
>> "pid virtualization"
>>
>> - Linux-VServer doesn't really need that right now.
>> we are perfectly fine with "pid isolation" here, we
>> only "virtualize" the init pid to make pstree happy
>>
>> - Snapshot/Restart and Migration will require "full"
>> pid virtualization (that's where Eric and OpenVZ
>> are heading towards)
>
> snapshot/restart/migration worry me. If they require complete
> serialisation of complex kernel data structures then we have a problem,
> because it means that any time anyone changes such a structure they need to
> update (and test) the serialisation.
There is a strict limit to what is user visible, and if it isn't user visible
we will never need it in a checkpoint. So internal implementation details
should not matter.
> This may be a show-stopper, in which case maybe we only need to virtualise
> pid #1.
Except we do need something for pid isolation, and a pid namespace is
quite possibly the light weight solution. If you can't see the pid it is
clearly isolated from you.
> Anyway. Thanks, guys. It sound like most of this work will be nicely
> separable so we can think about each bit as it comes along.
Yes, and there are enough issues it is significant.
Eric
From: Andrew Morton [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 09:40:47 -0700
[email blocked] (Eric W. Biederman) wrote:
>
> > Herbert Poetzl [email blocked] wrote:
> >>
> >> let me
> >> give a simple example here:
> >
> > Examples are useful.
> >
> >> "pid virtualization"
> >>
> >> - Linux-VServer doesn't really need that right now.
> >> we are perfectly fine with "pid isolation" here, we
> >> only "virtualize" the init pid to make pstree happy
> >>
> >> - Snapshot/Restart and Migration will require "full"
> >> pid virtualization (that's where Eric and OpenVZ
> >> are heading towards)
> >
> > snapshot/restart/migration worry me. If they require complete
> > serialisation of complex kernel data structures then we have a problem,
> > because it means that any time anyone changes such a structure they need to
> > update (and test) the serialisation.
>
> There is a strict limit to what is user visible, and if it isn't user visible
> we will never need it in a checkpoint. So internal implementation details
> should not matter.
Migration of currently-open sockets (for example) would require storing of
a lot of state, wouldn't it?
From: Dave Hansen [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 13:17:12 -0700
On Fri, 2006-05-19 at 09:40 -0700, Andrew Morton wrote:
> Migration of currently-open sockets (for example) would require storing of
> a lot of state, wouldn't it?
In a word, yes. :)
I don't think the networking guys from either the OpenVZ project or IBM
were cc'd on this. Alexey, Daniel, can you elaborate, or point us to
any existing code?
-- Dave
From: Alexey Kuznetsov [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Sat, 20 May 2006 00:52:47 +0400
Hello!
> > Migration of currently-open sockets (for example) would require storing of
> > a lot of state, wouldn't it?
>
> In a word, yes. :)
Yes. But, actually, it is not "for example". Socket state is really far more
complicated thing than all the rest. I would say, migration of another
objects is mostly trivial thing.
Actually, what Andrew worried about:
> snapshot/restart/migration worry me. If they require complete
> serialisation of complex kernel data structures then we have a problem,
> because it means that any time anyone changes such a structure they need to
> update (and test) the serialisation.
The answer is: after user space processes referring to objects are suspended,
_surprizingly_, not so much of places, which have trouble with serialization
remain. Actually, no serialization additional to existing one is required.
Sockets are the most complicated, to suspend networking state, after
processes are frozen, we have to:
1. Block access from network.
2. Stop socket timers.
Only after this we can make a coherent snapshot. But it is an exception,
most of objects are in coherent state (all the VM, files etc. etc),
when processes are frozen.
> I don't think the networking guys from either the OpenVZ project or IBM
> were cc'd on this. Alexey, Daniel, can you elaborate, or point us to
> any existing code?
http://git.openvz.org
linux-2.6-openvz/kernel/cpt/. Particularly, kernel/cpt/cpt_socket*.c.
Hairy, but straighforward.
Alexey
From: Andrey Savochkin [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 17:47:57 +0400
Hi Andrew,
What you are saying indeed makes a lot of sense.
We want to start merging virtualization code some way or another.
Yet, if we merge code step-by-step, we do not want a pile of unused
infrastructure for the beginning, which may happen to be not entirely useful
in the future, or even create obstacles for development.
And in the course of merging, we would like to overcome differences in
opinions of various group, and live happily ever after.
I have a practical proposal.
We can start with presenting and merging the most interesting part, network
containers. We discuss details, possible approaches, and related subsystems,
until networking is finished to its utmost detail.
This will create an example of virtualization of a non-trivial subsystem,
and we will have to agree on basic principles of virtualization of related
subsystems like proc.
Virtualization of networking presents a lot of challenges and decision-making
points with respect to user-visible interfaces: proc, sysctl, netlink events
(and netlink sockets themselves), and so on. This code will also become
immediately useful as an improvement over chroot.
I am sure that when we come to a mutually acceptable solution with respect to
networking, virtualization of all other subsystems can be implemented and
merged without many questions.
What do people think about this plan?
Best regards,
Andrey
From: Andrew Morton [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Fri, 19 May 2006 08:25:16 -0700
Andrey Savochkin [email blocked] wrote:
>
> I have a practical proposal.
> We can start with presenting and merging the most interesting part, network
> containers. We discuss details, possible approaches, and related subsystems,
> until networking is finished to its utmost detail.
> This will create an example of virtualization of a non-trivial subsystem,
> and we will have to agree on basic principles of virtualization of related
> subsystems like proc.
>
> Virtualization of networking presents a lot of challenges and decision-making
> points with respect to user-visible interfaces: proc, sysctl, netlink events
> (and netlink sockets themselves), and so on. This code will also become
> immediately useful as an improvement over chroot.
> I am sure that when we come to a mutually acceptable solution with respect to
> networking, virtualization of all other subsystems can be implemented and
> merged without many questions.
>
> What do people think about this plan?
It sounds like that feature might be the
most-likely-to-cause-maintainer-revolt one, in which case yes, it is
absolutely definitely the one to start with.
Because if it ends up that an acceptable approach cannot be found, and if
this feature is compulsory for any sane virtualisation implementation then
that's it - game over. We want to discover such blockers as early in the
process as possible.
From: Herbert Poetzl [email blocked]
Subject: Re: [PATCH 0/9] namespaces: Introduction
Date: Sat, 20 May 2006 23:24:55 +0200
On Fri, May 19, 2006 at 08:25:16AM -0700, Andrew Morton wrote:
> Andrey Savochkin [email blocked] wrote:
>>
>> I have a practical proposal. We can start with presenting and
>> merging the most interesting part, network containers. We discuss
>> details, possible approaches, and related subsystems, until
>> networking is finished to its utmost detail. This will create an
>> example of virtualization of a non-trivial subsystem, and we will
>> have to agree on basic principles of virtualization of related
>> subsystems like proc.
>>
>> Virtualization of networking presents a lot of challenges and
>> decision-making points with respect to user-visible interfaces:
>> proc, sysctl, netlink events (and netlink sockets themselves),
>> and so on. This code will also become immediately useful as an
>> improvement over chroot. I am sure that when we come to a mutually
>> acceptable solution with respect to networking, virtualization of
>> all other subsystems can be implemented and merged without many
>> questions.
>>
>> What do people think about this plan?
well, I think it is interesting ...
> It sounds like that feature might be the
> most-likely-to-cause-maintainer-revolt one, in which case yes,
> it is absolutely definitely the one to start with.
yes, I absolutely agree here, this will be one
of the tougher nuts to crack, and therefore it
might be an excellent candidate to proove that
the different virtualization camps can find an
acceptable solution .. together.
> Because if it ends up that an acceptable approach cannot be found,
> and if this feature is compulsory for any sane virtualisation
> implementation then that's it - game over.
this, OTOH is something I'm not convinced of,
because looking at BSD jails, I see a very simple
approach (only one IP, limiting binds) which seems
to be sufficient for all the BSD jails out there
this is probably something which does not meet the
requirements of fully blown distro virtualizations
but actually it might be more than sufficient for
'mainline' linux jails
> We want to discover such blockers as early in the process as
> possible.
yes, I would also appreciate if we could get some
support from the network folks, as I think, most
of them are already working into that direction
(think Van Jacobson's net channels, routing tables)
especially as the network virtualization brings up
a number of questions, which are not easily answered
like the following:
- what policy will be applied inside guests?
+ allow arbitrary packets/rules/routes
+ have some generic limits/basic rules
+ put policy into userspace
- how to 'connect' the virtual interfaces to
the real network?
+ via routing and bridging?
(means duplicate stack traversal and
therefore twice the overhead)
+ via split personality interfaces?
(less overhead, more complicated cases)
+ directly (only by isolation)
- at what level should the virtualization happen?
+ ethernet level (all protocols)
+ ip level (all ip based and control protocols)
+ udp/tcp level
best,
Herbert
Smells like 2.7
Don't we have smell from a 2.7 fork ?
There are pieces that will fit right into 2.6, but the "most-likely-to-cause-maintainer-revolt" features are interesting too. For structure changes problem, 2.7 would be great if it was the "great stabilization" release : clean and stable ABI for drivers (hardware manufacturers may not like to rewrite their drivers for each ABI change), clean and stable internal structures for the whole 2.7 series, ... Virtualization as presented here is also extremely interesting.
At the end, I would like to see a Debian-Sarge-like release, very long to stabilize, but very carefully checked.
I hope 2.7 wil happen, because there are changes that I really want : cleaner VT separation, high quality software suspend, multi-user device layer, ...
2.7 is here
2.7 has been here all along; we just happened to call it 2.6.9. Really, no feature is considered too big for 2.6 any more
Please fork 2.7
There is a growing need for a fork. WLAN and this virtualization is two of many areas that should be cleaned up.
It would be nice if it was started some work to port all drivers to some cleaned up infrastructures. Something like the WLAN architecture of OpenBSD and FreeBSD.
The best would be to drop all drivers that was not ported to the new infrastructures. The drivers could be added in later 2.8.x releases.
2.8.x.y is still a nice idea, but for smaller changes.
Virtualization (jail in FreeBSD), general infrastructures (WLAN in FreeBSD and OpenBSD) and KISS is the main reason way I uses BSDs at all.
I have used Linux as my primary OS in both private and at work for more than 10 years. The changes this days makes me a bit worried.
Please clean up and give us some peace ;-)
tty
A tty rewrite is long awaited as well. Preferably in conjunction with X11.
I do not want any new feature
For all my (limited) needs, linux is pretty complete. I do not need any new feature to be happy (ok, maybe new drivers). I want (need/wish/whatever) some cleaning.
I saw that the VT problem is old code remaining from 1.x.y times, that suspend problems are drivers that do not fully comply with suspending interface, that multi-user input devices or user-specific USB hubs (device hubs) need more cleaning.
Linux IS a great, fast, high quality program. It does not really need more features. But it may need some cleaning (not a kernel hacker, i only relay other people's point of view).
Another example comes to my mind : the progressive merge of IDE driver to the more generic libATA : that's a good thing.
I do not want to criticize, because everything I see takes the right path. I am pretty confident in maintainers to keep up the good work.