Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.

Previous thread: ==(Thank You)== by Mr Kane Roy on Thursday, January 14, 2010 - 5:55 am. (1 message)

Next thread: [PATCH] MAINTAINERS: transfer maintainership of I/OAT by Maciej Sosnowski on Thursday, January 14, 2010 - 8:52 am. (2 messages)
From: Patrick McHardy
Date: Thursday, January 14, 2010 - 7:05 am

The attached largish patch adds support for "conntrack zones",
which are virtual conntrack tables that can be used to seperate
connections from different zones, allowing to handle multiple
connections with equal identities in conntrack and NAT.

A zone is simply a numerical identifier associated with a network
device that is incorporated into the various hashes and used to
distinguish entries in addition to the connection tuples. Additionally
it is used to seperate conntrack defragmentation queues. An iptables
target for the raw table could be used alternatively to the network
device for assigning conntrack entries to zones.

This is mainly useful when connecting multiple private networks using
the same addresses (which unfortunately happens occasionally) to pass
the packets through a set of veth devices and SNAT each network to a
unique address, after which they can pass through the "main" zone and
be handled like regular non-clashing packets and/or have NAT applied a
second time based f.i. on the outgoing interface.

Something like this, with multiple tunl and veth devices, each pair
using a unique zone:

  <tunl0 / zone 1>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to unique network
     |
  <veth1 / zone 1>
  <veth0 / zone 0>
     |
  PREROUTING
     |
  FORWARD
     |
  POSTROUTING: SNAT to eth0 address
     |
  <eth0>

As probably everyone has noticed, this is quite similar to what you
can do using network namespaces. The main reason for not using
network namespaces is that its an all-or-nothing approach, you can't
virtualize just connection tracking. Beside the difficulties in
managing different namespaces from f.i. an IKE or PPP daemon running
in the initial namespace, network namespaces have a quite large
overhead, especially when used with a large conntrack table.

I'm not too fond of this partial feature duplication myself, but I
couldn't think of a better way to do this without the downsides of
using namespaces. Having ...
From: jamal
Date: Thursday, January 14, 2010 - 8:05 am

Ive had an equivalent discussion with B Greear (CCed) at one point on
something similar, curious if you solve things differently - couldnt
tell from the patch if you address it.
Comments inline:


Agreed that this would be a main driver of such a feature.
Which means that you need zones (or whatever noun other people use) to
work on not just netfilter, but also routing, ipsec etc.

The fundamental question i have is:
how you deal with overlapping addresses?
i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for

Unless there is a clever approach for overlapping IP addresses (my
question above), i dont see a way around essentially virtualizing the

This is a valid concern against the namespace approach. Existing tools
of course could be taught to know about namespaces - and one could
argue that if you can resolve the overlap IP address issue, then you

Elaboration needed.
You said the size in 64 bit increases to 152B per conntrack i think?
Do you have a hand-wave figure we can use as a metric to elaborate this
point? What would a typical user of this feature have in number of
"zones" and how many contracks per zone? Actually we could also look
at extremes (huge number vs low numbers)...

You may also wanna look as a metric at code complexity/maintainability
of this scheme vs namespace (which adds zero changes to the kernel).
I am pretty sure you will soon be "zoning" on other pieces of the net

My opinions above.

BTW, why not use skb->mark instead of creating a new semantic construct?

cheers,
jamal

--

From: Patrick McHardy
Date: Thursday, January 14, 2010 - 8:37 am

Its basically the same, except that this patch uses ct_extend

Routing already works fine. I believe IPsec should also work already,

The zone is set based on some other criteria (in this case the
incoming device). The packets make one pass through the stack
to a veth device and are SNATed in POSTROUTING to non-clashing
addresses. When they come out of the other side of the veth
device, they make a second pass through the network stack and
can be handled like any other packet.

So the setup would be (with 10.0.0.0/24 on if0 and if1):

ip rule add from if0 lookup t0
ip route add default veth0 table t0
iptables -t nat -A POSTROUTING -o veth0 -j NETMAP --to 10.1.0.0/24
echo 1 >/sys/class/net/if0/nf_ct_zone
echo 1 >/sys/class/net/veth0/nf_ct_zone

ip rule add from if1 lookup t1
ip route add default veth2 table t0
iptables -t nat -A POSTROUTING -o veth2 -j NETMARK --to 10.1.1.0/24
etho 2 >/sys/class/net/if1/nf_ct_zone
echo 2 >/sys/class/net/veth2/nf_ct_zone

The mapped packets are received on veth1 and veth3 with non-clashing


I don't think thats true. In any case its completely impractical
to modify every userspace tool that does something with networking
and potentially make complex configuration changes to have all
those namespaces interact nicely. Currently they are simply not


I'm not sure whether there is a typical user for overlapping
networks :) I know of setups with ~150 overlapping networks.

The number of conntracks per zone doesn't matter since the
table is shared between all zones. network namespaces would
allocate 150 tables, each of the same size, which might be

There's not a lot of complexity, its basically passing a numeric
identifier around in a few spots and comparing it. Something like

I've thought about that and I don't think that's necessary for this
use case. Its enough to resolve overlapping address ranges, everything

Because people are already using it for different purposes.
--

From: jamal
Date: Thursday, January 14, 2010 - 10:33 am

If you are using a netdev as a reference point, then I take it 
if you add vlans should be possible to do multiple zones on a single

Ok - makes sense. 
i.e NAT would work; and policy routing as well as arp would be fine.
Also it looks to be sufficiently useful to fit a specific use case you
are interested in.
But back to my question on routing, ipsec etc (and you may not be
interested in solving this problem, but it is what i was getting to
earlier). Lets take for example: 
a) network tables like SAD/SPD tables: how you would separate those on a
per-zone basis? i.e 10.0.0.1/zone1 could use different
policy/association than 10.0.0.1/zone2
b) dynamic protocols (routing, IKE etc): how do you do that without 


Agreed. But the major ones like iproute2 etc could be taught. We have
namespaces in the kernel already, over a period of time I think changing

My contention is that it is a lot less headache to just virtualize 
all the network stack and then use what you want than it is to go and
selectively changing the network objects.
Note: if i wanted today i could run racoon on every namespace 
unchanged and it would work or i could modify racoon to understand

Thats what i was looking for ..
So the difference, to pick the 150 zones example so as to put a number
around it, is namespaces will consume 150.X bytes (where X is the
overhead of a conntrack table) and you approach will be (X + 152) bytes,
correct?

I think the challenge is whether zones will have to encroach on other
net stack objects or not. You are already touching structure netdev...
A digression: TOS is different really - it has network level semantic. This 

tru dat - it only gives you one semantical axis and you need an
additional dimension in your case (namespace have that resolved via
struct net).

cheers,
jamal

--

From: Patrick McHardy
Date: Friday, January 15, 2010 - 3:15 am

Yes, you can assign a zone to each netdev. macvlan will also work.

Using a netfilter target for the raw table might be a better choice
on second thought though, it provides more flexibility and avoids

The selectors include an ifindex, which could be used to

In case of IPsec the outer addresses are different, its only the
selectors which will have similar addresses. A keying deamon should
have no trouble with this. The ifindex would be needed in the
selectors though to make sure each policy is used for the correct
traffic.

A routing daemon is unrealistic to be used in this scenario, at

Yes, that might be useful in any case. But I don't think it would
even work for iproute or other standalone programs, a process can't
associate to an existing namespace except through clone(). So it
needs to run as child of a process already associated with the


No, to give some correct number. Assuming a conntrack table of
10MB (large, but reasonable depending on the number of connections)
we get an overhead of:

namespaces: 150 * 10MB memory use
"zones": 152 bytes increased code size

Both approaches additionally need one extra connection tracking

That will go away once I add a target for classification. I completely
agree that its undesirable to add this in more spots, but this is meant
purely for being able to pass traffic through conntrack/NAT more than
once.
--

From: jamal
Date: Friday, January 15, 2010 - 8:19 am

you need to have user space knowledgeable of the mapping between an
ifindex and a zone. It may work with perhaps that info explicitly in

I think in general, it would be hard to deal with anything that requires
dynamic control where one or more peers have to discover each other once
you have IP overlap. You will have to change those user space apps.


The mechanics are not there, yet. But if i had sufficient permission,
and was able to find the namespaces when i ask and/or get events when it
is created it should be an issue of sending it a message.
The current approach to say migrate a veth via iproute2 requires we 
know the pid of the target namespace. Thats a usability issue.
I tried to muck with namespaces and if you use a library like lxc
you can do it - but it is a hack as it stands today (and merging

That is substantial if you are doing an embedded device.
But otherwise, RAM is so cheap that i would take usability
any day for an extra $5.

BTW, I think the zones approach will still use more than 10MB
in this case given it encompasses all "zones" whereas namespace only


Makes sense 
On a side note: I wouldnt mind seeing some field in struct
netdev for some general purpose grouping/IDing which could be
set from user space. 

cheers,
jamal



--

From: Eric W. Biederman
Date: Monday, February 22, 2010 - 1:46 pm

This is one of the long standing issues that we have always known
we needed to solve, but have not taken the time to do it.  Now that
the need is more real it looks about time to solve this one.

There are currently two problems.
1) A process is needed to hold a reference to the network namespace.
2) We use pids which are an awkward way of talking about network
   namespaces.

The solution I have been playing with involves.
- Using a file descriptor to refer to a network namespace.
- Using a trivial virtual filesystem to persistently hold onto
  a namespace without the need of a process.
- Have a convention of mounting the fs at something like
  /var/run/netns/<name>

That solves the naming problem, and it should allow iproute and
it's kin to have support without being closely integrated with
lxc or anything else that creates namespaces.

It is a big conversation, and it is something that has to done
right but it looks like the problem is finally real enough that
it is time to solve it.

Eric
--

From: jamal
Date: Monday, February 22, 2010 - 2:55 pm

I didnt quiet follow how i could use the above to do:
"ip ns <name/id> route add blah" from namespace0.

I tend to think in packets and wires instead of files;
How about just allowing a "control" channel from which
i could discover the namespace?
Example, assuming i have the right permissions:
1) listen to async events example on a multicast bus when
a namespace is created or destroyed. Provide me a little more info on
the created namespace such as its pid, name(?), types of namespace, etc
2) send a query to dump existing namespace or query by name, id etc.
I get the same details as above.

using genetlink should provide you with sufficient ability to do this.

cheers,
jamal

--

From: Eric W. Biederman
Date: Monday, February 22, 2010 - 4:17 pm

What I am thinking is:

"ip ns <name> route add blah" is:
fd = open("/var/run/netns/<name>");
sys_setns(fd);  /* Like unshare but takes an existing namespace */
/* Then the rest of the existing ip command */

"ip ns list" is:
dfd = open("/var/run/netns", O_DIRECTORY);
getdents(dfd, buf, count);

"ip ns new <name>" is:
unshare(CLONE_NEWNS);
fd = nsfd(NETNS);
mkdir("/var/run/netns/<name>");
mount("none", "/var/run/netns/<name>", "ns", 0, fd);

Using unix domain names means that which namespaces you see is under
control of userspace.  Which allows for nested containers (something I
use today), and ultimately container migration.

Using genetlink userspace doesn't result in a nestable implementation
unless I introduce yet another namespace, ugh.

Eric
--

From: jamal
Date: Tuesday, February 23, 2010 - 6:27 am

The other two below make some sense; For the above:
Does the point after sys_setns(fd) allow me to do io inside

The only problem that i see is events are not as nice. I take it i am 

Is it not just a naming convention that you are dealing with?
Example in your scheme above a nested namespace shows up as:
/var/run/netns/<name>/<nestedname>, no?

cheers,
jamal

--

From: Eric W. Biederman
Date: Tuesday, February 23, 2010 - 7:07 am

Yes.  My intention is that current->nsproxy->net_ns be changed.

Yes.  Inotify would at the very least see that mkdir.  You could also

No.  More like:

For the outer namespace:
/var/run/netns/<name>

For the inner namespace:
/some/random/fs/path/to/a/chroot/var/run/netns/<name>

For a doubly nested scenario:
/some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/<name>

Since I would be using mount namespaces instead of chroot it is not
strictly required that the fs paths nest at all.

Eric




--

From: jamal
Date: Tuesday, February 23, 2010 - 7:20 am

Added Daniel to the discussion..


I like it if it makes it as easy as it sounds;-> With lxc,
i essentially have to create a proxy process inside the
namespace that i use unix domain to open fds inside the ns.

It is not as nice but livable. I suppose attributes of the specific

Ok.

cheers,
jamal

--

From: Eric W. Biederman
Date: Tuesday, February 23, 2010 - 1:00 pm

That point of the mount to hold a persistent reference to the
namespace without using a process.

The point of the of the to be written set_ns call is to change
the default network namespace of the process such that all future
open/bind/socket calls happen in the referenced network namespace.

The are a few stray places like sysfs where it is the mount point

Attributes of the specific namespace?

Eric


--

From: jamal
Date: Tuesday, February 23, 2010 - 4:09 pm

Well, example what is being un/shared etc. 

cheers,
jamal


--

From: Eric W. Biederman
Date: Tuesday, February 23, 2010 - 6:43 pm

My target will be 2.6.35.   There is an old prototype implementation

Mine is ;) I had a bad cold and didn't get through all of the patches
this development cycle, just all the prereqs.  I plan on getting that

Got it.  Implementation wise I'm going to stash a pointer
to the namespace in a inode or super block, simple.

Eric
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 1:57 pm

Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path
to keep the namespace alive indefinitely as long as
it is mounted.

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for persistent naming of namespaces
according to userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

This is just my first pass at this, and not yet compiled tested.
I was pleasantly surprised at how easy all of this was to implement.

I have verified mount will let me bind mount /proc/self/fd/N so
there is nothing special needed for the mount case, except
getting the reference counting and lifetime rules correct for
my filesystem objects.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  278 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 +++
 10 files changed, 315 ...
From: Daniel Lezcano
Date: Thursday, February 25, 2010 - 2:31 pm

Is it planned to support all the namespaces for 'nsfd' ?
 I mean will it be possible to specify an Or'ed combination of nstype to 
grab a reference for several namespaces at a time of the targeted process ?

for example : nsfd( 1234, NSTYPE_NET | NSTYPE_IPC, NSTYPE_MNT)
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 2:49 pm

No, the plan is only one namespace at a time.

It would not be much of a change to support multiple namespaces,
but I don't think I want to go there.  Bitmaps filling up are
ugly and I don't see what would be gained.

I does make sense to support all of the namespaces we can support
with unshare, but with nstype as an enumeration not as a bitmap.

This is slightly better than the earlier version that used a netlink
socket as the reference as I can give it the semantics of a deleted
file and only when that file goes away drop the reference on the
namespace.  It is also better in that this interface can support all
of the namespaces, without adding yet another syscall.

Eric
--

From: Daniel Lezcano
Date: Thursday, February 25, 2010 - 3:13 pm

The idea I had in mind when I asked this question was if we can "move" a 
I suppose when you say "to support all of the namespaces we can support 
with *unshare*", you exclude the pid namespace which is created only 
with clone, right ? Do you think we can extend the concept to all the 
I like the idea :)

--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 3:31 pm

Yes, and I think also the credential/uid namespace.

It is possible that this could be the basis for a general purpose
enter, but that is not the primary motivation.  I am after the
easy cases simple cases.  So I can modify /sbin/ip to take advantage
of it.

Eric

--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 1:35 pm

Looking at this a bit more I am frustrated and relieved.

I was looking at what it would take to join an arbitrary mount
namespace and I realized it is completely non-obvious what fs->root
and fs->pwd should be set to.

If I leave them untouched the new mount namespace is useless,
as all path lookups will give results in a different mount namespace,
so not even mount or umount can be used.

I can not change fs->root to mnt_ns->root as that is rootfs and there
is always something mounted on top so I can not use that.

In comparison an unshare of the mount namespace doesn't have to move
fs->root or fs->pwd at all and only has to update their mounts to
the corresponding mounts in the new mount namespace.

I might be able to find the topmost root filesystem and put at least
root there, but I'm not particularly fond of that option.

Eric





--

From: Matt Helsley
Date: Thursday, February 25, 2010 - 2:46 pm

Hi Eric,

	Seems like an ok concept to me. Did you try doing this with
anon_inodes and bind mounting the /proc/<pid>/fd/N as above to keep
them alive and name them?

Cheers,
	-Matt Helsley
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 2:54 pm

I used a normal file.  anon_nodes strictly speaking might work, but they
keep their state in the struct file not in the struct dentry.  So even
if the anon_inodes survived they would not be good for anything.  Otherwise
I would have just reused the anon_inodes.

Eric
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 5:53 pm

Of course this part doesn't work in my patch because I have the wrong
mnt_ns on my mount MS_NOUSER on my superblock.

MS_NOUSER is easy to get past.  Getting a vfsmount in the proper mnt
namespace could be tricky.

Eric
--

From: Matt Helsley
Date: Thursday, February 25, 2010 - 6:09 pm

Is this check preliminary? In the future would we check against the
owner of the target namespace too? Naturally that will require tagging
each namespace with an owner but I thought that was already part of the
plan...

Cheers,
	-Matt Helsley
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 6:26 pm

We aren't modifying the namespace here so namespace owners are
irrelevant here.

We are modifying the process so we need to have CAP_SYS_ADMIN in the
processes credential/uid namespace.

Eric
--

From: Eric W. Biederman
Date: Thursday, February 25, 2010 - 8:15 pm

Introduce two new system calls:
int nsfd(pid_t pid, unsigned long nstype);
int setns(unsigned long nstype, int fd);

These two new system calls address three specific problems that can
make namespaces hard to work with.
- Namespaces require a dedicated process to pin them in memory.
- It is not possible to use a namespace unless you are the
  child of the original creator.
- Namespaces don't have names that userspace can use to talk
  about them.

The nsfd() system call returns a file descriptor that can
be used to talk about a specific namespace, and to keep
the specified namespace alive.

The file descriptor returned from nsfd has the lifetime
semantics of a deleted file.  As long as the fd is
open or it is bind mounted into the filesystem
namespace the namespace will be kept alive.

The fd returned by nsfd() can be bind mounted as:
mount --bind /proc/self/fd/N /some/filesystem/path

open works on the fd returned by nsfd() so another
process can get a hold of it and do interesting things.

Overall that allows for naming of namespaces with
userspace policy.

setns() allows changing the namespace of the current process
to a namespace that originates with nsfd().

v2: The code is tested and works in the common case.
    The vfs has some of the strangest rules...

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

Enough for one day.  This code works, now it just needs
a some more use/testing and careful scrutiny before 2.6.35 rolls
around.

 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/unistd_32.h   |    4 +-
 arch/x86/include/asm/unistd_64.h   |    4 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 fs/Makefile                        |    2 +-
 fs/nsfd.c                          |  320 ++++++++++++++++++++++++++++++++++++
 include/linux/magic.h              |    1 +
 include/linux/nsproxy.h            |    1 +
 include/linux/nstype.h             |    6 +
 kernel/nsproxy.c                   |   17 ++
 10 files changed, 357 ...
From: Eric W. Biederman
Date: Wednesday, March 3, 2010 - 1:50 pm

But userspace might not know for certain and want to check that it is
getting what it expected.  It could be confusing if you think you are
changing your network stack and all of sudden sysv ipc shared memory
was changed instead.

As for the check that nstype is valid that happens earlier in setns.

The plan is to post a patch series with all of the namespace types.


Eric
--

From: Jonathan Corbet
Date: Wednesday, March 3, 2010 - 1:29 pm

I assume that, at some future point when more than one namespace type
is supported, there will be a check to ensure that the type of the given
namespace matches nstype?  I can imagine all kinds of mayhem that could
result in the case of an accidental (or intentional) mismatch.

Actually, why does setns() require the nstype parameter at all?  A
namespace fd is certainly going to have to know what sort of namespace
it represents...

Thanks,

jon
--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 2:24 pm

The CLONE_NEWXXX series of bits has been an royal pain to work with,
and it appears to be unnecessary complications for no gain.

Eric

--

From: Pavel Emelyanov
Date: Friday, February 26, 2010 - 2:34 pm

That's the answer for the "Yet another set..." question.

--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 2:42 pm

I am not certain which question you are asking:

Why don't we have an ability to enter all namespaces with one syscall
invocation?

Why don't we have a syscall that allows us to enter every namespace?

Eric

--

From: Oren Laadan
Date: Friday, February 26, 2010 - 2:58 pm

That's how I understood the question, and I, too, wonder why not ?

By the way, an alternative to using bitmap is to change the prototype
of setns() to accept an array of FD's:

	int setns(int *fds, int nfds);

So the process will atomically enter all the namespaces as specified
by the FDs.

--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 3:16 pm

We could.  Mostly I implemented things in the simplest way possible.

Semantically I know of no reason why need to enter more than one namespace
at once, and I don't expect entering a namespace to be on anyone's fast
path so every last drop of performance was not crucial.

The only justification I can think of for more than one namespace at a
time is that because we have a synchronize_rcu() in the kernel we can
loop in the kernel much more quickly than we can loop in userspace.

Eric
--

From: Oren Laadan
Date: Friday, February 26, 2010 - 3:52 pm

Can't think of a specific scenario, but I wonder if there would
be a problem (security or otherwise) with a process that only
partly belongs to a container, even if for a short time ?

Oren.
--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 4:13 pm

If we can find an instance of that then there are fundamental problems
with setns.

The driving use case right now is for things like network namespaces where
userspace really wants to have several at once, and wants to be able to
control them all.

Eric

--

From: Pavel Emelyanov
Date: Saturday, February 27, 2010 - 1:30 am

Exactly. Please add at least the NSTYPE_NSPROXY or whatever, that will

This one is done in the patch, no?

Although the approach is OK for me, there's one design issue, that came
up to my mind recently: can we use this fd to wail for a namespace to 
stop? I currently don't see this ability, but this is something I require
badly.


--

From: Eric W. Biederman
Date: Saturday, February 27, 2010 - 2:04 am

I have designed these file descriptors to pin the namespaces, so
waiting for them to exit isn't something they can do now.  It makes a
lot of sense to have similar ones that take  weak references to the namespaces
that we can use to wait for a namespace to exit.

Eric
--

From: Pavel Emelyanov
Date: Saturday, February 27, 2010 - 2:21 am

Yes, I saw this from patches. Eric, I'd very much appreciate if we
workout a solution that will allow us to kill two birds with one stone.
I do not want to invent yet another bunch of system calls for "taking
weak reference".

As a "brain storm" start up. Can we use inotify/dnotify for this? 
Or maybe we should better equip the nsfd call with flags argument and 
add a flag for weak reference? In that case - how shall we get a 
notification about namespace is dead? With poll? Maybe worth making
the sys_close return only when the namespace is dead (by providing a

--

From: Eric W. Biederman
Date: Saturday, February 27, 2010 - 2:42 am

joining a preexisting namespace is roughly the same problem as
unsharing a namespace.  We simply haven't figure out how to do it

Definitely.  I only consider the current interface to be a mushy not

We would want poll to work, anything else is a weird work-around.
The challenging part is that we don't have any infrastructure for
notifying when a namespace goes away.  So that has to be built before
we can wire it up to userspace.  I don't expect it is too difficult
but there is work to be done.

Eric

--

From: Pavel Emelyanov
Date: Saturday, February 27, 2010 - 9:16 am

The pid may change after this for sure. What problems do you know
about it? What if we try to allocate the same PID in a new space
or return -EBUSY? This will be a good starting point. If we manage

OK. The interface is good. I just don't want you to send it for an inclusion

Poll is OK with me. As far as the notification is concerned - that's also
done in OpenVZ. If you are OK to wait for a week or two I can do it for net

--

From: Eric W. Biederman
Date: Saturday, February 27, 2010 - 12:08 pm

Parentage.  The pid is the identity of a process and all kinds of things
make assumptions in all kinds of strange places.  I don't see how
waitpid can work if you change the pid.


Sure.  I am get a jump on 2.6.35 not aiming for inclusion this merge

Seems reasonable.

Eric
--

From: Pavel Emelyanov
Date: Saturday, February 27, 2010 - 12:29 pm

Agree. But what if we enter a pid space, which is a subnamespace of a current
one? In that case parent will still see the task by its old pid. We can restrict
first version of entering with this rule as well and this restriction will not

OK, but what if we try to allocate the same pid returning -EBUSY on failure?

My aim is to provide even a restricted enter. For most of the cases this
should work and make our lives easier. So two restrictions currently:
a) enter a sub namespace
b) allocate the same pid as we have now




Thanks,
Pavel
--

From: Eric W. Biederman
Date: Saturday, February 27, 2010 - 12:44 pm

When I was thinking about pid namespaces and unshare last time.  The idea I came
to was we unshare of the pid namespace should only affect which pid namespace
your children are in.

I remember that do that there were a few cases where you would have to access
task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty

Replacing struct pid is guaranteed to do all kinds of nasty things with
signal handling and the like, de_thread is nasty enough and you are talking
something worse.  So if we can change pid namespaces without changing
the pid I am for it.


Eric
--

From: Daniel Lezcano
Date: Sunday, February 28, 2010 - 3:05 pm

I agree with all the points you and Pavel you talked about but I don't 
feel comfortable to have the current process to switch the pid namespace 
because of the process tree hierarchy (what will be the parent of the 
process when you enter the pid namespace for example). What is the 
difference with the sys_bindns or the sys_hijack, proposed a couple of 
years ago ?

I did a suggestion some weeks ago about a new syscall 'cloneat' where 
the child process becomes the child of the targeted process specified in 
the syscall. Maybe it would be interesting to replace the 'setns' by, or 
add, a 'cloneat' syscall with the file descriptor passed as parameter. 
The copy_process function shall not use the nsproxy of the caller but 
the one provided in the fd argument.

The newly created process becomes the child of the process where we 
retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
caller of 'cloneat' can not wait it). It's a bit similar with the 
CLONE_PARENT flag, except the creation order is inverted (the father 
creates for the child).

So when entering the container, we specify the pid 1 of the container 
which is usually a child reaper.

Does it make sense ?

Thanks
   -- Daniel




--

From: Eric W. Biederman
Date: Monday, March 1, 2010 - 12:24 pm

I was not aiming at the general enter case.  There is a very specific case
in networking where we only need a network namespace, not full blown containers
so I was seeing what could be done to handle the easy case.

The big idea is solving the namespace naming issues with bind mounts and file
descriptors.  All of the rest is window dressing for that idea.

setns looks like the easy way but what is really needed for the network namespace

Essentially.  I am not hugely interested in solving the general case
if it takes us off into tangents about pid namespace semantics.

I have just realized that while the original use case for having unix
domain sockets able to work across network namespaces was a little
weak, there are much better arguments.  Operationally it is a game
changer.  In the case where you don't need to support migration it
allows direct access to your X server and greatly simplifies the
design of a server designed to start processes in your container.

Eric
--

From: Eric W. Biederman
Date: Monday, March 1, 2010 - 2:42 pm

I think what has changed is:
- We have mostly completed most of the namespace work.
- We have operational experience with the current namespaces.
- We have people not in the core containers group feeling the pain
  of not having some of these features.

So I think we are at point where we can perhaps talk about these
things and finally solve some of these issues.

Clearly how to enter a container is on your and Pavel's mind as big
concerns.  I am aiming a little lower.

I am of two mind about my patches.  Right now they are a brilliant
proof of concept that we can name namespaces without needing a
namespace for the names of namespaces, and start to be a practical
solution to the join problem.   At the same time, I'm not certain
I like a solution that requires yet more syscalls so I ask myself
is there not yet a simpler way.

Hopefully we can resolve something before the next merge window.

Eric
--

From: Cedric Le Goater
Date: Tuesday, March 2, 2010 - 6:10 am

thinking aloud,

what if you made the nsproxy a vfs_inode ? we could then mount the nsfs
to do all sorts of fs operations on the object, like notifying easily
its deletion. we would need to find a meaningful name, probably the inode
number.

one syscall (nsfd) would be required to get the nsproxy of a task (pid).
you can't guess that from an inode number.


C.
--

From: Pavel Emelyanov
Date: Tuesday, March 2, 2010 - 8:03 am

The answer is - the one, that used to be. I see no problems with it.
Do you?
--

From: Jan Engelhardt
Date: Tuesday, March 2, 2010 - 8:14 am

But perhaps it could be named "namespacefd" instead of nsfd, to reduce 
potential clashes (because glibc will usually just use the same name 
when making the syscall available as a C function).
--

From: Eric W. Biederman
Date: Tuesday, March 2, 2010 - 2:45 pm

Maybe.  namespacefd seems like a real mouthful.  I agree nsfd might be
a bit non-obvious for a rarish syscall.

Eric

--

From: Sukadev Bhattiprolu
Date: Tuesday, March 2, 2010 - 2:19 pm

Pavel Emelyanov [xemul@parallels.com] wrote:
| > I agree with all the points you and Pavel you talked about but I don't 
| > feel comfortable to have the current process to switch the pid namespace 
| > because of the process tree hierarchy (what will be the parent of the 
| > process when you enter the pid namespace for example).
| 
| The answer is - the one, that used to be. I see no problems with it.
| Do you?

Just to be clear, when a process unshares its pid namespace, it takes
on additional pid nr (== 1) in the new namespace but retains its original
pid nr(s) in the parent (ancestor) namespaces right ?

i.e the process becomes the container-init of the new namespace. When it
exits, all its children belonging to the new namespace are killed too,
but any children in the parent namespace (i.e children created before
unshare()) are not killed.

After the unshare() the process will not be able to signal any children
it created before the unshare() (bc their active pid namespaces are
different)

Sukadev
--

From: Eric W. Biederman
Date: Tuesday, March 2, 2010 - 3:13 pm

The only case that I see as being simple and unsurprising worked a bit
differently:

We currently have:

ns_of_pid(task_pid(tsk))
tsk->nsproxy->pid_ns


I would reduce the usage of tsk->nsproxy->pid_ns as much as possible,
and use ns_of_pid(task_pid(tsk)) for all of the routine things that
need to know the pid namespace of a process.  Possibly even to the point
or reversing the order of the upid array so using it is more efficient.

I would leave tsk->nsproxy->pid_ns for use by fork/clone when allocating
a childs pid number.

The unsharing process would have to become the child reaper.  I think the first
child would become pid 1 in that pid namespace.


From an implementation point of view who gets pid 1 when the child_reaper is
not visible inside the pid namespace doesn't make much difference but we would
want to carefully look at the details so we minimize userspace confusion.


I don't think a process tree rooted at pid 0 is a show stopper.  It is
somewhat confusing but we already have a forked process tree today,
and user space certainly hasn't fallen over.  In the case of a join if you want
to live in properly in the process tree you can daemonize and become a child
of init.




I think replacing a struct pid for another struct pid allocated in
descendant pid_namespace (but has all of the same struct upid values
as the first struct pid) is a disastrous idea.  It destroys the
uniqueness of struct pid and we have a lot of places where we check
that for equality of pid pointers, and that now would be broken.
Otherthings like proc directories also used a cached struct pid and
would start thinking the process was gone when it was not.

Eric
--

From: Sukadev Bhattiprolu
Date: Tuesday, March 2, 2010 - 5:07 pm

Eric W. Biederman [ebiederm@xmission.com] wrote:
| 
| I think replacing a struct pid for another struct pid allocated in
| descendant pid_namespace (but has all of the same struct upid values
| as the first struct pid) is a disastrous idea.  It destroys the

True. Sorry, I did not mean we would need a new 'struct pid' for an
existing process. I think we talked earlier of finding a way of attaching
additional pid numbers to the same struct pid.

Sukadev

--

From: Eric W. Biederman
Date: Tuesday, March 2, 2010 - 5:46 pm

I just played with this and if you make the semantics of unshare(CLONE_NEWPID)
to be that you become the idle task aka pid 0, and not the init task pid 1 the
implementation is trivial.

Eric
----

 arch/powerpc/platforms/cell/spufs/sched.c |    2 +-
 arch/um/drivers/mconsole_kern.c           |    2 +-
 fs/proc/root.c                            |    2 +-
 init/main.c                               |    9 ---------
 kernel/cgroup.c                           |    2 +-
 kernel/fork.c                             |   16 +++++++++++++---
 kernel/nsproxy.c                          |    2 +-
 kernel/perf_event.c                       |    2 +-
 kernel/pid.c                              |    8 ++++----
 kernel/signal.c                           |    9 ++++-----
 kernel/sysctl_binary.c                    |    2 +-
 11 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 4678078..b7f2026 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1094,7 +1094,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
 		LOAD_INT(c), LOAD_FRAC(c),
 		count_active_contexts(),
 		atomic_read(&nr_spu_contexts),
-		current->nsproxy->pid_ns->last_pid);
+		task_active_pid_ns(current)->last_pid);
 	return 0;
 }
 
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index 3b3c366..4e6985e 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
 void mconsole_proc(struct mc_request *req)
 {
 	struct nameidata nd;
-	struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+	struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
 	struct file *file;
 	int n, err;
 	char *ptr = req->request.data, *buf;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b080b79..fbcd3f8 100644
--- a/fs/proc/root.c
+++ ...
From: Serge E. Hallyn
Date: Wednesday, March 3, 2010 - 8:38 am

Heh, and then (browsing through your copy_process() patch hunks) the next
forked task becomes the child reaper for the new pidns?  <shrug>  why not
I guess.

Now if that child reaper then gets killed, will the idle task get killed too?
And if not, then idle task can just re-populating the new pidns with new
idle tasks...

If this brought us a step closer to entering an existing pidns that would
be one thing, but is there actually any advantage to being able to
unshare a new pidns?  Oh, I guess there is - PAM can then use it at
login, which might be neat.

-serge
--

From: Eric W. Biederman
Date: Wednesday, March 3, 2010 - 12:47 pm

I have to say that the semantics of my patch are unworkable for
unshare.  Unless I am mistaken for PAM to use it requires that the
current process fully change and become what it needs to be.
Requiring an extra fork to fully complete the process is a problem.

Scratch one bright idea.

Eric
--

From: Eric W. Biederman
Date: Thursday, March 4, 2010 - 2:45 pm

Maybe not.  I just looked and in the vast majority of cases the login
process goes like this.

{
	setup stuff include pam
	child = fork();
	if (!child) {
		setuid()
                exec /bin/bash
        }
        waitpid(child);
        
        pam and other cleanup
}

So an unshare of the pid namespace that doesn't really take effect
until we fork may actually be usable from pam, and in fact is probably
the preferred implementation.  It looks like neither openssh nor login
from util-linux-ng will cope properly with getting any pid back from
wait() except the pid of their child.  It looks like they both with
terminate.  Which means if you login in a new pid namespace (where the
unsharing process becomes pid 1) and call nohup everything will get
killed and you will be logged out.

Eric
--

From: Jan Engelhardt
Date: Thursday, March 4, 2010 - 3:55 pm

Correct; I can tell from experience with pam_mount. GDM for example is 
very unhappy if you fork/exit processes in PAM modules and don't hide 
the fact by bending SIGCHLD from gdm_handler to mypam_handler (which 
itself is racy, suppose GDM re-set the SIGCHLD handler midway through).

(In this particular case however, I'd prefer if login programs like GDM 
just ignored any PIDs they did not spawn in the first place instead of 
moaning around.)
--

From: Pavel Emelyanov
Date: Wednesday, March 3, 2010 - 9:50 am

This is not ... handy - if after enter you have pid 0 you obviously
can't perform 2 parallel enters. The way I see it:

As far as the numbers reported to the userspace are concerned:
1. task, that enters is still visible by its old parent by old pid
2. task, that enters gets some pid within the entering namespace
   and reports its parent pid to have pid 1 (init obviously doesn't
   care)
3. we _can_ try to allocate new pid equal to the old one so that
   glibc stays happy


As far as the pointers are concerned:
1. parent pointer doesn't change
2. task_pid(tsk) one (i.e. struct pid * one) _can_ change if
   a) we don't allow threads enter (de_thread problem is handeled)
   b) we don't allow leave the group/session, i.e. check, that there
      is the only one task that enters lives in its pgid/sid
   c) we wait for the quiescent state to pass by before destroying
      the old pid to handle race with sys_kill()

Thoughts/questions? ("This is a nasty problem" answer is not acceptable,
the real code problems/races please)
--

From: Eric W. Biederman
Date: Wednesday, March 3, 2010 - 1:16 pm

2 parallel enters?  I meant you have pid 0 in the entered pid namespace.
You have pid 0 because your pid simply does not map.

There is nothing that makes to parallel enters impossible in that.
Even today we have one thread per cpu that has task->pid == &init_struct_pid
which is pid 0.

For the case of unshare where we are designed to be used with PAM I don't
think my proposed semantics work.  For a join needed an extra fork before

That doesn't handle the case of cached struct pids.  A good example is
waitpid, where it waits for a specific struct pid.  Which means that
allocating a new struct pid and changing task->pid will cause
waitpid(pid) to wait forever...

To change struct pid would require the refcount on struct pid to show
no references from anywhere except the task_struct.

At the cost of a little memory we can solve that problem for unshare
if we have a an extra upid in struct pid, how we verify there is space
in struct pid I'm not certain.

I do think that at least until someone calls exec the namespace pids are
reported to the process itself should not change.  That is kill and
waitpid etc.  Which suggests an implementation the opposite of what
I proposed.  With ns_of_pid(task_pid(current)) being used as the
pid namespace of children, and current->nsproxy->pid_ns not changing
in the case of unshare.

Shrug.

Or perhaps this is a case where we use we can implement join with
an extra process but we can't implement unshare, because the effect
cannot be immediate.

Eric
--

From: Pavel Emelyanov
Date: Friday, March 5, 2010 - 12:18 pm

Hm... One more proposal - can we adopt the planned new fork_with_pids system


I think this is OK to return -EBUSY for this. And fix the waitpid
respectively not to block this common case. All the others I think



--

From: Eric W. Biederman
Date: Friday, March 5, 2010 - 1:26 pm

The normal rules of parentage apply.   So the child will see simply
see it's parent as ppid == 0.  If that child daemonizes it will become
a child of the pid namespaces init.

This is a lot like something that gets started from call_usermodehelper.  It's
parent process is not a descendant of init either.


The implementation of the join is to simply change current->nsproxy->pid_ns.

In a lot of ways I like this idea of sys_hijack/sys_cloneat, and I
don't think anything I am doing fundamentally undermines it.  The use
case of doing things in fork is that there is automatic inheritance of
everything.  All of the namespaces and all of the control groups, and
possibly also the parent process.  It does have the high cost that the
process we are copying from must be stopped because there are no locks
that let us take everything.  I haven't looked at the recent proposals
to see if anyone has solved that problem cleanly.



If we can do a sys_hijack/sys_cloneat style of join, that means we can
afford a fork.  At which point the my proposed pid namespace semantics
should be fine.

aka:
setns(NSTYPE_PID);
pid = fork();
if (pid == 0) {
	getpid() == 2; /* Or whatever the first free pid is joined pid namespace */
        getppid() == 0;
} else {
	pid == 6400; /* Or whatever the first free pid is in the original pid namespace */
	waitpid(pid);

That would probably work.  setsid() and setpgrp() have similar sorts
of restrictions.  That is both more challenging and more limiting than
the semantics that come out of my unshare(CLONE_NEWPID) patch.  So I

If all we do is populate an unused struct upid in struct pid there

Overall it sounds like the semantics I have proposed with
unshare(CLONE_NEWPID) are workable, and simple to implement.  The
extra fork is a bit surprising but it certainly does not
look like a show stopper for implementing a pid namespace join.

Eric
--

From: Daniel Lezcano
Date: Saturday, March 6, 2010 - 7:47 am

If the normal rules of parentage apply, that means pid 0 has to wait 
it's child.
If we are in the scenario of pid 0, it's child pid 1234 and we kill the 
pid 1 of the pid namespace, I suppose pid 1234 will be killed too.
The pid 0 will stay in the pid namespace and will able to fork again a 
new pid 1.

I think Serge already reported that...

And also the rootfs for executing the command inside the container (eg. 
shutdown), the uid/gid (if there is a user namespace), the mount points, ...
But I suppose we can do the same with setns for all the namespaces and 
chrooting within the container rootfs.

What I see is a problem with the tty. For example, we cloneat the init 
process of the container which is usually /sbin/init but this one has 
its tty mapped to /dev/console, so the output of the exec'ed command 
I agree, it's some kind of "ghost" process.
IMO, with a bit of userspace code it would be possible to enter or exec 
a command inside a container with nsfd, setns.

+1 to test your patchset Eric :)

Just a mindless suggestion, the "nsopen" / "nsattach" syscall names 
should be more clear no ?

Jumping back, one question about the nsfd and the poll for waiting the 
end of the namespace.
If we have an openened file descriptor on a specific namespace, we grab 
a reference on this one, so the namespace won't be destroyed until we 
close the fd which is used to poll the end of the namespace, no ? Did I 
miss something ?

Thanks
  -- Daniel
--

From: Eric W. Biederman
Date: Saturday, March 6, 2010 - 1:48 pm

I expect zap_pid_ns_processes should also arrange so we cannot allocate any
more processes.  We certainly need to do something explicit or pid 1 won't
be allocated.  It might make sense to resurrect a pid namespace after it's



Not bad suggestions.

I am going to explore a bit more.  Given that nsfd is using the same
permission checks as a proc file, I think I can just make it a proc
file.  Something like "/proc/<pid>/ns/net".  With a little luck that

Not really.  The assumption was that there would be a very similar
file descriptor that we could use with poll.

Eric

--

From: Daniel Lezcano
Date: Saturday, March 6, 2010 - 2:26 pm

Mmh, yes. But that was just an idea, maybe a bit out of the scope you 
Ah ! yes. Good idea.




--

From: Eric W. Biederman
Date: Monday, March 8, 2010 - 1:32 am

I have take an snapshot of my development tree and placed it at.



It is a hair more code to use proc files but nothing worth counting.

Probably the biggest thing I am aware of right now in my development
tree is in getting uids to pass properly between unix domain sockets
I would up writing this cred_to_ucred function.

Serge can you take a look and check my logic, and do you have
any idea of where we should place something like pid_vnr but
for the uid namespace?

void cred_to_ucred(struct pid *pid, const struct cred *cred,
		   struct ucred *ucred)
{
	ucred->pid = pid_vnr(pid);
	ucred->uid = ucred->gid = -1;
	if (cred) {
		struct user_namespace *cred_ns = cred->user->user_ns;
		struct user_namespace *current_ns = current_user_ns();
		struct user_namespace *tmp;

		if (likely(cred_ns == current_ns)) {
			ucred->uid = cred->euid;
			ucred->gid = cred->egid;
		} else {
			/* Is cred in a child user namespace */
			tmp = cred_ns;
			do {
				tmp = tmp->creator->user_ns;
				if (tmp == current_ns) {
					ucred->uid = tmp->creator->uid;
					ucred->gid = overflowgid;
					return;
				}
			} while (tmp != &init_user_ns);

			/* Is cred the creator of my user namespace,
			 * or the creator of one of it's parents?
			 */
			for( tmp = current_ns; tmp != &init_user_ns;
			     tmp = tmp->creator->user_ns) {
				if (cred->user == tmp->creator) {
					ucred->uid = 0;
					ucred->gid = 0;
					return;
				}
			}

			/* No user namespace relationship so no mapping */
			ucred->uid = overflowuid;
			ucred->gid = overflowgid;
		}
	}
}

Eric
--

From: Daniel Lezcano
Date: Monday, March 8, 2010 - 9:54 am

Hi Eric,

thanks for the pointer.

I tried to boot the kernel under qemu and I got this oops:

Loading /lib/kbd/keymaps/i386/azerty/fr.map
Creating block device nodes.
Creating character device nodes.
Making device-mapper control node
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff812df7e7>] netlink_broadcast+0x1bd/0x384
PGD 3cfd0067 PUD 3cfc1067 PMD 0
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/class/firmware/timeout
CPU 0
Pid: 841, comm: modprobe Not tainted 2.6.33 #1 /
RIP: 0010:[<ffffffff812df7e7>]  [<ffffffff812df7e7>] 
netlink_broadcast+0x1bd/0x384
RSP: 0018:ffff88003cfc3ca8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88003ce947f0 RCX: ffff88003cf877f0
RDX: ffff88003f939dd0 RSI: ffff88003f939ef0 RDI: ffff88003f939e00
RBP: ffff88003cfc3d18 R08: ffff88003f939d98 R09: ffff88003f939e88
R10: ffff88003cf87818 R11: 0000000000000286 R12: ffff88003f939d98
R13: ffff88003f939e88 R14: ffff88003ce94818 R15: ffff88003ce94800
FS:  00007f23a90a06f0(0000) GS:ffffffff8161b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003cfcd000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 841, threadinfo ffff88003cfc2000, task 
ffff88003d1db058)
Stack:
 0000000000000270 00000000000000d0 ffff88003f9377f0 ffffffff8203d630
<0> ffff88003f939f5c 0000000000000000 0000000000000001 0000000000000000
<0> ffff88003cf03000 0000000000000020 0000000000000004 ffff88003f939e88
Call Trace:
 [<ffffffff8121c1bd>] kobject_uevent_env+0x414/0x59b
 [<ffffffff81117432>] ? sysfs_create_file+0x25/0x27
 [<ffffffff8105eb13>] ? module_add_modinfo_attrs+0xd6/0xfc
 [<ffffffff8121c34f>] kobject_uevent+0xb/0xd
 [<ffffffff8105eba4>] mod_sysfs_setup+0x6b/0x99
 [<ffffffff810602aa>] load_module+0x12a2/0x16f1
 [<ffffffff81060b34>] sys_init_module+0x60/0x230
 [<ffffffff81002928>] ...
From: Eric W. Biederman
Date: Monday, March 8, 2010 - 10:29 am

I am clearly running an old userspace on my test machine.  No udev.
It looks like udev has a long standing netlink misfeature, where
it does not initializing NETLINK_CB....


From 8d85e3ab88718eda3d94cf8e1be14b69dae2b8f1 Mon Sep 17 00:00:00 2001
From: Eric W. Biederman <ebiederm@xmission.com>
Date: Mon, 8 Mar 2010 09:25:20 -0800
Subject: [PATCH] kobject_uevent:  Use the netlink allocator helper...

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 lib/kobject_uevent.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 920a3ca..b8229cc 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -216,7 +216,7 @@ int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
 
 		/* allocate message with the maximum possible size */
 		len = strlen(action_string) + strlen(devpath) + 2;
-		skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+		skb = nlmsg_new(len + env->buflen, GFP_KERNEL);
 		if (skb) {
 			char *scratch;
 
-- 
1.6.5.2.143.g8cc62
--

From: Daniel Lezcano
Date: Monday, March 8, 2010 - 12:57 pm

Thanks.

I was able to boot but I have the following warning:

------------[ cut here ]------------
WARNING: at net/netlink/af_netlink.c:198 netlink_sock_destruct+0x72/0xac()
Hardware name:
Modules linked in: [last unloaded: scsi_wait_scan]
Pid: 840, comm: nash-hotplug Tainted: G        W  2.6.33 #2
Call Trace:
 [<ffffffff812df182>] ? netlink_sock_destruct+0x72/0xac
 [<ffffffff8102ca29>] warn_slowpath_common+0x77/0xa4
 [<ffffffff8102ca65>] warn_slowpath_null+0xf/0x11
 [<ffffffff812df182>] netlink_sock_destruct+0x72/0xac
 [<ffffffff812bb2a4>] __sk_free+0x1e/0x118
 [<ffffffff812bb40d>] sk_free+0x19/0x1b
 [<ffffffff812e0dc2>] netlink_release+0x246/0x253
 [<ffffffff812b825a>] sock_release+0x1a/0x6b
 [<ffffffff812b82cd>] sock_close+0x22/0x26
 [<ffffffff810c7823>] __fput+0x11b/0x1d7
 [<ffffffff810c78f6>] fput+0x17/0x19
 [<ffffffff810c4ae2>] filp_close+0x67/0x72
 [<ffffffff8102e75c>] put_files_struct+0x6a/0xd4
 [<ffffffff8102e80d>] exit_files+0x47/0x4f
 [<ffffffff8102fe59>] do_exit+0x1eb/0x693
 [<ffffffff813864c2>] ? _raw_spin_unlock_irq+0x2b/0x31
 [<ffffffff81030373>] do_group_exit+0x72/0x9b
 [<ffffffff8103f37c>] get_signal_to_deliver+0x3a1/0x3c1
 [<ffffffff81001e8e>] do_notify_resume+0x8d/0x6ea
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff8102851e>] ? finish_task_switch+0x6a/0xb3
 [<ffffffff810284b4>] ? finish_task_switch+0x0/0xb3
 [<ffffffff813867aa>] ? retint_signal+0x11/0x87
 [<ffffffff810538c9>] ? trace_hardirqs_on_caller+0x110/0x13a
 [<ffffffff813867df>] retint_signal+0x46/0x87
---[ end trace d4a1e4cbaa70d63d ]---


And I have a kernel panic when exiting a network namespace using a macvlan:

linux-swk0 login: BUG: unable to handle kernel paging request at 
ffff880035475678
IP: [<ffffffff8128dbef>] macvlan_stop+0x54/0x7a
PGD 160b063 PUD 160f063 PMD 2aa067 PTE 35475160
Oops: 0002 [#1] DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/net/eth0/flags
CPU 0
Pid: 10, comm: netns Tainted: G        W  2.6.33 #2 /
RIP: ...
From: Eric W. Biederman
Date: Monday, March 8, 2010 - 1:24 pm

Thanks for the bug report.

For the moment you might want to drop:
af_netlink:  Allow credentials to work across namespaces.
af_netlink: Debugging in case I have missed something.

Although I am curious if you hit my debugging messages in
netlink recv.

I guess if the goal is to test my nsfd bits you can drop everything
starting with my 'scm: Reorder scm_cookie.' commit.  The rest is what
it takes to get get uids, gid and pids translated when the cross
namespaces on an af_unix of an af_netlink socket.

At least in the af_netlink case it appears clear I am have missed
something.

This is a warning that netlink throws when the packet accounting messed
up.  So it sounds like you are exercising another path that I failed

I wonder/hope this is simply the result of corruption from earlier problems.

Eric
--

From: Daniel Lezcano
Date: Monday, March 8, 2010 - 1:42 pm

I will look forward if I find more clues for this warning.

In the meantime  was able to enter the container with the ugly following 
program:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;

    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }

    for (i = 0; i < size; i++) {

        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }
    }

    execve(argv[2], &argv[2], NULL);
    perror("execve");

    return 0;
}

At the fist glance, no problem :)
--

From: Eric W. Biederman
Date: Monday, March 8, 2010 - 1:47 pm

No fork() so your processes is completely in the pid namespace?

Eric

--

From: Daniel Lezcano
Date: Monday, March 8, 2010 - 2:12 pm

What I do is to attach "/bin/sh" to the container with this program.
The container is a VPS running busybox with the full isolation.

echo $$ gives the real pid.
All the forked processes appears in the pid namespace, they are visible 
through /proc with the virtual pid.
I am not able to change to the /proc/self directory (I assume this is 
normal).


--

From: Eric W. Biederman
Date: Monday, March 8, 2010 - 2:25 pm

I guess my meaning is I was expecting.
child = fork();
if (child == 0) {
	execve(...);
}
waitpid(child);

This puts /bin/sh in the container as well.

I'm not certain about the /proc/self thing I have never encountered that.
But I guess if your pid is outside of the pid namespace of that instance
of proc /proc/self will be a broken symlink.

Eric

--

From: Serge E. Hallyn
Date: Monday, March 8, 2010 - 2:49 pm

Hmm, worse than a broken symlink, will it be a wrong symlink if just
the right pid is created in the container?

-serge
--

From: Eric W. Biederman
Date: Monday, March 8, 2010 - 3:24 pm

It won't happen. readlink and followlink are both based on 
task_tgid_nr_ns(current, ns_of_proc).

Which fails if your process is not known in that pid namespace.

Eric
--

From: Daniel Lezcano
Date: Tuesday, March 9, 2010 - 3:03 am

Eric W. Biederman wrote:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <syscall.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/param.h>

#define __NR_setns 300

int setns(int nstype, int fd)
{
    return syscall (__NR_setns, nstype, fd);
}

int main(int argc, char *argv[])
{
    char path[MAXPATHLEN];
    char *ns[] = { "pid", "mnt", "net", "pid", "uts" };
    const int size = sizeof(ns) / sizeof(char *);
    int fd[size];
    int i;
    pid_t pid;
   
    if (argc != 3) {
        fprintf(stderr, "mynsenter <pid> <command>\n");
        exit(1);
    }

    for (i = 0; i < size; i++) {
       
        sprintf(path, "/proc/%s/ns/%s", argv[1], ns[i]);

        fd[i] = open(path, O_RDONLY| FD_CLOEXEC);
        if (fd[i] < 0) {
            perror("open");
            return -1;
        }

    }
   
    for (i = 0; i < size; i++)
        if (setns(0, fd[i])) {
            perror("setns");
            return -1;
        }

    pid = fork();
    if (!pid) {

        fprintf(stderr, "mypid is %d\n", syscall(__NR_getpid));

        execve(argv[2], &argv[2], NULL);
        perror("execve");

    }

    if (pid < 0) {
        perror("fork");
        return -1;
    }

    if (waitpid(&pid, NULL, 0) < 0) {
        perror("waitpid");
    }

    return 0;
}

Waitpid returns an error:

waitpid: No child processes

The pid number returned by fork is the pid from the init pid namespace 
but it seems waitpid is waiting for a pid belonging to the child pid 
namespace.

waitpid
 -> wait4
   -> find_get_pid
     -> find_vpid
       -> find_pid_ns(nr, current->nsproxy->pid_ns);

The current->nsproxy->pid_ns is the one of the namespace we attached to. 
So the real pid returned by the fork does not exist in this pid namespace.
Maybe fork should return a pid number belonging to the current pid 
namespace we are attached no ?




--

From: Eric W. Biederman
Date: Tuesday, March 9, 2010 - 3:13 am

But it isn't.  It is.
           find_pid_ns(nr, task_active_pid_ns(current));
Which is:
           find_pid_ns(nr, ns_of_pid(task_pid(current)));
           

Do you not have my patch that changed that?

Eric

--

From: Daniel Lezcano
Date: Tuesday, March 9, 2010 - 3:26 am

argh ! right :)

Sorry for the noise. Works well now.

--

From: Daniel Lezcano
Date: Wednesday, March 10, 2010 - 2:16 pm

Eric,

at this point I did not fall in any obvious bug and I was able to enter 
/ execute commands directly inside the container.

Excellent !

Thanks
  -- Daniel


--

From: Serge E. Hallyn
Date: Monday, March 8, 2010 - 10:07 am

Well my first thought was user_namespace, but I'm thinking kernel/cred.c is

	Hmm, I think you want to catch one level up - so the creator itself
	is in current_user_ns, so

	do {
		if (tmp->creator->user_ns == current_ns) {
			ucred->uid = tmp->creator->uid;
			ucred->gid = tmp->creator_gid;
			return;
		}
		tmp = tmp->creator->user_ns;

			should we start recording a user_ns->creator_gid

--

From: Eric W. Biederman
Date: Monday, March 8, 2010 - 10:35 am

Eric

--

From: Serge E. Hallyn
Date: Monday, March 8, 2010 - 10:47 am

Oh, yeah, make user_ns->creator a cred, excellent idea - then we have
the LSM and capability fields cached as well.

-serge
--

From: Oren Laadan
Date: Wednesday, March 3, 2010 - 1:59 pm

For what it's worth, I think that this suggestion (cloneat) is the
so far the cleanest to allow a process to enter an existing namespace.

Oren.

--

From: Eric W. Biederman
Date: Wednesday, March 3, 2010 - 2:05 pm

If the goal is to enter a container you are probably right.  I don't
think I have seen how scary the cloneat code is.

At least for the network namespace there is a lot of value in being
able to just change that single namespace.  Having multiple logical
network stacks has it's challenges but has a lot of practical
applications.  Especially when there is the possibility of private
ipv4 addresses overlapping, or you have interfaces where you never
want to forward between them but you want forwarding enabled.

Eric
--

From: Pavel Emelyanov
Date: Friday, February 26, 2010 - 2:35 pm

Worth changing them that way?
--

From: Eric W. Biederman
Date: Friday, February 26, 2010 - 2:49 pm

I don't think so.  They keep all of their state in struct file.  To be
usefully bind mounted you need to keep your state in the dentry or the
inode.

Ultimately what I have done is fix rootfs so it supports bind mounts and
used rootfs inodes.

Eric

--

From: Pavel Emelyanov
Date: Friday, February 26, 2010 - 2:13 pm

Yet another set of per-namespace IDs along with CLONE_NEWXXX ones?
I currently have a way to create all namespaces we have with one
syscall. Why don't we have an ability to enter them all with one syscall?
--

From: Matt Helsley
Date: Tuesday, February 23, 2010 - 4:49 pm

I think technicaly it's still held using processes, only now it's
much more indirect:

netns <- mount <- mount namespace(s) <- process(es)

The big difference is we'd be waiting for all the processes
sharing that mount (or dups of it in multiple mount namespaces) to
exit too -- not just those sharing the netns.

Using a mount requires keeping names for the namespaces themselves
in the kernel which is a problem we've largely avoided so far.
The nscgroup is an example of the messes that creates, I think. And it
further complicates c/r -- we'd need to checkpoint and recreate the
names of the namespaces too. So we'll need a namespace for the names of
the namespaces to make restart reliable won't we? Makes my head spin...

Cheers,
	-Matt Helsley
--

From: Eric W. Biederman
Date: Tuesday, February 23, 2010 - 6:32 pm

True. The practical difference is that it doesn't require a dedicated

This is strictly different.  It may require a bit of extra support from
checkpoint/restart because it introduces some more user visible objects
but the names themselves are nothing special.  The name that userspace
sees and deals with is the name of the mount point.  No new namespaces
are required.

Eric
--

From: Serge E. Hallyn
Date: Tuesday, February 23, 2010 - 6:39 pm

Shouldn't be a big deal - assuming the mount is of a special type
for a network ns, we just record the objref for the checkpointed
network ns.  We don't need a namespace for the namespaces - we just
need unique names for the checkpoint image (the objref, which is
unique per netns).

Guess it really is about time that i work on some clean patches
for checkpoint/restart of mounts namespaces and mounts.

-serge
--

From: Ben Greear
Date: Thursday, January 14, 2010 - 11:32 am

For small or simple cases, this may be true..but there is a lot of work
to make a complex user-space app that manages arbitrary amounts of interfaces
routing tables in an arbitrary amount of network namespaces.  With the contrack-zones
approach, user-space apps do not require any significant changes, and you do not
need the rest of the namespace overhead to accomplish the task.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

--

From: jamal
Date: Friday, January 15, 2010 - 8:03 am

I think for your use case what you state is true. In the general case,
it is not. 
Note: I am not arguing against the patch - just that it is not the
generic scenario solution compared to namespaces.

cheers,
jamal

--

Previous thread: ==(Thank You)== by Mr Kane Roy on Thursday, January 14, 2010 - 5:55 am. (1 message)

Next thread: [PATCH] MAINTAINERS: transfer maintainership of I/OAT by Maciej Sosnowski on Thursday, January 14, 2010 - 8:52 am. (2 messages)