Re: [PATCH 00/10] sysfs tagged directories

Previous thread: [00/37] 2.6.25-stable review by Greg KH on Tuesday, April 29, 2008 - 10:17 am. (43 messages)

Next thread: [PATCH 01/10] sysfs: Support for preventing unmounts. by Benjamin Thery on Tuesday, April 29, 2008 - 10:10 am. (1 message)
From: Benjamin Thery
Date: Tuesday, April 29, 2008 - 10:10 am

This is a port to 2.6.25-mm1 of Eric Biederman's patchset to implement
tagged directories in sysfs. This patchset was discussed a few months
ago on lkml@.

I re-post this patchset as most of network namespace work has been
merged in net-2.6.26 and the issue is it is currently easy to test
because of this missing piece.
(today sysfs must be disabled to test netns).


This feature is needed for network namespaces where we have to be able
to have multiple network devices with the same name in different network
namespaces. 

On top of that, the final patch from Serge Hallyn also implements 
tagging for user namespaces.


Here is the announcement Eric wrote back in December to introduce his 
patchset:

"
Now that we have network namespace support merged it is time to
revisit the sysfs support so we can remove the dependency on !SYSFS.

[...]

The bulk of the patches are the changes to allow multiple sysfs
superblocks.

Then comes the tagged directory sysfs support which uses information
captured at mount time to decide which object with which tag will
appear in a directory.

Then the support for renaming and deleting objects where the source
may be ambiguous because of tagging.

Then finally the network namespace support so it is clear how all
of this tied together.
"

Regards,
Benjamin

-- 
--

From: Greg KH
Date: Tuesday, April 29, 2008 - 10:36 am

<snip>

Are the objections that Al Viro made to this patchset when it was last
sent out addressed in this new series?

thanks,

greg k-h
--

From: Serge E. Hallyn
Date: Tuesday, April 29, 2008 - 11:04 am

Which objections were those?  The last submission which I see by Eric
was http://lkml.org/lkml/2007/12/1/15 this past December.  I see no
response from Al and get the feeling you were ok with them.

So my hunch would be that Eric had addressed those before that last
submission, but if not I'm sorry, and please do set me straight.

thanks,
-serge
--

From: Greg KH
Date: Tuesday, April 29, 2008 - 11:41 am

See the thread from Al starting with:
	Date: Mon, 7 Jan 2008 10:24:17 +0000
	From: Al Viro <viro@ZenIV.linux.org.uk>
	To: "Eric W. Biederman" <ebiederm@xmission.com>
	Cc: linux-kernel@vger.kernel.org, htejun@gmail.com,
	        linux-fsdevel@vger.kernel.org, gregkh@suse.de
	Subject: [RFC] netns / sysfs interaction
	Message-ID: <20080107072301.GW27894@ZenIV.linux.org.uk>

He had a lot of questions and objections to this way forward, and I
share those objections.

thanks,

greg k-h
--

From: Serge E. Hallyn
Date: Tuesday, April 29, 2008 - 12:34 pm

Ah I see it, thanks.

All Al's questions appear to be about how a task migration will be handled
in the face of funky userspace usage of sysfs files.  But it seems clear the
first use of these will not be for migration but for vservers.  The key
thing to remember is that we don't (as decided at kernel-summit 06) aim
to hide from userspace the fact that it's in a vserver, we just give it
what it needs so that it can pretend.

As we start implementing checkpoint and restart to effect migration,
*clearly* if we're trying to restart a task which has cwd or an open fd
in /sys/class/net/eth42/, but that directory doesn't exist on the target
machine, then the restart (and hence migrate) fails.

There was a concern about
/sys/devices/pci0000\:00/0000\:00\:0a.0/net:eth0.  Since that's a
symlink to ../../../class/net/eth0, it will either point nowhere or
point to the virtualized eth0, if veth1 (or vethN) was renamed to eth0
in the container.  (see below)  If that is the wrong thing to do we
could try to address it in this patchset, but I suspect it is better
left until device namespace are implemented.  Does that sounds sane?


I think this is answered in patch 4.  So yeah, it does d_move() in each
sysfs mount.  It's all done under the sysfs_rename_mutex.  Judging by
the phrasing of the question, is that not acceptable?

Finally, to give an idea about how the trees end up looking, here is
what I just did on my test box;

/usr/sbin/ip link add type veth
mount --bind /mnt /mnt
mkdir /mnt/sys
mount --make-shared /mnt
ns_exec -cmn /bin/sh  # unshare netns and mounts ns
 # At this point, I still see eth0 and friends under /sys/class/net etc
mount -t sysfs none /sys
 # At this point, /sys/class/net has only lo0 and sit0, and
 # /sys/devices/pci0000:00/0000:00:03.0/net:eth0 is a dead link
mount --bind /sys /mnt/sys
echo $$
	3050

(back in another shell):
/usr/sbin/ip link set veth1 netns 3050

(back in container shell):
/usr/sbin/ip link set veth1 name eth0
 # Now ...
From: Eric W. Biederman
Date: Tuesday, April 29, 2008 - 1:10 pm

We also have to call sysfs_grab_supers to ensure none of the superblocks
we know about will go away during the rename.  I believe that is the
only locking change from the current code.

Eric
--

From: Greg KH
Date: Wednesday, April 30, 2008 - 8:12 pm

I really don't think so, but I'll wait for the reworked patches to
review them and see how badly they mess the code up :)

thanks,

greg k-h
--

From: Greg KH
Date: Wednesday, April 30, 2008 - 8:13 pm

What does this all look like without CONFIG_SYSFS_DEPRECATED enabled,
which is what all sane distros do these days.  That's going to change
the look of the tree for stuff like this a lot I think...

thanks,

greg k-h
--

From: Serge E. Hallyn
Date: Thursday, May 1, 2008 - 8:10 am

Now before moving veth1 to the new netns, we have in the container:
/sys/class/net:
lo  sit0

/sys/devices/virtual/net:
lo  sit0

and after moving veth1, we have in the container:

/sys/class/net:
lo  sit0  veth1

/sys/devices/virtual/net:
lo  sit0

In the parent network namespace, veth1 is removed from /sys/class/net
but remains in /sys/devices/virtual/net.

I'm not sure whether this is the renaming bug that Daniel Lezcano's
patch addresses.  If not (as I suspect) then that clearly needs to be
fixed.

Benjamin can you play around with this and test it with Daniel's
patch?

thanks,
-serge
--

From: Eric W. Biederman
Date: Thursday, May 1, 2008 - 11:34 am

Darn.  It appears we have a regression in this patchset.
That part used to work.

I was thinking of blaming sysfs_rename_link.  But it the
links are fine so it looks more likely that sysfs has morphed once
again and we have  a reference counting issue or something similar.
Yuck. d_move and the other moves should have worked.

From a purely get the good less controversial parts of this
patchset in.  I suggest we look at patches 7/10 and 8/10 (without
the tag_ops).  And introduce and start using sysfs_delete_link
and sysfs_rename_link.  That code seems pretty stable and is
generally a code reduction all by itself by reducing a common
idiom into a single function.

Eric
--

From: Serge E. Hallyn
Date: Thursday, May 1, 2008 - 2:05 pm

Ok as it turns out Benjamin Thery does in fact have a patch to fix this.
Benjamin, please send your renaming patch out tomorrow if you can?

I guess this is why I didn't really see the problem Benjamin said there
was - it just works with SYSFS_DEPRECATED  :)

-serge
--

From: Eric W. Biederman
Date: Thursday, May 1, 2008 - 2:58 pm

It will be nice to see that patch.  I'm curious what the problem was.

Eric
--

From: Serge E. Hallyn
Date: Friday, May 2, 2008 - 10:42 am

Here is the patch he had privately sent.  As the comments indicate he's
still refining it so yell at me not him...   but this on top of the
tagged dir patchset makes everything work for me.

(I'd *really* like to see network namespaces be generally usable sometime
in 2.6.26.)

thanks,
-serge


[This is a proposal to fix the issue with network devices movement 
 and sysfs not being updated correctly.
 Please comment.
 There is no locking in sysfs_tag_will_change(), this may be wrong.]


When a network device is moved between namespaces its sysfs entry isn't
updated correctly because kobject_rename() fails before sysfs_rename_dir()
is called. kobject_rename() prints a warning and returns when it detects 
a kobject in the same kset already has the same name.

When moving between namespaces the device keeps its name but the sysfs 
tag associated with its sysfs_dirent should be updated to reflect the fact 
it now belongs to another namespace. This is done in sysfs_rename_dir().


* The simplest change we can make to "fix" kobject_rename() is to allow 
  a kobject to be renamed to the same name without complaining: we allow
  dumb renaming. We only check the name is actually "owned" by it.

(lib/kobject.c: kobject_rename())
                temp_kobj = kset_find_obj(kobj->kset, new_name);
                if (temp_kobj) {
-                       printk(KERN_WARNING "kobject '%s' cannot be renamed "
-                              "to '%s' as '%s' is already in existence.\n",
-                              kobject_name(kobj), new_name, new_name);
+                       if (kobj != temp_kobj) {
+                               printk(KERN_WARNING "kobject '%s' cannot be renamed "
+                                      "to '%s' as '%s' is already in existence.\n",
+                                      kobject_name(kobj), new_name, new_name);
+                               kobject_put(temp_kobj);
+                               return -EINVAL;
+                       }
      ...
From: Daniel Lezcano
Date: Sunday, May 4, 2008 - 4:13 pm

Serge,

I posted a fix to Dave Miller which was included in net-2.6 and merged 
in 2.6.26-rc1.

http://marc.info/?l=linux-netdev&m=120959135229360&w=2

This one should fix your problem for sysfs.

There is another fix to do in order to avoid name conflicts with network 
interfaces belonging to different namespaces, eg. allows to rename 
veth1234 to eth0 without conflicting with the network device belonging 
in init_net, I send this patch to container@ first to have it merged 
with the sysfs patchset.

Thanks,

   -- Daniel
--

From: Serge E. Hallyn
Date: Monday, May 5, 2008 - 9:18 am

Ok.

Benjamin, what are you doing with Eric's sysfs patchset then?  Are you
going to re-post it asap on top of whatever tree has these two patches,
or are you planning to wait longer?

thanks,
-serge
--

From: Benjamin Thery
Date: Tuesday, May 6, 2008 - 9:53 am

I will resend the patchset on top of 2.6.26-rc1,
with Daniel's last patch merged.

(If anyone prefer to have it merged on top of net-2.6, say so)



-- 
B e n j a m i n   T h e r y  - BULL/DT/Open Software R&D

    http://www.bull.com
--

From: Eric W. Biederman
Date: Tuesday, April 29, 2008 - 12:35 pm

If either of you will take those objections and see how they actually
apply to the patches I would be happy.

We are quite successfully using multiple mounts in proc for the
pid namespace with no large issues.

Similarly for the network namespace we have the same set of files
showing up in different places in /proc with no large coherency
issues despite the same files being in multiple places in the
dcache.

The only piece of the puzzle new in sysfs is directory rename
support.  Which takes a little work but seems sane.


Odd I thought I sent Al a reply to that bold statement but it doesn't
appear in the archive.

At any rate I'm not afraid of more code review and testing.

At the time Al made those statements his concerns about coherency
and locking nightmares did not seem to apply to these patches and 
they don't seem to apply now.

I will happily admit the VFS does not like to work with filesystems
where the state changes behind it's back.  That is the position we are
in fundamentally with sysfs and proc, and the locking works today.
Multiple superblocks for sysfs does not change that in any significant
way.

So bring on the tough code review and concrete objections.
The code can take it and it can only get better for it.

Eric

--

From: Eric W. Biederman
Date: Tuesday, April 29, 2008 - 12:14 pm

I'm trying to recall.  Al was nervous when the approach was described
to him but I don't remember him looking at specific patches and
objecting.

There was also an issue about races in sysfs between rename
and unlink that Al brought up, that causes real problems in
at least one piece of code that uses that functionality.  I have been
busy so I don't know if anyone has addressed that issue.  It is
independent but this patchset may make that issue slightly harder
to fix.

If the concern is Al nervousness with respect to locking order
(and that is complex) the only really sane way to fix that is
to dig into the VFS and change the lock ordering so that is 
friendlier to distributed filesystems like sysfs.

This patchset does not introduce any new lock ordering issues
but it may make the existing convolutions we have to go through
to keep the dcache for existing file handles in sync with the
internal sysfs tree more visible.  As of my last posting I am
not aware of any locking problems in the code itself.

Greg I had thought you had dropped the patchset simply because
you got busy.  I know it languished for a long time because of
that.

Eric
--

Previous thread: [00/37] 2.6.25-stable review by Greg KH on Tuesday, April 29, 2008 - 10:17 am. (43 messages)

Next thread: [PATCH 01/10] sysfs: Support for preventing unmounts. by Benjamin Thery on Tuesday, April 29, 2008 - 10:10 am. (1 message)