This is a port to 2.6.25-mm1 of Eric Biederman's patchset to implement tagged directories in sysfs. This patchset was discussed a few months ago on lkml@. I re-post this patchset as most of network namespace work has been merged in net-2.6.26 and the issue is it is currently easy to test because of this missing piece. (today sysfs must be disabled to test netns). This feature is needed for network namespaces where we have to be able to have multiple network devices with the same name in different network namespaces. On top of that, the final patch from Serge Hallyn also implements tagging for user namespaces. Here is the announcement Eric wrote back in December to introduce his patchset: " Now that we have network namespace support merged it is time to revisit the sysfs support so we can remove the dependency on !SYSFS. [...] The bulk of the patches are the changes to allow multiple sysfs superblocks. Then comes the tagged directory sysfs support which uses information captured at mount time to decide which object with which tag will appear in a directory. Then the support for renaming and deleting objects where the source may be ambiguous because of tagging. Then finally the network namespace support so it is clear how all of this tied together. " Regards, Benjamin -- --
<snip> Are the objections that Al Viro made to this patchset when it was last sent out addressed in this new series? thanks, greg k-h --
Which objections were those? The last submission which I see by Eric was http://lkml.org/lkml/2007/12/1/15 this past December. I see no response from Al and get the feeling you were ok with them. So my hunch would be that Eric had addressed those before that last submission, but if not I'm sorry, and please do set me straight. thanks, -serge --
See the thread from Al starting with: Date: Mon, 7 Jan 2008 10:24:17 +0000 From: Al Viro <viro@ZenIV.linux.org.uk> To: "Eric W. Biederman" <ebiederm@xmission.com> Cc: linux-kernel@vger.kernel.org, htejun@gmail.com, linux-fsdevel@vger.kernel.org, gregkh@suse.de Subject: [RFC] netns / sysfs interaction Message-ID: <20080107072301.GW27894@ZenIV.linux.org.uk> He had a lot of questions and objections to this way forward, and I share those objections. thanks, greg k-h --
Ah I see it, thanks. All Al's questions appear to be about how a task migration will be handled in the face of funky userspace usage of sysfs files. But it seems clear the first use of these will not be for migration but for vservers. The key thing to remember is that we don't (as decided at kernel-summit 06) aim to hide from userspace the fact that it's in a vserver, we just give it what it needs so that it can pretend. As we start implementing checkpoint and restart to effect migration, *clearly* if we're trying to restart a task which has cwd or an open fd in /sys/class/net/eth42/, but that directory doesn't exist on the target machine, then the restart (and hence migrate) fails. There was a concern about /sys/devices/pci0000\:00/0000\:00\:0a.0/net:eth0. Since that's a symlink to ../../../class/net/eth0, it will either point nowhere or point to the virtualized eth0, if veth1 (or vethN) was renamed to eth0 in the container. (see below) If that is the wrong thing to do we could try to address it in this patchset, but I suspect it is better left until device namespace are implemented. Does that sounds sane? I think this is answered in patch 4. So yeah, it does d_move() in each sysfs mount. It's all done under the sysfs_rename_mutex. Judging by the phrasing of the question, is that not acceptable? Finally, to give an idea about how the trees end up looking, here is what I just did on my test box; /usr/sbin/ip link add type veth mount --bind /mnt /mnt mkdir /mnt/sys mount --make-shared /mnt ns_exec -cmn /bin/sh # unshare netns and mounts ns # At this point, I still see eth0 and friends under /sys/class/net etc mount -t sysfs none /sys # At this point, /sys/class/net has only lo0 and sit0, and # /sys/devices/pci0000:00/0000:00:03.0/net:eth0 is a dead link mount --bind /sys /mnt/sys echo $$ 3050 (back in another shell): /usr/sbin/ip link set veth1 netns 3050 (back in container shell): /usr/sbin/ip link set veth1 name eth0 # Now ...
We also have to call sysfs_grab_supers to ensure none of the superblocks we know about will go away during the rename. I believe that is the only locking change from the current code. Eric --
I really don't think so, but I'll wait for the reworked patches to review them and see how badly they mess the code up :) thanks, greg k-h --
What does this all look like without CONFIG_SYSFS_DEPRECATED enabled, which is what all sane distros do these days. That's going to change the look of the tree for stuff like this a lot I think... thanks, greg k-h --
Now before moving veth1 to the new netns, we have in the container: /sys/class/net: lo sit0 /sys/devices/virtual/net: lo sit0 and after moving veth1, we have in the container: /sys/class/net: lo sit0 veth1 /sys/devices/virtual/net: lo sit0 In the parent network namespace, veth1 is removed from /sys/class/net but remains in /sys/devices/virtual/net. I'm not sure whether this is the renaming bug that Daniel Lezcano's patch addresses. If not (as I suspect) then that clearly needs to be fixed. Benjamin can you play around with this and test it with Daniel's patch? thanks, -serge --
Darn. It appears we have a regression in this patchset. That part used to work. I was thinking of blaming sysfs_rename_link. But it the links are fine so it looks more likely that sysfs has morphed once again and we have a reference counting issue or something similar. Yuck. d_move and the other moves should have worked. From a purely get the good less controversial parts of this patchset in. I suggest we look at patches 7/10 and 8/10 (without the tag_ops). And introduce and start using sysfs_delete_link and sysfs_rename_link. That code seems pretty stable and is generally a code reduction all by itself by reducing a common idiom into a single function. Eric --
Ok as it turns out Benjamin Thery does in fact have a patch to fix this. Benjamin, please send your renaming patch out tomorrow if you can? I guess this is why I didn't really see the problem Benjamin said there was - it just works with SYSFS_DEPRECATED :) -serge --
It will be nice to see that patch. I'm curious what the problem was. Eric --
Here is the patch he had privately sent. As the comments indicate he's
still refining it so yell at me not him... but this on top of the
tagged dir patchset makes everything work for me.
(I'd *really* like to see network namespaces be generally usable sometime
in 2.6.26.)
thanks,
-serge
[This is a proposal to fix the issue with network devices movement
and sysfs not being updated correctly.
Please comment.
There is no locking in sysfs_tag_will_change(), this may be wrong.]
When a network device is moved between namespaces its sysfs entry isn't
updated correctly because kobject_rename() fails before sysfs_rename_dir()
is called. kobject_rename() prints a warning and returns when it detects
a kobject in the same kset already has the same name.
When moving between namespaces the device keeps its name but the sysfs
tag associated with its sysfs_dirent should be updated to reflect the fact
it now belongs to another namespace. This is done in sysfs_rename_dir().
* The simplest change we can make to "fix" kobject_rename() is to allow
a kobject to be renamed to the same name without complaining: we allow
dumb renaming. We only check the name is actually "owned" by it.
(lib/kobject.c: kobject_rename())
temp_kobj = kset_find_obj(kobj->kset, new_name);
if (temp_kobj) {
- printk(KERN_WARNING "kobject '%s' cannot be renamed "
- "to '%s' as '%s' is already in existence.\n",
- kobject_name(kobj), new_name, new_name);
+ if (kobj != temp_kobj) {
+ printk(KERN_WARNING "kobject '%s' cannot be renamed "
+ "to '%s' as '%s' is already in existence.\n",
+ kobject_name(kobj), new_name, new_name);
+ kobject_put(temp_kobj);
+ return -EINVAL;
+ }
...Serge, I posted a fix to Dave Miller which was included in net-2.6 and merged in 2.6.26-rc1. http://marc.info/?l=linux-netdev&m=120959135229360&w=2 This one should fix your problem for sysfs. There is another fix to do in order to avoid name conflicts with network interfaces belonging to different namespaces, eg. allows to rename veth1234 to eth0 without conflicting with the network device belonging in init_net, I send this patch to container@ first to have it merged with the sysfs patchset. Thanks, -- Daniel --
Ok. Benjamin, what are you doing with Eric's sysfs patchset then? Are you going to re-post it asap on top of whatever tree has these two patches, or are you planning to wait longer? thanks, -serge --
I will resend the patchset on top of 2.6.26-rc1,
with Daniel's last patch merged.
(If anyone prefer to have it merged on top of net-2.6, say so)
--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
http://www.bull.com
--
If either of you will take those objections and see how they actually apply to the patches I would be happy. We are quite successfully using multiple mounts in proc for the pid namespace with no large issues. Similarly for the network namespace we have the same set of files showing up in different places in /proc with no large coherency issues despite the same files being in multiple places in the dcache. The only piece of the puzzle new in sysfs is directory rename support. Which takes a little work but seems sane. Odd I thought I sent Al a reply to that bold statement but it doesn't appear in the archive. At any rate I'm not afraid of more code review and testing. At the time Al made those statements his concerns about coherency and locking nightmares did not seem to apply to these patches and they don't seem to apply now. I will happily admit the VFS does not like to work with filesystems where the state changes behind it's back. That is the position we are in fundamentally with sysfs and proc, and the locking works today. Multiple superblocks for sysfs does not change that in any significant way. So bring on the tough code review and concrete objections. The code can take it and it can only get better for it. Eric --
I'm trying to recall. Al was nervous when the approach was described to him but I don't remember him looking at specific patches and objecting. There was also an issue about races in sysfs between rename and unlink that Al brought up, that causes real problems in at least one piece of code that uses that functionality. I have been busy so I don't know if anyone has addressed that issue. It is independent but this patchset may make that issue slightly harder to fix. If the concern is Al nervousness with respect to locking order (and that is complex) the only really sane way to fix that is to dig into the VFS and change the lock ordering so that is friendlier to distributed filesystems like sysfs. This patchset does not introduce any new lock ordering issues but it may make the existing convolutions we have to go through to keep the dcache for existing file handles in sync with the internal sysfs tree more visible. As of my last posting I am not aware of any locking problems in the code itself. Greg I had thought you had dropped the patchset simply because you got busy. I know it languished for a long time because of that. Eric --
