sysfs: Implement sysfs tagged directory support.
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.To achieve a race free design all tagged directories are created
and managed by sysfs itself. The upper level code that knows what
tagged directories we need provides just two methods that enable
this:
sb_tag() - that returns a "void *" tag that identifies the context of
the process that mounted sysfs.
kobject_tag(kobj) - that returns a "void *" tag that identifies the context
a kobject should be in.
Everything else is left up to sysfs.For the network namespace sb_tag and kobject_tag are essentially
one line functions, and look to remain that.The work needed in sysfs is more extensive. At each directory
or symlink creating I need to check if the directory it is being
created in is a tagged directory and if so generate the appropriate
tag to place on the sysfs_dirent. Likewise at each symlink or
di...
As only directories can be tagged, I suppose handling tags explicitly
isn't necessary here, right? Can we please add a comment explaningTaking both @kobj and @dir_sd is ugly but it isn't your fault. I'll
As before, I can't bring myself to like this interface. Is computing
tags dynamically really necessary? Can't we do the followings?tag = sysfs_allocate_tag(s);
sysfs_enable_tag(kobj (or sd), tag);
sysfs_sb_show_tag(sb, tag);Where tags are allocated using ida and each sb has bitmap of enabled
tags so that sysfs ops can simply use something like the following to
test whether it's enabled.bool sysfs_tag_enabled(sb, tag)
{
return sysfs_info(sb)->tag_map & (1 << tag);
}Tags which can change dynamically seems too confusing to me and it
makes things difficult to verify as it's unclear how those tags are
gonna to change.Thanks.
--
tejun
--
Tejun thank you for the review, and my apologies for the delayed
reply.It isn't so much computing tags dynamically but rather it is reading them
Youch that seems limiting. The expectation is that we could have
as many as 100 different containers in use on a single system at one
time. So 100 apparent copies of the network stack.There is also a second dimension here we multiplex different
directories based on different sets of tags. One directory based
on user namespaces another on the network namespaces.The tags in practice are just pointers to the namespace pointers.
So while we could use the ida technique to specify which set of tags
we are talking about for a directory it isn't sufficient.The question sysfs_tag_enabled(sb, tag) makes no sense to me.
Especially in the context of needed a sysfs_sb_show_tag(sb, tag);The current structure is because of all of the darn fool races and
magic that sysfs does. We have to say for a given directory: Your
contents will always be tagged, and only those that one tag that
matches what was captured by the superblock when sysfs is mountedWe have a fundamental issue that we have to handle, and it sounds like
you are proposing something that will not handle it.- network devices can move between namespaces.
- network devices have driver specific sysfs attributes hanging off of them.So we have to move the network devices and their sysfs attributes
between namespaces, and I implemented that in kobject_rename,
sysfs_rename path.The tags on a kobject can only change during a rename operation.
So when the change happens is well defined. Further there is a
set of functions: sysfs_creation_tag, sysfs_removal_tag,
sysfs_lookup_tag, sysfs_dirent_tag which makes it clear what we
are doing.If you really don't like how the tags are managed we need to talk
about how we store the tags on kobjects and on the super block.Registering a set of tags could easily make the sb_tag function
obsolete, and that is one sma...
Hello, Eric.
It's still dynamic from sysfs's POV and I think that will make
100 netns would mean 100 bits and 100 different views of them would mean
100 sb's where each sb would need bitmap larger than 100 bits. I don'tNo matter which criteria is used to select ns, it should end up being
mapped to a set of tags (here, ida allocated numbers). Unless tags canI failed to follow here. Can you please elaborate a bit? If you can
sysfs_tag_enabled() was meant to test whether a directory which is
What you described is pretty much what I'm talking about. The only
difference is whether to use caller-provided pointer as tag or an
ida-allocated integer. The last sentence of the above paragraph is
basically sys_tag_enabled() function (maybe misnamed).The main reason why I'm whining about this so much is because I think
tag should be something abstracted inside sysfs proper. It's something
which affects very internal operation of sysfs and I really want to keep
the implementation details inside sysfs. Spreading implementation over
kobject and sysfs didn't turn out too pretty after all.Thank you.
--
tejun
--
Ah. When we are doing readdir or lookup. Yes that makes sense.
So some concrete code examples here. In the current code in lookup
what I am doing is:tag = sysfs_lookup_tag(parent_sd, parent->d_sb);
sd = sysfs_find_dirent(parent_sd, tag, dentry->d_name.name);With the proposed change of adding tag types sysfs_lookup_tag becomes:
const void *sysfs_lookup_tag(struct sysfs_dirent *dir_sd, struct super_block *sb)
{
const void *tag = NULL;if (dir_sd->s_flags & SYSFS_FLAG_TAGGED)
tag = sysfs_info(sb)->tag[dir_sd->tag_type];return tag;
}Which means that in practice I can lookup that tag that I am displaying
once.Then in sysfs_find_dirent we do:
for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) {
if ((parent_sd->s_flags & SYSFS_FLAG_TAGGED) &&
(sd->s_tag.tag != tag))
continue;
if (!strcmp(sd->s_name, name))
return sd;
}That should keep the implementation sufficiently inside of sysfs for there
to be no guessing. In addition as a practical matter we can only allow
one tag to be visible in a directory at once or else we can not check
for duplicate names. Which is the problem I see with a bitmap based test
too unnecessary many degrees of freedom.The number of tag types will be low as it is the number of subsystems
that use the feature. Simple enough that I expect statically allocating
the tag types in an enumeration is a safe and sane way to operate.
i.e.enum sysfs_tag_types {
SYSFS_TAG_NETNS,
SYSFS_TAG_USERNS,
SYSFS_TAG_MAXI agree. Most of the implementation is in sysfs already. We just have
a few corner cases.Fundamentally it is the subsystems responsibility that creates the
kobjects and the sysfs entries. The only case where I can see an
ida generated number being a help is if we start having lifetime
issues. Further the extra work to allocate and free tags ida based
tags seems unnecessary.I don'...
Hello, Eric.
Having enumed tag types limits that a sb can have map to only one tag
but it doesn't really prevent multiple possibly visible entries which is
the real unnecessary degrees of freedom. That said, I don't reallyI still would prefer something which is more generic. The abstraction
is clearer too. A sb shows untagged and a set of tags. A sd can eitherUsing ida (or idr if a pointer for private data is necessary) is really
easy. It'll probably take a few tens of lines of code. That said, I
don't think I have enough rationale to nack what you described. So, as
long as the tags are made static, I won't object.Thanks.
--
tejun
--
Having a single tag type per directory and thus a single tag visible per
directory does prevent multiple possible visible entries.That is we can check when we add the sd if there will be a conflict in
That is the abstraction now.
The only difference is how we represent the set of tags.
I use and array of the valid tags.
You use a bitmap.And array allows the lookup of the tag I am looking for before
I search for the sd. An bitmap requires me to compare each entry.For me that is a deal breaker. Currently in certain pathological
cases we have scaling issues with sysctl and sysfs that we can
have enormous directories that start running slowly. To fix
lookup performance requires that we know the full name before
we do the directory search which is the name string and the
tag.So I having a type of tag as being of fundamental importance in
the interface now so we don't need to refactor all of the users
later. In addition to the fact that we need the type to know
how to set the tags when mounting a superblock and when
given a new kobject to create an sd for.We could make the types dynamic rather then a static enumeration but
Sounds good. The only justification I can think of for ida tags is that
they are smaller, and so can keep the sysfs_dirents smaller. Which
occasionally is a significant concern. Still that should be an optimization
that we can apply later, as it is not a structural difference in the code.Just to confirm. Do you the two operations:
mount_tag - called only when the sb is mounted
kobject_tag - called when we create new sd or rename an sdCause you to view an the tags as dynamic?
Eric
--
Hello,
How so? sysfs_sb->bitmap which contains enough bits for all the defined
tags and determining whether a sd should be shown or not is as simple asWhat I'm feeling unease about is the extra level of abstraction added by
tag types. A sd is given a tag. A sb shows a set of tags. The most
straight forward to implement that is to give sd a tag and test the tag
against sb's set of tags. The type is added because pointer tag
requires sequential matching which is usually best to avoid. It'sThe thing is that I don't really see why there's tagged_dir_ops at all.
What's needed is tagged sd's and sb's which can show subset of those
tags, so adding callback ops for tags just doesn't make much sense to
me. The interface should ideally be...1. alloc/release tag
2. set / change / remove tag on sd
3. enable / disable tag on a sbThis has been my opinion from the beginning. Unless the tags need to be
changed dynamically on demand (which I hope is not the case), there just
is plainly no reason to have callbacks for tags.Thanks.
--
tejun
--
What we are implementing is not, a sb with a set of tags that are displayed,
but directories with a single tag that is displayed. The sb just happens
to hold the state for the directories.A directory displaying only a single tag is an necessary constraint for
Yes. The compare happens to be test_bit.
With a bitmap you must visit each dirent with a given name and see if
it has a tag that is displayed.With an array you can lookup the tag aprori and can potentially do a
hash table lookup or a tree lookup and are not required to visit eachThat is just one important aspect of it. We need a way to describe
We need callbacks for interfacing with the kobject layer, and for
selecting our set of tags at mount time. Not tagged_dir_ops so much
Essentially agreed.Create an sd with a tag, change the tag on a sd.
Having an untagged sd in a directory that requires tags should
Disagree that is too flexible. Tags on a sb need to be
unchanging or else we get vfs layer issues.Further the abstraction is logically exactly one tag on a
(sb,directory) pair.The operations needed are.
- Select the set of tags on a sb (at mount time)
This requires we call a set of callbacks. [ My mount_sb callback ]- release a tag (which implies removing all tagged entries and
removing the sb reference)4. Interface with the kobject layer.
kobject_add calls sysfs_create_dir
kboject_rename calls sysfs_rename_dir
kobject_del calls sysfs_remove_dirFor the first two operations we need a helper function to go from a
kobject to a tag.We don't need callbacks to poll to see if the tags on a sd have
changed.We need helper functions for interfacing with the rest of the kernel.
Eric
--
Hello,
A few things...
1. The lookup is currently done linearly and is fast enough for now.
Also, most lookup ops are cached by vfs layer. I'm not sure how
probable it is that we're gonna need hash or tree based sd lookup.2. I don't think it's gonna be too difficult to speed up bitmap based
lookup. It would require a bit more intelligence but there's no
fundamental restriction. Just organizing the tree by tag first would
give us the same order of magnitude lookup given that the tags are usedFor netns, yes. I just think it would be better if the sysfs mechanism
to support that concept is more generic especially because it doesn'tThe kobject op seems a bit strange way to interface to me. For mount,
I'm not so sure here. As a policy, maybe but I don't really see a
Yes, that's why I view it as strange. These can be done in forward way
(by passing in mount options and/or arguments) but it's done by first
going into the sysfs and then calling back out to outer layer.Thanks.
--
tejun
--
Well one of those reasons is not having duplicate entries in your directory listing.
I don't know how bad sysfs is. On the sysctl side I have people complaining
because I am doing a lookup during insert and that lookup is linear. SysfsWell the envisioned use is for other namespaces and they all are similar
I will look how if there is a place in the kobject layer to put it. With
I added it where it was easiest. Adding a parameter to sysfs_create_dir
simply means I have to add the function to the kobject layer. It is certainlyWell in the case of mount the default parameter at least is current, and
there are good reasons for that.On the other side I can't pass a tag through from the device layer to
the kobject layer. It isn't a concept the kobject layer supports.At least though the conversation is in relative agreement. I will refresh
the patches shortly and see where we are at.Eric
--
Hello,
Something I've been curious about is a directory which contains both the
untagged entries and tagged ones. I can definitely imagine somethingThis pretty much defines the interface and is likely to force future
Is it difficult to just export it via kobject and device layer? If
changing the default function is too much of a hassle (and I'm sure it
would be), just add an extended version which takes @tag. The current
implementation feels like it tried too hard to not add intermediateI was imagining something like...
mount -t sysfs -o ns=0,4,5 /my/sys
And let the userland control which ns's are visible in the particular
Thanks a lot for the patience. :-)
--
tejun
--
Well gregkh thought it wasn't a good idea last time I tried exploring
It tried for something that was simple to use and that worked.
Also the way things work. I have to use all of the intermediate layers
and their calls to various functions. So just passing a parameter through
doesn't work to well.It looks to me like the clean solution is move kobject_tag into
kobj_type, and have it call some higher level function.We also need to remove the maintenance disaster that is
kobject_set_name from sysfs_rename_dir. And push it into
kobject_rename instead. The error handling is harder inAssuming Greg will accept it when he sees reasonable patches.
Eric
--
Hello, Eric.
There is rather large possibility that I'm just being dumb here
especially because I haven't reviewed the users of this facility, so all
the comments I'm making are from the POV of interfaces of sysfs and the
related layers. I think I've made my concerns clear by now. If you
still think the callbacks are the best way to go, please try to
enlighten me. I really don't wanna be stopping something which is
better from ignorance. Just give me some concrete examples or point me
to codes which show how and why the current interface is the best forHeh... I personally think kobject layer as a whole should just be hidden
under the cabinet of device driver model but I'm having difficult time
convincing other people of it. Anyways, fully agree the interactionGreg says he would. :-)
Thanks a lot for your patience.
--
tejun
--
Currently I think a callback on to get the tag from a kobject is the
best way to go. That way we don't need to add a field to struct
kobject (and don't need the associated redundancy), and we can lookup
up the tag when we need it.I have been playing with the code and just about have it ready
to go. I just need to refactor all of my changes into clean
patches at this point, plus a bit of review and test. Ben & Daniel
have given me a version of the previous patchset rebased unto the
latest -mm so that should help for the unchanged parts.Introducing the sysfs_tag_type thing and pushing the functions to
the edges helps. It especially cleans up the ugly mount/umount
situation allowing us to handle that with generic code.Moving the kobject_tag into struct ktype works and looks roughly
as clean as what happens with attributes. So I that seems reasonable,
and doesn't result in a significant change in the users.The result of which means that I only have the helper function sysfs_creation_tag
left in sysfs/dir.c Left in there are some of the nasties in dealing with symlinks.At this point I believe I have achieved a nice degree of simplifying the sysfs
code in the current patches without really changing the users or
making it more complex for them.I have not implemented ida tags, and I don't plan to. That is just
unnecessary work right now. The users are simple and the meat of theI would be happy if we could remove all nonsense kobject that are there just
for structural purposes but have no purpose otherwise. Things like kobjects
for symlinks. The kobject layer doesn't seem to have a clear identityWelcome. The code reached a point a while ago where it didn't make sense
to change it without review feedback.Eric
--
The kobject events are sent through a netlink message which is not
currently per network namespace. Shouldn't be useful to have a way to
retrieve from the kobject the network namespace or the uevent socket
associated with it ? IMHO having idr in the kobject + netns pointer
associated may help to handle the sysfs isolation and makes the uevent--
Sauf indication contraire ci-dessus:
Compagnie IBM France
Siège Social : Tour Descartes, 2, avenue Gambetta, La Défense 5, 92400
Courbevoie
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 542.737.118 ?
SIREN/SIRET : 552 118 465 02430
--
Grumble. I have been conveniently been forgetting about that socket.
Similarly we have the user mode helpers to deal with.For this conversation there is a simple answer. All of that is in the
kobject layer, and works even when you compile sysfs out of your kernel.
Therefore it is a separate problem. And sysfs idr tags have nothing
to do with it.It is most definitely something we need to come back to. I bet there
are some interesting interactions when you have multiple network devices
with the same name generating events.Eric
--
Hello,
Related delta: I've been thinking that uevents should be part of sysfs
not kobject as that's what the userland is gonna associate the event
with. Would that solve the problem you're thinking about?Thanks.
--
tejun
--
When multiple namespaces are in use we can get multiple kernel objects
with the same name which is currently impossible to represent in sysfs.
In particular directories like /sys/class/net and /sys/kernel/uids have
significant problems.Not wanting to change the user space interface and wanting to have a simple
implementation where all objects are in the kobject and sysfs trees. The
decision has been made to tag objects with the namespace they live in,
and in a particular mount of sysfs only display objects with the tag
that corresponds to the namespaces in effect when sysfs was mounted.After the last round of reviews the mount/umount logic is
significantly cleaned up and easier to maintain. From a 10,000
foot view the code and the way it functions has remained the
same since we settled on tagged directories a year or so ago. I
intend any future cleanups to be as incremental patches on top
of this existing set.Lack of these patches are keeping the generally complete network
namespace work in 2.6.26 from being used and tested more heavily.
Can we please get the patches merged?These patches are based off of 2.6.26-rc8 + the -gregkh tree from
last night. Hopefully that means they apply -mm -gregkh and
-linux-next.Eric
--
A quick update. My patchset conflicts with the recently added
This patch is unnecessary as that path is never exercised anymore.
as: dev_change_name returns early in the case of a noop rename.In addition my introduction sysfs_rename_link handles this case
cleanly by first removing the old link and then creating the new
link. Preventing false positives when the link names are the same.So it should be safe to drop Cornelia patch without a reoccurance
of scary errors.Eric
--
On Sat, 05 Jul 2008 21:42:57 -0700,
My impression was that the networking folks didn't want any warnings for
Hm, the description looks badly worded - I unfortunately left the old
text unchanged when I respun the patch :( The patch re-introduces the
warning in sysfs_add_one() which had been removed in the meanwhile and
makes device_rename() use a non-warning version. I still think we want
a warning for the general case since this is usually caused be some
problems in the calling code (and the alternative would be to add
checks to all callers.)
--
Which would be reasonable. Because all of the checks have been done
Right. We just need to get the sysfs paths clean enough that we don't
emit false positives. I think I have accomplished that for rename.Eric
--
I should mention patches 1-12 are the core of the work.
Patches 13-15 are the users. Included in complete form primarily to
aid review.Eric
--
It finally dawned on me what the clean fix to sysfs_rename_dir
calling kobject_set_name is. Move the work into kobject_rename
where it belongs. The callers serialize us anyway so this is
safe.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/dir.c | 6 +-----
include/linux/sysfs.h | 4 +---
lib/kobject.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 10 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 8c0e4b9..146b86a 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -799,16 +799,12 @@ int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
if (!new_dentry)
goto out_unlock;- /* rename kobject and sysfs_dirent */
+ /* rename sysfs_dirent */
error = -ENOMEM;
new_name = dup_name = kstrdup(new_name, GFP_KERNEL);
if (!new_name)
goto out_unlock;- error = kobject_set_name(kobj, "%s", new_name);
- if (error)
- goto out_unlock;
-
dup_name = sd->s_name;
sd->s_name = new_name;diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 84d92bb..f7e43ed 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -20,8 +20,6 @@
struct kobject;
struct module;-extern int kobject_set_name(struct kobject *kobj, const char *name, ...)
- __attribute__((format(printf, 2, 3)));
/* FIXME
* The *owner field is no longer used, but leave around
* until the tree gets cleaned up fully.
@@ -140,7 +138,7 @@ static inline void sysfs_remove_dir(struct kobject *kobj)static inline int sysfs_rename_dir(struct kobject *kobj, const char *new_name)
{
- return kobject_set_name(kobj, "%s", new_name);
+ return 0;
}static inline int sysfs_move_dir(struct kobject *kobj,
diff --git a/lib/kobject.c b/lib/kobject.c
index 829b839..49b3bc4 100644
--- a/lib/kobject.c
+++ b/lib/kobject.c
@@ -451,6 +451,7 @@ int kobject_rename(struct kobject *kobj, const char *new_name)
{
int error = 0;
const char *devpath = NULL;
+ const char *dup...
Nice clean up. Acked-by: Tejun Heo <tj@kernel.org>
--
tejun
--
To support mounting multiple instances of sysfs occassionally I
need to walk through all of the currently present sysfs super blocks.To allow this iteration this patch adds sysfs_grab_supers
and sysfs_release_supers. While a piece of code is in
a section surrounded by these no more sysfs super blocks
will be either created or destroyed.So the fundamentals.
- The data in sysfs fundamentally changes behind the back of the
VFS and we need to keep the VFS in sync. Essentially this is the
distributed filesystem problem.- In particular for sysfs_rename and sysfs_move_dir we need to support finding
the dcache entries and calling d_move. So that the dcache does not
get into an inconsistent state. Timeouts and invalidates like NFS
uses are to be avoided if at all possible.- Coming through the vfs we are guaranteed that the filesystem will
not be unmounted while we have a reference on a dentry, and with
multiple mounts we do not get that guarantee. Therefore to get that
guarantee for all of the superblocks we need the blunt instrument.- Since mount/unmount are rare blocking them is no big deal.
I believe any distributed filesystem that is together enough to tell
us about renames (so we can update the dcache) instead of doing the
NFS timeout will need the ability to block mount/unmount while it is
executing d_move.Currently sysfs does not need to block mounts only because we perform
an internal mount and then never unmount sysfs.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/mount.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++-----
fs/sysfs/sysfs.h | 10 +++++++
2 files changed, 81 insertions(+), 8 deletions(-)diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 9f328d2..c812cc4 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ ...
In preparation for multiple mounts of sysfs add a superblock parameter to
sysfs_get_dentry.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/dir.c | 12 +++++++-----
fs/sysfs/file.c | 2 +-
fs/sysfs/sysfs.h | 3 ++-
3 files changed, 10 insertions(+), 7 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 146b86a..69c40ed 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -85,6 +85,7 @@ static void sysfs_unlink_sibling(struct sysfs_dirent *sd)/**
* sysfs_get_dentry - get dentry for the given sysfs_dirent
+ * @sb: superblock of the dentry to return
* @sd: sysfs_dirent of interest
*
* Get dentry for @sd. Dentry is looked up if currently not
@@ -97,9 +98,10 @@ static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
* RETURNS:
* Pointer to found dentry on success, ERR_PTR() value on error.
*/
-struct dentry *sysfs_get_dentry(struct sysfs_dirent *sd)
+struct dentry *sysfs_get_dentry(struct super_block *sb,
+ struct sysfs_dirent *sd)
{
- struct dentry *dentry = dget(sysfs_sb->s_root);
+ struct dentry *dentry = dget(sb->s_root);while (dentry->d_fsdata != sd) {
struct sysfs_dirent *cur;
@@ -777,7 +779,7 @@ int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
goto out; /* nothing to rename *//* get the original dentry */
- old_dentry = sysfs_get_dentry(sd);
+ old_dentry = sysfs_get_dentry(sysfs_sb, sd);
if (IS_ERR(old_dentry)) {
error = PTR_ERR(old_dentry);
old_dentry = NULL;
@@ -841,7 +843,7 @@ int sysfs_move_dir(struct kobject *kobj, struct kobject *new_parent_kobj)
goto out; /* nothing to move *//* get dentries */
- old_dentry = sysfs_get_dentry(sd);
+ old_dentry = sysfs_get_dentry(sysfs_sb, sd);
if (IS_ERR(old_dentry)) {
error = PTR_ERR(old_dentry);
old_dentry = NULL...
This function is similar but much simpler to sysfs_get_dentry
returns a sysfs dentry if one curently exists.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/dir.c | 39 +++++++++++++++++++++++++++++++++++++++
1 files changed, 39 insertions(+), 0 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 69c40ed..df9934a 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -764,6 +764,45 @@ void sysfs_remove_dir(struct kobject * kobj)
__sysfs_remove_dir(sd);
}+/**
+ * __sysfs_get_dentry - get dentry for the given sysfs_dirent
+ * @sb: superblock of the dentry to return
+ * @sd: sysfs_dirent of interest
+ *
+ * Get dentry for @sd. Only return a dentry if one currently
+ * exists.
+ *
+ * LOCKING:
+ * Kernel thread context (may sleep)
+ *
+ * RETURNS:
+ * Pointer to found dentry on success, NULL on failure.
+ */
+static struct dentry *__sysfs_get_dentry(struct super_block *sb,
+ struct sysfs_dirent *sd)
+{
+ struct inode *inode;
+ struct dentry *dentry = NULL;
+
+ inode = ilookup5_nowait(sysfs_sb, sd->s_ino, sysfs_ilookup_test, sd);
+ if (inode && !(inode->i_state & I_NEW)) {
+ struct dentry *alias;
+ spin_lock(&dcache_lock);
+ list_for_each_entry(alias, &inode->i_dentry, d_alias) {
+ if (!IS_ROOT(alias) && d_unhashed(alias))
+ continue;
+ if (alias->d_sb != sb)
+ continue;
+ dentry = alias;
+ dget_locked(dentry);
+ break;
+ }
+ spin_unlock(&dcache_lock);
+ }
+ iput(inode);
+ return dentry;
+}
+
int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
{
struct sysfs_dirent *sd = kobj->sd;
--
1.5.3.rc6.17.g1911--
This patch modifies the sysfs_rename_dir and sysfs_move_dir routines
to support multiple sysfs dentry tries rooted in different
sysfs superblocks.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/dir.c | 193 +++++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 136 insertions(+), 57 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index df9934a..b2d92ea 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -803,43 +803,113 @@ static struct dentry *__sysfs_get_dentry(struct super_block *sb,
return dentry;
}+struct sysfs_rename_struct {
+ struct list_head list;
+ struct dentry *old_dentry;
+ struct dentry *new_dentry;
+ struct dentry *old_parent;
+ struct dentry *new_parent;
+};
+
+static void post_rename(struct list_head *head)
+{
+ struct sysfs_rename_struct *srs;
+ while (!list_empty(head)) {
+ srs = list_entry(head->next, struct sysfs_rename_struct, list);
+ dput(srs->old_dentry);
+ dput(srs->new_dentry);
+ dput(srs->old_parent);
+ dput(srs->new_parent);
+ list_del(&srs->list);
+ kfree(srs);
+ }
+}
+
+static int prep_rename(struct list_head *head,
+ struct sysfs_dirent *sd, struct sysfs_dirent *new_parent_sd,
+ const char *name)
+{
+ struct sysfs_rename_struct *srs;
+ struct super_block *sb;
+ struct dentry *dentry;
+ int error;
+
+ list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
+ dentry = sysfs_get_dentry(sb, sd);
+ if (dentry == ERR_PTR(-EXDEV))
+ continue;
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto err_out;
+ }
+
+ srs = kzalloc(sizeof(*srs), GFP_KERNEL);
+ if (!srs) {
+ error = -ENOMEM;
+ dput(dentry);
+ goto err_out;
+ }
+
+ INIT_LIST_HEAD(&srs->list);
+ list_add(head, &srs->list);
+ srs->old_dentry = dentry;
...
Currently sysfs_chmod calls sys_setattr which in turn calls
inode_change_ok which checks to see if it is ok for the current user
space process to change tha attributes. Since sysfs_chmod_file has
only kernel mode clients denying them permission if user space is the
problem is completely inappropriate.Therefore factor out sysfs_sd_setattr which does not call
inode_change_ok and modify sysfs_chmod_file to call it.In addition setting victim_sd->s_mode explicitly in sysfs_chmod_file
is redundant so remove that as well.Thanks to Tejun Heo <htejun@gmail.com>, and
Daniel Lezcano <dlezcano@fr.ibm.com> for working on this
and spotting this case.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/file.c | 5 +----
fs/sysfs/inode.c | 23 ++++++++++++++++-------
fs/sysfs/sysfs.h | 1 +
3 files changed, 18 insertions(+), 11 deletions(-)diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index cb5dd3f..1304b3a 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -600,13 +600,10 @@ int sysfs_chmod_file(struct kobject *kobj, struct attribute *attr, mode_t mode)
newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
newattrs.ia_ctime = current_fs_time(inode->i_sb);
- rc = sysfs_setattr(victim, &newattrs);
+ rc = sysfs_sd_setattr(victim_sd, inode, &newattrs);if (rc == 0) {
fsnotify_change(victim, newattrs.ia_valid);
- mutex_lock(&sysfs_mutex);
- victim_sd->s_mode = newattrs.ia_mode;
- mutex_unlock(&sysfs_mutex);
}mutex_unlock(&inode->i_mutex);
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index eb53c63..80f8fd4 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -42,10 +42,9 @@ int __init sysfs_inode_init(void)
return bdi_init(&sysfs_backing_dev_info);
}-int sysfs_setattr(struct dentry * dentry, struct iattr * iattr)
+int sysfs_sd_setattr(struct sysfs_dirent *sd, struct inode *inode,
+ ...
Acked-by: Tejun Heo <tj@kernel.org>
--
tejun
--
Teach sysfs_chmod_file how to handle multiple sysfs superblocks.
Since we only have one inode per sd the only thing we have to deal
with is multiple dentries for sending fs notifications. This might
dup the inode notifications oh well.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/file.c | 39 +++++++++++++++++++++++++--------------
1 files changed, 25 insertions(+), 14 deletions(-)diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 1304b3a..5955ae9 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -574,8 +574,8 @@ EXPORT_SYMBOL_GPL(sysfs_add_file_to_group);
int sysfs_chmod_file(struct kobject *kobj, struct attribute *attr, mode_t mode)
{
struct sysfs_dirent *victim_sd = NULL;
- struct dentry *victim = NULL;
- struct inode * inode;
+ struct super_block *sb;
+ struct inode * inode = NULL;
struct iattr newattrs;
int rc;@@ -584,31 +584,42 @@ int sysfs_chmod_file(struct kobject *kobj, struct attribute *attr, mode_t mode)
if (!victim_sd)
goto out;- mutex_lock(&sysfs_rename_mutex);
- victim = sysfs_get_dentry(sysfs_sb, victim_sd);
- mutex_unlock(&sysfs_rename_mutex);
- if (IS_ERR(victim)) {
- rc = PTR_ERR(victim);
- victim = NULL;
- goto out;
- }
-
- inode = victim->d_inode;
+ rc = -ENOENT;
+ mutex_lock(&sysfs_mutex);
+ inode = sysfs_get_inode(victim_sd);
+ mutex_unlock(&sysfs_mutex);
+ if (!inode)
+ goto out;+ mutex_lock(&sysfs_rename_mutex);
+ sysfs_grab_supers();
mutex_lock(&inode->i_mutex);newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
newattrs.ia_ctime = current_fs_time(inode->i_sb);
rc = sysfs_sd_setattr(victim_sd, inode, &newattrs);
+ if (rc)
+ goto out_unlock;
+
+ list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
+ /* Ignore it when the dentry does not exist on the
+ * target superblock.
+ */
+ struct dentry * victim = sysfs_get_d...
Great, Acked-by: Tejun Heo <tj@kernel.org>
--
tejun
--
Accessing the internal sysfs_mount is error prone in the context
of multiple super blocks, and nothing needs it. Not even the
sysfs crash debugging patch (although it did in an earlier version).Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/mount.c | 2 +-
fs/sysfs/sysfs.h | 1 -
2 files changed, 1 insertions(+), 2 deletions(-)diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index c812cc4..99974f0 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -22,7 +22,7 @@
/* Random magic number */
#define SYSFS_MAGIC 0x62656572-struct vfsmount *sysfs_mount;
+static struct vfsmount *sysfs_mount;
struct super_block * sysfs_sb = NULL;
struct kmem_cache *sysfs_dir_cachep;diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 5ee5d0a..33b3c73 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -97,7 +97,6 @@ struct sysfs_super_info {
extern struct sysfs_dirent sysfs_root;
extern struct super_block *sysfs_sb;
extern struct kmem_cache *sysfs_dir_cachep;
-extern struct vfsmount *sysfs_mount;
extern struct file_system_type sysfs_fs_type;void sysfs_grab_supers(void);
--
1.5.3.rc6.17.g1911--
Acked-by: Tejun Heo <tj@kernel.org>
--
tejun
--
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.To achieve a reace free design all tagged directories are created and
managed by sysfs itself.Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_tag_types with the type and it's operations
- call sysfs_make_tagged_dir with the tag type on directories
to be managed by this tag type
- sysfs_exit_tag when an individual tag is no longer valid- Implement mount_tag() which returns the tag of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.sysfs_tag() which returns the tag of a syfs kobject.Everything else is left up to sysfs and the driver layer.
For the network namespace mount_tag and sysfs_tag are essentially
one line functions, and look to remain that.Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equal...
<snip>
I applied the patches up to here (well, patch 8 was no longer needed at
all anymore), but this one doesn't apply at all.Care to respin these last few patches?
thanks,
greg k-h
--
Greg the first 4 patches are the rest of the infrastructure.
Everything rebased quite nicely. All of the conflicts appear
to have been false positives.With the addition of sysfs_rename_link sysfs_create_link_nowarn
is never called so we can remove it.I'm not really certain whose tree the last netns or the user
namespace changes should live in, but I am continuing to
have those patches in this patchset for completeness.Eric W. Biederman (7):
1 sysfs: Implement sysfs tagged directory support.
2 sysfs: Merge sysfs_rename_dir and sysfs_move_dir
3 sysfs: Implement sysfs_delete_link and sysfs_rename_link
4 driver core: Implement tagged directory support for device classes.5 sysfs: Remove sysfs_create_link_nowarn
6 Revert "netns: Fix device renaming for sysfs"
7 netns: Enable tagging for net_class directories in sysfsSerge Hallyn (1):
8 sysfs: user namespaces: fix bug with clone(CLONE_NEWUSER) with fairsched
--
The problem. When implementing a network namespace I need to be able
to have multiple network devices with the same name. Currently this
is a problem for /sys/class/net/*, /sys/devices/virtual/net/*, and
potentially a few other directories of the form /sys/ ... /net/*.What this patch does is to add an additional tag field to the
sysfs dirent structure. For directories that should show different
contents depending on the context such as /sys/class/net/, and
/sys/devices/virtual/net/ this tag field is used to specify the
context in which those directories should be visible. Effectively
this is the same as creating multiple distinct directories with
the same name but internally to sysfs the result is nicer.I am calling the concept of a single directory that looks like multiple
directories all at the same path in the filesystem tagged directories.For the networking namespace the set of directories whose contents I need
to filter with tags can depend on the presence or absence of hotplug
hardware or which modules are currently loaded. Which means I need
a simple race free way to setup those directories as tagged.To achieve a reace free design all tagged directories are created
and managed by sysfs itself.Users of this interface:
- define a type in the sysfs_tag_type enumeration.
- call sysfs_register_tag_types with the type and it's operations
- call sysfs_make_tagged_dir with the tag type on directories
to be managed by this tag type
- sysfs_exit_tag when an individual tag is no longer valid- Implement mount_tag() which returns the tag of the calling process
so we can attach it to a sysfs superblock.
- Implement ktype.sysfs_tag() which returns the tag of a syfs kobject.Everything else is left up to sysfs and the driver layer.
For the network namespace mount_tag and sysfs_tag are essentially
one line functions, and look to remain that.Tags are currently represented a const void * pointers as that is
both generic, prevides enough information for equal...
______^
This typo is still present in your patch in the CONFIG_SYSFS=n case.
Otherwise the patchset, combined with the patches Greg has already
merged in his tree, still works great for me with network namespaces.--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
Hello,
I hope that this patch (from 4.7.08) was not forgetten... I don't see
it for example in linux-net (I have an up-to-date linux-next git
tree).Regards,
Mark--
I was about to ask the same thing.
Greg, what is the plan with these remaining patches for sysfs tagged
dirs? Will you have some time to merge them in your tree soon?I ran my network namespace tests last week with this latest patchset
from Eric applied on top of your tree (which already contains the first
batch of patches). Everything looked good.Regards,
--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
Yes, they should go into my tree this week, sorry been busy with 2.6.27
work and Novell's Hackweek.thanks,
greg k-h
--
I am also interested in this patch; may I ask - what do you mean by
"my tree" ?I am a little newbie in the kernel, as you might understand.
I looked into http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/
for candidates for this git tree you are talking about.May I ask: what is the **exact** URL for this git tree you are talking about ?
Thanks,
Regards,
DS--
Greg's tree is there:
http://git.kernel.org/?p=linux/kernel/git/gregkh/patches.git;a=summaryYou can clone it from here:
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/patches.gitIt contains the patches serie to be applied on top of
linux-2.6 (with 'quilt' for example).--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
Hello,
So I wonder: the sysfs tagged directory support patch is in GerKH
tree for more than a month.
I cloned today latest Linus tree (2.6.27-rc9) and
it is not there as far as I can see.
It is also not in linux-next tree (from september).
Now, I wonder what is the process of merging this
GregKH tree ? should I watch the LKML list for a pull request
from GregKH? or will it be first merged into the linux-next tree?Regards,
Mark--
They have been dropped.
I dropped it from my tree 2 days ago, see the thread on lkml for why.
thanks,
greg k-h
--
Greg why did you drop it?
Eric
--
Because Al said it was full of problems and for us not to accept it.
thanks,
greg k-h
--
Yes. Al found problems.
Al reviewed sysfs with my patchset on top of it.
Al's review found problems in sysfs with my patchset on top of it.
If you look at what Al found the majority of those problems exist in sysfs
without my patches.Does the following sound like a workable path going forward?
- Not merge for 2.6.28 (we are to close to the merge window for everyone's comfort).
- Fix the small issues specific to tagged directory support that showed
up in Al's review (patches sent).- Keep the patches the entire time in a public tree that merges into
linux-next so that people treat this code base seriously.- Resolve the sysfs/vfs lock ordering problem mess that makes
locking in sysfs excruciatingly difficult.- Merge other patches to fix Als other issues with sysfs.
- Merge to 2.6.29 or wherever we are.
Eric
--
And when something is crap your fix it firdt before piling up more shit
on top of it. And sysfs is a really severe case of that, and you're
piling a _lot_ of shit ontop.--
Chistoph, your comments and Al's would have been much more productive
if you have had said:"I didn't like sysfs because it doesn't do things the way other filesystems
with similar problems do things. Can you please use common idioms?
Making the code easier to read and making the code easier to maintain.
Some of those constructs look awfully complex can you recheck you code
and see if there is a simpler way to implement them."That would have been honest and productive. As it sits. I have partially
inaccurate feedback from Al, useless feedback from you, and only my own
tough skin and determination to keep me going..The fact that you and Al look at the code and can't easily make sense of
is a good sign that the code as written will be hard to maintain. Al's
recent breakage of sysctl is a good example of that.Eric
--
Hi,
Well, I believe that again it was somehow forgotten or was not applied for
other reasons, since polling the following URL several times in last
days did not
show this patch (url:
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/patches.git)Regards,
MR--
These two functions do 90% of the same work and it doesn't significantly
obfuscate the function to allow both the parent dir and the name to change
at the same time. So merge them together to simplify maintenance, and
increase testing.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/dir.c | 121 +++++++++++++++++--------------------------------------
1 files changed, 38 insertions(+), 83 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index dec7586..a76fb54 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -955,44 +955,57 @@ err_out:
return error;
}-int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
+static int sysfs_mv_dir(struct sysfs_dirent *sd,
+ struct sysfs_dirent *new_parent_sd, const char *new_name)
{
- struct sysfs_dirent *sd = kobj->sd;
struct list_head todo;
struct sysfs_rename_struct *srs;
- struct inode *parent_inode = NULL;
+ struct inode *old_parent_inode = NULL, *new_parent_inode = NULL;
const char *dup_name = NULL;
const void *old_tag, *tag;
int error;INIT_LIST_HEAD(&todo);
+ BUG_ON(!sd->s_parent);
mutex_lock(&sysfs_rename_mutex);
+ if (!new_parent_sd)
+ new_parent_sd = &sysfs_root;
+
old_tag = sd->s_tag;
tag = sysfs_creation_tag(sd->s_parent, sd);error = 0;
- if ((old_tag == tag) && (strcmp(sd->s_name, new_name) == 0))
- goto out; /* nothing to rename */
+ if ((sd->s_parent == new_parent_sd) && (old_tag == tag) &&
+ (strcmp(sd->s_name, new_name) == 0))
+ goto out; /* nothing to do */sysfs_grab_supers();
if (old_tag == tag) {
- error = prep_rename(&todo, sd, sd->s_parent, new_name);
+ error = prep_rename(&todo, sd, new_parent_sd, new_name);
if (error)
goto out_release;
}error = -ENOMEM;
mutex_lock(&sysfs_mutex);
- parent_inode = sysfs_get_inode(sd->s_parent);
+ old_parent_inode = sysfs_get_inode(sd->s_parent);
+ new_parent_inode = sysfs_...
When removing a symlink sysfs_remove_link does not provide
enough information to figure out which tagged directory the symlink
falls in. So I need sysfs_delete_link which is passed the target
of the symlink to delete.Further half the time when we are removing a symlink the code is
actually renaming the symlink but not doing so explicitly because
we don't have a symlink rename method. So I have added sysfs_rename_link
as well.Both of these functions now have enough information to find a symlink
in a tagged directory. The only restriction is that they must be called
before the target kobject is renamed or deleted. If they are called
later I loose track of which tag the target kobject was marked with
and can no longer find the old symlink to remove it.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/symlink.c | 31 +++++++++++++++++++++++++++++++
include/linux/sysfs.h | 17 +++++++++++++++++
2 files changed, 48 insertions(+), 0 deletions(-)diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 54b2e5f..2a64645 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -105,6 +105,21 @@ int sysfs_create_link_nowarn(struct kobject *kobj, struct kobject *target,
}/**
+ * sysfs_delete_link - remove symlink in object's directory.
+ * @kobj: object we're acting for.
+ * @targ: object we're pointing to.
+ * @name: name of the symlink to remove.
+ *
+ * Unlike sysfs_remove_link sysfs_delete_link has enough information
+ * to successfully delete symlinks in tagged directories.
+ */
+void sysfs_delete_link(struct kobject *kobj, struct kobject *targ,
+ const char *name)
+{
+ sysfs_hash_and_remove(targ, kobj->sd, name);
+}
+
+/**
* sysfs_remove_link - remove symlink in object's directory.
* @kobj: object we're acting for.
* @name: name of the symlink t...
This patch enables tagging on every class directory if struct class
has a tag_type.In addition device_del and device_rename were modified to use
sysfs_delete_link and sysfs_rename_link respectively to ensure
when these operations happen on devices whose classes have
tag_ops that they work properly.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
---
drivers/base/class.c | 30 ++++++++++++++++++++++---
drivers/base/core.c | 56 ++++++++++++++++++++++++++++++------------------
include/linux/device.h | 3 ++
3 files changed, 64 insertions(+), 25 deletions(-)diff --git a/drivers/base/class.c b/drivers/base/class.c
index cc5e28c..0cd5704 100644
--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -135,6 +135,17 @@ static void remove_class_attrs(struct class *cls)
}
}+static int class_setup_tagging(struct class *cls)
+{
+ enum sysfs_tag_type type;
+
+ type = cls->tag_type;
+ if (type == SYSFS_TAG_TYPE_NONE)
+ return 0;
+
+ return sysfs_make_tagged_dir(&cls->p->class_subsys.kobj, type);
+}
+
int __class_register(struct class *cls, struct lock_class_key *key)
{
struct class_private *cp;
@@ -171,13 +182,24 @@ int __class_register(struct class *cls, struct lock_class_key *key)
cls->p = cp;error = kset_register(&cp->class_subsys);
- if (error) {
- kfree(cp);
- return error;
- }
+ if (error)
+ goto out_free_cp;
+
+ error = class_setup_tagging(cls);
+ if (error)
+ goto out_unregister;
+
error = add_class_attrs(class_get(cls));
class_put(cls);
+ if (error)
+ goto out_unregister;
+out:
return error;
+out_unregister:
+ kset_unregister(&cp->class_subsys);
+out_free_cp:
+ kfree(cp);
+ goto out;
}
EXPORT_SYMBOL_GPL(__class_register);diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2bf7116..4fb9b00 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -122,9 +122,21 @@ static void ...
All of the uses have been replaced by sysfs_rename_link which
is a clearer primitive to is also needed for the tagged directory
support.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/symlink.c | 15 ---------------
include/linux/sysfs.h | 10 ----------
2 files changed, 0 insertions(+), 25 deletions(-)diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index 2a64645..3c7a338 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -90,21 +90,6 @@ int sysfs_create_link(struct kobject *kobj, struct kobject *target,
}/**
- * sysfs_create_link_nowarn - create symlink between two objects.
- * @kobj: object whose directory we're creating the link in.
- * @target: object we're pointing to.
- * @name: name of the symlink.
- *
- * This function does the same as sysf_create_link(), but it
- * doesn't warn if the link already exists.
- */
-int sysfs_create_link_nowarn(struct kobject *kobj, struct kobject *target,
- const char *name)
-{
- return sysfs_do_create_link(kobj, target, name, 0);
-}
-
-/**
* sysfs_delete_link - remove symlink in object's directory.
* @kobj: object we're acting for.
* @targ: object we're pointing to.
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 1204d45..4e1bfdb 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -112,9 +112,6 @@ void sysfs_remove_bin_file(struct kobject *kobj, struct bin_attribute *attr);int __must_check sysfs_create_link(struct kobject *kobj, struct kobject *target,
const char *name);
-int __must_check sysfs_create_link_nowarn(struct kobject *kobj,
- struct kobject *target,
- const char *name);
void sysfs_remove_link(struct kobject *kobj, const char *name);int sysfs_rename_link(struct kobject *kobj, struct kobject *target,
@@ -211,13 +208,6 @@ static inline int sysfs_create_link(struct kobject *kobj,
return 0;
}-static inline int sysfs_create_link_nowarn(struct kobject *kobj,
- struct kob...
This reverts commit aaf8cdc34ddba08122f02217d9d684e2f9f5d575.
Drivers like the ipw2100 call device_create_group when they
are initialized and device_remove_group when they are shutdown.
Moving them between namespaces deletes their sysfs groups early.In particular the following call chain results.
netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir
With sysfs_remove_dir recursively deleting all of it's subdirectories,
and nothing adding them back.Ouch!
Therefore we need to call something that ultimate calls sysfs_mv_dir
as that sysfs function can move sysfs directories between namespaces
without deleting their subdirectories or their contents. Allowing
us to avoid placing extra boiler plate into every driver that does
something interesting with sysfs.Currently the function that provides that capability is device_rename.
That is the code works without nasty side effects as originally written.So remove the misguided fix for moving devices between namespaces. The
bug in the kobject layer that inspired it has now been recognized and
fixed.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
net/core/dev.c | 4 +---
net/core/net-sysfs.c | 7 +------
net/core/net-sysfs.h | 1 -
3 files changed, 2 insertions(+), 10 deletions(-)diff --git a/net/core/dev.c b/net/core/dev.c
index 600bb23..a7e236c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3989,7 +3989,6 @@ int register_netdevice(struct net_device *dev)
if (dev->features & NETIF_F_SG)
dev->features |= NETIF_F_GSO;- netdev_initialize_kobject(dev);
ret = netdev_register_kobject(dev);
if (ret)
goto err_uninit;
@@ -4451,8 +4450,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
}/* Fixup kobjects */
- netdev_unregister_kobject(dev);
- err = netdev_register_kobject(dev);
+ err = device_rename(&dev->dev, dev->name);
WARN_ON(err);/* Add the device back in t...
From: ebiederm@xmission.com (Eric W. Biederman)
Acked-by: David S. Miller <davem@davemloft.net>
--
The problem. Network devices show up in sysfs and with the network
namespace active multiple devices with the same name can show up in
the same directory, ouch!To avoid that problem and allow existing applications in network namespaces
to see the same interface that is currently presented in sysfs, this
patch enables the tagging directory support in sysfs.By using the network namespace pointers as tags to separate out the
the sysfs directory entries we ensure that we don't have conflicts
in the directories and applications only see a limited set of
the network devices.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/sysfs.h | 1 +
net/Kconfig | 2 +-
net/core/net-sysfs.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 1 deletions(-)diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 4e1bfdb..6d7eb50 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -82,6 +82,7 @@ struct sysfs_dirent;enum sysfs_tag_type {
SYSFS_TAG_TYPE_NONE = 0,
+ SYSFS_TAG_TYPE_NETNS,
SYSFS_TAG_TYPES
};diff --git a/net/Kconfig b/net/Kconfig
index 7612cc8..29e7e09 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -27,7 +27,7 @@ menu "Networking options"
config NET_NS
bool "Network namespace support"
default n
- depends on EXPERIMENTAL && !SYSFS && NAMESPACES
+ depends on EXPERIMENTAL && NAMESPACES
help
Allow user space to create what appear to be multiple instances
of the network stack.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c9968f1..e4687af 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -13,7 +13,9 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/if_arp.h>
+#include <linux/nsproxy.h>
#include <net/sock.h>
+#include <net/net_namespace.h>
#include <linux/rtnetlink.h>
#include <linux/wireless.h>
#include <n...
From: ebiederm@xmission.com (Eric W. Biederman)
Acked-by: David S. Miller <davem@davemloft.net>
--
From: Serge Hallyn <serge@us.ibm.com>
Mark the /sys/kernel/uids directory to be tagged so that processes in
different user namespaces can remount /sys and see their own uid
listings.Without this patch, having CONFIG_FAIR_SCHED=y makes user namespaces
unusable, because when you
clone(CLONE_NEWUSER)
it will auto-create the root userid and try to create
/sys/kernel/uids/0. Since that already exists from the parent user
namespace, the create fails, and the clone misleadingly ends up
returning -ENOMEM.This patch fixes the issue by allowing each user namespace to remount
/sys, and having /sys filter the /sys/kernel/uid/ entries by user
namespace.Changelong:
v2 - Reworked for the updated sysfs apiSigned-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/sched.h | 1 +
include/linux/sysfs.h | 1 +
kernel/user.c | 22 ++++++++++++++++++++++
kernel/user_namespace.c | 1 +
4 files changed, 25 insertions(+), 0 deletions(-)diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5850bfb..b0fe15a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -600,6 +600,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+ struct user_namespace *user_ns;#ifdef CONFIG_USER_SCHED
struct task_group *tg;
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 6d7eb50..ac88374 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -83,6 +83,7 @@ struct sysfs_dirent;
enum sysfs_tag_type {
SYSFS_TAG_TYPE_NONE = 0,
SYSFS_TAG_TYPE_NETNS,
+ SYSFS_TAG_TYPE_USERNS,
SYSFS_TAG_TYPES
};diff --git a/kernel/user.c b/kernel/user.c
index 865ecf5..ca29fbc 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -53,6 +53,7 @@ struct user_struct ro...
From: ebiederm@xmission.com (Eric W. Biederman)
You can send it all through Greg or whoever, don't feel obligated to
push it through me since it depends upon the earlier bits in this
series.
--
Sounds like a plan.
Looks like we are just talking about the driver core patches so while a pain
it actually should not be too hard.I will get a copy of your gregkh tree from kernel.org tomorrow and see
what I can do.Eric
--
These two functions do 90% of the same work and it doesn't
significantly obfuscate the function to allow both the parent dir and
the name to change at the same time. So merge them together to
simplify maintenance, and increase testing.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/sysfs/dir.c | 121 +++++++++++++++++--------------------------------------
1 files changed, 38 insertions(+), 83 deletions(-)diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 6dc3376..fe2bb1c 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -924,44 +924,57 @@ err_out:
return error;
}-int sysfs_rename_dir(struct kobject * kobj, const char *new_name)
+static int sysfs_mv_dir(struct sysfs_dirent *sd,
+ struct sysfs_dirent *new_parent_sd, const char *new_name)
{
- struct sysfs_dirent *sd = kobj->sd;
struct list_head todo;
struct sysfs_rename_struct *srs;
- struct inode *parent_inode = NULL;
+ struct inode *old_parent_inode = NULL, *new_parent_inode = NULL;
const char *dup_name = NULL;
const void *old_tag, *tag;
int error;INIT_LIST_HEAD(&todo);
+ BUG_ON(!sd->s_parent);
mutex_lock(&sysfs_rename_mutex);
+ if (!new_parent_sd)
+ new_parent_sd = &sysfs_root;
+
old_tag = sd->s_tag;
tag = sysfs_creation_tag(sd->s_parent, sd);error = 0;
- if ((old_tag == tag) && (strcmp(sd->s_name, new_name) == 0))
- goto out; /* nothing to rename */
+ if ((sd->s_parent == new_parent_sd) && (old_tag == tag) &&
+ (strcmp(sd->s_name, new_name) == 0))
+ goto out; /* nothing to do */sysfs_grab_supers();
if (old_tag == tag) {
- error = prep_rename(&todo, sd, sd->s_parent, new_name);
+ error = prep_rename(&todo, sd, new_parent_sd, new_name);
if (error)
goto out_release;
}error = -ENOMEM;
mutex_lock(&sysfs_mutex);
- parent_inode = sysfs_get_inode(sd->s_parent);
+ old_parent_inode = sysfs_get_inode(sd->s_parent);
+ new_parent_inode = sysfs_...
When removing a symlink sysfs_remove_link does not provide
enough information to figure out which tagged directory the symlink
falls in. So I need sysfs_delete_link which is passed the target
of the symlink to delete.Further half the time when we are removing a symlink the code is
actually renaming the symlink but not doing so explicitly because
we don't have a symlink rename method. So I have added sysfs_rename_link
as well.Both of these functions now have enough information to find a symlink
in a tagged directory. The only restriction is that they must be called
before the target kobject is renamed or deleted. If they are called
later I loose track of which tag the target kobject was marked with
and can no longer find the old symlink to remove it.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
fs/sysfs/symlink.c | 31 +++++++++++++++++++++++++++++++
include/linux/sysfs.h | 17 +++++++++++++++++
2 files changed, 48 insertions(+), 0 deletions(-)diff --git a/fs/sysfs/symlink.c b/fs/sysfs/symlink.c
index de9a5c0..ed9c52c 100644
--- a/fs/sysfs/symlink.c
+++ b/fs/sysfs/symlink.c
@@ -80,6 +80,21 @@ int sysfs_create_link(struct kobject * kobj, struct kobject * target, const char
}/**
+ * sysfs_delete_link - remove symlink in object's directory.
+ * @kobj: object we're acting for.
+ * @targ: object we're pointing to.
+ * @name: name of the symlink to remove.
+ *
+ * Unlike sysfs_remove_link sysfs_delete_link has enough information
+ * to successfully delete symlinks in tagged directories.
+ */
+void sysfs_delete_link(struct kobject *kobj, struct kobject *targ,
+ const char *name)
+{
+ sysfs_hash_and_remove(targ, kobj->sd, name);
+}
+
+/**
* sysfs_remove_link - remove symlink in object's directory.
* @kobj: object we're acting for.
* @name: name of the symli...
This patch enables tagging on every class directory if struct class
has a tag_type.In addition device_del and device_rename were modified to uses
sysfs_delete_link and sysfs_rename_link respectively to ensure
when these operations happen on devices whose classes have
tag_ops that they work properly.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
---
drivers/base/class.c | 30 ++++++++++++++++++++++---
drivers/base/core.c | 56 +++++++++++++++++++++++++++++++++--------------
include/linux/device.h | 3 ++
3 files changed, 68 insertions(+), 21 deletions(-)diff --git a/drivers/base/class.c b/drivers/base/class.c
index 839d27c..cf4e03f 100644
--- a/drivers/base/class.c
+++ b/drivers/base/class.c
@@ -135,6 +135,17 @@ static void remove_class_attrs(struct class *cls)
}
}+static int class_setup_tagging(struct class *cls)
+{
+ enum sysfs_tag_type type;
+
+ type = cls->tag_type;
+ if (type == SYSFS_TAG_TYPE_NONE)
+ return 0;
+
+ return sysfs_make_tagged_dir(&cls->p->class_subsys.kobj, type);
+}
+
int __class_register(struct class *cls, struct lock_class_key *key)
{
struct class_private *cp;
@@ -171,13 +182,24 @@ int __class_register(struct class *cls, struct lock_class_key *key)
cls->p = cp;error = kset_register(&cp->class_subsys);
- if (error) {
- kfree(cp);
- return error;
- }
+ if (error)
+ goto out_free_cp;
+
+ error = class_setup_tagging(cls);
+ if (error)
+ goto out_unregister;
+
error = add_class_attrs(class_get(cls));
class_put(cls);
+ if (error)
+ goto out_unregister;
+out:
return error;
+out_unregister:
+ kset_unregister(&cp->class_subsys);
+out_free_cp:
+ kfree(cp);
+ goto out;
}
EXPORT_SYMBOL_GPL(__class_register);diff --git a/drivers/base/core.c b/drivers/base/core.c
index 90621a4..b009d5b 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -124,9 +124,21 @@ static void ...
Okay, I went through the users this time but I still think
determine-tags-by-callbacks is a bad idea. Please just add const void
*tag to kobject and set it during initialization. If you want to move a
device from one tag to another, implement kobject_rename_tagged(kobj,
new_name, new_tag).The determine-tag-by-callback basically multiplexes basic functions to
do tag-specific things which are determined by ktype callback called
back from down the layer. It's simply a bad interface. Those
operations become something else depending on how those callbacks
behave. That's unnecessarily subtle. Advertising what it's gonna do in
the function name and as arguments is way more straight forward and it's
not like determining or renaming tags should be done asynchronously.I personally think it would be better to make tags explicit in the mount
interface too but if extracting ns information from the mounting process
is what's currently being done, well...I'm sorry but Nacked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
Thank you for your opinion.
Incremental patches to make things more beautiful are welcome.
Please remember we are not building lisp. The goal is code that works today.
Since we are not talking about correctness of the code. Since we are not
talking about interfaces with user space. Since we are talking something
that is currently about 100 lines of code, and so will be easy to change
even after it is merged. I don't understand how discussing this further
is useful. Especially when I get a NAK based on the feel that the code
is ugly.As for your main objection. Adding a accessor method to an object versus
adding a data field that contain the same thing. The two are effectively
identical. With the practical difference in my eyes that an accessor method
prevents data duplication which reduces maintenance and reduces skew problems,
and it keeps the size of struct kobject small. Since you think methods are
horrible I must respectfully disagree with you.Eric
--
Hello, Eric.
I'm sorry if I gave you the impression of being draconian. Explanations
Yeah, it seems we should agree to disagree here. I think using callback
for static values is a really bad idea. It obfuscates the code and
opens up a big hole for awful misuses. Greg, what do you think?As we're very close to rc1 window, I think we can work out a solution
here. The reason why I nack'd was because the change wouldn't take too
much effort and I thought it could be done before -rc1. Unless you
disagree with making tags static values, I'll try to write up a patch to
do so. If you (and Greg) think the callback interface is better, we can
merge the code as-is and update (or not) later.Thanks.
--
tejun
--
The misuse argument is small because currently all users must be compiled
into the kernel and must add to the static enumeration. I'm afraid weMaking a change and pushing down into the patches is much more time intensive
then I would like. The last round of changes simple as they were took something
between 16 and 30 hours, and has left me sapped. Keeping all of the other
pieces in flight in all of the other patches so I can't just focus on the
change at hand is what makes it difficult at this point.Adding an additional patch on top isn't too bad, but my creativity is sapped on
this right now. I agree that a function called device_rename isn't the best
possible name when we are changing tags, but I can't think of anything that
seems better.I know in the users that the tags are already quite static and that I call
kobject_rename in the one case where they change (which is a significant
exception). So that part doesn't concern me as I have not intention of using
the interface like that. Ultimately I don't care as long as we have code
that works.Eric
--
Sorry, Greg is walking out the door in 30 minutes for a much needed week
long vacation and can't look into this right now :(I'll be able to review it next weekend, sorry for the delay.
greg k-h
--
Any progress in reviewing these changes, and seeing if you can stand
to merge them?Eric
--
Greg, please disregard my earlier NACKs and commit the patches if you're
okay with them. I'm working on cleaning it up but I don't think I'll be
able to make it in time for merge window and as Eric said getting the
functionality in place is more important at this point as it doesn't
affect user visible interface.Eric, with the multiple superblocks, sysfs now uses inode from the
default sysfs_sb with dentries from other sb's. Is this okay? Are
there any other filesystems which do this?Thanks.
--
tejun
--
Ok, I'll work to get these in where applicable.
thanks,
greg k-h
--
Did this get into 2.6.27? Or will this have to wait till .28?
Thanks.
--
tejun
--
I haven't sent any patches to Linus yet for 2.6.27, so it hasn't gotten
there yet.And due to the intrusiveness, and the fact that this hasn't been tested
at all in any build tree yet, I can't in good concious submit this for
.27 either.Part of this is my fault, I know, due to vacation and work, but also
lots is due to the fact that the code showed up so late in the
development cycle.I'll add it to my tree, and get it some testing in the next cycle of
linux-next until 2.6.27 is out, and then if it's still looking good, go
into .28.Hope this helps,
greg k-h
--
Greg. Where are we at with this?
I just looked at your tree and I don't perhaps I'm blind but I don't see
this patches, and it looks like you at least refreshed your patches
a day or two ago.Eric
--
Any progress here? Nothing seems reached lkml...
--
Have a look at Greg's tree. :)
He started to merge the last patches today.--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
Cool! I pulled it 10 mins before I sent the email and it wasn't there. 5
mins after it, it was...Thanks.
--
Hello, Greg.
Heh... A lot of it is my fault too. Probably my share is bigger than
I was just curious how the merge will turn out as I'm about to do a
refresh pass on the sysfs locking and stuff. I'm not sure yet whether
to put those changes on top of or below ns patches but if I'll always
keep the ns patches updated.Thank you.
--
tejun
--
The code showed up at least by the beginning of may, and the code has
been around for 9 months or more. I just haven't always had the
energy to retransmit every time it has gotten dropped. The last spinI think that is unnecessarily precautions but reasonable. That should at
least keep other sysfs patches from coming in and causing bit rot.Eric
--
I don't know of any other filesystems where this unique challenge arises.
/proc almost qualifies but it never needs to be modified.It is certainly ok to go from multiple dentries to a single inode.
I'm trying to remember why I choose to do that. I think both because it simplifies
the locking and keeps us more efficient in the icache.Eric
--
Hello,
It's a bit scary tho. Working inode->i_dentry or dentry->d_alias
crosses multiple sb's. sysfs isn't too greedy about dcache/icache.
Only open files and directories hold them and only single copy of
sysfs_dirent is there for most nodes. Wouldn't it be better to stay on
the safer side and use separate inode hierarchy?Thanks.
--
tejun
--
To do that I believe we would need to ensure sysfs does not use
the inode->i_mutex lock except to keep the VFS layer out. Allowing us
to safely change the directory structure, without holding it.You raise a good point about inode->i_dentry and dentry->d_alias.
Generally they are used by fat like filesystems but I am starting to
see uses in generic pieces of code. I don't see any problems today
but yes it would be good to do the refactoring to allow us to duplicate
the inodes.Eric
--
Hello, Eric.
I don't think sysfs is depending on i_mutex anymore but I need to go
Yeah, I can't spot any place which can cause actual problem yet but it's
still scary as we're breaking a vfs assumption and even if it's not a
problem now, future seemingly unrelated changes can break things subtly.Thanks.
--
tejun
--
The vfs still does. So at least for directory tree manipulation we
need to hold i_mutex before we grab sysfs_mutex.I think that means we need to unscramble the whole set of locking
order issues.In lookup we have:
local_vfs_lock -> fs_global_lockIn modifications we have:
fs_global_lock -> local_vfs_lockWhich is the definition of a lock ordering problem.
Currently we play jump through some significant hoops to keep things
in local_vfs_lock -> fs_global_lock order.If we also take the rename_mutex on directory adds and deletes we
may be able to keep jumping through those hoops. However I expect
we would be in a much better situation if we could figure out how
to avoid the problem.It looks like the easy way to handle this is to make the sysfs_dirent
list rcu protected. Which means we can fix our lock ordering problem
without VFS modifications. Allowing the locking to always
be: sysfs_mutex ... i_mutex.After that it would be safe and a good idea to have unshared
inodes between superblocks, just so we don't surprise anyone
making generic VFS assumptions.Eric
--
Okay, one small problem spotted. It seems invalidate_inodes() can fail
which will make generic_shutdown_super() complain. It's not a fatal
failure tho.--
tejun
--
How when the inode list is empty? We don't unmount the superblock that
has the inodes.Eric
--
Understood and no problem.
Eric
--
This reverts commit aaf8cdc34ddba08122f02217d9d684e2f9f5d575.
Drivers like the ipw2100 call device_create_group when they
are initialized and device_remove_group when they are shutdown.
Moving them between namespaces deletes their sysfs groups early.In particular the following call chain results.
netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir
With sysfs_remove_dir recursively deleting all of it's subdirectories,
and nothing adding them back.Ouch!
Therefore we need to call something that ultimate calls sysfs_mv_dir
as that sysfs function can move sysfs directories between namespaces
without deleting their subdirectories or their contents. Allowing
us to avoid placing extra boiler plate into every driver that does
something interesting with sysfs.Currently the function that provides that capability is device_rename.
That is the code works without nasty side effects as originally written.So remove the misguided fix for moving devices between namespaces. The
bug in the kobject layer that inspired it has now been recognized and
fixed.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
net/core/dev.c | 4 +---
net/core/net-sysfs.c | 7 +------
net/core/net-sysfs.h | 2 +-
3 files changed, 3 insertions(+), 10 deletions(-)diff --git a/net/core/dev.c b/net/core/dev.c
index fca23a3..585584d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3806,7 +3806,6 @@ int register_netdevice(struct net_device *dev)
}
}- netdev_initialize_kobject(dev);
ret = netdev_register_kobject(dev);
if (ret)
goto err_uninit;
@@ -4239,8 +4238,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
}/* Fixup kobjects */
- netdev_unregister_kobject(dev);
- err = netdev_register_kobject(dev);
+ err = device_rename(&dev->dev, dev->name);
WARN_ON(err);/* Add the device back in the hashes */
diff --git a/net/core/net-sysfs.c b/net/core/net-sysf...
The problem. Network devices show up in sysfs and with the network
namespace active multiple devices with the same name can show up in
the same directory, ouch!To avoid that problem and allow existing applications in network namespaces
to see the same interface that is currently presented in sysfs, this
patch enables the tagging directory support in sysfs.By using the network namespace pointers as tags to separate out the
the sysfs directory entries we ensure that we don't have conflicts
in the directories and applications only see a limited set of
the network devices.Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/sysfs.h | 1 +
net/Kconfig | 2 +-
net/core/net-sysfs.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 1 deletions(-)diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c3a30ce..1ed31bb 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -80,6 +80,7 @@ struct sysfs_ops {enum sysfs_tag_type {
SYSFS_TAG_TYPE_NONE = 0,
+ SYSFS_TAG_TYPE_NETNS,
SYSFS_TAG_TYPES
};diff --git a/net/Kconfig b/net/Kconfig
index acbf7c6..9aad03b 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -30,7 +30,7 @@ menu "Networking options"
config NET_NS
bool "Network namespace support"
default n
- depends on EXPERIMENTAL && !SYSFS && NAMESPACES
+ depends on EXPERIMENTAL && NAMESPACES
help
Allow user space to create what appear to be multiple instances
of the network stack.
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 4e7b847..6227a28 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -13,7 +13,9 @@
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/if_arp.h>
+#include <linux/nsproxy.h>
#include <net/sock.h>
+#include <net/net_namespace.h>
#include <linux/rtnetlink.h>
#include <linux/wireless.h>
#include <net...
Mark the /sys/kernel/uids directory to be tagged so that processes in
different user namespaces can remount /sys and see their own uid
listings.Without this patch, having CONFIG_FAIR_SCHED=y makes user namespaces
unusable, because when you
clone(CLONE_NEWUSER)
it will auto-create the root userid and try to create
/sys/kernel/uids/0. Since that already exists from the parent user
namespace, the create fails, and the clone misleadingly ends up
returning -ENOMEM.This patch fixes the issue by allowing each user namespace to remount
/sys, and having /sys filter the /sys/kernel/uid/ entries by user
namespace.Changelong:
v2 - Reworked for the updated sysfs apiSigned-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Benjamin Thery <benjamin.thery@bull.net>
Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/sched.h | 1 +
include/linux/sysfs.h | 1 +
kernel/user.c | 22 ++++++++++++++++++++++
kernel/user_namespace.c | 1 +
4 files changed, 25 insertions(+), 0 deletions(-)diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d3f84..d2be6a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -598,6 +598,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+ struct user_namespace *user_ns;#ifdef CONFIG_USER_SCHED
struct task_group *tg;
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 1ed31bb..ecb942c 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -81,6 +81,7 @@ struct sysfs_ops {
enum sysfs_tag_type {
SYSFS_TAG_TYPE_NONE = 0,
SYSFS_TAG_TYPE_NETNS,
+ SYSFS_TAG_TYPE_USERNS,
SYSFS_TAG_TYPES
};diff --git a/kernel/user.c b/kernel/user.c
index 865ecf5..ca29fbc 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -53,6 +53,7 @@ struct user_struct root_user = {
.files = ATOMIC_INIT(0),
.si...
The good news is that uevent_sock is currently restricted to just
the initial network namespace (so the functionality completely
disappears in the other namespaces), and that it is broadcast only.So it should be possible to look at who the client is and by some
magic criterian decide if it should receive the broadcast message.The call to the user mode helper is trickier. How do we setup the
proper user space context.None of this is fundamentally hard just different work, for a different
day.Eric
--
uevents can work with the network namespaces being compiled in and the
sysfs compiled out. AFAICS, uevents will be unable to handle multiple
network namespaces if it is tied with sysfs, no ?
--
Ah Ok, I am not really familiar with kobject/sysfs so I thought there
was a proposition to store the id in the kobject instead of using the
tag callbacks, so I figured, perhaps, the idr could have been used inYes as mentionned Benjamin, we have the eth0 in the init_net which is
shut down when a network namespace with a netdev with the same name
exits. There is a udev rule which ifdown eth0 :)
--
Indeed, we observed some fun things with one distro (which defines some
particular udev rules) when a device called eth0 in a namespace comes
back to init net :)--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D
Speaking of. Don't let me forget but I have a patch I need to send out
that deletes pseudo devices instead of sending them back to eth0. We can't
do that for real hardware obviously but for things like veth and macvlan
devices it greatly simplifies the cleanup.Eric
--
I always accept "reasonable patches" :)
thanks,
greg k-h
--
Now that the iproute2 patch is upstream, this patchset really is the
only thing keeping us from using network namespaces. Given that the
details of the tagging are trivially changeable with no abi changes, I'd
personally much rather see the patches go in as is, with whatever new
tagging patches Benjamin whips up, using ida or some new idea, being
applied later if we feel the need.-serge
--
My point exactly. No one seems to contest the userspace semantics so
as long as we don't put ourselves into a real mess we should be fine.Eric
--
| Greg Kroah-Hartman | [PATCH 006/196] Chinese: add translation of oops-tracing.txt |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Borislav Petkov | 2.6.23-rc1: no setup signature found... |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [BUG] New Kernel Bugs |
