put_css_set_taskexit may be called when find_css_set is called on
other cpu. And the race will occur:
put_css_set_taskexit side find_css_set side
|
atomic_dec_and_test(&kref->refcount) |
/* kref->refcount = 0 */ |
....................................................................
| read_lock(&css_set_lock)
| find_existing_css_set
| get_css_set
| read_unlock(&css_set_lock);
....................................................................
__release_css_set |
....................................................................
| /* use a released css_set */
|
[put_css_set is the same. But in the current code, all put_css_set are
put into cgroup mutex critical region as the same as find_css_set.]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 13932ab..003912e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -241,7 +241,6 @@ static void unlink_css_set(struct css_set *cg)
struct cg_cgroup_link *link;
struct cg_cgroup_link *saved_link;
- write_lock(&css_set_lock);
hlist_del(&cg->hlist);
css_set_count--;
@@ -251,8 +250,6 @@ static void unlink_css_set(struct css_set *cg)
list_del(&link->cgrp_link_list);
kfree(link);
}
-
- write_unlock(&css_set_lock);
}
static void __release_css_set(struct kref *k, int taskexit)
@@ -260,7 +257,13 @@ static void __release_css_set(struct kref *k, int taskexit)
int i;
struct css_set *cg = container_of(k, struct css_set, ref);
+ write_lock(&css_set_lock);
+ if (atomic_read(&k->refcount) > 0) {
+ write_unlock(&css_set_lock);
+ return;
+ }
...Sorry I didn't respond to this when it originally came out - I was on vacation.
I agree that it's a race that needs to be fixed, but I'm not sure that
I like the fix that can generate kref warnings.
I can see two possible fixes:
1) avoid the race entirely by introducing some new primitives:
atomic_dec_and_write_lock() (like atomic_dec_and_lock(), but for an rwlock)
and kref_put_and_write_lock() which would be something like:
int kref_put_and_write_lock(struct kref *kref, void (*release)(struct
kref *kref), rwlock*lock)
{
if(atomic_dec_and_write_lock(&kref->refcount, lock)) {
release(kref);
return 1;
}
return 0;
}
We'd then use kref_put_and_write_lock(), and enter __release_css_set()
with the lock already held
2) Use atomic_inc_not_zero() in find_existing_css_set(), to ensure
that we only return a referenced css, and remove the get_css_set()
call from find_css_set(). (Possibly wrapping this in a new
kref_get_not_zero() function)
Paul
--
[CC: Greg Kroah-Hartman <greg@kroah.com>]
There are indeed several ways fix this race by Using the
atomic-functions directly. I prefer the second one, i makes all
code clearly. And put_css_set[_taskexit] do not need to be changed.
I don't think adding kref_get_not_zero() API is a good idea.
It will bring kref APIs to a little chaos, kref_get_not_zero() is
hard to be used, for this function needs a special lock held.
But I tried:
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
---
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 0cef6ba..400ffab 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -25,6 +25,7 @@ struct kref {
void kref_set(struct kref *kref, int num);
void kref_init(struct kref *kref);
void kref_get(struct kref *kref);
+int kref_get_not_zero(struct kref *kref);
int kref_put(struct kref *kref, void (*release) (struct kref *kref));
#endif /* _KREF_H_ */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 13932ab..0bbb98d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -347,6 +347,8 @@ static struct css_set *find_existing_css_set(
hlist_for_each_entry(cg, node, hhead, hlist) {
if (!memcmp(template, cg->subsys, sizeof(cg->subsys))) {
/* All subsystems matched */
+ if (!kref_get_not_zero(&cg->ref))
+ return NULL;
return cg;
}
}
@@ -410,8 +412,6 @@ static struct css_set *find_css_set(
* the desired set */
read_lock(&css_set_lock);
res = find_existing_css_set(oldcg, cgrp, template);
- if (res)
- get_css_set(res);
read_unlock(&css_set_lock);
if (res)
diff --git a/lib/kref.c b/lib/kref.c
index 9ecd6e8..b8c1ce6 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -46,6 +46,25 @@ void kref_get(struct kref *kref)
}
/**
+ * kref_get_not_zero - increment refcount for object if current refcount
+ * is not zero.
+ * @kref: object.
+ *
+ * Beware, the object maybe be being released, so we need a special lock held
+ * to ensure the object's refcount is ...need better wording... what do you mean by "remaining access"? to ensure the object --
Maybe it would be better to do: if (!memcmp(...) && kref_get_not_zero(&cg->ref)) return cg; Probably clearer to have a bool return type. Paul --
What are you trying to solve here with this change? I agree, it does seem a bit "chaotic" :) I thought we used to have something like this for kref in the past, but I must be mistaken as it's no longer there... thanks, greg k-h --
There's a place in cgroups that uses kref_put() to release an object; the release function *then* takes a write-lock and removes the object from a lookup table; it could race with another thread that searches the lookup table (while holding a read-lock) and does kref_get() on the same object. The current fix is for the release function to recheck inside the lock that the object's refcount is still zero, and only actually unlink/free it if so. And actually I've just realised that this isn't actually even safe, since the thread that just acquired the object could kref_put() it almost immediately, which would leave two threads both trying to unlink/free the object. The two solutions being considered are: - add a kref_put_and_write_lock(), similar to atomic_dec_and_lock(), which would ensure that the final refcount on the object was only released inside the lock - add a kref_get_if_not_zero(), which would prevent a lookup from succeeding if another thread had just dropped the last reference on the object. Paul --
Ick, yeah that's not good. What about the way everyone else solves this, grab the lock before you Yeah, don't do that :) thanks, greg k-h --
do_exit()
cgroup_exit()
put_css_set_taskexit()
kref_put()
If we grab the lock before kref_put(), we add overhead to do_exit(), which
--
But you can't put such logic in the release() function as you are finding out, that's not going to work either. Maybe you need to just "open-code" an atomic counter here and not use kref as it sounds like you are needing to do something very "special" here. thanks, greg k-h --
Right - that's why a kref_put_and_write_lock() function would be handy - it would take the lock only when necessary. It would delegate most of its work to the (as-yet nonexistant) atomic_dec_and_write_lock() It's basically the same situation as the dentry cache - ref-counted objects that are also referenced from a hash table protected by a global lock. If you're opposed to the addition of kref_put_and_write_lock() then yes, I'll replace kref with a custom refcount. Paul --
It just seems messy, but if you want to try it, I'll be glad to look at the code. Oh wait, was that the patch that you sent out last time? thanks, greg k-h --
Yes - that patch replaced kref with a plain refcount. We weren't really using kref as anything other than an atomic_dec_and_test() wrapper anyway. Paul --
