This series tidies up the refcount of sysfs_dirents in sysfs, using kref where appropriate and a new karef for s_active. This achieves significant code simplification, especially the first patch. This is in part inspired by http://lwn.net/Articles/336224/ :-) NeilBrown --- NeilBrown (3): sysfs: simplify handling for s_active refcount sysfs: make s_count a kref kref: create karef and use for sysfs_dirent->s_active fs/sysfs/dir.c | 74 ++++++++++++++++++++------------------------------ fs/sysfs/mount.c | 3 +- fs/sysfs/sysfs.h | 19 +++++-------- include/linux/kref.h | 38 ++++++++++++++++++++++++++ lib/kref.c | 13 +++++++++ 5 files changed, 90 insertions(+), 57 deletions(-) -- Signature --
s_active counts the number of active references to a 'sysfs_direct'.
When we wish to deactivate a sysfs_direct, we subtract a large
number for the refcount so it will always appear negative. When
it is negative, new references will not be taken.
After that subtraction, we wait for all the active references to
drain away.
The subtraction of the large number contains exactly the same
information as the setting of the flag SYSFS_FLAG_REMOVED.
(We know this as we already assert that SYSFS_FLAG_REMOVED is set
before adding the large-negative-bias).
So doing both is pointless.
By starting s_active with a value of 1, not 0 (as is typical of
reference counts) and using atomic_inc_not_zero, we can significantly
simplify the code while keeping exactly the same functionality.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/sysfs/dir.c | 60 ++++++++++++++----------------------------------------
fs/sysfs/sysfs.h | 2 --
2 files changed, 16 insertions(+), 46 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 5907178..76a2d10 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -95,26 +95,13 @@ static void sysfs_unlink_sibling(struct sysfs_dirent *sd)
*/
struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
{
- if (unlikely(!sd))
+ if (likely(sd)
+ && (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
+ && atomic_inc_not_zero(&sd->s_active)) {
+ rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
+ return sd;
+ } else
return NULL;
-
- while (1) {
- int v, t;
-
- v = atomic_read(&sd->s_active);
- if (unlikely(v < 0))
- return NULL;
-
- t = atomic_cmpxchg(&sd->s_active, v, v + 1);
- if (likely(t == v)) {
- rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
- return sd;
- }
- if (t < 0)
- return NULL;
-
- cpu_relax();
- }
}
/**
@@ -126,34 +113,26 @@ struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
*/
void sysfs_put_active(struct sysfs_dirent *sd)
{
- struct completion *cmpl;
- int v;
-
...Overall your logic appears correct but in detail this patch scares me. sd->s_flags is protected by the sysfs_mutex, and you aren't taking it when you read it. So in general I don't see the new check if (sd->s_flags & SYSFS_FLAG_REMOVED) == 0 providing any guarantee of progress whatsoever with user space applications repeated reading from a sysfs file when that sysfs file is being removed. They could easily have the sd->s_flags value cached and never see the new value, given a crazy enough cache architecture. So as attractive as this patch is I don't think it is correct. Eric --
On Thu, 25 Mar 2010 21:24:43 -0700 As you say, this is only a liveness issue. The atomic_inc_not_zero guarantees that we don't take a new reference after the last one is gone. The test on SYSFS_FLAG_REMOVED is only there to ensure that the count does eventually get to zero. There could only be a problem here if the change to s_flags was not propagated to all CPUs in some reasonably small time. I'm not expert on these things but it was my understanding that interesting cache architectures could arbitrarily re-order accesses, but does not delay them indefinitely. Inserting barriers could possibly make this more predictable, but that would just delay certain loads/stores until a known state was reached - it would not make the data visible at another CPU any faster (or would it?). So unless there is no cache-coherency protocol at all, I think that SYSFS_FLAG_REMOVED will be seen promptly and that s_active will drop to zero I'm pleased you find it attractive - I certainly think the "atomic_inc_not_zero" is much more readable than the code it replaces. Hopefully if there are really problems (maybe I've fundamentally misunderstood caches) they can be easily resolved (a couple of memory barriers at worst?). Thanks for the review, NeilBrown --
Hello, Neil. Oh, no, please don't do that. That will inject a whole lot more of complexity into the mix. Now, it's only a problem of logical complexity. If you add memory barriers there, it not only complicates the problem itself beyond recognition but also reduces problem reproducibility (especially because x86/64 are relatively strongly ordered) and thus test coverage significantly. Now the refcounting can be understood by most people who put some time into it but if you put memory barriers into it, only Oleg would be able to identify problems. :-P If you really want to kill the bias and in an easy readable way, please put it inside a struct w/ dedicated spinlock but if you are gonna do that it might as well be better to simply implement the bias anti-pattern correctly there once. Thanks. -- tejun --
Having this handled correctly is a key part of the abstraction implemented with sysfs_get_active sysfs_put_active and sysfs_deactivate. It is a requirement that userspace can not cuause This is my problem with your patch. You have replaced something that used an existing abstraction properly, with something that does not use any abstraction at all and we are reduced to talking about cpu memory ordering requirements. Furthermore this code you are changing is not used raw but it is the building blocks for a well defined abstraction. The users of sysfs_get_active sysfs_put_active and sysfs_deactivate don't have to know the details of how they are implemented to think about and reason about it accurately. I am very much in favor of a general primitive that we can use for blocking ref counts. We have them in sysfs, proc, sysctl, and a few other places with similar requirements. Each location has a different implementation, but essentially the same semantics. What is really needed is something with the semantics of: read_trylock read_unlock write_lock() perhaps synchronize? With multiple reads in side the read side critical section, and biased so that the read side is very cheap, and the data structure size is very cheap. We have an ugly but otherwise near optimal implementation in sysfs. srcu is close the data structure costs too much to use on every sysfs dirent. Improving and getting it right is very attractive. The code that exists today is correct and signals that something fishy is going on so there is no point in doing something less than right. Eric --
On Fri, 26 Mar 2010 00:53:48 -0700
We are not talking about memory ordering requirements. There are no memory
ordering requirements introduced in the patch. It really doesn't matter if
the setting of SYSFS_FLAG_REMOVED is seen before or after anything else.
All that matters is that it does get seen, and it will be seen long before
you could possibly describe the situation as 'denial of service'.
However if we do consider memory ordering guarantees we can describe a clear
limit to the possibly delay between SYSFS_FLAG_REMOVED being set, and being
seen. The atomic_inc_not_zero serves as a memory barrier in exactly the same
way that the current code requires atomic_dec_return. So while
if (likely(sd)
&& (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
&& atomic_inc_not_zero(&sd->s_active)) {
could possibly gain a reference even 'after' SYS_FLAG_REMOVED as been set,
a second call to this on the same processor will see SYSFS_FLAG_REMOVED.
So at the absolute most, we could see NCPUS active references gained and
dropped after SYSFS_FLAG_REMOVED was set - a clear limit which is all we need.
I'm still not sure we even need to argue in terms of memory barriers to be
sure the code is correct, but it seems they are sufficient to give a simple
proof.
The sysfs code already has a dependency on implicit memory barriers to ensure
that ->s_sibling points to a completion. My change does not significantly
True. However it is raw in the sense that if some other developer wanted
similar functionality in their own code they might copy the low level stuff
as they wouldn't be able to use the sys_* routines directly. Copying
Yes - I think the synergy between locks an refcounts is really interesting...
As you say, simply using a general abstraction which provides these functions
would incur a space cost that you don't want. You avoid this cost by
re-using s_sibling to store a reference to a completion.
Finding a primitive that allows such optimisations is an ...On Mon, 29 Mar 2010 15:43:25 +1100 It just occurred to me that this 'proof' isn't quite complete in itself. I need to also show that there is a suitable memory barrier after SYSFS_FLAG_REMOVED is set. There is as it is always set under sysfs_mutex, so the mutex_unlock provides a barrier. So after sysfs_mutex is unlocked, it is conceivable that each CPU could grant one active reference against the sysfs_dirent before SYSFS_FLAG_REMOVED was NeilBrown --
s_count in sysfs behaves exactly like a kref, so change it to
be one.
This requires adding a KREF_INIT macro to kref.h
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/sysfs/dir.c | 12 +++++++++---
fs/sysfs/mount.c | 2 +-
fs/sysfs/sysfs.h | 15 +++++++--------
include/linux/kref.h | 1 +
4 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 76a2d10..63790ac 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -177,8 +177,14 @@ static void sysfs_free_ino(ino_t ino)
spin_unlock(&sysfs_ino_lock);
}
-void release_sysfs_dirent(struct sysfs_dirent * sd)
+static void no_recurse(struct kref *kref)
{
+}
+void release_sysfs_dirent(struct kref *count)
+{
+ struct sysfs_dirent *sd = container_of(count,
+ struct sysfs_dirent,
+ s_count);
struct sysfs_dirent *parent_sd;
repeat:
@@ -199,7 +205,7 @@ void release_sysfs_dirent(struct sysfs_dirent * sd)
kmem_cache_free(sysfs_dir_cachep, sd);
sd = parent_sd;
- if (sd && atomic_dec_and_test(&sd->s_count))
+ if (sd && kref_put(&sd->s_count, no_recurse))
goto repeat;
}
@@ -289,7 +295,7 @@ struct sysfs_dirent *sysfs_new_dirent(const char *name, umode_t mode, int type)
if (sysfs_alloc_ino(&sd->s_ino))
goto err_out2;
- atomic_set(&sd->s_count, 1);
+ kref_init(&sd->s_count);
atomic_set(&sd->s_active, 1);
sd->s_name = name;
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 0cb1088..07bff03 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -33,7 +33,7 @@ static const struct super_operations sysfs_ops = {
struct sysfs_dirent sysfs_root = {
.s_name = "",
- .s_count = ATOMIC_INIT(1),
+ .s_count = KREF_INIT,
.s_flags = SYSFS_DIR,
.s_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO,
.s_ino = 1,
diff --git a/fs/sysfs/sysfs.h b/fs/sysfs/sysfs.h
index 6a2a60e..f003a88 100644
--- a/fs/sysfs/sysfs.h
+++ b/fs/sysfs/sysfs.h
@@ -49,7 +49,7 @@ struct sysfs_inode_attrs {
* requires ...Except where it doesn't. The whole no_recurse thing is definitely not idiomatic kref usage. I could find only 3 instances of someone even looking at the return value. I would argue based on that the return value of kref_put should be removed. KREF_INIT if we want it should be added in a separate patch. kref should be kept for the stupid simple cases where so we don't have to think about refcounting, and just know it works. kref should not be where we are getting clever, and it this feels like getting clever to me. It isn't like the atomic primitives are any worse than kref. Eric --
->s_active is almost a kref, but needs atomic_inc_not_zero which
is not generally appropriate for a kref, as you can only access a
kreffed object if you hold a reference, and in that case the counter
cannot possibly be zero.
So introduce 'karef' - a counter of active references. An active
reference is separate from an existential reference and normally
implies one.
For a karef, get_not_zero have be appropriate providing a separate
existential reference is held.
Then change sysfs_dirent->s_active to be the first (and so-far only)
karef. super_block->s_active is another candidate to be a karef.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/sysfs/dir.c | 18 ++++++++++++------
fs/sysfs/mount.c | 1 +
fs/sysfs/sysfs.h | 2 +-
include/linux/kref.h | 37 +++++++++++++++++++++++++++++++++++++
lib/kref.c | 13 +++++++++++++
5 files changed, 64 insertions(+), 7 deletions(-)
diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index 63790ac..f0e2303 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -97,13 +97,22 @@ struct sysfs_dirent *sysfs_get_active(struct sysfs_dirent *sd)
{
if (likely(sd)
&& (sd->s_flags & SYSFS_FLAG_REMOVED) == 0
- && atomic_inc_not_zero(&sd->s_active)) {
+ && karef_get_not_zero(&sd->s_active)) {
rwsem_acquire_read(&sd->dep_map, 0, 1, _RET_IP_);
return sd;
} else
return NULL;
}
+
+static void sysfs_release(struct karef *k)
+{
+ struct sysfs_dirent *sd = container_of(k, struct sysfs_dirent, s_active);
+ struct completion *cmpl = (void*)sd->s_sibling;
+
+ complete(cmpl);
+}
+
/**
* sysfs_put_active - put an active reference to sysfs_dirent
* @sd: sysfs_dirent to put an active reference to
@@ -117,10 +126,7 @@ void sysfs_put_active(struct sysfs_dirent *sd)
return;
rwsem_release(&sd->dep_map, 1, _RET_IP_);
- if (atomic_dec_and_test(&sd->s_active)) {
- struct completion *cmpl = (void*)sd->s_sibling;
- complete(cmpl);
- }
+ karef_put(&sd->s_active, ...If this actually captured the semantics of active references I would find this interesting. But you currently miss the inconvenient but crucial part of preventing new references after deletion, and you miss the lockdep bits. Given that with the completion we act enough like a lock we can cause deadlocks I would expect a generic version for dummies (aka a kref flavor) to include lockdep annotations from the start. Lock ordering issues combined with ref counts are rare but they really suck, are hard to find (without lockdep) and hard to reason about. The logical semantics are: read_trylock() sysfs_get_active read_unlock() sysfs_put_active write_lock() sysfs_deactivate Maybe your karef stuff is good for something but it models the sysfs usage poorly. Eric --
Neil I would appreciate if you would cc' Tejun and myself on these kind of patches as we wrote the code and the best canidates to review it. Thanks, Eric --
On Thu, 25 Mar 2010 20:10:15 -0700 Yes, you are right of course. I just pulled names out of MAINTAINERS figuring the maintainer would know what best to do with the patch. NeilBrown --
Hello, Nice article. In general, yeap, I agree it would be nice to have a working reference count abstraction. However, kref along with kobject is a good example of obscurity by abstraction anti pattern. :-) kobject API incorrectly suggests that it deals with the last put problem. There still are large number of code paths which do the following, if (!(kob = kobject_get(kobj))) return; I believe (or at least hope) the actual problem cases are mostly fixed now but there still are a lot of misconceptions around how stuff built on kref/kobject is synchronized and they sometimes lead to race conditions buried deep under several layers of abstractions and it becomes very hard to see those race conditions when they are buried deep. If you want to kill refcounts w/ bias based off switch, please put it inside an abstraction which at least synchronizes itself properly. Open coding w/ bias at least warns you that there is some complex stuff going on and you need to trade carefully. Putting the switch on a separate flag - people often forget how bits in a flag field are synchronized - and the rest of refcount in a nice looking kref bundle is very likely to lead to subtle race conditions which are *very* difficult to notice. Thanks. -- tejun --
One thing to add. I think one of the reasons why k* hasn't really worked as well as it was orginally imagined to do is the way we - the kernel programmers - think and work. We add abstractions when something is functionally necessary, so in a lot of cases the functional requirements become the implementation and the communication among us (ie. serves as implied documentation). The k* stuff detracts from this principle. Those abstractions are there for the purpose of abstracting and our usual mindset becomes very susceptible to misinterpretations as no easily identifiable functional requirements are there - we either end up imaginging something up or waste time frustrated trying to figure out why the hell that abstraction is there. This actualy is a very generic problem. When a LOT of people are trying to work together sharing a lot of infrastructures, it is very deterimental to impose certain paradigm upon them. People can easily agree upon functional necessities but one guy's wildest, most ambitious paradigmatic vision looks like a complete bull to another gal. So, let's keep the abstractions to the just necessary level and communicate at the functional layer. In this case, mount and sysfs shares the requirement for a refcount w/ a kill switch. I'm not sure it warrants common abstraction at this stage but if the *function* can be wrapped nicely along with lockdep annotations and all, why not? Thanks. -- tejun --
On Fri, 26 Mar 2010 13:49:36 +0900 I'm not at all sure that opinion would be universal.... refcounting is something that it is quite easy to get wrong. There are several slightly different models for refcounting and if you don't have a clear understanding of the different use cases it is easy to get confused about exactly what model is being used and so use a refcount wrongly. kref certainly doesn't cover all models for refcounting but it does cover one fairly common one very well and I think that it's use bring clarity rather than obscurity. Of course if it is used for a refcount which should really follow a different kobject_get *always* returns exactly the argument that was passed to it. (kref_get doesn't have a return value.) I don't see how the code above has any bearing on the last-put problem, which I agree that there probably misconceptions about how kref works and they are probably based on a lack of appreciation of the subtle differences in flavours of refcounts. Hence my desire to create and document different The only other use of a BIAS that I am aware of is in struct super_block, and Al Viro recently removed that in his bleeding edge tree (two days before I sent him a patch to do the same thing:-) It is dangerous to build too much into an abstraction else you will find that no-one uses it as it is too specific. The s_count and s_active in struct super_block are very similar to s_count and s_active in struct sysfs_dirent, however they are also quite different. super_block uses a non-atomic s_count (because a spinlock is always held anyway) and has a separate way of preventing new s_active references (s_root becomes NULL). The only real similarity is that they both have an 'active' refcount that *can* become zero and still be visible, which is different to a kref but still a model worth encapsulating (I think) in karef. BTW I'd be perfectly happy if the first patch was taken and subsequent ones not. I think they are a good idea, but I'm happy to ...
Hello, Neil.
I don't know. After spending some time with k* and device model, I
grew a pretty strong dislike for abstractions without clear functional
necessities. kref being much simpler than kobject, the abuse is much
less severe but there have been kref misuses (in kobject and SCSI
midlayer) which could have been avoided or at least easily located if
they simply had used atomic_t instead of dreaming up some mystical
Oh, I should have been more explicit. It's not directly related to
kref but just something that always comes to my mind when thinking
about k* abstractions. The above bogus condition checks used to be
used quite widely. The programmer for some reason believed the last
kobject_put() somehow will magically make future kobject_get()s return
NULL, which of course doesn't make any sense but hey the
implementation is buried kilometers deep, the API and other usages
seem to suggest that and it's easy to imagine something up when you're
tired.
As an another example, please take a look at the kref API.
int kref_put(struct kref *kref, void (*release) (struct kref *kref));
The function takes @release callback but under which context is it
called? If you look at the source code, it's called in-line which
isn't clear from the API at all (why the hell take a callback if
you're gonna call it in-line?) and there have been *several* subtle
bugs which could have been avoided or at least would have been caught
much easier if it were not for that meaningless separate release
callback. It's just too easy to forget about the executing context
when people write and review stuff over function boundaries.
void put_something(something)
{
if (kref_put(&something->kref))
do something which might dead lock;
}
is way easier to avoid bugs and review than
void really_kill_something(struct kref *kref)
{
struct my_something *something = container_of(...);
do something which might dead lock;
}
void put_something(something)
{
...On Fri, 26 Mar 2010 15:32:51 +0900
Hi Tejun,
this strikes me as really valuable experience that it would be great to
share. While I generally like kref and see value in at least some of
kobject I don't for a moment suppose they are perfect. Fixing them requires
a good understanding of the problems they cause. If you have that knowledge,
it would be a great resource for anyone wanting to 'fix' kobject.
This would argue that having a return value from kobject_get violates Rusty's
law that interfaces should be hard to misuse.
We could probably change that - it is only used 19 times in the current
kernel.
It probably doesn't help that Documentation/kobject.txt includes the text:
A successful call to kobject_get() will increment the kobject's reference
counter and return the pointer to the kobject.
I'm not immediately convinced by this, though maybe I haven't seen enough
examples yet.
The deadlocks that I have come across would not have been any more obvious in
either of the above - they were caused because sysfs_remove deadlocks when
called from inside a sysfs attribute action...
Also, while this re-write is possible for kref uses it isn't really possible
in kobject as the 'final_put' function must be included in the ktype (though
maybe you don't like that either).
What would be really helpful here is a survey of what sorts of things are
actually done in final_put functions so would could create some guidelines
about how to write the release functions.
Thanks,
--
Hello, Neil. I'm not sure I have enough to say to write something up about. :-) One thing I've been hoping to do is to hide the k* abstraction behind device model abstraction so that device model API users only have to deal with device model abstractions - devices and drivers. They are what the driver writers need and have pretty good tangible concrete concept about. I think my opinions can be compressed into * Add abstractions which serve concrete functional necessities as doing so makes communication among peer programmers much easier by reducing confusion. * Don't get too attached to software engineering theories. They're not bad in themselves but often end up labeling only certain small subset of all the things which need to be considered, with the adverse side effect of emphasizing those named ones out of Yeap, the return value is only for convenience and I think it's more harmful overall, but then again I suppose most people got used to it now, which is another important factor to consider. Given that there are not many users of the return value, I agree that removing it would Ah, right, I probably was mixing kobject and kref release functions when talking about the problems I've seen. I still don't like the kref API and have chosen atomic_t over it in several cases. To me, the separate ->release seems way too heavy compared to the functionality the abstraction actually provides and hinders writing For kobject, I'm not sure what should be done other than hiding it as much as possible behind device model abstractions. For kref, I think it would be great if it can be made simpler and dumber so that it only clarifies the intention of the usage instead of forcing separate ->release. If fancier reference helper is needed, capability to defer release to a separate context and handle module symbol dereference problem automatically would be nice provided they can be done in clean manner, but these are just something I've been vaguely hoping for ...
