Re: 2.6.21-rc suspend regression: sysfs deadlock

Previous thread: [PATCH] make elv_register() output atomic by Thibaut VARENE on Thursday, March 8, 2007 - 6:06 am. (2 messages)

Next thread: Question about memory mapping mechanism by Martin Drab on Thursday, March 8, 2007 - 6:16 am. (5 messages)
From: Oliver Neukum
Date: Thursday, March 8, 2007 - 6:05 am

Hi,

after a lightning bolt from high above I've been looking into refcounting
the data structures drivers use to provide the data used to refill sysfs
buffers. I've come to the following conclusion.

1. struct sysfs_buffer must have a struct kref * and probably a destructor
pointer
2. drivers must be able to pass these pointers through an extended
device_create_file()
3. Drivers must use refcounting if they want to use attributes
4. read/write/poll must do refcounting

I am not sure where to store the pointers. struct sysfs_dirent() looks
like the obvious choice. Comments?

	Regards
		Oliver
-

From: Alan Stern
Date: Thursday, March 8, 2007 - 9:02 am

Can you explain the reasoning that led to these conclusions?  And what 
exactly was your lightning bolt?

Alan Stern

-

From: Oliver Neukum
Date: Thursday, March 8, 2007 - 5:45 pm

The old race between disconnect and IO to attribute via sysfs again.
If I cannot disassociate the drivers from the buffers in the buffers, drivers
must not deallocate the data necessary to answer sysfs callbacks while
a buffer exists. Reading from a buffer must up a refcount in the driver's
data structures. The question becomes how to get a pointer to the buffer.
And it cannot live in the dentry as the dentry can go away while files
are still open. This leaves the inode or the buffer.

	Regards
		Oliver
-

From: Alan Stern
Date: Friday, March 9, 2007 - 9:32 am

Why wouldn't you be able to dissociate a driver from a buffer?  That was 
the whole point of adding .orphan to sysfs_buffer and creating 
sysfs_buffer_collection -- it was supposed to solve exactly this race.

Alan Stern

-

From: Oliver Neukum
Date: Friday, March 9, 2007 - 9:44 am

It did solve the race but deadlocked when unbinding devices through sysfs.
Linux therefore asked for the patch to be reverted and wants the isue solved
with refcounting.

	Regards
		Oliver
-

From: Dmitry Torokhov
Date: Friday, March 9, 2007 - 10:02 am

I think we already have all refcounting that is needed. What is
missing is subsystem-provided ->release() hooks for drivers to release
driver-specific resources when a device finally goes away.

-- 
Dmitry
-

From: Oliver Neukum
Date: Friday, March 9, 2007 - 10:18 am

This is an interesting idea. Is it nice to pass through release()
but not open() ?

	Regards
		Oliver
-

From: Dmitry Torokhov
Date: Friday, March 9, 2007 - 10:34 am

Not sure if I follow... Generally speaking open is not a mandatory
operation; however every object in driver model has a release method.
What I am saying is that certain drivers need to have their disconnect
method split in 2 parts - one that shuts down the device and second is
releases resources that might be accesses through sysfs (and other
kernel parts). That second part will have to be called from
subsystem's core ->release() method se we need a release() hook.

-- 
Dmitry
-

From: Alan Stern
Date: Friday, March 9, 2007 - 12:32 pm

Dmitry, you're not viewing this correctly.

Adding a new release() callback would solve the problem by creating 
another.  Drivers need to release their data as soon as possible after
they unbind from a device, not when the device itself goes away.  Think
about what would happen if you tried to rmmod a driver.  The rmmod process 
would block until the device was unregistered.

Oliver, your idea won't work either.  Think about what would happen if 
someone did

	rmmod driver_module </sys/devices/.../attribute_file

The rmmod process would never actually read the attribute, so until it 
exited the private data structure would have a positive refcount.  But 
rmmod can't exit until the driver has been unloaded from memory, and it 
can't be unloaded while its data structure is still allocated.  Thus we 
would end up with deadlock; rmmod would hang forever.

It might be better to keep your earlier patch and fix the deadlock you
mentioned earlier, the one that occurs when unbinding a driver through
sysfs.  How exactly does that deadlock work?

Alan Stern

-

From: Oliver Neukum
Date: Friday, March 9, 2007 - 1:05 pm

Wait, the callback from closing the file in sysfs is the earliest we can safely


http://lkml.org/lkml/2007/3/6/364
http://lkml.org/lkml/2007/3/6/528

	Regards
		Oliver
-

From: Alan Stern
Date: Friday, March 9, 2007 - 1:27 pm

It is _not_ the earliest we can safely free the data structure.

Dmitry's callback occurs when _all_ the sysfs attributes have been
released -- including ones that don't have anything to do with the
driver's private data structure.  Think of the bInterfaceClass attribute,
for example.

But even aside from that, Dmitry's suggestion is wrong.  He wants to add a
second release() method to the driver, which can be called from the
subsystem's release() method -- which doesn't run until the device is
unregistered, possibly long after the driver has been unbound from it.  
Then how could the subsystem even know which drivers need their second


I get the picture, thanks.

Alan Stern


-

From: Oliver Neukum
Date: Friday, March 9, 2007 - 1:39 pm

Ok, yes I see. It is by far too late.

	Regards
		Oliver
-

From: Alan Stern
Date: Friday, March 9, 2007 - 1:08 pm

I take this back.  Redirecting stdin to the attribute file would increase 
the module's refcount and cause rmmod to exit immediately with an error.

After some more thought, I basically agree with what Oliver wrote
originally.  sysfs_dirent is indeed the logical place to store the kref
pointer.  However it needs to be used during open and release, not during
read, write, and poll.  Another point, which Oliver didn't think of, is
that the kref pointer needs to be passed to the driver as an argument in
the show() and store() method calls.

Implementing this will be difficult.  One possibility is to change the 
definition of sysfs_ops, adding the new struct kref * argument to the 
prototypes.  This will involve changing _lots_ of source files, adding an 
unused argument to many functions, which isn't attractive.

The other possibility is to test at runtime whether the kref pointer is 
NULL, and if it is, don't pass it.  This would work, but it isn't 
type-safe.

Finally, there's added complexity in each driver which wants to use the 
new facility.  The module_exit routine will need to be smart enough to 
block until all the private data structures have been released.  
usb-storage does something like that now; it's kind of ugly (although it 
could be improved if appropriate support were added to the core kernel).

Alan Stern

-

From: Oliver Neukum
Date: Friday, March 9, 2007 - 1:48 pm

If we up the module count for every bound device, all device attributes
should be gone before we ever get that far.

	Regards
		Oliver
-

From: Alan Stern
Date: Saturday, March 10, 2007 - 12:19 pm

It's the same old problem: the race between unbind and sysfs I/O.  What
good does holding a reference to the private data structure do if the
show/store method gets called after the driver has been unbound from the
device?  dev_get_drvdata() will no longer provide a valid pointer to the
private data, so the method will have no way to access it.  Hence the
method needs another argument.

(BTW, the sysfs core would actually need more than a kref.  It would also 
need a pointer to a release routine -- the kref contains only the atomic 
counter.  The more you think about it, the more complicated this approach 

Not quite right.  However, since every open sysfs file holds a module
reference, if the driver's module_exit has been called then there can be
no open sysfs files, hence no private data still pinned.  Thus this isn't
a problem at all.


But never mind all the above.  I'm going to post another message on this
thread in which I argue that Oliver's original approach was a good one and
should not have been reverted.  The specific problem identified by Hugh
Dickins can be fixed in the way Dmitry first suggested, by doing the real
operation from a workqueue routine.

Alan Stern

-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 1:54 am

It does half the job. You can make sure the driver is not asked to access
freed memory.
It is true that a driver will have to mark that device "disconnected"
and return errors if that device's attributes are referenced, but this can
be done internally.

Yes, this is a bit more complicated.
{rant mode}
Who came up with the idea of making life simpler by adding a code path?
All these problems were already solved for device nodes. Ioctl is ugly, but
at least a known code path.

Yes, this is implied.

	Regards
		Oliver
-

From: Alan Stern
Date: Monday, March 12, 2007 - 7:57 am

No, you're missing the point.  Let's say driver A's disconnect() is
called, so the driver marks its private data structure as "disconnected"
and does dev_set_drvdata(NULL).  Then driver B is probed and bound to the
device, and it does its own dev_set_drvdata().  Then a user still holding
an open sysfs file reference for driver A calls a show() or store()  
method.  The method will do dev_get_drvdata(), receiving the pointer to
driver B's private data.  Now you're in trouble, because A's method will

I'll let Greg give the complete answer.  :-)  Bear in mind, however, that
the aim was probably to make life simpler for userspace -- which does not
mean making life simpler for the kernel.

(Incidentally, I'm not so sure that all these problems really were solved 
by ioctl on device nodes.  I bet you could find plenty of cases where 
ioctl races with disconnect if you looked.)

Alan Stern

-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 8:23 am

Yes, I was missing the point. In consequence, drivers must not use
dev_get_drvdata() to get their references to their private data. It's
probably necessary to store it in struct sysfs_buffer and include that
in the store/show callbacks.

That doesn't mean that the method needed to be thrown out.
Sysfs could simply pass through the syscalls for a device, like
it is done in character devices. I am tempted to recommend

I will look. Death to all race conditions.

	Regards
		Oliver
-

From: Dmitry Torokhov
Date: Monday, March 12, 2007 - 8:42 am

Or drivers coudl verify that they still bound to the device they are
about to operate on (psmouse does this by taking a lock on device and
then checking if driver bound is the same address as psmouse). But I'd
rather get rid of all this clutter if we could sever sysfs access
after removing corresponding attributes.

-- 
Dmitry
-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 8:59 am

No, the call has to fail if the driver is rebound to the device.

	Regards
		Oliver
-

From: Alan Stern
Date: Monday, March 12, 2007 - 9:21 am

You do realize how foolish that sounds?  Why do you think 

I'm with Dmitry; the whole thing becomes much, much simpler if we put back
your patch and prevent sysfs access after unregistering an attribute 
file.  No API changes are needed, no driver changes are needed, no radical 
core changes are needed,...  All we would have to do is fix the one SCSI 
method to make it use a workqueue.

Alan Stern

-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 11:25 am

It's still useful in disconnect/suspend/resume/etc...
If everything were alright with the design, we wouldn't be discussing

Try. I don't like reverting my own code. But I predict he'll tell you that a
driver's bond with a device should be represented in a data structure
that is to be refcounted.

	Regards
		Oliver
-

From: Alan Stern
Date: Monday, March 12, 2007 - 12:31 pm

I did.  Didn't you see this message from Saturday:

http://marc.theaimsgroup.com/?l=linux-kernel&m=117355959020831&w=2

I sent it to Linus as well as to all of you.  No replies received so far 

Alan Stern

-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 12:49 pm

Yes. In this case, silence is partial agreement. However, convincing me
is futile if Linus rejects the approach.

I wrote the original patch. But this problem must be solved. If the

Coming to think about it, he might be right there.

	Regards
		Oliver
-

From: Alan Stern
Date: Monday, March 12, 2007 - 1:03 pm

There still would be a synchronization problem.  Refcounts don't solve
races; they only solve lifetime problems.  And you would still have to
change the sysfs API, plus all the other stuff...

Do you think Linus would listen if all three of us (plus maybe Greg) tried 
to convince him?

Alan Stern

-

From: Oliver Neukum
Date: Monday, March 12, 2007 - 1:15 pm

No. He'd tell you that a crap API should be changed.

	Regards
		Oliver
-

From: Dmitry Torokhov
Date: Monday, March 12, 2007 - 1:31 pm

If we'd accompany the argument with the patch that changes scsi to use
wq to perform deletion so we don't have deadlock regression in the
kernel he might be more perceptive... He is right about lifetime
issues but this is not strictly lifetime issue as you correctly point
out. Plus, refcounting also bloats the kernel so I don't relly want to
use refcount for every integer I happen to export through sysfs if I
can simply "revoke" access.

-- 
Dmitry
-

From: Alan Stern
Date: Monday, March 12, 2007 - 1:45 pm

I wrote that patch over the weekend but forgot to bring it in to work.  

Agreed.

Alan Stern

-

From: Richard Purdie
Date: Monday, March 12, 2007 - 2:31 pm

For what its worth, I think it makes sense if the driver no longer has
to worry about sysfs attributes after they've been removed. This is
something the core should look after, not each and every driver.

http://marc.theaimsgroup.com/?l=linux-kernel&m=117355959020831&w=2

makes a lot of sense, particularly that "No driver callbacks occur after
unregistration". When writing the backlight class code, I remember
checking into this, concluding that seemed to be the design of sysfs and
thinking it a sane design.

The alternative is to force each and every driver to do its own
refcounting. My experience with locking in the extremely simple
backlight class shows nobody reads the documentation or writes the code
correctly. With that, I've given up and added suitable locking to the
core even if not every driver needs it. In doing so, I made a net
removal of a few hundred lines of broken "ticking timebomb" style code.
I dread to think what would happen if every driver had to deal with
sysfs refcounting.

So count me as a vote for handling this in the sysfs core, not the
drivers.

Richard


-

From: Alan Stern
Date: Tuesday, March 13, 2007 - 8:00 am

Hugh, there has been a long discussion among several people concerning 
this issue.  See for example this thread:

http://marc.info/?t=117335935200001&r=1&w=2

and also:

http://marc.info/?l=linux-kernel&m=117355959020831&w=2

The consensus is that we would be better off keeping Oliver's original 
patch without your silly change, and instead fixing the particular method 
call that deadlocked.  Can you please try out the patch below with 
everything else as it was before?  It should solve your problem.

Alan Stern


Index: usb-2.6/drivers/scsi/scsi_sysfs.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_sysfs.c
+++ usb-2.6/drivers/scsi/scsi_sysfs.c
@@ -452,10 +452,39 @@ store_rescan_field (struct device *dev, 
 }
 static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field);
 
+/* An attribute method cannot unregister itself, so this workaround for
+ * sdev_store_delete() is necessary.
+ */
+struct sdev_work_struct {
+	struct scsi_device *sdev;
+	struct work_struct work;
+};
+
+static void sdev_store_delete_work(struct work_struct *work)
+{
+	struct sdev_work_struct *sdw = container_of(work,
+			struct sdev_work_struct, work);
+
+	scsi_remove_device(sdw->sdev);
+	scsi_device_put(sdw->sdev);
+	kfree(sdw);
+}
+
 static ssize_t sdev_store_delete(struct device *dev, struct device_attribute *attr, const char *buf,
 				 size_t count)
 {
-	scsi_remove_device(to_scsi_device(dev));
+	struct scsi_device *sdev = to_scsi_device(dev);
+	struct sdev_work_struct *sdw;
+
+	sdw = kmalloc(sizeof(*sdw), GFP_KERNEL);
+	if (!sdw)
+		return -ENOMEM;
+	sdw->sdev = sdev;
+	INIT_WORK(&sdw->work, sdev_store_delete_work);
+	if (scsi_device_get(sdev) != 0)
+		kfree(sdw);
+	else
+		schedule_work(&sdw->work);
 	return count;
 };
 static DEVICE_ATTR(delete, S_IWUSR, NULL, sdev_store_delete);

-

From: Cornelia Huck
Date: Tuesday, March 13, 2007 - 11:42 am

On Tue, 13 Mar 2007 11:00:21 -0400 (EDT),

Another call that deadlocked with Oliver's patch is ungroup for s390
ccwgroup devices. It can be made to work again with a similar patch.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>

---
 drivers/s390/cio/ccwgroup.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

--- linux-2.6.orig/drivers/s390/cio/ccwgroup.c
+++ linux-2.6/drivers/s390/cio/ccwgroup.c
@@ -67,22 +67,49 @@ __ccwgroup_remove_symlinks(struct ccwgro
 	
 }
 
+struct ccwgroup_work_struct {
+	struct ccwgroup_device *gdev;
+	struct work_struct work;
+};
+
+static void ccwgroup_ungroup_work(struct work_struct *work)
+{
+	struct ccwgroup_work_struct *ungroup_work
+		= container_of(work, struct ccwgroup_work_struct, work);
+
+	__ccwgroup_remove_symlinks(ungroup_work->gdev);
+	device_unregister(&ungroup_work->gdev->dev);
+	put_device(&ungroup_work->gdev->dev);
+	kfree(ungroup_work);
+}
+
 /*
  * Provide an 'ungroup' attribute so the user can remove group devices no
  * longer needed or accidentially created. Saves memory :)
+ * Note that we cannot unregister the device from one of its attribute
+ * methods, so we have to delay it.
  */
-static ssize_t
-ccwgroup_ungroup_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t count)
+static ssize_t ccwgroup_ungroup_store(struct device *dev,
+				      struct device_attribute *attr,
+				      const char *buf, size_t count)
 {
 	struct ccwgroup_device *gdev;
+	struct ccwgroup_work_struct *ungroup_work;
 
 	gdev = to_ccwgroupdev(dev);
 
 	if (gdev->state != CCWGROUP_OFFLINE)
 		return -EINVAL;
 
-	__ccwgroup_remove_symlinks(gdev);
-	device_unregister(dev);
+	ungroup_work = kmalloc(sizeof(*ungroup_work), GFP_KERNEL);
+	if (!ungroup_work)
+		return -ENOMEM;
+	ungroup_work->gdev = gdev;
+	INIT_WORK(&ungroup_work->work, ccwgroup_ungroup_work);
+	if ...
From: Linus Torvalds
Date: Tuesday, March 13, 2007 - 2:20 pm

Could we please make this easier to use by having some common sysfs helper 
routine for this kind of "delayed_store()" functionality.

I'm not a huge fan of delayed work at all, but if we have to have it, at 
least make it one generic function rather than having multiple functions 
all doing their own workqueue logic for it.

		Linus
-

From: Alan Stern
Date: Wednesday, March 14, 2007 - 9:12 am

This seems more elegant (not yet tested).  Cornelia, does it look okay to 
you?

Alan Stern


Index: usb-2.6/include/linux/sysfs.h
===================================================================
--- usb-2.6.orig/include/linux/sysfs.h
+++ usb-2.6/include/linux/sysfs.h
@@ -78,6 +78,9 @@ struct sysfs_ops {
 
 #ifdef CONFIG_SYSFS
 
+extern int sysfs_access_in_other_task(struct kobject *kobj,
+		void (*func)(void *), void *data);
+
 extern int __must_check
 sysfs_create_dir(struct kobject *, struct dentry *);
 
@@ -133,6 +136,12 @@ extern int __must_check sysfs_init(void)
 
 #else /* CONFIG_SYSFS */
 
+static inline int sysfs_access_in_other_task(struct kobject *kobj,
+		void (*func)(void *), void *data)
+{
+	return -ENOSYS;
+}
+
 static inline int sysfs_create_dir(struct kobject * k, struct dentry *shadow)
 {
 	return 0;
Index: usb-2.6/fs/sysfs/file.c
===================================================================
--- usb-2.6.orig/fs/sysfs/file.c
+++ usb-2.6/fs/sysfs/file.c
@@ -643,6 +643,59 @@ void sysfs_remove_file_from_group(struct
 }
 EXPORT_SYMBOL_GPL(sysfs_remove_file_from_group);
 
+struct other_task_struct {
+	struct kobject 		*kobj;
+	void			(*func)(void *);
+	void			*data;
+	struct work_struct	work;
+};
+
+static void other_task_work(struct work_struct *work)
+{
+	struct other_task_struct *ots = container_of(work,
+			struct other_task_struct, work);
+
+	(ots->func)(ots->data);
+	kobject_put(ots->kobj);
+	kfree(ots);
+}
+
+/**
+ * sysfs_access_in_other_task - delay access from an attribute method.
+ * @kobj: object we're acting for.
+ * @func: callback function to invoke later.
+ * @data: argument to pass to @func.
+ *
+ * sysfs attribute methods must not unregister themselves or their parent
+ * kobject (which would amount to the same thing).  Attempts to do so will
+ * deadlock, since unregistration is mutually exclusive with driver
+ * callbacks.
+ *
+ * Instead methods can call this routine, which will attempt to ...
From: Cornelia Huck
Date: Wednesday, March 14, 2007 - 11:43 am

On Wed, 14 Mar 2007 12:12:37 -0400 (EDT),

Works for me (grouping & ungrouping ctc) and looks sane. Some more

The naming seems a bit unintuitive, but I don't have a good


device_delay_access()?

-

From: Alan Stern
Date: Wednesday, March 14, 2007 - 12:23 pm

sysfs_work_struct is too generic; other parts of sysfs might also want to
use workqueues for different purposes.

I don't like calling it "delayed"-anything, because the operations aren't
necessarily delayed!  On an SMP system they might even execute before the
sysfs_access_in_other_task() call returns.  (Although the two examples we
have so far can't do that because of lock contention.)

The major feature added here is that the work takes place in a different 
task's context, not that it is delayed.  Hence the choice of names.

Alan Stern

-

From: Cornelia Huck
Date: Thursday, March 15, 2007 - 3:27 am

On Wed, 14 Mar 2007 15:23:10 -0400 (EDT),

Sure. But then you shouldn't refer to "delay" in the comments for the

Hm. Perhaps device_schedule_access()?
-

From: Hugh Dickins
Date: Thursday, March 15, 2007 - 5:31 am

It's really none of my business, I'm merely the reporter the
deadlock being fixed, and I don't know my way around sysfs at all ...

... but I have to say I share your discomfort with Alan's
"sysfs_access_in_other_task" naming, it sounded very weird to me.

Quite apart from this mysterious "other task", I don't understand
"access" either.

Perhaps "defer" would best capture the idea of another-task and
maybe-delay?  sysfs_defer_work(), struct sysfs_deferred_work?

Hugh
-

From: Oliver Neukum
Date: Thursday, March 15, 2007 - 6:02 am

But we do not wish to defer or delay anything.
How about: sysfs_action_from_neutral_context

	Regards
		Oliver
-

From: Dmitry Torokhov
Date: Thursday, March 15, 2007 - 6:22 am

How about sysfs_schedule_work? That is what it does - schedules a work
on a sysfs object and everyone here knows what schedule_work() does.

-- 
Dmitry
-

From: Hugh Dickins
Date: Thursday, March 15, 2007 - 6:59 am

I'm ashamed to have suggested anything else: certainly gets my vote.

Hugh
-

From: Alan Stern
Date: Thursday, March 15, 2007 - 7:27 am

Fair enough.  One use of "delay" is in a comment you wrote; I'll change it 





Personally I don't understand what was wrong with my name.  What's weird 
or unintuitive about doing something in a different task's context?

Dmitry's suggestion is slightly inappropriate because the function doesn't
take a workstruct as an argument and it isn't itself a workqueue callback.  

Would people be happier with sysfs_schedule_callback() and
device_schedule_callback()?  At least the functions do take a callback 
pointer as an argument, even though they aren't callbacks themselves.

Alan Stern

-

From: Cornelia Huck
Date: Thursday, March 15, 2007 - 8:32 am

On Thu, 15 Mar 2007 10:27:19 -0400 (EDT),


Count one happy person here.
-

From: Hugh Dickins
Date: Thursday, March 15, 2007 - 9:29 am

The only thing wrong with sysfs_do_something_in_a_different_task_context()
is the length of the name.  "do", that's good, much better than "access".

sysfs_access_in_other_task() left me wondering what this "other" task
was, and what kind of "access" it's trying to get - or is the calling
task the other, and it's trying to access something it wouldn't

True, though since he's saying "work" rather than "workstruct",

A lot happier than with sysfs_access_in_other_task() -
if you prefer this to Dmitry's, it's okay by me.

Hugh
-

From: Linus Torvalds
Date: Thursday, March 15, 2007 - 9:51 am

For naming clashes, I'd suggest:

 - try to name according to *why* something is done, not necessarily what 
   it does.

   For example, is it really in "another task"? Maybe it's just an 
   on-demand thread of the same task?  Do you actually care how the 
   deferred work is done?

 - avoid being vague. I agree with not liking the name much, and the 
   "other" thing bothers me. Like Hugh, it makes me ask "_What_ other 
   task?"

So I would suggest not concentrating on some implementation issue, but on 
the reason why you need it in the first place. Namely that you want to 
defer the actual action to avoid deadlock due to recursive locking. So 
that "why do I actually do this" thing implies something like 
"sysfs_store_async()" or "sysfs_store_deferred()" or maybe actually 
concentrate on the locking angle and say something like 
"sysfs_store_needs_to_reacquire_lock()".

(That last one wasn't really serious - it's too long and cumbersome, but 
it's an example of not caring _how_ you do it, just abotu what you want 
done).

		Linus
-

From: Alan Stern
Date: Thursday, March 15, 2007 - 12:51 pm

This patch (as869) reinstates the mutual exclusion between sysfs
attribute method calls and attribute unregistration.  The
previously-reported deadlocks have been fixed, and this exclusion is
by far the simplest way to avoid races during driver unbinding.

The check for orphaned read-buffers has been moved down slightly, so
that the remainder of a partially-read buffer will still be available
to userspace even after the attribute has been unregistered.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

---

Index: 2.6.21-rc3-git9/fs/sysfs/inode.c
===================================================================
--- 2.6.21-rc3-git9.orig/fs/sysfs/inode.c
+++ 2.6.21-rc3-git9/fs/sysfs/inode.c
@@ -222,13 +222,17 @@ const unsigned char * sysfs_get_name(str
 
 static inline void orphan_all_buffers(struct inode *node)
 {
-	struct sysfs_buffer_collection *set = node->i_private;
+	struct sysfs_buffer_collection *set;
 	struct sysfs_buffer *buf;
 
 	mutex_lock_nested(&node->i_mutex, I_MUTEX_CHILD);
-	if (node->i_private) {
-		list_for_each_entry(buf, &set->associates, associates)
+	set = node->i_private;
+	if (set) {
+		list_for_each_entry(buf, &set->associates, associates) {
+			down(&buf->sem);
 			buf->orphaned = 1;
+			up(&buf->sem);
+		}
 	}
 	mutex_unlock(&node->i_mutex);
 }
Index: 2.6.21-rc3-git9/fs/sysfs/file.c
===================================================================
--- 2.6.21-rc3-git9.orig/fs/sysfs/file.c
+++ 2.6.21-rc3-git9/fs/sysfs/file.c
@@ -168,12 +168,12 @@ sysfs_read_file(struct file *file, char 
 	ssize_t retval = 0;
 
 	down(&buffer->sem);
-	if (buffer->orphaned) {
-		retval = -ENODEV;
-		goto out;
-	}
 	if (buffer->needs_read_fill) {
-		if ((retval = fill_read_buffer(file->f_path.dentry,buffer)))
+		if (buffer->orphaned)
+			retval = -ENODEV;
+		else
+			retval = fill_read_buffer(file->f_path.dentry,buffer);
+		if (retval)
 			goto out;
 	}
 	pr_debug("%s: count = %zd, ppos = %lld, buf = %s\n",

-

From: Alan Stern
Date: Thursday, March 15, 2007 - 12:50 pm

This patch (as868) adds a helper routine for device drivers that need
to set up a callback to perform some action in a different process's
context.  This is intended for use by attribute methods that want to
unregister themselves or their parent device.  Attribute method calls
are mutually exclusive with unregistration, so such actions cannot be
taken directly.

Two attribute methods are converted to use the new helper routine: one
for SCSI device deletion and one for System/390 ccwgroup devices.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>

---

Index: 2.6.21-rc3-git9/include/linux/sysfs.h
===================================================================
--- 2.6.21-rc3-git9.orig/include/linux/sysfs.h
+++ 2.6.21-rc3-git9/include/linux/sysfs.h
@@ -78,6 +78,9 @@ struct sysfs_ops {
 
 #ifdef CONFIG_SYSFS
 
+extern int sysfs_schedule_callback(struct kobject *kobj,
+		void (*func)(void *), void *data);
+
 extern int __must_check
 sysfs_create_dir(struct kobject *, struct dentry *);
 
@@ -132,6 +135,12 @@ extern int __must_check sysfs_init(void)
 
 #else /* CONFIG_SYSFS */
 
+static inline int sysfs_schedule_callback(struct kobject *kobj,
+		void (*func)(void *), void *data)
+{
+	return -ENOSYS;
+}
+
 static inline int sysfs_create_dir(struct kobject * k, struct dentry *shadow)
 {
 	return 0;
Index: 2.6.21-rc3-git9/fs/sysfs/file.c
===================================================================
--- 2.6.21-rc3-git9.orig/fs/sysfs/file.c
+++ 2.6.21-rc3-git9/fs/sysfs/file.c
@@ -629,6 +629,60 @@ void sysfs_remove_file_from_group(struct
 }
 EXPORT_SYMBOL_GPL(sysfs_remove_file_from_group);
 
+struct sysfs_schedule_callback_struct {
+	struct kobject 		*kobj;
+	void			(*func)(void *);
+	void			*data;
+	struct work_struct	work;
+};
+
+static void sysfs_schedule_callback_work(struct work_struct *work)
+{
+	struct sysfs_schedule_callback_struct *ss = container_of(work,
+			struct sysfs_schedule_callback_struct, ...
From: Hugh Dickins
Date: Tuesday, March 13, 2007 - 12:00 pm

Yep, it works fine with your patch in and my silly reverted, thanks.
But (I was about to say, even before seeing Cornelia's reply, honest!)
I think you do need to check (audit the source? or is some runtime
check possible?) for other such "suicidal" sysfs files, which
seemed to (sysfs-ignorant) me to pose the real problem.

Hugh
-

From: Alan Stern
Date: Tuesday, March 13, 2007 - 1:09 pm

A runtime check wouldn't detect anything until someone tried to use the 
file -- at which point the process would deadlock anyway.

On the other hand, a quick survey of the kernel source shows that
DEVICE_ATTR is used over 1500 times.  Auditing all of them is not a job
for the faint-of-heart!

Alan Stern

-

From: Hugh Dickins
Date: Tuesday, March 13, 2007 - 1:55 pm

Indeed, and faint-hearted Hugh wasn't intending to do so: but
stout-hearted Alan will need to, won't he, before his patch can go in?
-

From: Dmitry Torokhov
Date: Tuesday, March 13, 2007 - 2:08 pm

I think we could rely on subsystems maintainers to let us know if
there are potential problems. For example I can tell that neither
input, serio nor gameport subsystems use sysfs to destroy their
devices (action on sysfs may cause some other device to be destroyed
but that should be ok, only self-destruction is not allowed, right?)

-- 
Dmitry
-

From: Alan Stern
Date: Tuesday, March 13, 2007 - 2:20 pm

Allow me to point out that the original patch is Oliver's (although I
helped), and it doesn't need to go in -- it needs not to be removed.

Furthermore, I have better things to do with the next month of my time 
than auditing hundreds of routines I don't understand for behavior I 
probably won't be able to recognize.  (Although at 50 a day... hmmm, 
maybe.)

This sounds more like a job for kernel-janitors!



Very good points.  USB doesn't do anything like that either.  And right, 
it's okay for a method to destroy other devices; it just can't do anything 
that would lead to its own unregistration.

Alan Stern

-

Previous thread: [PATCH] make elv_register() output atomic by Thibaut VARENE on Thursday, March 8, 2007 - 6:06 am. (2 messages)

Next thread: Question about memory mapping mechanism by Martin Drab on Thursday, March 8, 2007 - 6:16 am. (5 messages)