[PATCH] DMA: Fix broken device refcounting

Previous thread: [PATCH] 2.6.23: Filesystem capabilities 0.17 by Olaf Dietsche on Friday, October 26, 2007 - 12:08 pm. (3 messages)

Next thread: [2.624-rc1 regression] lost battery information by Andrey Borzenkov on Friday, October 26, 2007 - 12:24 pm. (18 messages)
To: Shannon Nelson <shannon.nelson@...>
Cc: Dan Williams <dan.j.williams@...>, <linux-kernel@...>, <akpm@...>, Haavard Skinnemoen <hskinnemoen@...>
Date: Friday, October 26, 2007 - 12:12 pm

When a DMA device is unregistered, its reference count is decremented
twice for each channel: Once dma_class_dev_release() and once in
dma_chan_cleanup(). This may result in the DMA device driver's
remove() function completing before all channels have been cleaned
up, causing lots of use-after-free fun.

Fix it by incrementing the device's reference count twice for each
channel during registration.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
---
I'm not sure if this is the correct way to solve it, but it seems to
work. The remove() function does not hang, which indicates that the
device's reference count does drop all the way to zero on
unregistration, which in turn indicates that it did actually drop
_below_ zero before.

drivers/dma/dmaengine.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 8248992..302eded 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -397,6 +397,8 @@ int dma_async_device_register(struct dma_device *device)
goto err_out;
}

+ /* One for the channel, one of the class device */
+ kref_get(&device->refcount);
kref_get(&device->refcount);
kref_init(&chan->refcount);
chan->slow_ref = 0;
--
1.5.2.5

-

To: Haavard Skinnemoen <hskinnemoen@...>
Cc: Williams, Dan J <dan.j.williams@...>, <linux-kernel@...>, <akpm@...>
Date: Friday, October 26, 2007 - 12:59 pm

As Dan said, we've been discussing this offline, and hadn't come to an
agreement yet. My version of the patch is the opposite of yours -
instead of adding a kref_get(), I remove one of the kref_put() calls.
--

When a channel is removed from dmaengine, too many kref_put() calls
are made and the device removal happens too soon, usually causing
a panic.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
---

drivers/dma/dmaengine.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 8248992..144a1b7 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -131,7 +131,6 @@ static void dma_async_device_cleanup(struct kref
*kref);
static void dma_class_dev_release(struct class_device *cd)
{
struct dma_chan *chan = container_of(cd, struct dma_chan,
class_dev);
- kref_put(&chan->device->refcount, dma_async_device_cleanup);
}

static struct class dma_devclass = {
-

To: Haavard Skinnemoen <hskinnemoen@...>
Cc: Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>, Shannon Nelson <shannon.nelson@...>
Date: Friday, October 26, 2007 - 12:36 pm

Yeah, Shannon ran into this too... I'd like to be able clean this up by
reducing the number of time we take the device reference, but the
following patch is still showing problems in Shannon's environment, so I
missed one...

---

dmaengine: fix up dma_device refcounting

From: Dan Williams <dan.j.williams@intel.com>

Currently the code drops too many references on the parent device. Change
the scheme to:

+ take a reference at registration:
dma_async_device_register()
+ take a reference for each channel device registered:
device_register(&chan->dev)
- drop a reference for each channel device unregistered:
device_unregister(&chan->dev)
- drop a reference at unregistration:
dma_async_device_unregister()

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

drivers/dma/dmaengine.c | 16 ++++------------
1 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 84257f7..d2b600b 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -186,10 +186,9 @@ static void dma_client_chan_alloc(struct dma_client *client)
/* we are done once this client rejects
* an available resource
*/
- if (ack == DMA_ACK) {
+ if (ack == DMA_ACK)
dma_chan_get(chan);
- kref_get(&device->refcount);
- } else if (ack == DMA_NAK)
+ else if (ack == DMA_NAK)
return;
}
}
@@ -221,7 +220,6 @@ void dma_chan_cleanup(struct kref *kref)
{
struct dma_chan *chan = container_of(kref, struct dma_chan, refcount);
chan->device->device_free_chan_resources(chan);
- kref_put(&chan->device->refcount, dma_async_device_cleanup);
}
EXPORT_SYMBOL(dma_chan_cleanup);

@@ -276,11 +274,8 @@ static void dma_clients_notify_removed(struct dma_chan *chan)
/* client was holding resources for this channel so
* free it
*/
- if (ack == DMA_ACK) {
+ if (ack == DMA_ACK)
dma_chan_put(chan);
- kref_pu...

To: Dan Williams <dan.j.williams@...>
Cc: Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>, Shannon Nelson <shannon.nelson@...>
Date: Saturday, October 27, 2007 - 9:49 am

On Fri, 26 Oct 2007 09:36:17 -0700

While I can't see any problems with the rest of the patch, I think this
part is wrong for the same reasons removing the kref_put() from the
class device cleanup function is. I don't see any constraint that
guarantees that dma_chan_cleanup() will always be called before
dma_dev_release(), which means that "chan" may have been freed before
this function gets a chance to run. Please correct me if I'm wrong.

Håvard
-

To: Haavard Skinnemoen <hskinnemoen@...>
Cc: Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>, Nelson, Shannon <shannon.nelson@...>
Date: Saturday, October 27, 2007 - 3:12 pm

Absolutely right, the driver, not dmaengine, frees the memory so there
must be a per channel reference on the device to hold off the driver's

So how about this...

---snip---
dmaengine: Fix broken device refcounting

From: Haavard Skinnemoen <hskinnemoen@atmel.com>

When a DMA device is unregistered, its reference count is decremented
twice for each channel: Once dma_class_dev_release() and once in
dma_chan_cleanup(). This may result in the DMA device driver's
remove() function completing before all channels have been cleaned
up, causing lots of use-after-free fun.

Fix it by incrementing the device's reference count twice for each
channel during registration.

Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com>
[dan.j.williams@intel.com: kill unnecessary client refcounting]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

drivers/dma/dmaengine.c | 17 ++++++-----------
1 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c
index 84257f7..ec7e871 100644
--- a/drivers/dma/dmaengine.c
+++ b/drivers/dma/dmaengine.c
@@ -186,10 +186,9 @@ static void dma_client_chan_alloc(struct dma_client *client)
/* we are done once this client rejects
* an available resource
*/
- if (ack == DMA_ACK) {
+ if (ack == DMA_ACK)
dma_chan_get(chan);
- kref_get(&device->refcount);
- } else if (ack == DMA_NAK)
+ else if (ack == DMA_NAK)
return;
}
}
@@ -276,11 +275,8 @@ static void dma_clients_notify_removed(struct dma_chan *chan)
/* client was holding resources for this channel so
* free it
*/
- if (ack == DMA_ACK) {
+ if (ack == DMA_ACK)
dma_chan_put(chan);
- kref_put(&chan->device->refcount,
- dma_async_device_cleanup);
- }
}

mutex_unlock(&dma_list_mutex);
@@ -320,11 +316,8 @@ void dma_async_client_unregister(struct dma_client *client)
ack = client->event_callback(client, chan,...

To: Williams, Dan J <dan.j.williams@...>, Haavard Skinnemoen <hskinnemoen@...>
Cc: Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>
Date: Monday, October 29, 2007 - 12:02 pm

I tested this in my ioatdma setup and no longer get the panic. I'm good with this if you two are happy with it.

Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>

sln
-

To: Nelson, Shannon <shannon.nelson@...>
Cc: Williams, Dan J <dan.j.williams@...>, Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>
Date: Monday, October 29, 2007 - 12:11 pm

On Mon, 29 Oct 2007 09:02:34 -0700

Looks good to me too, although I haven't had a chance to test it yet.

Thanks,
Håvard
-

To: Dan Williams <dan.j.williams@...>
Cc: Haavard Skinnemoen <hskinnemoen@...>, Andrew Morton <akpm@...>, linux-kernel <linux-kernel@...>, Nelson, Shannon <shannon.nelson@...>
Date: Sunday, October 28, 2007 - 3:17 pm

Thanks - when I get back in tomorrow morning I'll test this to see
that it gets rid of the panic that I've been getting.

sln
--
==============================================
Mr. Shannon Nelson Parents can't afford to be squeamish.
-

Previous thread: [PATCH] 2.6.23: Filesystem capabilities 0.17 by Olaf Dietsche on Friday, October 26, 2007 - 12:08 pm. (3 messages)

Next thread: [2.624-rc1 regression] lost battery information by Andrey Borzenkov on Friday, October 26, 2007 - 12:24 pm. (18 messages)