fw_device.node_id and fw_device.generation are accessed without mutexes.
We have to ensure that all readers will get to see node_id updates
before generation updates.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
drivers/firewire/fw-device.c | 6 ++++++
drivers/firewire/fw-topology.c | 1 +
2 files changed, 7 insertions(+)
Index: linux/drivers/firewire/fw-device.c
===================================================================
--- linux.orig/drivers/firewire/fw-device.c
+++ linux/drivers/firewire/fw-device.c
@@ -808,6 +813,7 @@ void fw_node_event(struct fw_card *card,
device = node->data;
device->node_id = node->node_id;
+ wmb();
device->generation = card->generation;
if (atomic_read(&device->state) == FW_DEVICE_RUNNING) {
PREPARE_DELAYED_WORK(&device->work, fw_device_update);
Index: linux/drivers/firewire/fw-topology.c
===================================================================
--- linux.orig/drivers/firewire/fw-topology.c
+++ linux/drivers/firewire/fw-topology.c
@@ -518,6 +518,7 @@ fw_core_handle_bus_reset(struct fw_card
card->bm_retries = 0;
card->node_id = node_id;
+ wmb();
card->generation = generation;
card->reset_jiffies = jiffies;
schedule_delayed_work(&card->work, 0);
--
Stefan Richter
-=====-=-=== =-== ----=
http://arcgraph.de/sr/
-
read_rom() obtained a fresh new fw_device.generation for each read
transaction (unless the compiler performed very aggressive inlining in
read_bus_info_block). It's unlikely but not impossible that we could
end up with a corrupt fetched ROM image if there was a generation change
during the ROM reading.
We now restart reading the ROM if the bus generation changed.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
drivers/firewire/fw-device.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)
Index: linux/drivers/firewire/fw-device.c
===================================================================
--- linux.orig/drivers/firewire/fw-device.c
+++ linux/drivers/firewire/fw-device.c
@@ -391,7 +391,8 @@ complete_transaction(struct fw_card *car
complete(&callback_data->done);
}
-static int read_rom(struct fw_device *device, int index, u32 * data)
+static int
+read_rom(struct fw_device *device, int generation, int index, u32 *data)
{
struct read_quadlet_callback_data callback_data;
struct fw_transaction t;
@@ -401,7 +402,7 @@ static int read_rom(struct fw_device *de
offset = 0xfffff0000400ULL + index * 4;
fw_send_request(device->card, &t, TCODE_READ_QUADLET_REQUEST,
- device->node_id, device->generation, device->max_speed,
+ device->node_id, generation, device->max_speed,
offset, NULL, 4, complete_transaction, &callback_data);
wait_for_completion(&callback_data.done);
@@ -411,7 +412,14 @@ static int read_rom(struct fw_device *de
return callback_data.rcode;
}
-static int read_bus_info_block(struct fw_device *device)
+/*
+ * Read the bus info block, perform a speed probe, and read all of the rest of
+ * the config ROM. We do all this with a cached bus generation. If the bus
+ * generation changes under us, read_bus_info_block will fail and get retried.
+ * It's better to start all over in this case because the node from which we
+ * are reading the ROM may have changed the ROM during the ...Two fixes:
- There was a small window where a login or reconnect job could use an
already updated card generation with an outdated node ID. We better
use the fw_device.generation here, not the fw_card.generation.
- Insert a memory barrier to ensure that the device generation is read
before the node ID. This is to guarantee that the generation is not
newer than the node ID.
A small optimization:
- The target's and initiator's node IDs can be obtained from fw_device
and fw_card. Dereferencing their underlying topology objects is not
necessary.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
drivers/firewire/fw-sbp2.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -643,9 +643,10 @@ static void sbp2_login(struct work_struc
struct sbp2_login_response response;
int generation, node_id, local_node_id;
- generation = device->card->generation;
- node_id = device->node->node_id;
- local_node_id = device->card->local_node->node_id;
+ generation = device->generation;
+ rmb();
+ node_id = device->node_id;
+ local_node_id = device->card->node_id;
if (sbp2_send_management_orb(lu, node_id, generation,
SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
@@ -900,9 +901,10 @@ static void sbp2_reconnect(struct work_s
struct fw_device *device = fw_device(unit->device.parent);
int generation, node_id, local_node_id;
- generation = device->card->generation;
- node_id = device->node->node_id;
- local_node_id = device->card->local_node->node_id;
+ generation = device->generation;
+ rmb();
+ node_id = device->node_id;
+ local_node_id = device->card->node_id;
if (sbp2_send_management_orb(lu, node_id, generation,
SBP2_RECONNECT_REQUEST,
--
Stefan ...Hi, a few points: - can change it to use spinlocks instead? This would be most preferable. - if not, you need comments. - you also seem to be missing rmb()s now. I see a couple in the firewire directory, but nothing that seems to be ordering loads of these particular fields. - use smp_*mb() if you are just ordering regular cacheable RAM accesses. -
It could be done, but I don't find it preferable for this purpose.
Although the struct members in question are basically part of fw-core's
API (kernel-internal API), there won't be a lot of places which access
these members.
(Besides, lockless algorithms are essential to FireWire driver code
everywhere where CPU and controller interact. So IMO these lockless
So far I tended to disagree every time when checkpatch.pl bugged me
about barriers without comments. But maybe that was because I had a
wrong idea of what kind of comment should go there. There would
certainly be no value in a comment which effectively says "trust me, I
know we need a barrier here". However, thinking about it again, it
occurs to me that I should add the following comments:
1.) At respective struct type definitions: A comment on how struct
members are to be accessed and why.
2.) At the occurrences of rmb() and wmb(): A comment on which
accesses the particular barrier divides. This is in order to get
the barriers properly updated (i.e. moved or removed) when
surrounding code is reordered, APIs reworked, or whatever.
Of course this division of the Why and the What only applies to accesses
like those in the patch --- i.e. API-like data types which aren't
abstracted by accessor functions --- while there may be other
requirements on comments for other usage types of barriers.
That's right. Of course the wmb()s don't make sense if the reader side
isn't fixed up accordingly. I did this for all places which I spotted
yesterday in the patches
"firewire: fw-core: react on bus resets while the config ROM is being
fetched", http://lkml.org/lkml/2007/10/31/464 (Hmm, this has an unclear
changelog.)
"firewire: fw-sbp2: enforce read order of device generation and node
ID", http://lkml.org/lkml/2007/10/31/465
I didn't fold all these patches into one because these two other patches
include also other changes to the respective read sides. I should have
pointed this out ...Yes, this is the main thing to comment at the actual site of the barirer. Although it isn't always 100% clear with locks either (because you might do some non-critical operatoins inside a lock section), it just tends to be clearer what it is being locked, and what other paths are being locked against. So describing what accesses are being ordered is good. Also a brief description of how the lockless read-side works, if it's not obvious I think what you've said sounds good. I think that even memory barriers Sure. I might personally still separate out the memory ordering fix and put all the rmb and wmb in a single patch, but that's no big Ah, that's no problem. I manage to screw up patches in much worse ways than this, don't worry ;) Thanks, Nick -
I now updated some patches from November according to comments I received from Nick back then. As Jarod found out now while testing the November versions, parts of these patches actually fix multiply reported bugs. While updating this stuff, I also found two more read sites where the read order was not properly enforced. Incoming: 1/4 firewire: fw-sbp2: use device generation, not card generation 2/4 firewire: fw-cdev: use device generation, not card generation 3/4 firewire: enforce access order between generation and node ID 4/4 firewire: fw-core: react on bus resets while the config ROM is being fetched drivers/firewire/fw-cdev.c | 3 ++- drivers/firewire/fw-device.c | 33 ++++++++++++++++++++++++--------- drivers/firewire/fw-device.h | 12 ++++++++++++ drivers/firewire/fw-sbp2.c | 14 ++++++++------ drivers/firewire/fw-topology.c | 5 +++++ 5 files changed, 51 insertions(+), 16 deletions(-) -- Stefan Richter -=====-==--- ---= ==--- http://arcgraph.de/sr/ --
There was a small window where a login or reconnect job could use an
already updated card generation with an outdated node ID. We have to
use the fw_device.generation here, not the fw_card.generation, because
the generation must never be newer than the node ID when we emit a
transaction. This cannot be guaranteed with fw_card.generation.
Furthermore, the target's and initiator's node IDs can be obtained from
fw_device and fw_card. Dereferencing their underlying topology objects
is not necessary.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
Rework of patch "firewire: fw-sbp2: enforce read order of device
generation and node ID" from November 1 2007.
This code also needs barriers to work precisely as intended. They will
be added by a subsequent patch which consistently updates readers and
writers of .generation and .node_id.
drivers/firewire/fw-sbp2.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
Index: linux/drivers/firewire/fw-sbp2.c
===================================================================
--- linux.orig/drivers/firewire/fw-sbp2.c
+++ linux/drivers/firewire/fw-sbp2.c
@@ -671,9 +671,9 @@ static void sbp2_login(struct work_struc
struct sbp2_login_response response;
int generation, node_id, local_node_id;
- generation = device->card->generation;
- node_id = device->node->node_id;
- local_node_id = device->card->local_node->node_id;
+ generation = device->generation;
+ node_id = device->node_id;
+ local_node_id = device->card->node_id;
if (sbp2_send_management_orb(lu, node_id, generation,
SBP2_LOGIN_REQUEST, lu->lun, &response) < 0) {
@@ -921,9 +921,9 @@ static void sbp2_reconnect(struct work_s
struct fw_device *device = fw_device(unit->device.parent);
int generation, node_id, local_node_id;
- generation = device->card->generation;
- node_id = device->node->node_id;
- local_node_id = device->card->local_node->node_id;
+ generation = ...Verified in concert with parts 2 and 3 as well as with 2, 3 and 4, to fix 'giving up on config rom' issues on multiple system and drive combinations that were previously affected. Signed-off-by: Jarod Wilson <jwilson@redhat.com> -- Jarod Wilson jwilson@redhat.com --
We have to use the fw_device.generation here, not the fw_card.generation, because the generation must never be newer than the node ID when we emit a transaction. This cannot be guaranteed with fw_card.generation. Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> --- This code also needs barriers to work precisely as intended. They will be added by a subsequent patch which consistently updates readers and writers of .generation and .node_id. drivers/firewire/fw-cdev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux/drivers/firewire/fw-cdev.c =================================================================== --- linux.orig/drivers/firewire/fw-cdev.c +++ linux/drivers/firewire/fw-cdev.c @@ -206,12 +206,12 @@ fill_bus_reset_event(struct fw_cdev_even event->closure = client->bus_reset_closure; event->type = FW_CDEV_EVENT_BUS_RESET; + event->generation = client->device->generation; event->node_id = client->device->node_id; event->local_node_id = card->local_node->node_id; event->bm_node_id = 0; /* FIXME: We don't track the BM. */ event->irm_node_id = card->irm_node->node_id; event->root_node_id = card->root_node->node_id; - event->generation = card->generation; } static void -- Stefan Richter -=====-==--- ---= ==--- http://arcgraph.de/sr/ --
Verified in concert with parts 1 and 3 as well as with 1, 3 and 4, to fix 'giving up on config rom' issues on multiple system and drive combinations that were previously affected. Signed-off-by: Jarod Wilson <jwilson@redhat.com> -- Jarod Wilson jwilson@redhat.com --
fw_device.node_id and fw_device.generation are accessed without mutexes. We have to ensure that all readers will get to see node_id updates before generation updates. An earlier incarnation of this patch fixes an inability to recognize devices after "giving up on config rom", https://bugzilla.redhat.com/show_bug.cgi?id=429950 Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> --- Rework of patches firewire: fw-core: enforce write order when updating fw_device.generation and parts of firewire: fw-core: react on bus resets while the config ROM is being fetched firewire: fw-sbp2: enforce read order of device generation and node ID from November 1 2007. Update: - write site and read sites folded into one patch - added fix to fw_device_enable_phys_dma() and fill_bus_reset_event() - smp_ barriers are sufficient - comments, changelog drivers/firewire/fw-cdev.c | 1 + drivers/firewire/fw-device.c | 13 +++++++++++-- drivers/firewire/fw-device.h | 12 ++++++++++++ drivers/firewire/fw-sbp2.c | 2 ++ drivers/firewire/fw-topology.c | 5 +++++ 5 files changed, 31 insertions(+), 2 deletions(-) Index: linux/drivers/firewire/fw-device.c =================================================================== --- linux.orig/drivers/firewire/fw-device.c +++ linux/drivers/firewire/fw-device.c @@ -182,9 +182,13 @@ static void fw_device_release(struct dev int fw_device_enable_phys_dma(struct fw_device *device) { + int generation = device->generation; + + /* device->node_id, accessed below, must not be older than generation */ + smp_rmb(); return device->card->driver->enable_phys_dma(device->card, device->node_id, - device->generation); + generation); } EXPORT_SYMBOL(fw_device_enable_phys_dma); @@ -389,12 +393,16 @@ static int read_rom(struct fw_device *de struct read_quadlet_callback_data callback_data; struct fw_transaction t; u64 offset; + int generation = ...
I don't know the firewire subsystem at all, but the barriers seem right (in that they match your description of the problem), and comments for them are really good. Thanks, Nick --
Verified in concert with parts 1 and 2 as well as with 1, 2 and 4, to fix 'giving up on config rom' issues on multiple system and drive combinations that were previously affected. Signed-off-by: Jarod Wilson <jwilson@redhat.com> -- Jarod Wilson jwilson@redhat.com --
I'm going to commit it with a whitespace adjustment and a comment typo fixed. Jarod, thanks for the rigorous testing; Nick, thanks for helping in making this a solid improvement over the initial submission. -- Stefan Richter -=====-==--- ---= ==--= http://arcgraph.de/sr/ --
Also forgot to #include <asm/system.h> in some of the files. Here is the final product, for the record. From: Stefan Richter <stefanr@s5r6.in-berlin.de> Subject: firewire: enforce access order between generation and node ID, fix "giving up on config rom" fw_device.node_id and fw_device.generation are accessed without mutexes. We have to ensure that all readers will get to see node_id updates before generation updates. Fixes an inability to recognize devices after "giving up on config rom", https://bugzilla.redhat.com/show_bug.cgi?id=429950 Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Reviewed by Nick Piggin <nickpiggin@yahoo.com.au>. Verified to fix 'giving up on config rom' issues on multiple system and drive combinations that were previously affected. Signed-off-by: Jarod Wilson <jwilson@redhat.com> --- drivers/firewire/fw-cdev.c | 1 + drivers/firewire/fw-device.c | 15 +++++++++++++-- drivers/firewire/fw-device.h | 12 ++++++++++++ drivers/firewire/fw-sbp2.c | 3 +++ drivers/firewire/fw-topology.c | 6 ++++++ 5 files changed, 35 insertions(+), 2 deletions(-) Index: linux/drivers/firewire/fw-device.c =================================================================== --- linux.orig/drivers/firewire/fw-device.c +++ linux/drivers/firewire/fw-device.c @@ -27,6 +27,7 @@ #include <linux/idr.h> #include <linux/rwsem.h> #include <asm/semaphore.h> +#include <asm/system.h> #include <linux/ctype.h> #include "fw-transaction.h" #include "fw-topology.h" @@ -182,9 +183,14 @@ static void fw_device_release(struct dev int fw_device_enable_phys_dma(struct fw_device *device) { + int generation = device->generation; + + /* device->node_id, accessed below, must not be older than generation */ + smp_rmb(); + return device->card->driver->enable_phys_dma(device->card, device->node_id, - device->generation); + generation); } EXPORT_SYMBOL(fw_device_enable_phys_dma); @@ -389,12 ...
read_rom() obtained a fresh new fw_device.generation for each read
transaction. Hence it was able to continue reading in the middle of the
ROM even if a bus reset happened. However the device may have modified
the ROM during the reset. We would end up with a corrupt fetched ROM
image then.
Although all of this is quite unlikely, it is not impossible.
Therefore we now restart reading the ROM if the bus generation changed.
Side note: The barrier in read_rom(), inserted by patch "firewire:
enforce access order between generation and node ID" is not necessary
anymore because the sequence of calls
fw_device_init() ->
read_bus_info_block() ->
read_rom()
read_rom()
read_rom()
...
will take care that generation is read before node_id, won't it?
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
Refreshed version of the patch from November 1 2007.
drivers/firewire/fw-device.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
Index: linux/drivers/firewire/fw-device.c
===================================================================
--- linux.orig/drivers/firewire/fw-device.c
+++ linux/drivers/firewire/fw-device.c
@@ -388,15 +388,12 @@ complete_transaction(struct fw_card *car
complete(&callback_data->done);
}
-static int read_rom(struct fw_device *device, int index, u32 * data)
+static int
+read_rom(struct fw_device *device, int generation, int index, u32 *data)
{
struct read_quadlet_callback_data callback_data;
struct fw_transaction t;
u64 offset;
- int generation = device->generation;
-
- /* device->node_id, accessed below, must not be older than generation */
- smp_rmb();
init_completion(&callback_data.done);
@@ -412,7 +409,14 @@ static int read_rom(struct fw_device *de
return callback_data.rcode;
}
-static int read_bus_info_block(struct fw_device *device)
+/*
+ * Read the bus info block, perform a speed probe, and read all of the rest of
+ * the config ROM. We ...Based on a quick read through the code path, coupled with empirical evidence, yes, it appears safe to remove the barrier in read_rom(). Signed-off-by: Jarod Wilson <jwilson@redhat.com> -- Jarod Wilson jwilson@redhat.com --
Crap. My earlier 'empirical evidence' seems to have been happy coincidence. After a fresh boot, I'm consistently hitting 'failed to read config rom' failures with this patch applied. Simply putting the barrier back in gets things working again though. Interestingly, subsequent reloading of firewire-* modules less the barrier also tend to work until the system is rebooted. -- Jarod Wilson jwilson@redhat.com --
read_rom() obtained a fresh new fw_device.generation for each read
transaction. Hence it was able to continue reading in the middle of the
ROM even if a bus reset happened. However the device may have modified
the ROM during the reset. We would end up with a corrupt fetched ROM
image then.
Although all of this is quite unlikely, it is not impossible.
Therefore we now restart reading the ROM if the bus generation changed.
Note, the memory barrier in read_rom() is still necessary according to
tests by Jarod Wilson, despite of the ->generation access being moved up
in the call chain.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
---
First posted on 2007-11-01. Update: Barrier stays.
drivers/firewire/fw-device.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
Index: linux-2.6.24/drivers/firewire/fw-device.c
===================================================================
--- linux-2.6.24.orig/drivers/firewire/fw-device.c
+++ linux-2.6.24/drivers/firewire/fw-device.c
@@ -389,12 +389,12 @@ complete_transaction(struct fw_card *car
complete(&callback_data->done);
}
-static int read_rom(struct fw_device *device, int index, u32 * data)
+static int
+read_rom(struct fw_device *device, int generation, int index, u32 *data)
{
struct read_quadlet_callback_data callback_data;
struct fw_transaction t;
u64 offset;
- int generation = device->generation;
/* device->node_id, accessed below, must not be older than generation */
smp_rmb();
@@ -413,7 +413,14 @@ static int read_rom(struct fw_device *de
return callback_data.rcode;
}
-static int read_bus_info_block(struct fw_device *device)
+/*
+ * Read the bus info block, perform a speed probe, and read all of the rest of
+ * the config ROM. We do all this with a cached bus generation. If the bus
+ * generation changes under us, read_bus_info_block will fail and get retried.
+ * It's better to start all over in this case because the node from which ...This is essentially what I've been beating on locally, and I've yet to hit another config rom read failure with it. Signed-off-by: Jarod Wilson <jwilson@redhat.com> -- Jarod Wilson jwilson@redhat.com --
Thanks. I already pushed to linux1394-2.6.git but I will add your sign-off before I ask Linus to pull. -- Stefan Richter -=====-==--- ---= ==--= http://arcgraph.de/sr/ --
