I'm debugging a blk_rq_map_sg() crash that i'm getting on sparc64 as
root is mounted over IDE. I think I know what is happening now.
The IDE sg table is allocated and initialized like this in
drivers/ide/ide-probe.c:
x = kmalloc(sizeof(struct scatterlist) * nents, GFP_XXX);
sg_init_table(x, nents);
So far, so good.
Now, ide_map_sg() passes requests down to blk_rq_map_sg() like this in
drivers/block/ide-io.c:
hwif->sg_nents = blk_rq_map_sg(drive->queue, rq, sg);
Ok, so what does blk_rq_map_sg() do?
sg = NULL;
rq_for_each_segment(bvec, rq, iter) {
...
if (bvprv && cluster) {
...
} else {
new_segment:
if (!sg)
sg = sglist;
else
sg = sg_next(sg);
...
}
bvprv = bvec;
} /* segments in rq */
if (sg)
__sg_mark_end(sg);
So let's say the first request comes in and needs 2 segs.
This will mark sg[1].page_link with 0x2
If the next request from IDE needs 4 segs, we'll OOPS because
sg_next() on &sg[1] will see page_link bit 0x2 is set and
therefore return NULL.
A quick look shows that if you're testing on SCSI (or something
layered on top of it like SATA or PATA) you won't see this seemingly
guarenteed crash because the SCSI mid-layer allocates a fresh sglist
via mempool_alloc() and runs sg_init_table() on it for every I/O
request.
-
We should never see the end pointer in blk_rq_map_sg(), or that's a bug in the driver. So it should be OK to just clear the end pointer always in there, even if it's not the prettiest solution... This just needs to be wrapped up in some scatterlist.h macro/function. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 61c2e39..a3bda2f 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -1354,6 +1354,12 @@ new_segment: else sg = sg_next(sg); + /* + * Clear end-of-table pointer, we'll mark a new one + * at the end + */ + sg->page_link &= ~0x2; + sg_dma_len(sg) = 0; sg_dma_address(sg) = 0; sg_set_page(sg, bvec->bv_page); -- Jens Axboe -
Eh this wont work, it's the wrong entry... Here's a temporary
work-around.
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index c89f0d3..108202b 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -822,6 +822,7 @@ void ide_map_sg(ide_drive_t *drive, struct request *rq)
return;
if (rq->cmd_type != REQ_TYPE_ATA_TASKFILE) {
+ sg_init_table(hwif->sg_table, hwif->sg_max_nents);
hwif->sg_nents = blk_rq_map_sg(drive->queue, rq, sg);
} else {
sg_init_one(sg, rq->buffer, rq->nr_sectors * SECTOR_SIZE);
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index ec55a17..cbfb113 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -1324,8 +1324,6 @@ static int hwif_init(ide_hwif_t *hwif)
goto out;
}
- sg_init_table(hwif->sg_table, hwif->sg_max_nents);
-
if (init_irq(hwif) == 0)
goto done;
--
Jens Axboe
-
That should work - once you verify that, would you mind testing this one
as well? Thanks!
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 61c2e39..290836f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1352,7 +1352,7 @@ new_segment:
if (!sg)
sg = sglist;
else
- sg = sg_next(sg);
+ sg = sg_next_force(sg);
sg_dma_len(sg) = 0;
sg_dma_address(sg) = 0;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 42daf5e..a98a2ee 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -99,6 +99,22 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
return sg;
}
+/**
+ * sg_next_force - return the next scatterlist entry in a list
+ * @sg: The current sg entry
+ *
+ * Description:
+ * Must only be used when more entries beyond this one is known to exist,
+ * as it clears the termination bit. Useful to avoid adding a full sg
+ * table init on every mapping.
+ *
+ **/
+static inline struct scatterlist *sg_next_force(struct scatterlist *sg)
+{
+ sg->page_link &= ~0x02;
+ return sg_next(sg);
+}
+
/*
* Loop over each sg element, following the pointer to a new list if necessary
*/
--
Jens Axboe
-
Great, thanks for testing that as well. Thinking a bit more about it, I think the forced clear should stay in blk_rq_map_sg() since we don't want to advertise it to drivers. So I'll just open-code it in there. -- Jens Axboe -
>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes: Jens> Great, thanks for testing that as well. Thinking a bit more Jens> about it, I think the forced clear should stay in Jens> blk_rq_map_sg() since we don't want to advertise it to Jens> drivers. So I'll just open-code it in there. Should there be more bounds checking here, so that if you try to do a _force() you at least check to make sure that there's something beyond there and useable on the list? We're saving the time from not reallocating from scratch, but let's make it robust by doing at least some more checks here. John -
We have to rely on the caller passing in an sgtable that is big enough to map the request, we have always made that assumption. Anything else would be a bug, of course. Now, catching that bug would indeed be nice. Any suggestions? -- Jens Axboe -
>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes: Jens> Great, thanks for testing that as well. Thinking a bit more Jens> about it, I think the forced clear should stay in Jens> blk_rq_map_sg() since we don't want to advertise it to Jens> We have to rely on the caller passing in an sgtable that is big enough Jens> to map the request, we have always made that assumption. Anything else Jens> would be a bug, of course. Now, catching that bug would indeed be nice. Jens> Any suggestions? Poor mans slab poisoning maybe? As Alan Cox was saying about keeping it simple with a NULL entry on the end of the SG list, having an easy way to not fall of the end seems key. But from reading and re-reading the thread and some of the code, it looks like this list is really allocated, then the driver can just re-use the list (or a subset) as much as it likes without clearing and then reallocating. So if you've got a chain of 127 items, does it make sense when you only use 16 of them to poison the next three or four in the chain to really make sure you don't fall off the end? Sorry I don't have a concrete example, my C-foo is quite weak and I haven't had the time to really dive into this code. So take my comments as just an interested but ignorant bystander butting in. :] John -
Yes, that is indeed how most drivers work. It's both cheaper and faster to keep the list than allocating and initializing it every time. With the current termination bit approach, only that bits needs to be set/cleared to properly terminate the sgtable. So it's free, since it happens with setting page/length/offset. -- Jens Axboe -
On Tue, 23 Oct 2007 00:43:21 -0700 (PDT) BTW, we avoid calling sg_init_table() for every I/O request due to Jens' trick. And Jens will remove the code to clear sg_dma_len/addr in blk_rq_map_sg(). So sparc64 iommu needs to do that for itself? http://marc.info/?l=linux-scsi&m=119261920425120&w=2 -
It does - Dave, do you agree with the patch? I think it should be added, so that sparc64 isn't the odd-arch-out in this respect. I've tentatively applied the patch, Davem? -- Jens Axboe -
On Tue, 23 Oct 2007 09:09:33 +0200 Yeah, it won't work. Now we must call sg_init_table for every I/O request (it's not nice). -
I think the fix would be to have a sg_next_and_clear() or something that
doesn't honor the 0x02 termination bit and clears it, for the cases
Possibly, I did do quite a few of them. Alternatively, we can remove
__sg_mark_end() and leave that up to the driver.
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 61c2e39..290836f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1352,7 +1352,7 @@ new_segment:
if (!sg)
sg = sglist;
else
- sg = sg_next(sg);
+ sg = sg_next_force(sg);
sg_dma_len(sg) = 0;
sg_dma_address(sg) = 0;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 42daf5e..a98a2ee 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -99,6 +99,22 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
return sg;
}
+/**
+ * sg_next_force - return the next scatterlist entry in a list
+ * @sg: The current sg entry
+ *
+ * Description:
+ * Must only be used when more entries beyond this one is known to exist,
+ * as it clears the termination bit. Useful to avoid adding a full sg
+ * table init on every mapping.
+ *
+ **/
+static inline struct scatterlist *sg_next_force(struct scatterlist *sg)
+{
+ sg->page_link &= ~0x02;
+ return sg_next(sg);
+}
+
/*
* Loop over each sg element, following the pointer to a new list if necessary
*/
--
Jens Axboe
-
