Re: IDE crash...

Previous thread: [PATCH] arm: build fix by FUJITA Tomonori on Monday, October 22, 2007 - 11:35 pm. (2 messages)

Next thread: [PATCH] fix typo in per_cpu_offset by Luming Yu on Tuesday, October 23, 2007 - 12:15 am. (6 messages)
From: David Miller
Subject: IDE crash...
Date: Monday, October 22, 2007 - 11:50 pm

I'm debugging a blk_rq_map_sg() crash that i'm getting on sparc64 as
root is mounted over IDE.  I think I know what is happening now.

The IDE sg table is allocated and initialized like this in
drivers/ide/ide-probe.c:

	x = kmalloc(sizeof(struct scatterlist) * nents, GFP_XXX);
	sg_init_table(x, nents);

So far, so good.

Now, ide_map_sg() passes requests down to blk_rq_map_sg() like this in
drivers/block/ide-io.c:

		hwif->sg_nents = blk_rq_map_sg(drive->queue, rq, sg);

Ok, so what does blk_rq_map_sg() do?

	sg = NULL;
	rq_for_each_segment(bvec, rq, iter) {
 ...
		if (bvprv && cluster) {
 ...
		} else {
new_segment:
			if (!sg)
				sg = sglist;
			else
				sg = sg_next(sg);
 ...
		}
		bvprv = bvec;
	} /* segments in rq */

	if (sg)
		__sg_mark_end(sg);

So let's say the first request comes in and needs 2 segs.
This will mark sg[1].page_link with 0x2

If the next request from IDE needs 4 segs, we'll OOPS because
sg_next() on &sg[1] will see page_link bit 0x2 is set and
therefore return NULL.

A quick look shows that if you're testing on SCSI (or something
layered on top of it like SATA or PATA) you won't see this seemingly
guarenteed crash because the SCSI mid-layer allocates a fresh sglist
via mempool_alloc() and runs sg_init_table() on it for every I/O
request.
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 12:02 am

We should never see the end pointer in blk_rq_map_sg(), or that's a bug
in the driver. So it should be OK to just clear the end pointer always
in there, even if it's not the prettiest solution...

This just needs to be wrapped up in some scatterlist.h macro/function.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 61c2e39..a3bda2f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1354,6 +1354,12 @@ new_segment:
 			else
 				sg = sg_next(sg);
 
+			/*
+			 * Clear end-of-table pointer, we'll mark a new one
+			 * at the end
+			 */
+			sg->page_link &= ~0x2;
+
 			sg_dma_len(sg) = 0;
 			sg_dma_address(sg) = 0;
 			sg_set_page(sg, bvec->bv_page);

-- 
Jens Axboe

-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 12:09 am

Eh this wont work, it's the wrong entry... Here's a temporary
work-around.

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index c89f0d3..108202b 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -822,6 +822,7 @@ void ide_map_sg(ide_drive_t *drive, struct request *rq)
 		return;
 
 	if (rq->cmd_type != REQ_TYPE_ATA_TASKFILE) {
+		sg_init_table(hwif->sg_table, hwif->sg_max_nents);
 		hwif->sg_nents = blk_rq_map_sg(drive->queue, rq, sg);
 	} else {
 		sg_init_one(sg, rq->buffer, rq->nr_sectors * SECTOR_SIZE);
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index ec55a17..cbfb113 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -1324,8 +1324,6 @@ static int hwif_init(ide_hwif_t *hwif)
 		goto out;
 	}
 
-	sg_init_table(hwif->sg_table, hwif->sg_max_nents);
-	
 	if (init_irq(hwif) == 0)
 		goto done;
 

-- 
Jens Axboe

-

From: David Miller
Date: Tuesday, October 23, 2007 - 12:18 am

From: Jens Axboe <jens.axboe@oracle.com>

That's the exact patch I'm about to boot test :-)
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 12:23 am

That should work - once you verify that, would you mind testing this one
as well? Thanks!

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 61c2e39..290836f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1352,7 +1352,7 @@ new_segment:
 			if (!sg)
 				sg = sglist;
 			else
-				sg = sg_next(sg);
+				sg = sg_next_force(sg);
 
 			sg_dma_len(sg) = 0;
 			sg_dma_address(sg) = 0;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 42daf5e..a98a2ee 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -99,6 +99,22 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
 	return sg;
 }
 
+/**
+ * sg_next_force - return the next scatterlist entry in a list
+ * @sg:		   The current sg entry
+ *
+ * Description:
+ *   Must only be used when more entries beyond this one is known to exist,
+ *   as it clears the termination bit. Useful to avoid adding a full sg
+ *   table init on every mapping.
+ *
+ **/
+static inline struct scatterlist *sg_next_force(struct scatterlist *sg)
+{
+	sg->page_link &= ~0x02;
+	return sg_next(sg);
+}
+
 /*
  * Loop over each sg element, following the pointer to a new list if necessary
  */

-- 
Jens Axboe

-

From: David Miller
Date: Tuesday, October 23, 2007 - 12:43 am

From: Jens Axboe <jens.axboe@oracle.com>

This one works here too, thanks.
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 12:45 am

Great, thanks for testing that as well. Thinking a bit more about it, I
think the forced clear should stay in blk_rq_map_sg() since we don't
want to advertise it to drivers. So I'll just open-code it in there.

-- 
Jens Axboe

-

From: John Stoffel
Date: Tuesday, October 23, 2007 - 8:10 am

>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes:


Jens> Great, thanks for testing that as well. Thinking a bit more
Jens> about it, I think the forced clear should stay in
Jens> blk_rq_map_sg() since we don't want to advertise it to
Jens> drivers. So I'll just open-code it in there.

Should there be more bounds checking here, so that if you try to do a
_force() you at least check to make sure that there's something beyond
there and useable on the list?  

We're saving the time from not reallocating from scratch, but let's
make it robust by doing at least some more checks here.

John
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 11:49 pm

We have to rely on the caller passing in an sgtable that is big enough
to map the request, we have always made that assumption. Anything else
would be a bug, of course. Now, catching that bug would indeed be nice.
Any suggestions?

-- 
Jens Axboe

-

From: John Stoffel
Date: Wednesday, October 24, 2007 - 9:27 am

>>>>> "Jens" == Jens Axboe <jens.axboe@oracle.com> writes:

Jens> Great, thanks for testing that as well. Thinking a bit more
Jens> about it, I think the forced clear should stay in
Jens> blk_rq_map_sg() since we don't want to advertise it to

Jens> We have to rely on the caller passing in an sgtable that is big enough
Jens> to map the request, we have always made that assumption. Anything else
Jens> would be a bug, of course. Now, catching that bug would indeed be nice.
Jens> Any suggestions?

Poor mans slab poisoning maybe?  As Alan Cox was saying about keeping
it simple with a NULL entry on the end of the SG list, having an easy
way to not fall of the end seems key. 

But from reading and re-reading the thread and some of the code, it
looks like this list is really allocated, then the driver can just
re-use the list (or a subset) as much as it likes without clearing and
then reallocating.

So if you've got a chain of 127 items, does it make sense when you
only use 16 of them to poison the next three or four in the chain to
really make sure you don't fall off the end?

Sorry I don't have a concrete example, my C-foo is quite weak and I
haven't had the time to really dive into this code.  So take my
comments as just an interested but ignorant bystander butting in.  :]

John

-

From: Jens Axboe
Date: Wednesday, October 24, 2007 - 11:10 am

Yes, that is indeed how most drivers work. It's both cheaper and faster
to keep the list than allocating and initializing it every time. With
the current termination bit approach, only that bits needs to be
set/cleared to properly terminate the sgtable. So it's free, since it
happens with setting page/length/offset.

-- 
Jens Axboe

-

From: FUJITA Tomonori
Date: Tuesday, October 23, 2007 - 3:52 am

On Tue, 23 Oct 2007 00:43:21 -0700 (PDT)

BTW, we avoid calling sg_init_table() for every I/O request due to Jens'
trick. And Jens will remove the code to clear sg_dma_len/addr in
blk_rq_map_sg(). So sparc64 iommu needs to do that for itself?

http://marc.info/?l=linux-scsi&m=119261920425120&w=2

-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 3:57 am

It does - Dave, do you agree with the patch? I think it should be added,
so that sparc64 isn't the odd-arch-out in this respect. I've tentatively
applied the patch, Davem?

-- 
Jens Axboe

-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 3:58 am

BTW, you sure that should not include to check whether sg_next() returns
non-NULL?

-- 
Jens Axboe

-

From: FUJITA Tomonori
Date: Tuesday, October 23, 2007 - 4:10 am

On Tue, 23 Oct 2007 12:58:11 +0200

Yeah, the following part checks that:

+	if (dma_sg != sg) {
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 4:43 am

Ah good, thanks for confirming.

-- 
Jens Axboe

-

From: David Miller
Date: Tuesday, October 23, 2007 - 2:18 pm

From: Jens Axboe <jens.axboe@oracle.com>

No objections.
-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 2:44 pm

Great, thanks Dave.

-- 
Jens Axboe

-

From: FUJITA Tomonori
Date: Tuesday, October 23, 2007 - 12:14 am

On Tue, 23 Oct 2007 09:09:33 +0200

Yeah, it won't work. Now we must call sg_init_table for every I/O
request (it's not nice).

-

From: Jens Axboe
Date: Tuesday, October 23, 2007 - 12:23 am

I think the fix would be to have a sg_next_and_clear() or something that
doesn't honor the 0x02 termination bit and clears it, for the cases

Possibly, I did do quite a few of them. Alternatively, we can remove
__sg_mark_end() and leave that up to the driver.


diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 61c2e39..290836f 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1352,7 +1352,7 @@ new_segment:
 			if (!sg)
 				sg = sglist;
 			else
-				sg = sg_next(sg);
+				sg = sg_next_force(sg);
 
 			sg_dma_len(sg) = 0;
 			sg_dma_address(sg) = 0;
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 42daf5e..a98a2ee 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -99,6 +99,22 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
 	return sg;
 }
 
+/**
+ * sg_next_force - return the next scatterlist entry in a list
+ * @sg:		   The current sg entry
+ *
+ * Description:
+ *   Must only be used when more entries beyond this one is known to exist,
+ *   as it clears the termination bit. Useful to avoid adding a full sg
+ *   table init on every mapping.
+ *
+ **/
+static inline struct scatterlist *sg_next_force(struct scatterlist *sg)
+{
+	sg->page_link &= ~0x02;
+	return sg_next(sg);
+}
+
 /*
  * Loop over each sg element, following the pointer to a new list if necessary
  */

-- 
Jens Axboe

-

Previous thread: [PATCH] arm: build fix by FUJITA Tomonori on Monday, October 22, 2007 - 11:35 pm. (2 messages)

Next thread: [PATCH] fix typo in per_cpu_offset by Luming Yu on Tuesday, October 23, 2007 - 12:15 am. (6 messages)