Re: [bug] ata subsystem related crash with latest -git

Previous thread: 2.6.23-git11 compile issues by Badari Pulavarty on Wednesday, October 17, 2007 - 8:33 am. (7 messages)

Next thread: Anyone else out there have an ExpressCard to try? by Mark Lord on Wednesday, October 17, 2007 - 9:04 am. (1 message)
From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 8:46 am

Jens, just got this crash on a testbox:

[   37.701628] warning: process `kmodule' used the removed sysctl system call with 1.23.
[   39.409390] BUG: unable to handle kernel paging request at virtual address 7ca76000
[   39.416892] printing eip: 78406669 *pde = 00dda027 *pte = 04a76000 
[   39.423132] Oops: 0000 [#1] DEBUG_PAGEALLOC
[   39.427292] 
[   39.428766] Pid: 431, comm: fsck.ext3 Not tainted (2.6.23 #45)
[   39.434571] EIP: 0060:[<78406669>] EFLAGS: 00010006 CPU: 0
[   39.440035] EIP is at blk_rq_map_sg+0xb9/0x170
[   39.444450] EAX: 04aed000 EBX: 7ca1f380 ECX: 04aee000 EDX: 00095da0
[   39.450689] ESI: 7ca75ff0 EDI: 79180da0 EBP: 00001000 ESP: 7caa1cac
[   39.456929]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[   39.462303] Process fsck.ext3 (pid: 431, ti=7caa0000 task=7ca80000 task.ti=7caa0000)
[   39.469841] Stack: 00000020 7b520000 00002000 04aef000 7ca75fe0 0000001f 00000001 00000000 
[   39.478161]        7ca1f300 01000002 7ca44980 7b522b94 7ca0de00 7b520000 784c89b5 7b520000 
[   39.486480]        784c843a 7ca0de00 7b520f20 7b522b94 7ca0de00 784e54a0 784f394b 00000000 
[   39.494799] Call Trace:
[   39.497400]  [<784c89b5>] scsi_init_io+0x55/0xe0
[   39.501992]  [<784c843a>] scsi_get_cmd_from_req+0x2a/0x40
[   39.507366]  [<784e54a0>] sd_prep_fn+0x80/0x940
[   39.511872]  [<784f394b>] ata_bmdma_start+0xb/0x20
[   39.516638]  [<784ef344>] ata_qc_issue_prot+0x164/0x1e0
[   39.521839]  [<78405c63>] elv_dispatch_sort+0x23/0xe0
[   39.526865]  [<784057d0>] elv_next_request+0xa0/0x130
[   39.531891]  [<787715b8>] _spin_lock_irq+0x38/0x50
[   39.536657]  [<784c9af4>] scsi_request_fn+0x1e4/0x370
[   39.541684]  [<78120e72>] del_timer+0x62/0x70
[   39.546016]  [<78408a45>] __generic_unplug_device+0x25/0x30
[   39.551563]  [<78408d15>] generic_unplug_device+0x15/0x30
[   39.556936]  [<78406430>] blk_backing_dev_unplug+0x0/0x10
[   39.562309]  [<7840643c>] blk_backing_dev_unplug+0xc/0x10
[   39.567682]  [<7818080d>] block_sync_page+0x2d/0x40
[   39.572535]  ...
From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 8:50 am

managed to trigger it a second time, so it seems rather reproducible:

[  328.771333] kjournald starting.  Commit interval 5 seconds
[  328.776963] EXT3 FS on sda5, internal journal
[  328.781172] EXT3-fs: mounted filesystem with ordered data mode.
[  329.689493] BUG: unable to handle kernel paging request at virtual address 7d516000
[  329.696990] printing eip: 78406669 *pde = 00ddd027 *pte = 05516000 
[  329.703230] Oops: 0000 [#1] DEBUG_PAGEALLOC
[  329.707390] 
[  329.708863] Pid: 0, comm: swapper Not tainted (2.6.23 #45)
[  329.714321] EIP: 0060:[<78406669>] EFLAGS: 00010006 CPU: 0
[  329.719787] EIP is at blk_rq_map_sg+0xb9/0x170
[  329.724202] EAX: 2f6df000 EBX: 7d6af880 ECX: 34d55000 EDX: 005edbe0
[  329.730441] ESI: 7d515ff0 EDI: 796d8be0 EBP: 00001000 ESP: 78a13db0
[  329.736680]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[  329.742054] Process swapper (pid: 0, ti=78a12000 task=7893c400 task.ti=78a12000)
[  329.749246] Stack: 00000020 7b520000 00002000 34d56000 7d515fe0 00000007 00000001 00000000 
[  329.757565]        7d6af800 01000000 7d515e00 7b522b94 7d84f58c 7b520000 784c89b5 7b520000 
[  329.765885]        784c843a 7d84f58c 7b520f20 7b522b94 7d84f58c 784e54a0 784f394b 00000000 
[  329.774204] Call Trace:
[  329.776804]  [<784c89b5>] scsi_init_io+0x55/0xe0
[  329.781398]  [<784c843a>] scsi_get_cmd_from_req+0x2a/0x40
[  329.786770]  [<784e54a0>] sd_prep_fn+0x80/0x940
[  329.791277]  [<784f394b>] ata_bmdma_start+0xb/0x20
[  329.796043]  [<784ef344>] ata_qc_issue_prot+0x164/0x1e0
[  329.801243]  [<78405c63>] elv_dispatch_sort+0x23/0xe0
[  329.806268]  [<784057d0>] elv_next_request+0xa0/0x130
[  329.811295]  [<787715b8>] _spin_lock_irq+0x38/0x50
[  329.816062]  [<784c9af4>] scsi_request_fn+0x1e4/0x370
[  329.821088]  [<784097f6>] blk_run_queue+0x36/0x80
[  329.825768]  [<784c8370>] scsi_next_command+0x30/0x50
[  329.830794]  [<784c84fb>] scsi_end_request+0xab/0xe0
[  329.835733]  [<784c9249>] scsi_io_completion+0xa9/0x3d0
[  329.840933]  [<78135de7>] ...
From: Jens Axboe
Date: Wednesday, October 17, 2007 - 9:32 am

Can you pull

git://git.kernel.dk/linux-2.6-block.git for-linus

and see if it still reproduces?

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 9:50 am

The code in question is:

     mov    %edx,0xc(%esp)
     mov    (%ebx),%edi
     mov    %edi,%edx
     sub    %eax,%edx
     mov    %edx,%eax
     sar    $0x5,%eax
     shl    $0xc,%eax
     add    0x8(%ebx),%eax
     cmp    %eax,0xc(%esp)
     je     +126
     mov    0x10(%esi),%eax	<----- Oops
     lea    0x10(%esi),%edx
     test   $0x1,%al
     jne    +76
     mov    %edi,(%esi)
     mov    %ebp,0xc(%esi)
     mov    0x8(%ebx),%eax
     mov    %eax,0x4(%esi)


and it looks like %esi is overflowing from one page to the next one, ie:

	BUG: unable to handle kernel paging request at virtual address 7ca76000
	ESI: 7ca75ff0

and you caught this thanks to page-alloc debugging again.

I think I can match that up with the source code: that's "sg_next()". It's 
doing:

        sg++;

        if (unlikely(sg_is_chain(sg)))
                sg = sg_chain_ptr(sg);

        return sg;

and the oopsing instruction is that load of "sg->page" in the assembly 
code:

	mov    0x10(%esi),%eax		# %eax = sg->page
	lea    0x10(%esi),%edx		# %edx = sg+1;
	test   $0x1,%al			# if (unlikely(sg_is_chain()))
	jne    +76

Jens?

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 9:59 am

Yep, that's what I came up with as well - I asked Ingo for a dump in
private, but ended up just using ksymoops to decode the line.

The way blk_rq_map_sg() operates is that it ends up doing a

        next_sg = sg_next(sg);

even though sg may be the last entry. Perhaps this is crapping out,
although if sg is a valid address, then sg + 1 should be as well.
next_sg may end up being crap, in fact it will, but we'll never use that
unless there are more entries to fill. And if there is, then both sg and
next_sg were valid.

So nothing in for-linus should fix it, I'll try and come up with an
alternate way to assign next_sg so it's always valid.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:08 am

OK, the below should actually be safe, I don't know why I talked myself
into the next_sg stuff in the beginning. It's always safe to zero sg,
since it's a valid entry - nothing to save in ->page. Ingo, does this
work for you?

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		  struct scatterlist *sglist)
 {
 	struct bio_vec *bvec, *bvprv;
-	struct scatterlist *next_sg, *sg;
 	struct req_iterator iter;
+	struct scatterlist *sg;
 	int nsegs, cluster;
 
 	nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	 * for each bio in rq
 	 */
 	bvprv = NULL;
-	sg = next_sg = &sglist[0];
+	sg = NULL;
 	rq_for_each_segment(bvec, rq, iter) {
 		int nbytes = bvec->bv_len;
 
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 			sg->length += nbytes;
 		} else {
 new_segment:
-			sg = next_sg;
-			next_sg = sg_next(sg);
+			if (!sg)
+				sg = sglist;
+			else
+				sg = sg_next(sg);
 
 			memset(sg, 0, sizeof(*sg));
 			sg->page = bvec->bv_page;

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:21 am

Scratch that, it cannot work... I'll think up a different approach.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:29 am

OK, it is fine, as long as the sglist is cleared initially. And I don't
think there's anyway around that, clearly I didn't think long enough
before including the memset() removal from Tomo.

Ingo, please try this rolled up version.

Linus, this should work. It would probably be best if you first did a
git revert on f5c0dde4c66421a3a2d7d6fa604a712c9b0744e5 and then applied
the ll_rw_blk.c bit alone. Do you want me to stuff that (revert + patch)
into a branch for you to pull?

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		  struct scatterlist *sglist)
 {
 	struct bio_vec *bvec, *bvprv;
-	struct scatterlist *next_sg, *sg;
 	struct req_iterator iter;
+	struct scatterlist *sg;
 	int nsegs, cluster;
 
 	nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	 * for each bio in rq
 	 */
 	bvprv = NULL;
-	sg = next_sg = &sglist[0];
+	sg = NULL;
 	rq_for_each_segment(bvec, rq, iter) {
 		int nbytes = bvec->bv_len;
 
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 			sg->length += nbytes;
 		} else {
 new_segment:
-			sg = next_sg;
-			next_sg = sg_next(sg);
+			if (!sg)
+				sg = sglist;
+			else
+				sg = sg_next(sg);
 
 			memset(sg, 0, sizeof(*sg));
 			sg->page = bvec->bv_page;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0c86be7..aac8a02 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -764,6 +764,8 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		if (unlikely(!sgl))
 			goto enomem;
 
+		memset(sgl, 0, sizeof(*sgl) * sgp->size);
+
 		/*
 		 * first loop through, set initial index and return value
 		 */

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:34 am

yep, this one did the trick, it booted up fine twice in a row already!

 Tested-by: Ingo Molnar <mingo@elte.hu>

thanks!

	Ingo
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:36 am

Great! Thanks a lot for reporting and testing... Linus, care to pull

  git://git.kernel.dk/data/git/linux-2.6-block.git for-linus

Jens Axboe (2):
      Revert "[SCSI] Remove full sg table memset()"
      [BLOCK] blk_rq_map_sg() next_sg fixup

 block/ll_rw_blk.c       |   10 ++++++----
 drivers/scsi/scsi_lib.c |    2 ++
 2 files changed, 8 insertions(+), 4 deletions(-)

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:45 am

ok, here's a different but similar crash that triggers on the testbox:

[  233.438890] BUG: unable to handle kernel paging request at virtual address 7d93e000
[  233.446390] printing eip: 784e9480 *pde = 01000067 *pte = 0593e000 
[  233.452630] Oops: 0000 [#1] DEBUG_PAGEALLOC
[  233.456790] 
[  233.458264] Pid: 0, comm: swapper Not tainted (2.6.23 #5)
[  233.463637] EIP: 0060:[<784e9480>] EFLAGS: 00010087 CPU: 0
[  233.469101] EIP is at ata_qc_issue+0x90/0x380
[  233.473429] EAX: 7d93dff0 EBX: 0000001f ECX: 7d93dff0 EDX: 798daf80
[  233.479668] ESI: 00000020 EDI: 7d93de00 EBP: 7b54007c ESP: 78a13e14
[  233.485908]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[  233.491282] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000)
[  233.498473] Stack: 7d93de00 7b540000 7b540000 00000000 7d93dfe0 7b54007c 7d93db00 7b5417a4 
[  233.506793]        784c2490 784ef69e 784f21f3 7b52de98 7d93db00 7b540000 7b5417a4 7d93db00 
[  233.515112]        7b540000 7b524004 784f22e0 784ef380 784c2490 7d93db00 00000202 7b524004 
[  233.523432] Call Trace:
[  233.526033]  [<784c2490>] scsi_done+0x0/0x20
[  233.530279]  [<784ef69e>] ata_scsi_translate+0xbe/0x140
[  233.535478]  [<784f21f3>] ata_scsi_queuecmd+0x33/0x200
[  233.540591]  [<784f22e0>] ata_scsi_queuecmd+0x120/0x200
[  233.545791]  [<784ef380>] ata_scsi_rw_xlat+0x0/0x220
[  233.550730]  [<784c2490>] scsi_done+0x0/0x20
[  233.554976]  [<784c2d12>] scsi_dispatch_cmd+0x152/0x290
[  233.560177]  [<78135c67>] trace_hardirqs_on+0x67/0xb0
[  233.565202]  [<784c8c7e>] scsi_request_fn+0x1be/0x370
[  233.570229]  [<78408086>] blk_run_queue+0x36/0x80
[  233.574909]  [<784c7520>] scsi_next_command+0x30/0x50
[  233.579935]  [<784c76ab>] scsi_end_request+0xab/0xe0
[  233.584875]  [<784c83f9>] scsi_io_completion+0xa9/0x3d0
[  233.590075]  [<78135c67>] trace_hardirqs_on+0x67/0xb0
[  233.595100]  [<78405125>] blk_done_softirq+0x45/0x80
[  233.600040]  [<78405153>] blk_done_softirq+0x73/0x80
[  233.604981]  [<7811d4c3>] ...
From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:53 am

You must have a magic test box :-)

Will investigate... libata doesn't actually enable chaining, but since
i386 supports it, it ends up using the chain helpers anyway.

There seems to be some automatic inlining involved here, it must be
dying inside ata_sg_setup().

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:55 am

Also, can you send a dmesg from that system so I can see which libata
drivers are involved?

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:58 am

same system for which i sent the earlier bootlog - but here's the second 
crashlog too. Crash is reproducible - crashed twice in a row already. 
(NOTE: the attached log is from this second crash.)

	Ingo
From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:37 am

OK, something to try out - libata doesn't enable sg chaining, since the
support isn't complete yet. Here's a dumb check just to verify that
scsi_alloc_sgtable() will NEVER return a chain entry for a host that
doesn't have it enabled. If this triggers for you, then that could
explain your problem.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index aac8a02..cc89c86 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -777,8 +777,12 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		 * sglist must be the biggest one, or we would not have
 		 * ended up doing another loop.
 		 */
-		if (prev)
+		if (prev) {
+			struct Scsi_Host *shost = cmd->device->host;
+
+			BUG_ON(!shost->use_sg_chaining);
 			sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
+		}
 
 		/*
 		 * don't allow subsequent mempool allocs to sleep, it would

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 12:04 pm

btw., i get the following build warning:

drivers/scsi/scsi_lib.c: In function 'scsi_free_sgtable':
drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function
drivers/scsi/scsi_lib.c:708: note: 'index' was declared here
drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function
drivers/scsi/scsi_lib.c:708: note: 'index' was declared here
drivers/scsi/scsi_lib.c: In function 'scsi_alloc_sgtable':
drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function
drivers/scsi/scsi_lib.c:708: note: 'index' was declared here

not sure it matters.

	Ingo
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:08 pm

Hmm, I don't see them here (gcc 4.2.1). Must be the BUG(), does it go
away if you add a index = -1 in the default: case? Just to be absolutely
sure...

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 12:14 pm

NOTE: i get the same warning even without the BUG_ON() patch, and 
without your other fix patch as well. I'm using gcc 4.2.2. (Do you get 
the warning if you pick up the config i sent you earlier today?)

	Ingo
-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 12:17 pm

btw., i changed the initialization to -1 and the crash still occurs 
(as expected).

	Ingo
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:25 pm

Would think so, thanks for checking though.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:25 pm

Just making 100% sure that was the missing place, I try not to take

Let me check... Yep, I do!

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 12:09 pm

nope, this did not help. First bootup went fine, second bootup crashed 
again (see below), without hitting the BUG_ON().

	Ingo

[   39.351198] BUG: unable to handle kernel paging request at virtual address 7ca58000
[   39.358698] printing eip: 784e9480 *pde = 00dda027 *pte = 04a58000 
[   39.364939] Oops: 0000 [#1] DEBUG_PAGEALLOC
[   39.369099] 
[   39.370573] Pid: 0, comm: swapper Not tainted (2.6.23 #9)
[   39.375944] EIP: 0060:[<784e9480>] EFLAGS: 00010087 CPU: 0
[   39.381408] EIP is at ata_qc_issue+0x90/0x380
[   39.385737] EAX: 7ca57ff0 EBX: 0000001f ECX: 7ca57ff0 EDX: 79181c20
[   39.391977] ESI: 00000020 EDI: 7ca57e00 EBP: 7b54007c ESP: 78a13e10
[   39.398217]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[   39.403590] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000)
[   39.410781] Stack: 7ca57e00 7b540000 7b540000 00000000 7ca57fe0 7b54007c 7c991980 7b5417a4 
[   39.419101]        784c2490 784ef69e 784f21f3 7b52de98 7c991980 7b540000 7b5417a4 7c991980 
[   39.427421]        7b540000 7b524004 784f22e0 784ef380 784c2490 7c991980 00000206 7b524004 
[   39.435740] Call Trace:
[   39.438340]  [<784c2490>] scsi_done+0x0/0x20
[   39.442586]  [<784ef69e>] ata_scsi_translate+0xbe/0x140
[   39.447785]  [<784f21f3>] ata_scsi_queuecmd+0x33/0x200
[   39.452899]  [<784f22e0>] ata_scsi_queuecmd+0x120/0x200
[   39.458099]  [<784ef380>] ata_scsi_rw_xlat+0x0/0x220
[   39.463038]  [<784c2490>] scsi_done+0x0/0x20
[   39.467285]  [<784c2d12>] scsi_dispatch_cmd+0x152/0x290
[   39.472484]  [<78135c67>] trace_hardirqs_on+0x67/0xb0
[   39.477511]  [<784c8c7e>] scsi_request_fn+0x1be/0x370
[   39.482538]  [<78408086>] blk_run_queue+0x36/0x80
[   39.487217]  [<784c7520>] scsi_next_command+0x30/0x50
[   39.492243]  [<784c76ab>] scsi_end_request+0xab/0xe0
[   39.497183]  [<784c83f9>] scsi_io_completion+0xa9/0x3d0
[   39.502383]  [<78135c67>] trace_hardirqs_on+0x67/0xb0
[   39.507409]  [<78405125>] blk_done_softirq+0x45/0x80
[   39.512348]  [<78405153>] ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 12:28 pm

I think you'll always hit it if you have a scatter-gather list that is 
exactly filled up.

Why? Because those things do "sg_next()" on the last entry, and as 
mentioned, that ends up actually accessing one past the end - even if the 
end result is not actually ever *used* (because we just effectively 
incremented it to past the last entry when the code was done with the SG 
list).

So I think the sg_next() interface is fundamentally mis-designed. It 
should do the scatter-gather link following on *starting* to use the SG 
entry, not after finishing with it.

Put another way: I suspect pretty much every single sg_next() out there is 
likely to hit this issue. The way that blk_rq_map_sg() fixed its problem 
was exactly to move the "sg_next()" to *before* the use of the SG (and 
even that one is somewhat bogus, in that it just blindly assumes that the 
first entry is not a link entry).

I suspect the "the next entry is a link" bit should be in the *previous* 
entry, and then sg_next() could look like

	if (next_is_link(sg))
		sg = sg_chain_ptr(sg+1);
	else
		sg++;
	return sg;

and that would work. 

The alternative is to always make sure to allocate one more SG entry than 
required, so that the last entry is always either the link, or an unused 
entry!

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:35 pm

OK, I think you have a very good point here. Ingo, can you verify it
goes away if apply the below? I have to tend to Other Life stuff but
will take this up again tomorrow morning and fix the sg_next() usage.

Linus, please still pull the branch I asked you to earlier. Thanks!


diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index aac8a02..58ede7e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -39,7 +39,7 @@
  * (unless chaining is used). Should ideally fit inside a single page, to
  * avoid a higher order allocation.
  */
-#define SCSI_MAX_SG_SEGMENTS	128
+#define SCSI_MAX_SG_SEGMENTS	129
 
 struct scsi_host_sg_pool {
 	size_t		size;


-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:49 pm

Well, hang on - where does it end up doing sg_next() on the LAST sg
entry? I'd argue that this is a bug, like it was in ll_rw_blk.c. I still
agree that I should make the interface more robust, I just don't see
where libata ends up doing the sg_next() on the last entry.

I'm assuming that Ingo is using the previous patches, so that
blk_rq_map_sg() is using this construct:

new_segment:
       if (!sg)
               sg = sglist;
       else
               sg = sg_next(sg);

and the memset() in scsi_alloc_sgtable(), which I'm pretty sure he is.
I'm assuming we're not hitting pio paths, so there are no manual
sg_next() calls. Ingo, if you care, can you test this one as well?

No rush, as mentioned I'll be back tomorrow morning...

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index bbaa545..0246b61 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4664,7 +4664,7 @@ static int ata_sg_setup(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
 	struct scatterlist *sg = qc->__sg;
-	struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
+	struct scatterlist *lsg = &qc->__sg[qc->n_elem - 1];
 	int n_elem, pre_n_elem, dir, trim_sg = 0;
 
 	VPRINTK("ENTER, ata%u\n", ap->print_id);

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 1:05 pm

hm, it still crashes - find the log below.

	Ingo

[   34.653682] EXT3-fs: mounted filesystem with ordered data mode.
[   34.659487] VFS: Mounted root (ext3 filesystem) readonly.
[   34.665208] Freeing unused kernel memory: 372k freed
[   35.809988] BUG: unable to handle kernel paging request at virtual address 7c8d3000
[   35.817483] printing eip: 784e9480 *pde = 00dda027 *pte = 048d3000 
[   35.823723] Oops: 0000 [#1] DEBUG_PAGEALLOC
[   35.827883] 
[   35.829357] Pid: 47, comm: rc.sysinit Not tainted (2.6.23 #14)
[   35.835162] EIP: 0060:[<784e9480>] EFLAGS: 00010006 CPU: 0
[   35.840626] EIP is at ata_qc_issue+0xd0/0x340
[   35.844954] EAX: 3fd79000 EBX: 7c8d3000 ECX: 00000020 EDX: 00000020
[   35.851194] ESI: 7c8a1a80 EDI: 7c8d2e00 EBP: 7b54007c ESP: 7c8b3d5c
[   35.857434]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[   35.862807] Process rc.sysinit (pid: 47, ti=7c8b2000 task=7c86d000 task.ti=7c8b2000)
[   35.870346] Stack: 7c8d3000 7b540000 7b540000 00000000 7c8d2ff0 7b54007c 7c8a1a80 7b5417a4 
[   35.878665]        784c2490 784ef61e 784f2173 7b52de98 7c8a1a80 7b540000 7b5417a4 7c8a1a80 
[   35.886985]        7b540000 7b524004 784f2260 784ef300 784c2490 7c8a1a80 00000216 7b524004 
[   35.895304] Call Trace:
[   35.897905]  [<784c2490>] scsi_done+0x0/0x20
[   35.902151]  [<784ef61e>] ata_scsi_translate+0xbe/0x140
[   35.907350]  [<784f2173>] ata_scsi_queuecmd+0x33/0x200
[   35.912463]  [<784f2260>] ata_scsi_queuecmd+0x120/0x200
[   35.917663]  [<784ef300>] ata_scsi_rw_xlat+0x0/0x220
[   35.922602]  [<784c2490>] scsi_done+0x0/0x20
[   35.926849]  [<784c2d12>] scsi_dispatch_cmd+0x152/0x290
[   35.932049]  [<78135c9c>] trace_hardirqs_on+0x9c/0xb0
[   35.937075]  [<784c8c3e>] scsi_request_fn+0x1be/0x370
[   35.942102]  [<78120f02>] del_timer+0x62/0x70
[   35.946434]  [<784072d5>] __generic_unplug_device+0x25/0x30
[   35.951981]  [<784075a5>] generic_unplug_device+0x15/0x30
[   35.957354]  [<78404e00>] blk_backing_dev_unplug+0x0/0x10
[   35.962726]  [<78404e0c>] ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 1:10 pm

They pretty much *all* do, as far as I can tell.

For example, to look at the very first one:

	#define for_each_sg(sglist, sg, nr, __i)        \
		for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))

let's say that "nr" is 1 (and that's also the allocation size), and look 
at what that causes.

Right. It causes us to do "sg = sg_next(sg)" once. Which will do what? It 
will increment sg (so that it now points past the single-entry array!) and 
then it will dereference that invalid entry to see if it's a chain entry!


And no, "1" is not the special case. The special case is *any* time when 
you iterate over as many sg entries as you allocated. You *always* have to 
leave the last one unused in order to avoid this "access past the end" 
problem.

The alternative is to rewrite the loop, but it's going to be ugly as hell, 
and you need to do that for *every*single*user* of sg_next(). You can 
re-write the above loop as something like

	define for_each_sg(sglist, sg, nr, __i)
		for (__i = 0, sg = NULL ;
			__i < (nr) && sg = (sg ? sglist : sg_next(sg) ;
			__i++))

but the important part here is that you must do that "sg_next()" *after* 
you have broken out of the loop, and you must basically do it one less 
time than you go through the loop.

IOW, any loop that leaves "sg" pointing to past the array is inevitably 
buggy, because it will have accessed that last past-tne-end entry as part 

See above. I think the exact sequence may be:

    ata_qc_issue()
	(implicitly inlined) ata_sg_setup()
		(explicitly inlined) dma_map_sg()
			(macro) for_each_sg()

but I didn't see if there are other possible chains that get you to one of 
those invalid sg loops.

And no, it's *not* just for_each_sg(). Pretty much any "natural" loop over 
the SG list will cause it, because that's how you write loops in C: you 
almost always end up pointing to one past the last entry after the loop.

			Linus
-

From: Ingo Molnar
Date: Thursday, October 18, 2007 - 12:07 am

this one finally made the trick and it's booting fine now, without any 
crashes!

Tested-by: Ingo Molnar <mingo@elte.hu>

	Ingo
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 12:10 am

Super, that validates the theory the theory that Linus put forth. So the
bug is clear now, this morning I'll work on proper sg looping.

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Thursday, October 18, 2007 - 4:03 am

hm, spoke too soon - i just got the ata_qc_issue() crash below. I've 
attached further below the current fixes/workarounds that i have applied 
at the moment. Any ideas?

	Ingo

------------->
[  155.259466] kjournald starting.  Commit interval 5 seconds
[  155.265103] EXT3 FS on sda5, internal journal
[  155.269319] EXT3-fs: mounted filesystem with ordered data mode.
[  156.458225] BUG: unable to handle kernel paging request at virtual address 7d5ac000
[  156.465723] printing eip: 784e9300 *pde = 00ddd027 *pte = 055ac000 
[  156.471964] Oops: 0000 [#1] DEBUG_PAGEALLOC
[  156.476123] 
[  156.477597] Pid: 0, comm: swapper Not tainted (2.6.23 #40)
[  156.483055] EIP: 0060:[<784e9300>] EFLAGS: 00010006 CPU: 0
[  156.488520] EIP is at ata_qc_issue+0xd0/0x340
[  156.492848] EAX: 3d328000 EBX: 7d5ac000 ECX: 00000020 EDX: 00000020
[  156.499087] ESI: 7d5ab480 EDI: 7d5abe00 EBP: 7b54007c ESP: 78a13e1c
[  156.505328]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[  156.510700] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000)
[  156.517893] Stack: 7d5ac000 7b540000 7b540000 00000000 7d5abff0 7b54007c 7d5ab480 7b5417a4 
[  156.526213]        784c2330 784ef49e 784f1ff3 7b52de98 7d5ab480 7b540000 7b5417a4 7d5ab480 
[  156.534531]        7b540000 7b524004 784f20e0 784ef180 784c2330 7d5ab480 00000216 7b524004 
[  156.542851] Call Trace:
[  156.545452]  [<784c2330>] scsi_done+0x0/0x20
[  156.549698]  [<784ef49e>] ata_scsi_translate+0xbe/0x140
[  156.554897]  [<784f1ff3>] ata_scsi_queuecmd+0x33/0x200
[  156.560010]  [<784f20e0>] ata_scsi_queuecmd+0x120/0x200
[  156.565210]  [<784ef180>] ata_scsi_rw_xlat+0x0/0x220
[  156.570150]  [<784c2330>] scsi_done+0x0/0x20
[  156.574395]  [<784c2bb2>] scsi_dispatch_cmd+0x152/0x290
[  156.579596]  [<78135aa7>] trace_hardirqs_on+0x67/0xb0
[  156.584622]  [<784c8abe>] scsi_request_fn+0x1be/0x370
[  156.589649]  [<78407ef6>] blk_run_queue+0x36/0x80
[  156.594328]  [<784c73c0>] scsi_next_command+0x30/0x50
[  156.599354]  [<784c754b>] ...
From: Jens Axboe
Date: Thursday, October 18, 2007 - 4:05 am

Hang on Ingo, will post an updated patch soonish!

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 1:22 am

Alas, this didn't help me here.  I did manage to capture the error 
messages this time, and in the two machines I'm actively testing on, 
sata_mv is the driver that's dying in both cases.  Machine A 
additionally has sata_nv, which is working.  Machine B additionally has 
ata_piix, which is working.  So in both cases, its sata_mv that is 

Still digging...  this behavior showed up after libata changes went in, 
FWIW.

	Jeff


-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 1:32 am

Theory - ata_sg_is_last() isn't returning true for the last entry. Can
you double check that it correcly marks the last entry in mv_fill_sg()?
Alternatively, just try this patch.


diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 4df8311..b858183 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -1139,6 +1139,7 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
 	struct mv_port_priv *pp = qc->ap->private_data;
 	struct scatterlist *sg;
 	struct mv_sg *mv_sg;
+	int end_marked = 0;
 
 	mv_sg = pp->sg_tbl;
 	ata_for_each_sg(sg, qc) {
@@ -1159,13 +1160,15 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
 			sg_len -= len;
 			addr += len;
 
-			if (!sg_len && ata_sg_is_last(sg, qc))
+			if (!sg_len && ata_sg_is_last(sg, qc)) {
 				mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
+				end_marked++;
+			}
 
 			mv_sg++;
 		}
-
 	}
+	BUG_ON(end_marked != 1);
 }
 
 static inline void mv_crqb_pack_cmd(__le16 *cmdw, u8 data, u8 addr, unsigned last)

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 1:38 am

Will check in a few minutes, after my current test:  I noticed that 
sg_tablesize in sata_mv is not LIBATA_MAX_PRD.  This is expected 
behavior, but I wonder if that difference -- most notably being smaller 
than SCSI_MAX_SG_SEGMENTS -- would trigger any latent bugs.

Anyway, we will both have answers momentarily...

	Jeff



-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 1:51 am

Another sata_mv difference from the rest: the chip does not support 
ATAPI, so we never care about qc->pad

	Jeff


-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 2:01 am

Your BUG_ON() does indeed trip, here.

Its surprising that other folks don't explode, considering that 
mv_fill_sg() intentionally mirrors the logic in ata_fill_sg().

	Jeff


-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 2:17 am

The sata_mv construct looks a bit odd. Does this work? That last
end_mv_sg test should always be true, just being paranoid...

diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 4df8311..5397eea 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -1138,8 +1138,9 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
 {
 	struct mv_port_priv *pp = qc->ap->private_data;
 	struct scatterlist *sg;
-	struct mv_sg *mv_sg;
+	struct mv_sg *mv_sg, *end_mv_sg;
 
+	end_mv_sg = NULL;
 	mv_sg = pp->sg_tbl;
 	ata_for_each_sg(sg, qc) {
 		dma_addr_t addr = sg_dma_address(sg);
@@ -1158,14 +1159,12 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
 
 			sg_len -= len;
 			addr += len;
-
-			if (!sg_len && ata_sg_is_last(sg, qc))
-				mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
-
+			end_mv_sg = mv_sg;
 			mv_sg++;
 		}
-
 	}
+	if (end_mv_sg)
+		end_mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
 }
 
 static inline void mv_crqb_pack_cmd(__le16 *cmdw, u8 data, u8 addr, unsigned last)

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 2:32 am

I'm testing a similar patch based on ata_fill_sg()'s method, which 
basically does something similar to what you've done here (see 
attached).  I had noticed that ata_fill_sg() did not call ata_sg_is_last().

If this fixes the problem, I think the best solution would be to delete 
ata_sg_is_last().  In the few users that exist, we should be able to 
eliminate the test programmatically as you and ata_fill_sg() have done 
-- thereby eliminating a branch per loop in a hotpath.

Off to test the attached...  if that doesn't work I'll try your version, 
though there shouldn't be much difference.

	Jeff


From: Jens Axboe
Date: Thursday, October 18, 2007 - 2:41 am

Yes I know, but I'm trying to works towards getting rid of sg_last() and

That should work as well. WRT ata_sg_is_last(), if we go ahead with my
recent sg chaining updates, we can keep the test as it would be a single
conditional and not require any looping.

Let me know when you have tested this!

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 3:04 am

The patch I attached to the last email got both sata_mv test boxes 
working reliably (so far).

I worked up a patch that kills ata_sg_is_last() (plus the 
max_phys_segments sata_mv fix), see attached.  I'm thinking this is what 
I like to see in upstream.

Of course, this doesn't explain why ata_sg_is_last() was broken, but 
since it's working _and_ slightly more efficient, I don't really care :)

	Jeff


From: Jens Axboe
Date: Thursday, October 18, 2007 - 3:10 am

Tomo and I agreed to kill sg_last() a few days ago anyways, so this is
perfectly fine with me.

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Thursday, October 18, 2007 - 3:13 am

it would still be nice to figure out exactly why it was broken.

	Ingo
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 3:16 am

It would always be nice. For this case I don't think it's very
interesting, if we pursue the improved sg iteration setup.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 3:17 am

BTW, I think it's pretty clear that ata_sg_is_last() is broken. It's
likely a one-off in the n_iter test.

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Thursday, October 18, 2007 - 3:49 am

fixing that would be a -stable candidate, as a potential data corruptor 
- or is it more benign?

	Ingo
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 3:56 am

I think it's safe to say that it was sg chaining introduced breakage, so
it should work fine in 2.6.23.x.

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 3:50 am

It is confirmed working prior to the sg-chaining stuff, so 2.6.23.1 is OK...

	Jeff



-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 3:42 am

Yep, the [attached] patch that kills ata_sg_is_last() is working here on 
both machines that were previously croaking.

It would be nice to get pdc_adma, sata_sil24 and ipr it-works test done, 
but IMO the patch is pretty straightforward and should be OK.

	Jeff


From: Ingo Molnar
Date: Thursday, October 18, 2007 - 3:54 am

just a quick question: i have Jens's workarounds applied right now (see 
patch below). Am i now crash/corruption-safe, or do i need your patch 
too? And once your patch [and the other sg_*() patches] are upstream i 
dont need the workaround anymore, correct?

	Ingo

---
 block/ll_rw_blk.c         |    2 +-
 drivers/ata/libata-core.c |    2 +-
 drivers/scsi/scsi_lib.c   |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Index: linux/block/ll_rw_blk.c
===================================================================
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -631,7 +631,7 @@ void blk_queue_max_phys_segments(struct 
 		printk("%s: set to minimum %d\n", __FUNCTION__, max_segments);
 	}
 
-	q->max_phys_segments = max_segments;
+	q->max_phys_segments = max_segments - 1;
 }
 
 EXPORT_SYMBOL(blk_queue_max_phys_segments);
Index: linux/drivers/ata/libata-core.c
===================================================================
--- linux.orig/drivers/ata/libata-core.c
+++ linux/drivers/ata/libata-core.c
@@ -4664,7 +4664,7 @@ static int ata_sg_setup(struct ata_queue
 {
 	struct ata_port *ap = qc->ap;
 	struct scatterlist *sg = qc->__sg;
-	struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
+	struct scatterlist *lsg = &qc->__sg[qc->n_elem - 1];
 	int n_elem, pre_n_elem, dir, trim_sg = 0;
 
 	VPRINTK("ENTER, ata%u\n", ap->print_id);
Index: linux/drivers/scsi/scsi_lib.c
===================================================================
--- linux.orig/drivers/scsi/scsi_lib.c
+++ linux/drivers/scsi/scsi_lib.c
@@ -39,7 +39,7 @@
  * (unless chaining is used). Should ideally fit inside a single page, to
  * avoid a higher order allocation.
  */
-#define SCSI_MAX_SG_SEGMENTS	128
+#define SCSI_MAX_SG_SEGMENTS	129
 
 struct scsi_host_sg_pool {
 	size_t		size;
-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 4:02 am

You need my patch if and only if you use one of the drivers touched by 
the patch.  ata_sg_is_last() was a driver helper function, so my fix 
never really touched core code.

I never had to apply the changes you included, to fix problems here.


I wonder if libata should be doing

	blk_queue_max_phys_segments(q, q->max_phys_segments - 1)

to account for the pad entry that libata owns.

	Jeff



-

From: Ingo Molnar
Date: Thursday, October 18, 2007 - 4:40 am

perhaps because you are not running a CONFIG_DEBUG_PAGEALLOC=y kernel?

I recently fixed DEBUG_PAGEALLOC (it would crash upon bootup on x86 most 
of the time on any real hardware - so i doubt people were able to use it 
all that much). As long as you try Linus' latest -git tree (which has 
the latest arch/x86 merge) and use the 32-bit x86 kernel it should work 
fine for you too, and you will probably be able to trigger similar 
crashes too.

	Ingo
-

From: Olof Johansson
Date: Thursday, October 18, 2007 - 7:52 am

Tested-by: Olof Johansson <olof@lixom.net>

Looks ok on SATA_SIL24 and SATA_MV here on PPC (together with Jens' latest
patch). Both barfed before.


Thanks!

-Olof
-

From: Torsten Kaiser
Date: Saturday, October 20, 2007 - 4:55 am

[Just catching with reading lkml to this post]


I "hate" to point this out, but I already reported that sata_sil24
fails on 1. Sep.:
http://lkml.org/lkml/2007/9/1/95

In the thread "sata_sil24 broken since 2.6.23-rc4-mm1" I spent over a
week to trace this back to ata_sg_is_last (finally found on 7. Oct):
http://lkml.org/lkml/2007/10/7/43

Thanks for finally patching this now...

Torsten
-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 12:45 pm

This one will only fix the ones that use the SCSI-lib routines, so it will 

Already did.

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:56 pm

Thanks

-- 
Jens Axboe

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 1:06 pm

So until that is fixed up, how about this? Should cover all block
devices, and I don't think sg_next()/for_each_sg() has spread outside of
that yet.

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3935469..8708ad0 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -631,7 +631,7 @@ void blk_queue_max_phys_segments(struct request_queue *q,
 		printk("%s: set to minimum %d\n", __FUNCTION__, max_segments);
 	}
 
-	q->max_phys_segments = max_segments;
+	q->max_phys_segments = max_segments - 1;
 }
 
 EXPORT_SYMBOL(blk_queue_max_phys_segments);

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 1:51 pm

this one crashes too, if added ontop of all the other fixes so far.

	Ingo

[   34.711185] EXT3-fs: mounted filesystem with ordered data mode.
[   34.716991] VFS: Mounted root (ext3 filesystem) readonly.
[   34.722712] Freeing unused kernel memory: 372k freed
[   36.112742] BUG: unable to handle kernel paging request at virtual address 7c8b1000
[   36.120246] printing eip: 784e9490 *pde = 00dda027 *pte = 048b1000 
[   36.126486] Oops: 0000 [#1] DEBUG_PAGEALLOC
[   36.130645] 
[   36.132119] Pid: 0, comm: swapper Not tainted (2.6.23 #15)
[   36.137579] EIP: 0060:[<784e9490>] EFLAGS: 00010006 CPU: 0
[   36.143043] EIP is at ata_qc_issue+0xd0/0x340
[   36.147371] EAX: 3f896000 EBX: 7c8b1000 ECX: 00000008 EDX: 00000008
[   36.153610] ESI: 7c86de80 EDI: 7c8b0f80 EBP: 7b54007c ESP: 78a13e28
[   36.159851]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[   36.165223] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000)
[   36.172416] Stack: 7c8b1000 7b540000 7b540000 00000000 7c8b0ff0 7b54007c 7c86de80 7b5417a4 
[   36.180735]        784c24a0 784ef62e 784f2183 7b52de98 7c86de80 7b540000 7b5417a4 7c86de80 
[   36.189055]        7b540000 7b524004 784f2270 784ef310 784c24a0 7c86de80 00000206 7b524004 
[   36.197374] Call Trace:
[   36.199975]  [<784c24a0>] scsi_done+0x0/0x20
[   36.204221]  [<784ef62e>] ata_scsi_translate+0xbe/0x140
[   36.209420]  [<784f2183>] ata_scsi_queuecmd+0x33/0x200
[   36.214533]  [<784f2270>] ata_scsi_queuecmd+0x120/0x200
[   36.219733]  [<784ef310>] ata_scsi_rw_xlat+0x0/0x220
[   36.224672]  [<784c24a0>] scsi_done+0x0/0x20
[   36.228918]  [<784c2d22>] scsi_dispatch_cmd+0x152/0x290
[   36.234118]  [<78135c67>] trace_hardirqs_on+0x67/0xb0
[   36.239145]  [<784c8c4e>] scsi_request_fn+0x1be/0x370
[   36.244171]  [<78408096>] blk_run_queue+0x36/0x80
[   36.248850]  [<784c7530>] scsi_next_command+0x30/0x50
[   36.253877]  [<784c76bb>] scsi_end_request+0xab/0xe0
[   36.258817]  [<784c83c9>] scsi_io_completion+0xa9/0x3d0
[   36.264016]  ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 1:24 pm

Heh. Not good. There are drivers that set max phys segmsnts to 1. Either 
through some host-specific setting (MMC), or explicitly (eg a grep shows 
viodasd uses just a constant 1 there).

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 1:31 pm

That would hurt... Care to commit your for_each_sg() uglification fixup
for now then? Or disable the allocation debug config entry, so that the
sg+1 deref wont crash? Whichever your prefer, just don't want to cause
crash pains for the unfortunate souls that happen to run into this
before the real fix is done.

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 2:11 pm

Well, in practice, it will only crash with DEBUG_PAGEALLOC, so few enough 
are going to be hit by it. In that sense I don't think we're in any deep 
trouble yet.

That said, maybe this is an acceptable, if hacky, replacement for the 
current "for_each_sg()" loop.

It does:
 - starts at one *before* the sglist
 - does the sg_next() at the *top* of the loop rather than the bottom of it
 - has a "--count" before that sg_next, so that we don't do it for the 
   case when we break out and have used up all segments.

Totally untested, but it *may* work, and it doesn't look horribly ugly.

Ingo, does this actually make any difference?

		Linus

---
 include/linux/scatterlist.h |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 2dc7464..f5c8e11 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -51,11 +51,18 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
 	return sg;
 }
 
+static inline struct scatterlist *sg_safe_next(struct scatterlist *sg, int left)
+{
+	if (left < 0)
+		return NULL;
+	return sg_next(sg);
+}
+
 /*
  * Loop over each sg element, following the pointer to a new list if necessary
  */
 #define for_each_sg(sglist, sg, nr, __i)	\
-	for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
+	for (__i = (nr), sg = (sglist)-1; (sg = sg_safe_next(sg, --__i)) != NULL ; )
 
 /**
  * sg_last - return the last scatterlist entry in a list
-

From: FUJITA Tomonori
Date: Wednesday, October 17, 2007 - 4:00 pm

On Wed, 17 Oct 2007 14:11:34 -0700 (PDT)

Looks that (sglist) - 1 isn't initialized and we use sg_next for it?
-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 6:07 pm

sg_next() - as it stands now - never actually looks at the SG that its 
argument points to: it explicitly *only* looks at the next one.

That's the bug. If sg_next() looked at the actual *current* sg entry, we 
wouldn't have any issues to begin with, and that's what I'm arguing we 
should do in the longer run (where "longer run" is defined as "when Jens 
does it asap").

So the hacky solution as it stands right now is to actually use the 
behaviour of "sg_next()" to our advantage in for_each_sg(), and starting 
off by setting sg to (sglist)-1. That way we can do the "sg_next()" (which 
will *not* look at the uninitialized space before the array) before 
entering the loop, regardless of whether it's the first time through the 
loop or not.

So no, it's not technically "strictly conforming ANSI C", because we use a 
pointer (not do not dereference it!) outside of its allocation, which is 
strictly speaking against the standard. However, the kernel does those 
kinds of things all the time, since the kernel does its own memory 
allocation, so that's not actually relevant.

The *standard* may say that you cannot decrement a pointer to before the 
beginning of the object and then increment it back up again and be 
strictly conforming, but the fact is, we very much depend on contiguous 
and flat kernel pointer model, and always have (and probably always will)

		Linus
-

From: David Miller
Date: Wednesday, October 17, 2007 - 6:19 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

What the thing really wants is some kind of indication of state,
without having to bloat up the scatterlist structure.

I believe that we have enough of a limited set of accessors to
sg->page that we can more aggressively encode things in the lower
bits.

I'm thinking of encoding the low two bits of sg->page as
follows:

1) bits == 0

   then the SG list is linear and sg_next() is sg++

2) bits == 1

   the nest SG is an indirect chunk, sg_next() is
   therefore something like:

	next = *((struct scatterlist **)(sg + 1));

3) bits == 2

   this is the last entry in the scatterlist, sg_next() is NULL

So for the cases of ARCH_HAS_SG_CHAIN not being set (ie. back
compatible), we can do no bit encoding in page->flags and just do
sg_next() == sg++, as is done now.

When doing SG chaining, in each non-linear chunk we have to allocate
one more pointer past the end of the scatterlist array (instead of a
full extra scatterlist entry for the indirect pointer encode).  Next,
all sg->page accesses have to be guarded to clear the state bits
out first.

I don't know, maybe it would work, and would make the loop termination
issues easier to handle properly.
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 1:21 am

I like it. Basically the only real change is using bit 2 as a
termination point, so we avoid going beyond the end of the sgtable.
Here's a starting point, it actually booted for me in the first go
(boggle). Only x86 so far, archs will need to be converted. And lots
more drivers I'm sure, I only fixed up the ones that botched my compile.

So just consider this a directional patch.

diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 5cdfab6..daaf636 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -302,7 +302,7 @@ static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg,
 #endif
 
 	for_each_sg(sg, s, nents, i) {
-		unsigned long addr = page_to_phys(s->page) + s->offset; 
+		unsigned long addr = page_to_phys(sg_page(s)) + s->offset; 
 		if (nonforced_iommu(dev, addr, s->length)) { 
 			addr = dma_map_area(dev, addr, s->length, dir);
 			if (addr == bad_dma_address) { 
@@ -397,7 +397,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	start_sg = sgmap = sg;
 	ps = NULL; /* shut up gcc */
 	for_each_sg(sg, s, nents, i) {
-		dma_addr_t addr = page_to_phys(s->page) + s->offset;
+		dma_addr_t addr = page_to_phys(sg_page(s)) + s->offset;
 		s->dma_address = addr;
 		BUG_ON(s->length == 0); 
 
diff --git a/arch/x86/kernel/pci-nommu_64.c b/arch/x86/kernel/pci-nommu_64.c
index e85d436..d64a4a5 100644
--- a/arch/x86/kernel/pci-nommu_64.c
+++ b/arch/x86/kernel/pci-nommu_64.c
@@ -62,8 +62,8 @@ static int nommu_map_sg(struct device *hwdev, struct scatterlist *sg,
 	int i;
 
 	for_each_sg(sg, s, nents, i) {
-		BUG_ON(!s->page);
-		s->dma_address = virt_to_bus(page_address(s->page) +s->offset);
+		BUG_ON(!sg_page(s));
+		s->dma_address = virt_to_bus(page_address(sg_page(s)) +s->offset);
 		if (!check_addr("map_sg", hwdev, s->dma_address, s->length))
 			return 0;
 		s->dma_length = s->length;
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3935469..d02783c ...
From: David Miller
Date: Thursday, October 18, 2007 - 4:55 am

From: Jens Axboe <jens.axboe@oracle.com>

Here are some sparc64 bits:

diff --git a/arch/sparc64/kernel/iommu.c b/arch/sparc64/kernel/iommu.c
index 29af777..73852a2 100644
--- a/arch/sparc64/kernel/iommu.c
+++ b/arch/sparc64/kernel/iommu.c
@@ -473,7 +473,7 @@ static void dma_4u_unmap_single(struct device *dev, dma_addr_t bus_addr,
 }
 
 #define SG_ENT_PHYS_ADDRESS(SG)	\
-	(__pa(page_address((SG)->page)) + (SG)->offset)
+	(__pa(page_address(sg_page(SG))) + (SG)->offset)
 
 static void fill_sg(iopte_t *iopte, struct scatterlist *sg,
 		    int nused, int nelems,
@@ -566,7 +566,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 	if (nelems == 1) {
 		sglist->dma_address =
 			dma_4u_map_single(dev,
-					  (page_address(sglist->page) +
+					  (page_address(sg_page(sglist)) +
 					   sglist->offset),
 					  sglist->length, direction);
 		if (unlikely(sglist->dma_address == DMA_ERROR_CODE))
diff --git a/arch/sparc64/kernel/iommu_common.c b/arch/sparc64/kernel/iommu_common.c
index d7ca900..ec863e0 100644
--- a/arch/sparc64/kernel/iommu_common.c
+++ b/arch/sparc64/kernel/iommu_common.c
@@ -73,7 +73,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
 
 	daddr = dma_sg->dma_address;
 	sglen = sg->length;
-	sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+	sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
 	while (dlen > 0) {
 		unsigned long paddr;
 
@@ -123,7 +123,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
 		sg = sg_next(sg);
 		if (--nents <= 0)
 			break;
-		sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+		sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
 		sglen = sg->length;
 	}
 	if (dlen < 0) {
@@ -191,7 +191,7 @@ void verify_sglist(struct scatterlist *sglist, int nents, iopte_t *iopte, int np
 			printk("sg(%d): page_addr(%p) off(%x) length(%x) "
 			       "dma_address[%016x] ...
From: Jens Axboe
Date: Thursday, October 18, 2007 - 4:57 am

Thanks a lot, Dave! The patch is a monster right now, I'll work on
splitting it into a 3-step process. Any arch help is greatly
appreciated.

-- 
Jens Axboe

-

From: David Miller
Date: Thursday, October 18, 2007 - 5:05 am

From: Jens Axboe <jens.axboe@oracle.com>

I have some other bits that my compile hit, such as some things in the
crypto layer.

But I hesitate to send them to you because I think the on-stack cases
need some helpers such that DEBUG_SG works for them.

BTW, you missed a case in drivers/usb/core/message.c because of
the config used in your build.  This thing below is a good
argument for trying to avoid HIGHMEM et al. ifdefs in drivers :-)

--- drivers/usb/core/message.c~	2007-10-18 01:46:44.000000000 -0700
+++ drivers/usb/core/message.c	2007-10-18 03:15:20.000000000 -0700
@@ -438,7 +438,7 @@
 			io->urbs[i]->transfer_buffer = NULL;
 #else
 			io->urbs[i]->transfer_buffer =
-				page_address(sg[i].page) + sg[i].offset;
+				page_address(sg_page(&sg[i])) + sg[i].offset;
 #endif
 		} else {
 			/* hc may use _only_ transfer_buffer */
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 5:09 am

Yeah, I have tons of that so far. I hope to have an allyesconfig

Indeed. I convert where appropriate, but I'm not anal about it. If they
don't use sg_next() and/or for_each_sg() on their list, it should be ok.

Thanks :-)

-- 
Jens Axboe

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 5:15 am

OK here goes, this compiles with allyesconfig on x86-64. Not too bad,
the scsi/ drivers were by far the worst.

 arch/sparc64/kernel/iommu.c                 |    4 -
 arch/sparc64/kernel/iommu_common.c          |   12 ++--
 arch/sparc64/kernel/pci_sun4v.c             |    4 -
 arch/x86/kernel/pci-calgary_64.c            |   12 ++--
 arch/x86/kernel/pci-gart_64.c               |    4 -
 arch/x86/kernel/pci-nommu_64.c              |    4 -
 block/ll_rw_blk.c                           |    5 +
 crypto/digest.c                             |    2 
 crypto/hmac.c                               |    3 -
 crypto/scatterwalk.c                        |    2 
 crypto/scatterwalk.h                        |    6 +-
 crypto/tcrypt.c                             |    4 -
 crypto/xcbc.c                               |    2 
 drivers/ata/libata-core.c                   |   10 +--
 drivers/ata/libata-scsi.c                   |    2 
 drivers/block/cciss.c                       |    4 -
 drivers/block/cpqarray.c                    |    3 -
 drivers/block/cryptoloop.c                  |   12 ++--
 drivers/block/ub.c                          |    8 +-
 drivers/ide/ide-probe.c                     |    4 +
 drivers/ide/ide-taskfile.c                  |    2 
 drivers/ieee1394/dma.c                      |    2 
 drivers/ieee1394/sbp2.c                     |    2 
 drivers/infiniband/core/umem.c              |   11 ++-
 drivers/infiniband/hw/ipath/ipath_dma.c     |    4 -
 drivers/infiniband/hw/ipath/ipath_mr.c      |    2 
 drivers/infiniband/hw/mthca/mthca_memfree.c |   24 +++++---
 drivers/infiniband/ulp/iser/iser_memory.c   |    6 +-
 drivers/md/dm-crypt.c                       |   21 +++----
 drivers/media/common/saa7146_core.c         |    3 -
 drivers/media/video/ivtv/ivtv-udma.c        |    4 -
 drivers/media/video/videobuf-dma-sg.c       |    8 +-
 drivers/mmc/card/queue.c                    |   15 ++---
 drivers/mmc/host/mmc_spi.c                  |    8 +-
 drivers/mmc/host/sdhci.c      ...
From: David Miller
Date: Thursday, October 18, 2007 - 5:36 am

From: Jens Axboe <jens.axboe@oracle.com>

It build cleanly here on sparc64 too.

It's late and I'm about to hit bed so I'm too chicken to do a test
boot it right now :-)

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 5:39 am

Don't boot it Dave, odds are that something will break and you'll then
be stuck debugging that since you can't relax and sleep until it's
working :-)

-- 
Jens Axboe

-

From: Benny Halevy
Date: Thursday, October 18, 2007 - 5:58 am

On Oct. 18, 2007, 14:15 +0200, Jens Axboe <jens.axboe@oracle.com> wrote:

Hmm, sg_next is not supposed to return a pointer to the chain entry
itself, but rather skip it.  I think that the fix needs only
check the "last" flag before incrementing sg.

+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+#endif
+	if (sg_is_last(sg))
+		return NULL;
+
 	sg++
 
 	if (unlikely(sg_is_chain(sg)))


-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:56 am

That would make sense, definitely. I'll add that.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 7:05 am

BUG_ON(!sg_is_last(ret));

it should be, not sg. That's what I merged.

-- 
Jens Axboe

-

From: Benny Halevy
Date: Thursday, October 18, 2007 - 7:16 am

right. of course.
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 7:38 am

OK, that found something interesting - mapping a request may shrink it,
so we need to update the end marker to move it earlier in the list.
Basically just a

        if (sg)
                __sg_mark_end(sg);

at the bottom of blk_rq_map_sg(). Updated patch below, booted on other
machines now as well without incident.


diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index e1c4707..094a95e 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -465,7 +465,7 @@ EXPORT_SYMBOL(pci_free_consistent);
    Write dma_length of each leader with the combined lengths of
    the mergable followers.  */
 
-#define SG_ENT_VIRT_ADDRESS(SG) (page_address((SG)->page) + (SG)->offset)
+#define SG_ENT_VIRT_ADDRESS(SG) (page_address(sg_page((SG)) + (SG)->offset))
 #define SG_ENT_PHYS_ADDRESS(SG) __pa(SG_ENT_VIRT_ADDRESS(SG))
 
 static void
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 44ab0da..9d371e4 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -442,7 +442,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	BUG_ON(dir == DMA_NONE);
 
 	for (i = 0; i < nents; i++, sg++) {
-		struct page *page = sg->page;
+		struct page *page = sg_page(sg);
 		unsigned int offset = sg->offset;
 		unsigned int length = sg->length;
 		void *ptr = page_address(page) + offset;
diff --git a/arch/blackfin/kernel/dma-mapping.c b/arch/blackfin/kernel/dma-mapping.c
index 94d7b11..dbf7222 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -160,7 +160,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	BUG_ON(direction == DMA_NONE);
 
 	for (i = 0; i < nents; i++, sg++) {
-		sg->dma_address = (dma_addr_t)(page_address(sg->page) +
+		sg->dma_address = (dma_addr_t)(page_address(sg_page(sg)) +
 					sg->offset);
 
 		invalidate_dcache_range(sg_dma_address(sg),
diff --git a/arch/ia64/hp/common/sba_iommu.c ...
From: Olof Johansson
Date: Thursday, October 18, 2007 - 7:58 am

PPC needs this. I'm guessing other arches might too. Otherwise seems to boot
and work well (with Jeff's sata patch).


Thanks,

-Olof

diff --git a/include/asm-powerpc/scatterlist.h b/include/asm-powerpc/scatterlist.h
index b9f1dbc..fcf7d55 100644
--- a/include/asm-powerpc/scatterlist.h
+++ b/include/asm-powerpc/scatterlist.h
@@ -14,6 +14,9 @@
 #include <asm/dma.h>
 
 struct scatterlist {
+#ifdef CONFIG_DEBUG_SG
+	unsigned long sg_magic;
+#endif
 	unsigned long page_link;
 	unsigned int offset;
 	unsigned int length;
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 8:25 am

Oh duh, indeed I didn't add that to the archs when converting. Thanks
for the patch, I'll update the rest as well.

And thanks a lot for boot testing it!

-- 
Jens Axboe

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 5:58 am

Alright, small modifications and tested somewhat. It boots on my laptop
just fine (CONFIG_DEBUG_SG is set) without warnings. Also tested with
PAGEALLOC set, that also boots and works fine without issue.

So if this is what we want, the question is how to proceed. Right now
i386 and x86-64 works, sparc64 PROBABLY works but is not tested. Any
other arch will NOT compile, since it's not converted to use page_link.
I'll try and do ppc64 now.


diff --git a/arch/sparc64/kernel/iommu.c b/arch/sparc64/kernel/iommu.c
index 29af777..73852a2 100644
--- a/arch/sparc64/kernel/iommu.c
+++ b/arch/sparc64/kernel/iommu.c
@@ -473,7 +473,7 @@ static void dma_4u_unmap_single(struct device *dev, dma_addr_t bus_addr,
 }
 
 #define SG_ENT_PHYS_ADDRESS(SG)	\
-	(__pa(page_address((SG)->page)) + (SG)->offset)
+	(__pa(page_address(sg_page(SG))) + (SG)->offset)
 
 static void fill_sg(iopte_t *iopte, struct scatterlist *sg,
 		    int nused, int nelems,
@@ -566,7 +566,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
 	if (nelems == 1) {
 		sglist->dma_address =
 			dma_4u_map_single(dev,
-					  (page_address(sglist->page) +
+					  (page_address(sg_page(sglist)) +
 					   sglist->offset),
 					  sglist->length, direction);
 		if (unlikely(sglist->dma_address == DMA_ERROR_CODE))
diff --git a/arch/sparc64/kernel/iommu_common.c b/arch/sparc64/kernel/iommu_common.c
index d7ca900..ec863e0 100644
--- a/arch/sparc64/kernel/iommu_common.c
+++ b/arch/sparc64/kernel/iommu_common.c
@@ -73,7 +73,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
 
 	daddr = dma_sg->dma_address;
 	sglen = sg->length;
-	sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+	sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
 	while (dlen > 0) {
 		unsigned long paddr;
 
@@ -123,7 +123,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
 		sg = sg_next(sg);
 		if (--nents <= 0)
 ...
From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:32 am

OK, I think that covers every arch out there. I haven't been able to
compile any of them, but it's mostly search'n replace operations. I hope
nothing is missing linux/scatterlist.h includes...

 arch/alpha/kernel/pci_iommu.c               |    2 
 arch/arm/common/dmabounce.c                 |    2 
 arch/blackfin/kernel/dma-mapping.c          |    2 
 arch/ia64/hp/common/sba_iommu.c             |    2 
 arch/ia64/hp/sim/simscsi.c                  |    4 
 arch/ia64/sn/pci/pci_dma.c                  |    2 
 arch/m68k/kernel/dma.c                      |    2 
 arch/mips/mm/dma-default.c                  |    8 -
 arch/powerpc/kernel/dma_64.c                |    2 
 arch/powerpc/kernel/ibmebus.c               |    2 
 arch/powerpc/kernel/iommu.c                 |    2 
 arch/powerpc/platforms/ps3/system-bus.c     |    2 
 arch/sparc/kernel/ioport.c                  |   16 +-
 arch/sparc/mm/io-unit.c                     |    2 
 arch/sparc/mm/iommu.c                       |    8 -
 arch/sparc/mm/sun4c.c                       |    2 
 arch/sparc64/kernel/iommu.c                 |    4 
 arch/sparc64/kernel/iommu_common.c          |   12 -
 arch/sparc64/kernel/ldc.c                   |    2 
 arch/sparc64/kernel/pci_sun4v.c             |    4 
 arch/um/drivers/ubd_kern.c                  |    2 
 arch/x86/kernel/pci-calgary_64.c            |   12 +
 arch/x86/kernel/pci-gart_64.c               |    4 
 arch/x86/kernel/pci-nommu_64.c              |    4 
 block/ll_rw_blk.c                           |    5 
 crypto/digest.c                             |    2 
 crypto/hmac.c                               |    3 
 crypto/scatterwalk.c                        |    2 
 crypto/scatterwalk.h                        |    6 
 crypto/tcrypt.c                             |    4 
 crypto/xcbc.c                               |    2 
 drivers/ata/libata-core.c                   |   10 -
 drivers/ata/libata-scsi.c                   |    2 
 drivers/block/cciss.c                       |    4 
 ...
From: Benny Halevy
Date: Thursday, October 18, 2007 - 6:49 am

Jens, again, please correct me if I'm wrong, but when sg points at the
entry right before a chain entry this implementation of sg_next will

here's how I think sg_next should be implemented:

  */
 static inline struct scatterlist *sg_next(struct scatterlist *sg)
 {
+#ifdef CONFIG_DEBUG_SG
+	BUG_ON(sg->sg_magic != SG_MAGIC);
+#endif
+	if (sg_is_last(sg))
+		return NULL;
+
 	sg++;
 
 	if (unlikely(sg_is_chain(sg)))
 		sg = sg_chain_ptr(sg);
 
 	return sg;
 }
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:55 am

Yep, thanks for catching that!

-- 
Jens Axboe

-

From: Mark Lord
Date: Thursday, October 18, 2007 - 6:51 am

Patch fails on drivers/scsi/scsi_lib.c.

I replaced that part of the patch with this updated portion instead:



--- a/drivers/scsi/scsi_lib.c	2007-10-18 09:35:28.000000000 -0400
+++ b/drivers/scsi/scsi_lib.c	2007-10-18 09:46:47.000000000 -0400
@@ -295,7 +295,7 @@
 	int i, err, nr_vecs = 0;
 
 	for_each_sg(sgl, sg, nsegs, i) {
-		page = sg->page;
+		page = sg_page(sg);
 		off = sg->offset;
 		len = sg->length;
  		data_len += len;
@@ -764,6 +764,8 @@
 		if (unlikely(!sgl))
 			goto enomem;
 
+		sg_init_table(sgl, sgp->size);
+
 		/*
 		 * first loop through, set initial index and return value
 		 */
@@ -779,6 +781,13 @@
 			sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
 
 		/*
+		 * if we have nothing left, mark the last segment as
+		 * end-of-list
+		 */
+		if (!left)
+			sg_mark_end(sgl, this);
+
+		/*
 		 * don't allow subsequent mempool allocs to sleep, it would
 		 * violate the mempool principle.
 		 */
@@ -2351,7 +2360,7 @@
 	*offset = *offset - len_complete + sg->offset;
 
 	/* Assumption: contiguous pages can be accessed as "page + i" */
-	page = nth_page(sg->page, (*offset >> PAGE_SHIFT));
+	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
 	*offset &= ~PAGE_MASK;
 
 	/* Bytes in this sg-entry from *offset to the end of the page */
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:58 am

Hmm, what are you applying against? Must be a clean tree, throw away any
patches that you already applied in this thread.

Updated below.

diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index e1c4707..094a95e 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -465,7 +465,7 @@ EXPORT_SYMBOL(pci_free_consistent);
    Write dma_length of each leader with the combined lengths of
    the mergable followers.  */
 
-#define SG_ENT_VIRT_ADDRESS(SG) (page_address((SG)->page) + (SG)->offset)
+#define SG_ENT_VIRT_ADDRESS(SG) (page_address(sg_page((SG)) + (SG)->offset))
 #define SG_ENT_PHYS_ADDRESS(SG) __pa(SG_ENT_VIRT_ADDRESS(SG))
 
 static void
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 44ab0da..9d371e4 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -442,7 +442,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	BUG_ON(dir == DMA_NONE);
 
 	for (i = 0; i < nents; i++, sg++) {
-		struct page *page = sg->page;
+		struct page *page = sg_page(sg);
 		unsigned int offset = sg->offset;
 		unsigned int length = sg->length;
 		void *ptr = page_address(page) + offset;
diff --git a/arch/blackfin/kernel/dma-mapping.c b/arch/blackfin/kernel/dma-mapping.c
index 94d7b11..dbf7222 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -160,7 +160,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 	BUG_ON(direction == DMA_NONE);
 
 	for (i = 0; i < nents; i++, sg++) {
-		sg->dma_address = (dma_addr_t)(page_address(sg->page) +
+		sg->dma_address = (dma_addr_t)(page_address(sg_page(sg)) +
 					sg->offset);
 
 		invalidate_dcache_range(sg_dma_address(sg),
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 3c95f41..0b74c63 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -246,7 +246,7 @@ static int reserve_sba_gart = ...
From: Mark Lord
Date: Thursday, October 18, 2007 - 7:03 am

Squeaky-clean linux-2.6.23 + patch-2.6.23-git13 + your patch.
Fails on scsi_lib.c.

-ml
-

From: Mark Lord
Date: Thursday, October 18, 2007 - 7:10 am

I'll re-pull everything fresh again from kernel.org and rebuild
with the "updated below" patch you posted.  Thanks.
-

From: Mark Lord
Date: Thursday, October 18, 2007 - 7:13 am

Okay, fresh pull of everything from kernel.org,
and now your latest patch does apply cleanly to -git13.

Something weird (at this end).

Thanks.
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 7:14 am

Thanks for confirming, I did double check that my HEAD was uptodate -
and it is.

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Thursday, October 18, 2007 - 9:55 am

Umm. May I suggest (I haven't read the whole thread yet, maybe somebody 
else already did) that

	static inline unsigned long sg_phys(struct scatterlist *sg)
	{
		return 	page_to_phys(sg_page(sg)) + sg->offset;
	}

would be a good thing to have?

Very few drivers should care so much about the *page* itself (or the 
offset). That's something that the generic allocation code etc cares 
about, but the driver is almost bound to care mostly about the actual DMA 
address, so adding that helper function that abstracts the sg access would 
be helpful in hiding some of the cruft?

		Linus
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 10:01 am

I'll add it to the mix.

-- 
Jens Axboe

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 10:10 am

Actually, only 11 occurrences in the patch. But still a nice little
cleanup.

-- 
Jens Axboe

-

From: Arjan van de Ven
Date: Thursday, October 18, 2007 - 10:10 am

On Thu, 18 Oct 2007 09:55:04 -0700 (PDT)

.... but will that work for systems with IOMMU ? or is it fundamentally
the wrong interface
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 10:14 am

They use foo_to_bus() on the address. sg_phys() should of course only be
used where the user previously did page_to_phys() on the sg page.

-- 
Jens Axboe

-

From: FUJITA Tomonori
Date: Friday, October 19, 2007 - 1:59 am

On Thu, 18 Oct 2007 19:14:29 +0200

I can take care of IOMMU stuff when I'll send IOMMU merging fix
patchset:

http://marc.info/?l=linux-scsi&m=119079718126157&w=2
-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 12:20 pm

FWIW libata cares about both.  When doing DMA, it cares about the DMA 
address.  When doing PIO, it cares about the actual page, because it 
does a kmap before sending the data through a 16-bit/32-bit data FIFO.

	Jeff


-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 6:36 pm

Yes, that sounds sane.

Although I also wonder whether we want one global per-arch 
ARCH_HAS_SG_CHAIN, or perhaps have something more dynamic, which allows a 
per-SG-list choice (which in turn would require some kind of "head" entry 
that is passed into sg_next(), and in general descibes the SG list)

		Linus
-

From: David Miller
Date: Wednesday, October 17, 2007 - 6:49 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

It's there because the DMA mapping support code for a platform has to
be converted to handle these chains and audited to make sure the
conversion is right.  Some platforms, such as sparc64, took a lot of
work to get right :-)

Once that macro is set, the block device driver has to support it too,
and there are knobs all the way down to the scsi host driver to
indicate this.

The idea is that all of this mess gets deleted in the end, it was
meant to be a safe transition scheme.

-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 8:44 pm

It would be good to have something soon-ish.
This "dead at boot time" issue is impacting the general ability to test
-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 9:01 pm

In the meantime, does the patch I sent out help people? I'd like to get 
feedback, but I'm a lazy bum, and don't use DEBUG_PAGEALLOC myself, so I 
was hoping that people who actually see this could comment on my untested 
suggestion.

		Linus
-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 9:05 pm

Oh.. so this bug is supposed to only bite with DEBUG_PAGEALLOC=y ??

Then something else is broken, perhaps.  I just saw a long traceback
scroll off the top of the screen, with lots of bio_* functions in the list
and assumed it was the same bug.

Mmm.. I'll get out the camera and try it again now..
-

From: Jeff Garzik
Date: Wednesday, October 17, 2007 - 9:14 pm

I'm currently bisecting on two machines (sata_mv and sata_nv).  sata_mv 
suddenly started spewing errors in recent kernels, but kernels from ~48 
hours ago are just fine.  AMD64 sata_nv machine will simply lock up if I 
push it too hard.  Again, reproducible, and kernels from ~48 hours ago 
are fine.

AHCI machine so far seems fine (kernel ~24 hours ago), storage-wise, but 
the hda_intel audio decided to stop working :)

I'm quite happy to test patches, but I never run with DEBUG_PAGEALLOC so 
that shouldn't be causing the problems here.

	Jeff



-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 9:18 pm

Okay, mine is dying with EIP at blk_rq_map_sg+0xcb/0x160.

Screen photo is at http://rtr.ca/recent/2.6.23-git12-crash.jpg,
but the top was cut off (isn't there a new config option or patch
to do double-columns or scrollback or something ???.

Cheers
-

From: Jeff Garzik
Date: Wednesday, October 17, 2007 - 9:31 pm

Is this a sata_mv box?  If so, could you try this patch?

	Jeff



From: Jens Axboe
Date: Thursday, October 18, 2007 - 12:05 am

If anything, that shrinks the size of the resulting request. Did this
patch make any difference to you?

Now sata_mv should not be doing this (already discussed), but as long as
it's only lowering the physical sg count then it should not cause any
bugs at least.

-- 
Jens Axboe

-

From: Mark Lord
Date: Thursday, October 18, 2007 - 6:13 am

Not a sata_mv box, so no point here.

ata_piix.
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:23 am

Can you try the big patch I just posted?

-- 
Jens Axboe

-

From: Mark Lord
Date: Thursday, October 18, 2007 - 6:32 am

I'll hunt for it and try it, but your earlier little patch already works fine.

Cheers

-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 6:34 am

I'll send it privately.

-- 
Jens Axboe

-

From: Mark Lord
Date: Thursday, October 18, 2007 - 6:59 am

I found the latest rev, and it failed to apply cleanly on -git12 or -git13
due to scsi_lib.c. After fixing that portion (replacement chunk below),
I'm now running with -git12, with the sg list debug option enabled (no messages).

Looks okay so far

 
--- a/drivers/scsi/scsi_lib.c	2007-10-18 09:35:28.000000000 -0400
+++ b/drivers/scsi/scsi_lib.c	2007-10-18 09:46:47.000000000 -0400
@@ -295,7 +295,7 @@
 	int i, err, nr_vecs = 0;
 
 	for_each_sg(sgl, sg, nsegs, i) {
-		page = sg->page;
+		page = sg_page(sg);
 		off = sg->offset;
 		len = sg->length;
  		data_len += len;
@@ -764,6 +764,8 @@
 		if (unlikely(!sgl))
 			goto enomem;
 
+		sg_init_table(sgl, sgp->size);
+
 		/*
 		 * first loop through, set initial index and return value
 		 */
@@ -779,6 +781,13 @@
 			sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
 
 		/*
+		 * if we have nothing left, mark the last segment as
+		 * end-of-list
+		 */
+		if (!left)
+			sg_mark_end(sgl, this);
+
+		/*
 		 * don't allow subsequent mempool allocs to sleep, it would
 		 * violate the mempool principle.
 		 */
@@ -2351,7 +2360,7 @@
 	*offset = *offset - len_complete + sg->offset;
 
 	/* Assumption: contiguous pages can be accessed as "page + i" */
-	page = nth_page(sg->page, (*offset >> PAGE_SHIFT));
+	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
 	*offset &= ~PAGE_MASK;
 
 	/* Bytes in this sg-entry from *offset to the end of the page */
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 7:04 am

OK, thanks a lot for testing!

-- 
Jens Axboe

-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 9:41 pm

No.. it's a pretty ordinary ata_piix machine, with no serial ports either.

Cheers
-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 9:53 pm

That could explain it: if the SG allocation is simply too small, the 
scatter-gather code will run off the end of the SG list, and encounter 
random uninitialized entries, and if any of those have the low bit set, 
they will be considered to be "link" entries, and the SG list goes wild.

That SG code is really pretty fragile. I don't see it *ever* checking that 
the SG allocation is large enough when it fills it in. And these things 
take the sizes from different places (ie "cmd->use_sg" bs 
"req->nr_phys_segments" vs God knows what else..)

		Linus
-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 9:54 pm

I tried the fancy "boot_delay=nnn" feature, but that doesn't slow down
the tracebacks at all.  So I hardcoded some mdelay(1000) lines into
the traceback code, and here's the top part of the oops now:

   http://rtr.ca/recent/2.6.23-git12-crash2.jpg

And here's the .config for the kernel, just in case it's of any use here:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-git12
# Thu Oct 18 00:28:52 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=16
# CONFIG_CPUSETS is not set
# CONFIG_FAIR_GROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_EMBEDDED=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
# CONFIG_SLUB_DEBUG is ...
From: Mark Lord
Date: Wednesday, October 17, 2007 - 10:09 pm

...

And, yes, I make that out as being this line from blk_rq_map_sg():

    next_sg = sg_next(sg);

"objdump -d" output from my actual kernel:

00003ce0 <blk_rq_map_sg>:
    3ce0:       55                      push   %ebp
    3ce1:       57                      push   %edi
    3ce2:       56                      push   %esi
    3ce3:       53                      push   %ebx
    3ce4:       83 ec 24                sub    $0x24,%esp
    3ce7:       89 44 24 04             mov    %eax,0x4(%esp)
    3ceb:       8b 98 44 01 00 00       mov    0x144(%eax),%ebx
    3cf1:       83 e3 01                and    $0x1,%ebx
    3cf4:       89 5c 24 14             mov    %ebx,0x14(%esp)
    3cf8:       8b 52 34                mov    0x34(%edx),%edx
    3cfb:       c7 44 24 10 00 00 00    movl   $0x0,0x10(%esp)
    3d02:       00
    3d03:       85 d2                   test   %edx,%edx
    3d05:       89 54 24 1c             mov    %edx,0x1c(%esp)
    3d09:       0f 84 f4 00 00 00       je     3e03 <blk_rq_map_sg+0x123>
    3d0f:       89 cb                   mov    %ecx,%ebx
    3d11:       31 ff                   xor    %edi,%edi
    3d13:       89 4c 24 0c             mov    %ecx,0xc(%esp)
    3d17:       8b 44 24 1c             mov    0x1c(%esp),%eax
    3d1b:       0f b7 48 16             movzwl 0x16(%eax),%ecx
    3d1f:       8b 50 2c                mov    0x2c(%eax),%edx
    3d22:       89 4c 24 18             mov    %ecx,0x18(%esp)
    3d26:       0f b7 40 14             movzwl 0x14(%eax),%eax
    3d2a:       39 c1                   cmp    %eax,%ecx
    3d2c:       0f 8d be 00 00 00       jge    3df0 <blk_rq_map_sg+0x110>
    3d32:       8d 04 49                lea    (%ecx,%ecx,2),%eax
    3d35:       8d 34 82                lea    (%edx,%eax,4),%esi
    3d38:       0f b6 44 24 14          movzbl 0x14(%esp),%eax
    3d3d:       88 44 24 23             mov    %al,0x23(%esp)
    3d41:       eb 05                   jmp    3d48 <blk_rq_map_sg+0x68>
    3d43:       89 f7           ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 9:45 pm

Ok, I think your picture cut off the last hex digits on the right, but 
what I can make out of the disassembly, I have to admit that it looks 
very much like it might be exactly the same thing Ingo was seeing.

The disassembly is a bit scrambled, but the particular instruction that 
oopses for you is

	mov    0x10(%ebx),%eax

and the instructions around it do seem to bear some similarity to the 
whole "sg_next()" thing, ie I see a 

	test   $0x1,%al

and a

	lea    0x10(%ebx),%eax

around there too.

If you cut down the number of entries printed for the call trace (which 
involves editing the source code, no nice config options, I'm afraid), you 
could get the rest of it..

		Linus
-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 9:20 pm

Yeah, this particular one really should only bite you with 
DEBUG_PAGEALLOC.

The SG code potentially _derefences_ a field past the end of the SG array, 
but it should be a read, and the result should never be used (it's used to 
calculate the pointer to past the end of the queue, so if it's used, 

Please do. I wouldn't be surprised if it's also related to the SG changes, 
but I don't think it's the exact particular bug that Ingo saw.

		Linus
-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 10:25 pm

Your patch from this posting http://lkml.org/lkml/2007/10/17/285
does not seem to make much difference here.

It still crashes at exactly the same place.
-

From: Mark Lord
Date: Wednesday, October 17, 2007 - 10:34 pm

However, Jens's patch from that same thread:

     http://lkml.org/lkml/2007/10/17/269

..allowed me to boot and post this followup message from -git12

Jeff: try that one.

Patch reproduced here for convenience:


diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 		  struct scatterlist *sglist)
 {
 	struct bio_vec *bvec, *bvprv;
-	struct scatterlist *next_sg, *sg;
 	struct req_iterator iter;
+	struct scatterlist *sg;
 	int nsegs, cluster;
 
 	nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	 * for each bio in rq
 	 */
 	bvprv = NULL;
-	sg = next_sg = &sglist[0];
+	sg = NULL;
 	rq_for_each_segment(bvec, rq, iter) {
 		int nbytes = bvec->bv_len;
 
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 			sg->length += nbytes;
 		} else {
 new_segment:
-			sg = next_sg;
-			next_sg = sg_next(sg);
+			if (!sg)
+				sg = sglist;
+			else
+				sg = sg_next(sg);
 
 			memset(sg, 0, sizeof(*sg));
 			sg->page = bvec->bv_page;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0c86be7..aac8a02 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -764,6 +764,8 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		if (unlikely(!sgl))
 			goto enomem;
 
+		memset(sgl, 0, sizeof(*sgl) * sgp->size);
+
 		/*
 		 * first loop through, set initial index and return value
 		 */

-

From: Jeff Garzik
Date: Wednesday, October 17, 2007 - 10:45 pm

That's already in my upstream kernel, here.  commits
ba951841ceb7fa5b06ad48caa5270cc2ae17941e and 
a3bec5c5aea0da263111c4d8f8eabc1f8560d7bf.

sata_mv and sata_nv still reliably poop themselves here, whereas its 
rock solid with 2.6.23.1.  Sounds like different issues from yours, as I 
see a stream of SATA errors on the bad kernels, errors which are often a 
symptom of something whacked in the DMA engine (misprogramming causes 
the silicon to generate bogus FIS's, which the device then chokes on)

	Jeff


-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 12:09 am

Do you know if this poop involves the segment padding that sometimes
goes on in libata?

-- 
Jens Axboe

-

From: Jeff Garzik
Date: Thursday, October 18, 2007 - 12:30 am

Definitely not, in this case -- it's all ATA, nothing ATAPI.

It throws SError { Handshk } which then triggers the EH to reset the 
link, and so it goes, over and over :)  The same thing happens when I 
intentionally screw up the PRD tables.  Not much more data points than 
that, so far.

I'll try the SCSI_MAX_SG_SEGMENTS patch too, to see if that fixes things.

	Jeff



-

From: Jeff Garzik
Date: Wednesday, October 17, 2007 - 6:14 pm

On a related hardware note:

FWIW, most ATA controllers have scatter/gather tables that terminate 
themselves by a bit in the final s/g entry.  The 90% case needs to know 
the last scatterlist entry, at the end of the s/g walk.

So however this all gets worked out, please make sure not to unduly 
punish libata controllers with an expensive sg_last() inside a loop, or 
something like that.  :)

	Jeff




-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 12:42 pm

In fact, I think you'll hit it even if you use the "for_each_sg()" 
helper function. Exactly because it does the sg = sg_next(sg) at the end 
of the for-loop, so it will do it for the last entry too, even though that 
will access one past the end.

So it really *is* the case that every *single* use of that SG chain needs 
to be switched over to only do the sg_next() when required, or sg_next() 
needs to be fixed to look at the current-in-use entry (which we *can* 
access) when it decides what to do about the next one (which we can *not* 
access).

Moving the sg_is_chain() bit to the previous entry would work, but it 
requires that nobody who could have a chained scatterlist must *ever* 
access "sg->page" directly - you'd always need to use some helper function 
that masks off the bit, eg

 - rename "sg->page" into "sh->page_and_flag", and make it "unsigned long" 
   instead of a pointer.

 - change every single "sg->page" access to use "sg_page(sg)" instead:

	#define sg_pointer(sg)	((void *)((sg)->page_and_flag & ~1ul))

	static inline struct page *sg_page(struct scatterist *sg)
	{
		return sg_pointer(sg);
	}

 - change "sg_next()" to do

	static inline struct scatterlist *sg_next(struct scatterlist *sg)
	{
		if (sg->page_and_flag & 1)
			sg = sg_pointer(sg+1)-1;
		return ++sg;
	}

where the magic is exactly the fact that now "sg_next()" will *not* 
derefence the next SG entry unless the current one was marked 
appropriately.

And then *creating* the chain obviously needs to properly mark the 
next-to-last entry with the "next entry is a pointer" flag.

Big diff, it sounds like. But I don't see many alternatives. Jens?

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:55 pm

Auch yes, ok nevermind that last email. There really is no way around
allocating an extra 'pad' entry right now, so the SCSI_MAX_SG_SEGMENTS

Big diff is not a problem for me, my primary concern would be making
sure that the interface is solid. The above also has the nice side
effect of making all sg elements usable, so we don't waste any. IIRC
this was even debated months ago when I first proposed sg chaining, I
shied away from it because I thought it would seem huge and too
invasive. Ah well. I'll get digging on this.

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:08 am

Just to confirm - is this with the added memset() in 
drivers/scsi/scsi_lib.c?

Or did you get this oops before that patch?

		Linus
-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 11:13 am

yes, Jens's last patch was applied - apparently this ATA crash was 
masked by the other one. I got one successful bootup after applying 
Jens's patch and it has crashed on bootup twice since then.

	Ingo
-

From: Luca Tettamanti
Date: Wednesday, October 17, 2007 - 1:15 pm

Hi Jens,
I just hit what I believe is the same bug, though with a slightly
different OOPS (I see a NULL pointer dereference, see below). The
patches fixes the bug, and - as Linus suggested - the piece touching
scsi_lib.c is enough.

The OOPS:

scsi4 : ata_piix
scsi5 : ata_piix
ata5: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14
ata6: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15
 sda1 sda2 sda3 < sda5 sda6 >
sd 0:0:0:0: [sda] Attached SCSI disk
ata5.00: ATAPI: MATSHITADVD-RAM UJ-850S, 1.22, max UDMA/33
Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: 
 [<ffffffff802d275e>] blk_rq_map_sg+0xf6/0x16a
PGD 4711067 PUD 508c067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 0 
Modules linked in: sd_mod ohci1394 ehci_hcd ata_piix ieee1394 ahci uhci_hcd libata scsi_mod usbcore thermal processor fan
Pid: 0, comm: swapper Not tainted 2.6.23-ge6d5a11d #3
RIP: 0010:[<ffffffff802d275e>]  [<ffffffff802d275e>] blk_rq_map_sg+0xf6/0x16a
RSP: 0018:ffffffff804eccd8  EFLAGS: 00010087
RAX: 0000000005331000 RBX: ffff810004740220 RCX: ffff810001000000
RDX: 0000000000000000 RSI: ffff8100052d4f50 RDI: 0000000005142c00
RBP: 0000000000001400 R08: 0000000000000000 R09: ffff8100052a3980
R10: 0000000005143000 R11: 0000000000000400 R12: ffff8100047a4000
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffffff8049a000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000020 CR3: 0000000005118000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff804ac000, task ffffffff8046c340)
Stack:  ffffffff88041834 000000010314ba00 ffff81000477fa80 ffff81000503dd80
 ffff81000503dd80 ffff81000517fc00 000000000000001e 0000000000001d4c
 ffffffff8804193c 0000000015925849 ffff81000473f800 ffff81000503dd80
Call Trace:
 <IRQ>  ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 10:56 am

Ok, I think that one-liner fixes the real bug.

But I think the rest of your changes are simply bad.

The fix to block/ll_rw_block.c should likely be something like the 
appended instead:

 - remove the "memset()" you had added earlier. It's bogus. It cannot be 
   the right thing. If the sg list wasn't initialized correctly much 
   earlier, trying to initialize it late is pointless - it contains crap.

 - the old code was fine, but let's initialize "sg" to NULL to make it 
   clear that the initial value of sg is pointless, and only "next_sg" 
   matters (since sg had better be assigned from that).

Hmm?

		Linus

---
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..54d974e 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1333,7 +1333,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
 	 * for each bio in rq
 	 */
 	bvprv = NULL;
-	sg = next_sg = &sglist[0];
+	sg = NULL;
+	next_sg = &sglist[0];
 	rq_for_each_segment(bvec, rq, iter) {
 		int nbytes = bvec->bv_len;
 
@@ -1352,7 +1353,6 @@ new_segment:
 			sg = next_sg;
 			next_sg = sg_next(sg);
 
-			memset(sg, 0, sizeof(*sg));
 			sg->page = bvec->bv_page;
 			sg->length = nbytes;
 			sg->offset = bvec->bv_offset;
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:02 am

It's required to clear output members (like dma_len and so on), since

If you prefer the old next_sg approach to my alternative, that is fine
with me. But we do need the memset().

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:13 am

No it's NOT!

The whole point here is that the sg had *already* better be cleared.

If it wasn't cleared before, that's a bug regardless of anything else. So 
a memset() is guaranteed to be either a no-op or hiding another bug!

So yes, those members had better be zero. It's just that they had better 
be zero long before that memset! The memset should have been done when 

Really? Explain why. Entering that code with a non-initialized SG list is 
a bug to begin with.

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:20 am

For the chain elements - yes, definitely! But we also want to clear dma
mapping output values, at least sparc64 wants that. You could argue that
the IOMMU code should be fixed up, but I don't think we should mix the
two.

So we need the memset() in blk_rq_map_sg() AS WELL AS the initial sg

Not for the output members, they will be filled AFTER blk_rq_map_sg()
returns and someone use iommu_map_sg() to re-iterate the sglist and map
the pages appropriately.

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:58 am

I still don't see your argument.

The SG table is *already*zeroed*.

The fact that output values will be changed later is irrelevant. They 
haven't been changed *before* - or at least you haven't given a reasonable 
explanation for why they should have.

So explain to me why the memset() in blk_rq_map_sg() helps, considering 
that the memory must have been zeroed at allocation time anyway? Tell me 
what it is that fills any of the fields we don't already initialize, and 
that can happen *before* that memset?

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 12:03 pm

Ah ok, I see why you are confused. The SCSI case is one, it allocs and
frees the sg table each time. The entries are thus always initialized
when they end up in blk_rq_map_sg(). However, other drivers allocate one
at driver init time and use that one all the time. It needs to be
initially cleared, but it wont be cleared before each request mapping.
And that is fine, as long as we do clear entries in blk_rq_map_sg().

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 12:15 pm

Ahh, ok, thanks for clearing that up.

			Linus
-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 11:02 am

built and booted your patch (removed Jens's) but it still crashes in 
blk_rq_map_sg() - see below. Should i have left something from Jens's 
patch?

	Ingo

------------------->
[   34.698199] VFS: Mounted root (ext3 filesystem) readonly.
[   34.703917] Freeing unused kernel memory: 372k freed
[   34.746609] BUG: unable to handle kernel paging request at virtual address 6e616872
[   34.754106] printing eip: 7840503b *pde = 00000000 
[   34.758960] Oops: 0000 [#1] DEBUG_PAGEALLOC
[   34.763120] 
[   34.764594] Pid: 1, comm: swapper Not tainted (2.6.23 #7)
[   34.769965] EIP: 0060:[<7840503b>] EFLAGS: 00010006 CPU: 0
[   34.775429] EIP is at blk_rq_map_sg+0xbb/0x170
[   34.779845] EAX: 3ffcd000 EBX: 7c88a68c ECX: 3ffce000 EDX: 007ff9a0
[   34.786085] ESI: 6e616862 EDI: 798ea9a0 EBP: 00001000 ESP: 7b421bf4
[   34.792324]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[   34.797697] Process swapper (pid: 1, ti=7b420000 task=7b416000 task.ti=7b420000)
[   34.804890] Stack: 00000046 7b520000 00002000 3ffcf000 7c88ac80 00000001 00000001 00000001 
[   34.813209]        7c88a600 01000000 7c88ab00 7b521d1c 7c88a708 7b520000 784c7ac5 7b520000 
[   34.821529]        784c75ca 7c88a708 7b524ce4 7b521d1c 7c88a708 784e4570 7b416000 7813661b 
[   34.829847] Call Trace:
[   34.832448]  [<784c7ac5>] scsi_init_io+0x55/0xe0
[   34.837042]  [<784c75ca>] scsi_get_cmd_from_req+0x2a/0x40
[   34.842414]  [<784e4570>] sd_prep_fn+0x80/0x940
[   34.846920]  [<7813661b>] __lock_acquire+0x4ab/0xe20
[   34.851860]  [<781450a6>] add_to_page_cache+0x66/0xb0
[   34.856887]  [<78404633>] elv_dispatch_sort+0x23/0xe0
[   34.861912]  [<784041a0>] elv_next_request+0xa0/0x130
[   34.866938]  [<784c8c04>] scsi_request_fn+0x1e4/0x370
[   34.871965]  [<78120f02>] del_timer+0x62/0x70
[   34.876298]  [<78407445>] __generic_unplug_device+0x25/0x30
[   34.881844]  [<78407715>] generic_unplug_device+0x15/0x30
[   34.887217]  [<78404e00>] blk_backing_dev_unplug+0x0/0x10
[   34.892590]  [<78404e0c>] ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:14 am

The *real* fix in Jens' patch (and the only thing that really matters) is 
the one to drivers/scsi/scsi_lib.c.

The rest is all fluff.

		Linus
-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:30 am

too late, crashed my box with it already :-)

	Ingo
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 10:31 am

Sorry about that, please try the next one that includes the scsi_lib.c
one liner to clear the sg table on alloc :-)

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:28 am

with that patch it not crashes on NULL dereference - see crashlog below. 
Compiler bug perhaps?

	Ingo

---------------->
[   34.605614] EXT3-fs: INFO: recovery required on readonly filesystem.
[   34.611842] EXT3-fs: write access will be enabled during recovery.
[   34.635861] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008
[   34.644227] printing eip: 7840840e *pde = 00000000 
[   34.649081] Oops: 0002 [#1] DEBUG_PAGEALLOC
[   34.653239] 
[   34.654713] Pid: 1, comm: swapper Not tainted (2.6.23 #3)
[   34.660086] EIP: 0060:[<7840840e>] EFLAGS: 00010046 CPU: 0
[   34.665548] EIP is at blk_rq_map_sg+0x8e/0x190
[   34.669965] EAX: 00000000 EBX: 7c885180 ECX: 00000004 EDX: 033b6000
[   34.676205] ESI: 00001000 EDI: 00000008 EBP: 00000000 ESP: 7b4219b8
[   34.682444]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[   34.687818] Process swapper (pid: 1, ti=7b420000 task=7b416000 task.ti=7b420000)
[   34.695010] Stack: 7b521d38 00000008 7c884000 7b520000 00002000 033b6000 7c885080 00000001 
[   34.703329]        00000001 7c885880 01000002 7c886e00 7b521d1c 7c8857c4 7b520000 784c7ae5 
[   34.711649]        7b520000 784c75ea 7c8857c4 7b524ce4 7b521d1c 7c8857c4 784e4590 7b416000 
[   34.719968] Call Trace:
[   34.722568]  [<784c7ae5>] scsi_init_io+0x55/0xe0
[   34.727161]  [<784c75ea>] scsi_get_cmd_from_req+0x2a/0x40
[   34.732534]  [<784e4590>] sd_prep_fn+0x80/0x940
[   34.737041]  [<7813661b>] __lock_acquire+0x4ab/0xe20
[   34.741981]  [<78135c9c>] trace_hardirqs_on+0x9c/0xb0
[   34.747007]  [<78770e30>] _spin_unlock_irq+0x20/0x30
[   34.751946]  [<78404633>] elv_dispatch_sort+0x23/0xe0
[   34.756973]  [<784041a0>] elv_next_request+0xa0/0x130
[   34.761999]  [<784c8c24>] scsi_request_fn+0x1e4/0x370
[   34.767025]  [<78120f02>] del_timer+0x62/0x70
[   34.771358]  [<784072d5>] __generic_unplug_device+0x25/0x30
[   34.776905]  [<784075a5>] generic_unplug_device+0x15/0x30
[   34.782278]  [<78404e0c>] blk_backing_dev_unplug+0xc/0x10
[   34.787650]  ...
From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 10:52 am

I really don't think this should work.

Doing "sg_next()" on a valid sg is *always* ok. So if the old code didn't 
work, then "sg" wasn't valid to start with (and the code *after* the 
sg_next() would have oopsed even if you try to avoid using sg_next.

So avoiding the "sg_next()" on the last entry is pointless. 

Also, your patch makes the code almost totally unreadable, with that 
subtle issue of the "if (bvprv && cluster)" case not triggering on the 
first case, so the NULL initial sg is "safe".

So at a guess, I think the *real* problem is simply that the passed-in 
sglist was just too small. What guarantees that the sg list allocation 
(apparently done by scsi_alloc_sgtable()) is big enough? 

If I read things right, scsi_alloc_sgtable() will allocate "cmd->use_sg" 
SG enties, no? But I also notice that it does not seem to initialize the 
SG allocation, so those SG entries contain random crap - including, 
perhaps, a random - and bogus - chain pointer in sg->page..

Yes, we set sh->page *if* we create a chain, but if we don't chain, we 
leave the old random contents around which in turn may include old and 
stale chain pointers. Or am I missing something?

So when you added that "memset(sg, 0, sizeof(*sg))" into blk_rq_map_sg(), 
you did it way too late - it needs to be done when the sg chain is 
allocated, and for every entry (and then the "link" entry needs to be 
linked in separately)

I think.

		Linus

-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:00 am

Yeah, I didn't quite understand why if sg was valid, why dereferencing

Hmm I think it's quite readable, but perhaps that's just me :-). The
first is much cleaner, and the last part just reads 'if sg is not set

Right, we allocate an sgtable that will hold ->use_sg entries, which
contains request->nr_phys_segments. And that should definitely fit.

Regarding the init of the sglist, that was the revert I was talking
about. We do need that memset() in there, so all those sg entries will

Yep, and that is what Ingo did test as well and it worked. For that
case, now libata is crapping out elsewhere in sg_next().

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:18 am

Actually, I take that back. If 'sg' is the last entry in a *non*linked 
scatter-gather list (ie we don't use the last entry as a link, we actually 
use it as a real SG entry), then "sg_next(sg)" will indeed access past the 
end of the whole allocated array, and will access one past the end.

And with page-alloc debugging, that *will* blow up.

So I think your change to use "sg_next()" only when you actually need a 
next pointer is the correct one after all.

				Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:22 am

Thanks, so I'm not totally crazy :-)

Can you just pull:

  git://git.kernel.dk/data/git/linux-2.6-block.git for-linus

then so we get those two pieces correct? Then the remaining issue seems
to be a new one that is biting Ingo elsewhere, at least we'll all be on
the same page then.

-- 
Jens Axboe

-

From: Benny Halevy
Date: Thursday, October 18, 2007 - 3:52 am

Jens, for_each_sg still calls sg_next on the last entry which will
dereference a possibly bogus sg->page (for the sg_is_chain(sg)
condition in sg_next) if the last entry is the last one on the page
of unchained entry and sg+1 falls over into an uninitialized page.

How about the following?
(untested yet.
 sg.c included here as an example for usage out of scatterlist.h)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 2dc7464..3a27e03 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -30,7 +30,7 @@ static inline void sg_init_one(struct scatterlist *sg, const void *buf,
        ((struct scatterlist *) ((unsigned long) (sg)->page & ~0x01))

 /**
- * sg_next - return the next scatterlist entry in a list
+ * sg_next_unsafe - return the next scatterlist entry in a list
  * @sg:                The current sg entry
  *
  * Usually the next entry will be @sg@ + 1, but if this sg element is part
@@ -41,7 +41,7 @@ static inline void sg_init_one(struct scatterlist *sg, const void *buf,
  * the current entry, this function will NOT return NULL for an end-of-list.
  *
  */
-static inline struct scatterlist *sg_next(struct scatterlist *sg)
+static inline struct scatterlist *sg_next_unsafe(struct scatterlist *sg)
 {
        sg++;

@@ -51,11 +51,27 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
        return sg;
 }

+/**
+ * sg_next - return the next scatterlist entry in a list
+ * @sg:                The current sg entry
+ * @next:      Index of next sg entry
+ * @nr:                Number of sg entries in the list
+ *
+ * Note that the caller must ensure that there are further entries after
+ * the current entry, this function will NOT return NULL for an end-of-list.
+ *
+ */
+static inline struct scatterlist *sg_next(struct scatterlist *sg,
+                                          int next, int nr)
+{
+       return next < nr ? sg_next_unsafe(sg) : NULL;
+}
+
 /*
  * Loop over each sg ...
From: Jens Axboe
Date: Thursday, October 18, 2007 - 3:55 am

Things have progressed a lot since, see my recent posting based on
Davem's proposal. Will post another patch soonish, that is also tested.

-- 
Jens Axboe

-

From: David Miller
Date: Thursday, October 18, 2007 - 5:03 am

From: Jens Axboe <jens.axboe@oracle.com>

One core issue here is that we need to decide whether this thing to be
iterated like an array or like a linked list.  It's trying to be both.

If we decide upon a looping construct for consumers and stick to it,
we'll be in much better shape than we are now and bugs will be eaiser
to spot.  It would be so much simpler to audit if all we saw in the
consumers were things like:

	while (sg) {
		do_stuff(sg);
		sg = sg_next(sg);
	}

I would suggest that we just get it over with and convert the whole
tree now rather than trying to do this kind of thing in stages.
Because then we can say that ever scatterlist creator has to set
the "end" bit and therefore you use well established patterns
for scatterlist iteration such as "traverse sg_next() until NULL"
as shown above.

I also noticed that there is the issue of on-stack and embedded
scatterlist users.  We'll need some sort of "DECLARE_SCATTERLIST"
and a "scatterlist_init()" thing so that we can keep DEBUG_SG
working even in those cases.  But for all I know Jens could be
working on that already :-)

The only other real option if we don't convert the whole tree now to
the "end" marker stuff, is to enforce that every scatterlist iterator
only traverse the number of entries there were told are in the one
given to them.
-

From: Jens Axboe
Date: Thursday, October 18, 2007 - 5:28 am

The above should work now, PROVIDED that the end has been marked
properly. I have added some helpers to init an sglist (or sg entry).
You can still use

        for_each_sg()

and pass in the number of entries, that'll work even with an end marker.

Heh, got some of it covered! The stack usage is generally just converted
to either use sg_init_one() if we can, or sg_init_table(). Adding some
soft of DECLARE_SCATTERLIST() sounds like a good idea, though. I'll take

Driver that do all their own sg management typically still use a manual
loop over the number of entries they know are there. I didn't change
those, and the original sg chaining didn't convert those either. Some
manual labor is required in utilizing chained sg lists, unless you are
part of a subsystem (like SCSI) that allocates and inits themf or you.
So I still think that baby steps are a good idea - only convert things
that matter. Leave the rest to janitors or others willing to dive in.

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, October 17, 2007 - 11:22 am

That still leaves the initialization issue. The link pointers need to all 
be initialized at SG allocation time (and not just the last one for the 
case where it's a linked entry).

And if you initialize them at allocation time, does an allocation ever get 
re-used without being free'd in between (if it does, you do indeed need to 
initialize the non-link pointers each time - but I don't see it being the 
case, so ..)

		Linus
-

From: Jens Axboe
Date: Wednesday, October 17, 2007 - 11:40 am

The links will not change, so reuse should actually be OK. The reuse
cannot be larger than the original would hold, but that would be a bug
without chaining as well of course.

WRT reuse, if we end up requeuing a request, than we will go through the
whole thing again. I don't think Ingo is hitting that either, though.

-- 
Jens Axboe

-

From: Ingo Molnar
Date: Wednesday, October 17, 2007 - 10:11 am

find below the disassembled code. Here's the faulting source line 
according to gdb:

(gdb) list *0x78406669
0x78406669 is in blk_rq_map_sg (include/linux/scatterlist.h:48).
43       */
44      static inline struct scatterlist *sg_next(struct scatterlist *sg)
45      {
46              sg++;
47
48              if (unlikely(sg_is_chain(sg)))
49                      sg = sg_chain_ptr(sg);
50
51              return sg;
52      }

(gdb) list *0x78406673
0x78406673 is in blk_rq_map_sg (block/ll_rw_blk.c:1355).
1350                    } else {
1351    new_segment:
1352                            sg = next_sg;
1353                            next_sg = sg_next(sg);
1354
1355                            sg->page = bvec->bv_page;
1356                            sg->length = nbytes;
1357                            sg->offset = bvec->bv_offset;
1358                            nsegs++;
1359                    }
(gdb)

the compiler is gcc-4.2.2. (vanilla, built from sources)

	Ingo

784065b0 <blk_rq_map_sg>:
784065b0:	55                   	push   %ebp
784065b1:	57                   	push   %edi
784065b2:	56                   	push   %esi
784065b3:	53                   	push   %ebx
784065b4:	83 ec 28             	sub    $0x28,%esp
784065b7:	89 44 24 04          	mov    %eax,0x4(%esp)
784065bb:	8b 98 08 01 00 00    	mov    0x108(%eax),%ebx
784065c1:	83 e3 01             	and    $0x1,%ebx
784065c4:	89 5c 24 18          	mov    %ebx,0x18(%esp)
784065c8:	8b 52 3c             	mov    0x3c(%edx),%edx
784065cb:	c7 44 24 14 00 00 00 	movl   $0x0,0x14(%esp)
784065d2:	00 
784065d3:	85 d2                	test   %edx,%edx
784065d5:	89 54 24 20          	mov    %edx,0x20(%esp)
784065d9:	0f 84 fc 00 00 00    	je     784066db <blk_rq_map_sg+0x12b>
784065df:	89 ce                	mov    %ecx,%esi
784065e1:	31 d2                	xor    %edx,%edx
784065e3:	89 4c 24 10          	mov    %ecx,0x10(%esp)
784065e7:	8b 44 24 20          	mov    0x20(%esp),%eax
784065eb:	0f b7 58 1a          ...
Previous thread: 2.6.23-git11 compile issues by Badari Pulavarty on Wednesday, October 17, 2007 - 8:33 am. (7 messages)

Next thread: Anyone else out there have an ExpressCard to try? by Mark Lord on Wednesday, October 17, 2007 - 9:04 am. (1 message)