Jens, just got this crash on a testbox: [ 37.701628] warning: process `kmodule' used the removed sysctl system call with 1.23. [ 39.409390] BUG: unable to handle kernel paging request at virtual address 7ca76000 [ 39.416892] printing eip: 78406669 *pde = 00dda027 *pte = 04a76000 [ 39.423132] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 39.427292] [ 39.428766] Pid: 431, comm: fsck.ext3 Not tainted (2.6.23 #45) [ 39.434571] EIP: 0060:[<78406669>] EFLAGS: 00010006 CPU: 0 [ 39.440035] EIP is at blk_rq_map_sg+0xb9/0x170 [ 39.444450] EAX: 04aed000 EBX: 7ca1f380 ECX: 04aee000 EDX: 00095da0 [ 39.450689] ESI: 7ca75ff0 EDI: 79180da0 EBP: 00001000 ESP: 7caa1cac [ 39.456929] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 39.462303] Process fsck.ext3 (pid: 431, ti=7caa0000 task=7ca80000 task.ti=7caa0000) [ 39.469841] Stack: 00000020 7b520000 00002000 04aef000 7ca75fe0 0000001f 00000001 00000000 [ 39.478161] 7ca1f300 01000002 7ca44980 7b522b94 7ca0de00 7b520000 784c89b5 7b520000 [ 39.486480] 784c843a 7ca0de00 7b520f20 7b522b94 7ca0de00 784e54a0 784f394b 00000000 [ 39.494799] Call Trace: [ 39.497400] [<784c89b5>] scsi_init_io+0x55/0xe0 [ 39.501992] [<784c843a>] scsi_get_cmd_from_req+0x2a/0x40 [ 39.507366] [<784e54a0>] sd_prep_fn+0x80/0x940 [ 39.511872] [<784f394b>] ata_bmdma_start+0xb/0x20 [ 39.516638] [<784ef344>] ata_qc_issue_prot+0x164/0x1e0 [ 39.521839] [<78405c63>] elv_dispatch_sort+0x23/0xe0 [ 39.526865] [<784057d0>] elv_next_request+0xa0/0x130 [ 39.531891] [<787715b8>] _spin_lock_irq+0x38/0x50 [ 39.536657] [<784c9af4>] scsi_request_fn+0x1e4/0x370 [ 39.541684] [<78120e72>] del_timer+0x62/0x70 [ 39.546016] [<78408a45>] __generic_unplug_device+0x25/0x30 [ 39.551563] [<78408d15>] generic_unplug_device+0x15/0x30 [ 39.556936] [<78406430>] blk_backing_dev_unplug+0x0/0x10 [ 39.562309] [<7840643c>] blk_backing_dev_unplug+0xc/0x10 [ 39.567682] [<7818080d>] block_sync_page+0x2d/0x40 [ 39.572535] ...
managed to trigger it a second time, so it seems rather reproducible: [ 328.771333] kjournald starting. Commit interval 5 seconds [ 328.776963] EXT3 FS on sda5, internal journal [ 328.781172] EXT3-fs: mounted filesystem with ordered data mode. [ 329.689493] BUG: unable to handle kernel paging request at virtual address 7d516000 [ 329.696990] printing eip: 78406669 *pde = 00ddd027 *pte = 05516000 [ 329.703230] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 329.707390] [ 329.708863] Pid: 0, comm: swapper Not tainted (2.6.23 #45) [ 329.714321] EIP: 0060:[<78406669>] EFLAGS: 00010006 CPU: 0 [ 329.719787] EIP is at blk_rq_map_sg+0xb9/0x170 [ 329.724202] EAX: 2f6df000 EBX: 7d6af880 ECX: 34d55000 EDX: 005edbe0 [ 329.730441] ESI: 7d515ff0 EDI: 796d8be0 EBP: 00001000 ESP: 78a13db0 [ 329.736680] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 329.742054] Process swapper (pid: 0, ti=78a12000 task=7893c400 task.ti=78a12000) [ 329.749246] Stack: 00000020 7b520000 00002000 34d56000 7d515fe0 00000007 00000001 00000000 [ 329.757565] 7d6af800 01000000 7d515e00 7b522b94 7d84f58c 7b520000 784c89b5 7b520000 [ 329.765885] 784c843a 7d84f58c 7b520f20 7b522b94 7d84f58c 784e54a0 784f394b 00000000 [ 329.774204] Call Trace: [ 329.776804] [<784c89b5>] scsi_init_io+0x55/0xe0 [ 329.781398] [<784c843a>] scsi_get_cmd_from_req+0x2a/0x40 [ 329.786770] [<784e54a0>] sd_prep_fn+0x80/0x940 [ 329.791277] [<784f394b>] ata_bmdma_start+0xb/0x20 [ 329.796043] [<784ef344>] ata_qc_issue_prot+0x164/0x1e0 [ 329.801243] [<78405c63>] elv_dispatch_sort+0x23/0xe0 [ 329.806268] [<784057d0>] elv_next_request+0xa0/0x130 [ 329.811295] [<787715b8>] _spin_lock_irq+0x38/0x50 [ 329.816062] [<784c9af4>] scsi_request_fn+0x1e4/0x370 [ 329.821088] [<784097f6>] blk_run_queue+0x36/0x80 [ 329.825768] [<784c8370>] scsi_next_command+0x30/0x50 [ 329.830794] [<784c84fb>] scsi_end_request+0xab/0xe0 [ 329.835733] [<784c9249>] scsi_io_completion+0xa9/0x3d0 [ 329.840933] [<78135de7>] ...
Can you pull git://git.kernel.dk/linux-2.6-block.git for-linus and see if it still reproduces? -- Jens Axboe -
The code in question is:
mov %edx,0xc(%esp)
mov (%ebx),%edi
mov %edi,%edx
sub %eax,%edx
mov %edx,%eax
sar $0x5,%eax
shl $0xc,%eax
add 0x8(%ebx),%eax
cmp %eax,0xc(%esp)
je +126
mov 0x10(%esi),%eax <----- Oops
lea 0x10(%esi),%edx
test $0x1,%al
jne +76
mov %edi,(%esi)
mov %ebp,0xc(%esi)
mov 0x8(%ebx),%eax
mov %eax,0x4(%esi)
and it looks like %esi is overflowing from one page to the next one, ie:
BUG: unable to handle kernel paging request at virtual address 7ca76000
ESI: 7ca75ff0
and you caught this thanks to page-alloc debugging again.
I think I can match that up with the source code: that's "sg_next()". It's
doing:
sg++;
if (unlikely(sg_is_chain(sg)))
sg = sg_chain_ptr(sg);
return sg;
and the oopsing instruction is that load of "sg->page" in the assembly
code:
mov 0x10(%esi),%eax # %eax = sg->page
lea 0x10(%esi),%edx # %edx = sg+1;
test $0x1,%al # if (unlikely(sg_is_chain()))
jne +76
Jens?
Linus
-
Yep, that's what I came up with as well - I asked Ingo for a dump in
private, but ended up just using ksymoops to decode the line.
The way blk_rq_map_sg() operates is that it ends up doing a
next_sg = sg_next(sg);
even though sg may be the last entry. Perhaps this is crapping out,
although if sg is a valid address, then sg + 1 should be as well.
next_sg may end up being crap, in fact it will, but we'll never use that
unless there are more entries to fill. And if there is, then both sg and
next_sg were valid.
So nothing in for-linus should fix it, I'll try and come up with an
alternate way to assign next_sg so it's always valid.
--
Jens Axboe
-
OK, the below should actually be safe, I don't know why I talked myself
into the next_sg stuff in the beginning. It's always safe to zero sg,
since it's a valid entry - nothing to save in ->page. Ingo, does this
work for you?
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
struct scatterlist *sglist)
{
struct bio_vec *bvec, *bvprv;
- struct scatterlist *next_sg, *sg;
struct req_iterator iter;
+ struct scatterlist *sg;
int nsegs, cluster;
nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
* for each bio in rq
*/
bvprv = NULL;
- sg = next_sg = &sglist[0];
+ sg = NULL;
rq_for_each_segment(bvec, rq, iter) {
int nbytes = bvec->bv_len;
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
sg->length += nbytes;
} else {
new_segment:
- sg = next_sg;
- next_sg = sg_next(sg);
+ if (!sg)
+ sg = sglist;
+ else
+ sg = sg_next(sg);
memset(sg, 0, sizeof(*sg));
sg->page = bvec->bv_page;
--
Jens Axboe
-
Scratch that, it cannot work... I'll think up a different approach. -- Jens Axboe -
OK, it is fine, as long as the sglist is cleared initially. And I don't
think there's anyway around that, clearly I didn't think long enough
before including the memset() removal from Tomo.
Ingo, please try this rolled up version.
Linus, this should work. It would probably be best if you first did a
git revert on f5c0dde4c66421a3a2d7d6fa604a712c9b0744e5 and then applied
the ll_rw_blk.c bit alone. Do you want me to stuff that (revert + patch)
into a branch for you to pull?
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
struct scatterlist *sglist)
{
struct bio_vec *bvec, *bvprv;
- struct scatterlist *next_sg, *sg;
struct req_iterator iter;
+ struct scatterlist *sg;
int nsegs, cluster;
nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
* for each bio in rq
*/
bvprv = NULL;
- sg = next_sg = &sglist[0];
+ sg = NULL;
rq_for_each_segment(bvec, rq, iter) {
int nbytes = bvec->bv_len;
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
sg->length += nbytes;
} else {
new_segment:
- sg = next_sg;
- next_sg = sg_next(sg);
+ if (!sg)
+ sg = sglist;
+ else
+ sg = sg_next(sg);
memset(sg, 0, sizeof(*sg));
sg->page = bvec->bv_page;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0c86be7..aac8a02 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -764,6 +764,8 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
if (unlikely(!sgl))
goto enomem;
+ memset(sgl, 0, sizeof(*sgl) * sgp->size);
+
/*
* first loop through, set initial index and return value
*/
--
Jens Axboe
-
yep, this one did the trick, it booted up fine twice in a row already! Tested-by: Ingo Molnar <mingo@elte.hu> thanks! Ingo -
Great! Thanks a lot for reporting and testing... Linus, care to pull
git://git.kernel.dk/data/git/linux-2.6-block.git for-linus
Jens Axboe (2):
Revert "[SCSI] Remove full sg table memset()"
[BLOCK] blk_rq_map_sg() next_sg fixup
block/ll_rw_blk.c | 10 ++++++----
drivers/scsi/scsi_lib.c | 2 ++
2 files changed, 8 insertions(+), 4 deletions(-)
--
Jens Axboe
-
ok, here's a different but similar crash that triggers on the testbox: [ 233.438890] BUG: unable to handle kernel paging request at virtual address 7d93e000 [ 233.446390] printing eip: 784e9480 *pde = 01000067 *pte = 0593e000 [ 233.452630] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 233.456790] [ 233.458264] Pid: 0, comm: swapper Not tainted (2.6.23 #5) [ 233.463637] EIP: 0060:[<784e9480>] EFLAGS: 00010087 CPU: 0 [ 233.469101] EIP is at ata_qc_issue+0x90/0x380 [ 233.473429] EAX: 7d93dff0 EBX: 0000001f ECX: 7d93dff0 EDX: 798daf80 [ 233.479668] ESI: 00000020 EDI: 7d93de00 EBP: 7b54007c ESP: 78a13e14 [ 233.485908] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 233.491282] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000) [ 233.498473] Stack: 7d93de00 7b540000 7b540000 00000000 7d93dfe0 7b54007c 7d93db00 7b5417a4 [ 233.506793] 784c2490 784ef69e 784f21f3 7b52de98 7d93db00 7b540000 7b5417a4 7d93db00 [ 233.515112] 7b540000 7b524004 784f22e0 784ef380 784c2490 7d93db00 00000202 7b524004 [ 233.523432] Call Trace: [ 233.526033] [<784c2490>] scsi_done+0x0/0x20 [ 233.530279] [<784ef69e>] ata_scsi_translate+0xbe/0x140 [ 233.535478] [<784f21f3>] ata_scsi_queuecmd+0x33/0x200 [ 233.540591] [<784f22e0>] ata_scsi_queuecmd+0x120/0x200 [ 233.545791] [<784ef380>] ata_scsi_rw_xlat+0x0/0x220 [ 233.550730] [<784c2490>] scsi_done+0x0/0x20 [ 233.554976] [<784c2d12>] scsi_dispatch_cmd+0x152/0x290 [ 233.560177] [<78135c67>] trace_hardirqs_on+0x67/0xb0 [ 233.565202] [<784c8c7e>] scsi_request_fn+0x1be/0x370 [ 233.570229] [<78408086>] blk_run_queue+0x36/0x80 [ 233.574909] [<784c7520>] scsi_next_command+0x30/0x50 [ 233.579935] [<784c76ab>] scsi_end_request+0xab/0xe0 [ 233.584875] [<784c83f9>] scsi_io_completion+0xa9/0x3d0 [ 233.590075] [<78135c67>] trace_hardirqs_on+0x67/0xb0 [ 233.595100] [<78405125>] blk_done_softirq+0x45/0x80 [ 233.600040] [<78405153>] blk_done_softirq+0x73/0x80 [ 233.604981] [<7811d4c3>] ...
You must have a magic test box :-) Will investigate... libata doesn't actually enable chaining, but since i386 supports it, it ends up using the chain helpers anyway. There seems to be some automatic inlining involved here, it must be dying inside ata_sg_setup(). -- Jens Axboe -
Also, can you send a dmesg from that system so I can see which libata drivers are involved? -- Jens Axboe -
same system for which i sent the earlier bootlog - but here's the second crashlog too. Crash is reproducible - crashed twice in a row already. (NOTE: the attached log is from this second crash.) Ingo
OK, something to try out - libata doesn't enable sg chaining, since the
support isn't complete yet. Here's a dumb check just to verify that
scsi_alloc_sgtable() will NEVER return a chain entry for a host that
doesn't have it enabled. If this triggers for you, then that could
explain your problem.
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index aac8a02..cc89c86 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -777,8 +777,12 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
* sglist must be the biggest one, or we would not have
* ended up doing another loop.
*/
- if (prev)
+ if (prev) {
+ struct Scsi_Host *shost = cmd->device->host;
+
+ BUG_ON(!shost->use_sg_chaining);
sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
+ }
/*
* don't allow subsequent mempool allocs to sleep, it would
--
Jens Axboe
-
btw., i get the following build warning: drivers/scsi/scsi_lib.c: In function 'scsi_free_sgtable': drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function drivers/scsi/scsi_lib.c:708: note: 'index' was declared here drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function drivers/scsi/scsi_lib.c:708: note: 'index' was declared here drivers/scsi/scsi_lib.c: In function 'scsi_alloc_sgtable': drivers/scsi/scsi_lib.c:708: warning: 'index' may be used uninitialized in this function drivers/scsi/scsi_lib.c:708: note: 'index' was declared here not sure it matters. Ingo -
Hmm, I don't see them here (gcc 4.2.1). Must be the BUG(), does it go away if you add a index = -1 in the default: case? Just to be absolutely sure... -- Jens Axboe -
NOTE: i get the same warning even without the BUG_ON() patch, and without your other fix patch as well. I'm using gcc 4.2.2. (Do you get the warning if you pick up the config i sent you earlier today?) Ingo -
btw., i changed the initialization to -1 and the crash still occurs (as expected). Ingo -
Would think so, thanks for checking though. -- Jens Axboe -
Just making 100% sure that was the missing place, I try not to take Let me check... Yep, I do! -- Jens Axboe -
nope, this did not help. First bootup went fine, second bootup crashed again (see below), without hitting the BUG_ON(). Ingo [ 39.351198] BUG: unable to handle kernel paging request at virtual address 7ca58000 [ 39.358698] printing eip: 784e9480 *pde = 00dda027 *pte = 04a58000 [ 39.364939] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 39.369099] [ 39.370573] Pid: 0, comm: swapper Not tainted (2.6.23 #9) [ 39.375944] EIP: 0060:[<784e9480>] EFLAGS: 00010087 CPU: 0 [ 39.381408] EIP is at ata_qc_issue+0x90/0x380 [ 39.385737] EAX: 7ca57ff0 EBX: 0000001f ECX: 7ca57ff0 EDX: 79181c20 [ 39.391977] ESI: 00000020 EDI: 7ca57e00 EBP: 7b54007c ESP: 78a13e10 [ 39.398217] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 39.403590] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000) [ 39.410781] Stack: 7ca57e00 7b540000 7b540000 00000000 7ca57fe0 7b54007c 7c991980 7b5417a4 [ 39.419101] 784c2490 784ef69e 784f21f3 7b52de98 7c991980 7b540000 7b5417a4 7c991980 [ 39.427421] 7b540000 7b524004 784f22e0 784ef380 784c2490 7c991980 00000206 7b524004 [ 39.435740] Call Trace: [ 39.438340] [<784c2490>] scsi_done+0x0/0x20 [ 39.442586] [<784ef69e>] ata_scsi_translate+0xbe/0x140 [ 39.447785] [<784f21f3>] ata_scsi_queuecmd+0x33/0x200 [ 39.452899] [<784f22e0>] ata_scsi_queuecmd+0x120/0x200 [ 39.458099] [<784ef380>] ata_scsi_rw_xlat+0x0/0x220 [ 39.463038] [<784c2490>] scsi_done+0x0/0x20 [ 39.467285] [<784c2d12>] scsi_dispatch_cmd+0x152/0x290 [ 39.472484] [<78135c67>] trace_hardirqs_on+0x67/0xb0 [ 39.477511] [<784c8c7e>] scsi_request_fn+0x1be/0x370 [ 39.482538] [<78408086>] blk_run_queue+0x36/0x80 [ 39.487217] [<784c7520>] scsi_next_command+0x30/0x50 [ 39.492243] [<784c76ab>] scsi_end_request+0xab/0xe0 [ 39.497183] [<784c83f9>] scsi_io_completion+0xa9/0x3d0 [ 39.502383] [<78135c67>] trace_hardirqs_on+0x67/0xb0 [ 39.507409] [<78405125>] blk_done_softirq+0x45/0x80 [ 39.512348] [<78405153>] ...
I think you'll always hit it if you have a scatter-gather list that is exactly filled up. Why? Because those things do "sg_next()" on the last entry, and as mentioned, that ends up actually accessing one past the end - even if the end result is not actually ever *used* (because we just effectively incremented it to past the last entry when the code was done with the SG list). So I think the sg_next() interface is fundamentally mis-designed. It should do the scatter-gather link following on *starting* to use the SG entry, not after finishing with it. Put another way: I suspect pretty much every single sg_next() out there is likely to hit this issue. The way that blk_rq_map_sg() fixed its problem was exactly to move the "sg_next()" to *before* the use of the SG (and even that one is somewhat bogus, in that it just blindly assumes that the first entry is not a link entry). I suspect the "the next entry is a link" bit should be in the *previous* entry, and then sg_next() could look like if (next_is_link(sg)) sg = sg_chain_ptr(sg+1); else sg++; return sg; and that would work. The alternative is to always make sure to allocate one more SG entry than required, so that the last entry is always either the link, or an unused entry! Linus -
OK, I think you have a very good point here. Ingo, can you verify it
goes away if apply the below? I have to tend to Other Life stuff but
will take this up again tomorrow morning and fix the sg_next() usage.
Linus, please still pull the branch I asked you to earlier. Thanks!
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index aac8a02..58ede7e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -39,7 +39,7 @@
* (unless chaining is used). Should ideally fit inside a single page, to
* avoid a higher order allocation.
*/
-#define SCSI_MAX_SG_SEGMENTS 128
+#define SCSI_MAX_SG_SEGMENTS 129
struct scsi_host_sg_pool {
size_t size;
--
Jens Axboe
-
Well, hang on - where does it end up doing sg_next() on the LAST sg
entry? I'd argue that this is a bug, like it was in ll_rw_blk.c. I still
agree that I should make the interface more robust, I just don't see
where libata ends up doing the sg_next() on the last entry.
I'm assuming that Ingo is using the previous patches, so that
blk_rq_map_sg() is using this construct:
new_segment:
if (!sg)
sg = sglist;
else
sg = sg_next(sg);
and the memset() in scsi_alloc_sgtable(), which I'm pretty sure he is.
I'm assuming we're not hitting pio paths, so there are no manual
sg_next() calls. Ingo, if you care, can you test this one as well?
No rush, as mentioned I'll be back tomorrow morning...
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index bbaa545..0246b61 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4664,7 +4664,7 @@ static int ata_sg_setup(struct ata_queued_cmd *qc)
{
struct ata_port *ap = qc->ap;
struct scatterlist *sg = qc->__sg;
- struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
+ struct scatterlist *lsg = &qc->__sg[qc->n_elem - 1];
int n_elem, pre_n_elem, dir, trim_sg = 0;
VPRINTK("ENTER, ata%u\n", ap->print_id);
--
Jens Axboe
-
hm, it still crashes - find the log below. Ingo [ 34.653682] EXT3-fs: mounted filesystem with ordered data mode. [ 34.659487] VFS: Mounted root (ext3 filesystem) readonly. [ 34.665208] Freeing unused kernel memory: 372k freed [ 35.809988] BUG: unable to handle kernel paging request at virtual address 7c8d3000 [ 35.817483] printing eip: 784e9480 *pde = 00dda027 *pte = 048d3000 [ 35.823723] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 35.827883] [ 35.829357] Pid: 47, comm: rc.sysinit Not tainted (2.6.23 #14) [ 35.835162] EIP: 0060:[<784e9480>] EFLAGS: 00010006 CPU: 0 [ 35.840626] EIP is at ata_qc_issue+0xd0/0x340 [ 35.844954] EAX: 3fd79000 EBX: 7c8d3000 ECX: 00000020 EDX: 00000020 [ 35.851194] ESI: 7c8a1a80 EDI: 7c8d2e00 EBP: 7b54007c ESP: 7c8b3d5c [ 35.857434] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 [ 35.862807] Process rc.sysinit (pid: 47, ti=7c8b2000 task=7c86d000 task.ti=7c8b2000) [ 35.870346] Stack: 7c8d3000 7b540000 7b540000 00000000 7c8d2ff0 7b54007c 7c8a1a80 7b5417a4 [ 35.878665] 784c2490 784ef61e 784f2173 7b52de98 7c8a1a80 7b540000 7b5417a4 7c8a1a80 [ 35.886985] 7b540000 7b524004 784f2260 784ef300 784c2490 7c8a1a80 00000216 7b524004 [ 35.895304] Call Trace: [ 35.897905] [<784c2490>] scsi_done+0x0/0x20 [ 35.902151] [<784ef61e>] ata_scsi_translate+0xbe/0x140 [ 35.907350] [<784f2173>] ata_scsi_queuecmd+0x33/0x200 [ 35.912463] [<784f2260>] ata_scsi_queuecmd+0x120/0x200 [ 35.917663] [<784ef300>] ata_scsi_rw_xlat+0x0/0x220 [ 35.922602] [<784c2490>] scsi_done+0x0/0x20 [ 35.926849] [<784c2d12>] scsi_dispatch_cmd+0x152/0x290 [ 35.932049] [<78135c9c>] trace_hardirqs_on+0x9c/0xb0 [ 35.937075] [<784c8c3e>] scsi_request_fn+0x1be/0x370 [ 35.942102] [<78120f02>] del_timer+0x62/0x70 [ 35.946434] [<784072d5>] __generic_unplug_device+0x25/0x30 [ 35.951981] [<784075a5>] generic_unplug_device+0x15/0x30 [ 35.957354] [<78404e00>] blk_backing_dev_unplug+0x0/0x10 [ 35.962726] [<78404e0c>] ...
They pretty much *all* do, as far as I can tell.
For example, to look at the very first one:
#define for_each_sg(sglist, sg, nr, __i) \
for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
let's say that "nr" is 1 (and that's also the allocation size), and look
at what that causes.
Right. It causes us to do "sg = sg_next(sg)" once. Which will do what? It
will increment sg (so that it now points past the single-entry array!) and
then it will dereference that invalid entry to see if it's a chain entry!
And no, "1" is not the special case. The special case is *any* time when
you iterate over as many sg entries as you allocated. You *always* have to
leave the last one unused in order to avoid this "access past the end"
problem.
The alternative is to rewrite the loop, but it's going to be ugly as hell,
and you need to do that for *every*single*user* of sg_next(). You can
re-write the above loop as something like
define for_each_sg(sglist, sg, nr, __i)
for (__i = 0, sg = NULL ;
__i < (nr) && sg = (sg ? sglist : sg_next(sg) ;
__i++))
but the important part here is that you must do that "sg_next()" *after*
you have broken out of the loop, and you must basically do it one less
time than you go through the loop.
IOW, any loop that leaves "sg" pointing to past the array is inevitably
buggy, because it will have accessed that last past-tne-end entry as part
See above. I think the exact sequence may be:
ata_qc_issue()
(implicitly inlined) ata_sg_setup()
(explicitly inlined) dma_map_sg()
(macro) for_each_sg()
but I didn't see if there are other possible chains that get you to one of
those invalid sg loops.
And no, it's *not* just for_each_sg(). Pretty much any "natural" loop over
the SG list will cause it, because that's how you write loops in C: you
almost always end up pointing to one past the last entry after the loop.
Linus
-
this one finally made the trick and it's booting fine now, without any crashes! Tested-by: Ingo Molnar <mingo@elte.hu> Ingo -
Super, that validates the theory the theory that Linus put forth. So the bug is clear now, this morning I'll work on proper sg looping. -- Jens Axboe -
hm, spoke too soon - i just got the ata_qc_issue() crash below. I've attached further below the current fixes/workarounds that i have applied at the moment. Any ideas? Ingo -------------> [ 155.259466] kjournald starting. Commit interval 5 seconds [ 155.265103] EXT3 FS on sda5, internal journal [ 155.269319] EXT3-fs: mounted filesystem with ordered data mode. [ 156.458225] BUG: unable to handle kernel paging request at virtual address 7d5ac000 [ 156.465723] printing eip: 784e9300 *pde = 00ddd027 *pte = 055ac000 [ 156.471964] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 156.476123] [ 156.477597] Pid: 0, comm: swapper Not tainted (2.6.23 #40) [ 156.483055] EIP: 0060:[<784e9300>] EFLAGS: 00010006 CPU: 0 [ 156.488520] EIP is at ata_qc_issue+0xd0/0x340 [ 156.492848] EAX: 3d328000 EBX: 7d5ac000 ECX: 00000020 EDX: 00000020 [ 156.499087] ESI: 7d5ab480 EDI: 7d5abe00 EBP: 7b54007c ESP: 78a13e1c [ 156.505328] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 156.510700] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000) [ 156.517893] Stack: 7d5ac000 7b540000 7b540000 00000000 7d5abff0 7b54007c 7d5ab480 7b5417a4 [ 156.526213] 784c2330 784ef49e 784f1ff3 7b52de98 7d5ab480 7b540000 7b5417a4 7d5ab480 [ 156.534531] 7b540000 7b524004 784f20e0 784ef180 784c2330 7d5ab480 00000216 7b524004 [ 156.542851] Call Trace: [ 156.545452] [<784c2330>] scsi_done+0x0/0x20 [ 156.549698] [<784ef49e>] ata_scsi_translate+0xbe/0x140 [ 156.554897] [<784f1ff3>] ata_scsi_queuecmd+0x33/0x200 [ 156.560010] [<784f20e0>] ata_scsi_queuecmd+0x120/0x200 [ 156.565210] [<784ef180>] ata_scsi_rw_xlat+0x0/0x220 [ 156.570150] [<784c2330>] scsi_done+0x0/0x20 [ 156.574395] [<784c2bb2>] scsi_dispatch_cmd+0x152/0x290 [ 156.579596] [<78135aa7>] trace_hardirqs_on+0x67/0xb0 [ 156.584622] [<784c8abe>] scsi_request_fn+0x1be/0x370 [ 156.589649] [<78407ef6>] blk_run_queue+0x36/0x80 [ 156.594328] [<784c73c0>] scsi_next_command+0x30/0x50 [ 156.599354] [<784c754b>] ...
Hang on Ingo, will post an updated patch soonish! -- Jens Axboe -
Alas, this didn't help me here. I did manage to capture the error messages this time, and in the two machines I'm actively testing on, sata_mv is the driver that's dying in both cases. Machine A additionally has sata_nv, which is working. Machine B additionally has ata_piix, which is working. So in both cases, its sata_mv that is Still digging... this behavior showed up after libata changes went in, FWIW. Jeff -
Theory - ata_sg_is_last() isn't returning true for the last entry. Can
you double check that it correcly marks the last entry in mv_fill_sg()?
Alternatively, just try this patch.
diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 4df8311..b858183 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -1139,6 +1139,7 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
struct mv_port_priv *pp = qc->ap->private_data;
struct scatterlist *sg;
struct mv_sg *mv_sg;
+ int end_marked = 0;
mv_sg = pp->sg_tbl;
ata_for_each_sg(sg, qc) {
@@ -1159,13 +1160,15 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
sg_len -= len;
addr += len;
- if (!sg_len && ata_sg_is_last(sg, qc))
+ if (!sg_len && ata_sg_is_last(sg, qc)) {
mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
+ end_marked++;
+ }
mv_sg++;
}
-
}
+ BUG_ON(end_marked != 1);
}
static inline void mv_crqb_pack_cmd(__le16 *cmdw, u8 data, u8 addr, unsigned last)
--
Jens Axboe
-
Will check in a few minutes, after my current test: I noticed that sg_tablesize in sata_mv is not LIBATA_MAX_PRD. This is expected behavior, but I wonder if that difference -- most notably being smaller than SCSI_MAX_SG_SEGMENTS -- would trigger any latent bugs. Anyway, we will both have answers momentarily... Jeff -
Another sata_mv difference from the rest: the chip does not support ATAPI, so we never care about qc->pad Jeff -
Your BUG_ON() does indeed trip, here. Its surprising that other folks don't explode, considering that mv_fill_sg() intentionally mirrors the logic in ata_fill_sg(). Jeff -
The sata_mv construct looks a bit odd. Does this work? That last
end_mv_sg test should always be true, just being paranoid...
diff --git a/drivers/ata/sata_mv.c b/drivers/ata/sata_mv.c
index 4df8311..5397eea 100644
--- a/drivers/ata/sata_mv.c
+++ b/drivers/ata/sata_mv.c
@@ -1138,8 +1138,9 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
{
struct mv_port_priv *pp = qc->ap->private_data;
struct scatterlist *sg;
- struct mv_sg *mv_sg;
+ struct mv_sg *mv_sg, *end_mv_sg;
+ end_mv_sg = NULL;
mv_sg = pp->sg_tbl;
ata_for_each_sg(sg, qc) {
dma_addr_t addr = sg_dma_address(sg);
@@ -1158,14 +1159,12 @@ static void mv_fill_sg(struct ata_queued_cmd *qc)
sg_len -= len;
addr += len;
-
- if (!sg_len && ata_sg_is_last(sg, qc))
- mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
-
+ end_mv_sg = mv_sg;
mv_sg++;
}
-
}
+ if (end_mv_sg)
+ end_mv_sg->flags_size |= cpu_to_le32(EPRD_FLAG_END_OF_TBL);
}
static inline void mv_crqb_pack_cmd(__le16 *cmdw, u8 data, u8 addr, unsigned last)
--
Jens Axboe
-
I'm testing a similar patch based on ata_fill_sg()'s method, which basically does something similar to what you've done here (see attached). I had noticed that ata_fill_sg() did not call ata_sg_is_last(). If this fixes the problem, I think the best solution would be to delete ata_sg_is_last(). In the few users that exist, we should be able to eliminate the test programmatically as you and ata_fill_sg() have done -- thereby eliminating a branch per loop in a hotpath. Off to test the attached... if that doesn't work I'll try your version, though there shouldn't be much difference. Jeff
Yes I know, but I'm trying to works towards getting rid of sg_last() and That should work as well. WRT ata_sg_is_last(), if we go ahead with my recent sg chaining updates, we can keep the test as it would be a single conditional and not require any looping. Let me know when you have tested this! -- Jens Axboe -
The patch I attached to the last email got both sata_mv test boxes working reliably (so far). I worked up a patch that kills ata_sg_is_last() (plus the max_phys_segments sata_mv fix), see attached. I'm thinking this is what I like to see in upstream. Of course, this doesn't explain why ata_sg_is_last() was broken, but since it's working _and_ slightly more efficient, I don't really care :) Jeff
Tomo and I agreed to kill sg_last() a few days ago anyways, so this is perfectly fine with me. -- Jens Axboe -
it would still be nice to figure out exactly why it was broken. Ingo -
It would always be nice. For this case I don't think it's very interesting, if we pursue the improved sg iteration setup. -- Jens Axboe -
BTW, I think it's pretty clear that ata_sg_is_last() is broken. It's likely a one-off in the n_iter test. -- Jens Axboe -
fixing that would be a -stable candidate, as a potential data corruptor - or is it more benign? Ingo -
I think it's safe to say that it was sg chaining introduced breakage, so it should work fine in 2.6.23.x. -- Jens Axboe -
It is confirmed working prior to the sg-chaining stuff, so 2.6.23.1 is OK... Jeff -
Yep, the [attached] patch that kills ata_sg_is_last() is working here on both machines that were previously croaking. It would be nice to get pdc_adma, sata_sil24 and ipr it-works test done, but IMO the patch is pretty straightforward and should be OK. Jeff
just a quick question: i have Jens's workarounds applied right now (see
patch below). Am i now crash/corruption-safe, or do i need your patch
too? And once your patch [and the other sg_*() patches] are upstream i
dont need the workaround anymore, correct?
Ingo
---
block/ll_rw_blk.c | 2 +-
drivers/ata/libata-core.c | 2 +-
drivers/scsi/scsi_lib.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
Index: linux/block/ll_rw_blk.c
===================================================================
--- linux.orig/block/ll_rw_blk.c
+++ linux/block/ll_rw_blk.c
@@ -631,7 +631,7 @@ void blk_queue_max_phys_segments(struct
printk("%s: set to minimum %d\n", __FUNCTION__, max_segments);
}
- q->max_phys_segments = max_segments;
+ q->max_phys_segments = max_segments - 1;
}
EXPORT_SYMBOL(blk_queue_max_phys_segments);
Index: linux/drivers/ata/libata-core.c
===================================================================
--- linux.orig/drivers/ata/libata-core.c
+++ linux/drivers/ata/libata-core.c
@@ -4664,7 +4664,7 @@ static int ata_sg_setup(struct ata_queue
{
struct ata_port *ap = qc->ap;
struct scatterlist *sg = qc->__sg;
- struct scatterlist *lsg = sg_last(qc->__sg, qc->n_elem);
+ struct scatterlist *lsg = &qc->__sg[qc->n_elem - 1];
int n_elem, pre_n_elem, dir, trim_sg = 0;
VPRINTK("ENTER, ata%u\n", ap->print_id);
Index: linux/drivers/scsi/scsi_lib.c
===================================================================
--- linux.orig/drivers/scsi/scsi_lib.c
+++ linux/drivers/scsi/scsi_lib.c
@@ -39,7 +39,7 @@
* (unless chaining is used). Should ideally fit inside a single page, to
* avoid a higher order allocation.
*/
-#define SCSI_MAX_SG_SEGMENTS 128
+#define SCSI_MAX_SG_SEGMENTS 129
struct scsi_host_sg_pool {
size_t size;
-
You need my patch if and only if you use one of the drivers touched by the patch. ata_sg_is_last() was a driver helper function, so my fix never really touched core code. I never had to apply the changes you included, to fix problems here. I wonder if libata should be doing blk_queue_max_phys_segments(q, q->max_phys_segments - 1) to account for the pad entry that libata owns. Jeff -
perhaps because you are not running a CONFIG_DEBUG_PAGEALLOC=y kernel? I recently fixed DEBUG_PAGEALLOC (it would crash upon bootup on x86 most of the time on any real hardware - so i doubt people were able to use it all that much). As long as you try Linus' latest -git tree (which has the latest arch/x86 merge) and use the 32-bit x86 kernel it should work fine for you too, and you will probably be able to trigger similar crashes too. Ingo -
Tested-by: Olof Johansson <olof@lixom.net> Looks ok on SATA_SIL24 and SATA_MV here on PPC (together with Jens' latest patch). Both barfed before. Thanks! -Olof -
[Just catching with reading lkml to this post] I "hate" to point this out, but I already reported that sata_sil24 fails on 1. Sep.: http://lkml.org/lkml/2007/9/1/95 In the thread "sata_sil24 broken since 2.6.23-rc4-mm1" I spent over a week to trace this back to ata_sg_is_last (finally found on 7. Oct): http://lkml.org/lkml/2007/10/7/43 Thanks for finally patching this now... Torsten -
This one will only fix the ones that use the SCSI-lib routines, so it will Already did. Linus -
So until that is fixed up, how about this? Should cover all block devices, and I don't think sg_next()/for_each_sg() has spread outside of that yet. diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 3935469..8708ad0 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -631,7 +631,7 @@ void blk_queue_max_phys_segments(struct request_queue *q, printk("%s: set to minimum %d\n", __FUNCTION__, max_segments); } - q->max_phys_segments = max_segments; + q->max_phys_segments = max_segments - 1; } EXPORT_SYMBOL(blk_queue_max_phys_segments); -- Jens Axboe -
this one crashes too, if added ontop of all the other fixes so far. Ingo [ 34.711185] EXT3-fs: mounted filesystem with ordered data mode. [ 34.716991] VFS: Mounted root (ext3 filesystem) readonly. [ 34.722712] Freeing unused kernel memory: 372k freed [ 36.112742] BUG: unable to handle kernel paging request at virtual address 7c8b1000 [ 36.120246] printing eip: 784e9490 *pde = 00dda027 *pte = 048b1000 [ 36.126486] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 36.130645] [ 36.132119] Pid: 0, comm: swapper Not tainted (2.6.23 #15) [ 36.137579] EIP: 0060:[<784e9490>] EFLAGS: 00010006 CPU: 0 [ 36.143043] EIP is at ata_qc_issue+0xd0/0x340 [ 36.147371] EAX: 3f896000 EBX: 7c8b1000 ECX: 00000008 EDX: 00000008 [ 36.153610] ESI: 7c86de80 EDI: 7c8b0f80 EBP: 7b54007c ESP: 78a13e28 [ 36.159851] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 36.165223] Process swapper (pid: 0, ti=78a12000 task=789753e0 task.ti=78a12000) [ 36.172416] Stack: 7c8b1000 7b540000 7b540000 00000000 7c8b0ff0 7b54007c 7c86de80 7b5417a4 [ 36.180735] 784c24a0 784ef62e 784f2183 7b52de98 7c86de80 7b540000 7b5417a4 7c86de80 [ 36.189055] 7b540000 7b524004 784f2270 784ef310 784c24a0 7c86de80 00000206 7b524004 [ 36.197374] Call Trace: [ 36.199975] [<784c24a0>] scsi_done+0x0/0x20 [ 36.204221] [<784ef62e>] ata_scsi_translate+0xbe/0x140 [ 36.209420] [<784f2183>] ata_scsi_queuecmd+0x33/0x200 [ 36.214533] [<784f2270>] ata_scsi_queuecmd+0x120/0x200 [ 36.219733] [<784ef310>] ata_scsi_rw_xlat+0x0/0x220 [ 36.224672] [<784c24a0>] scsi_done+0x0/0x20 [ 36.228918] [<784c2d22>] scsi_dispatch_cmd+0x152/0x290 [ 36.234118] [<78135c67>] trace_hardirqs_on+0x67/0xb0 [ 36.239145] [<784c8c4e>] scsi_request_fn+0x1be/0x370 [ 36.244171] [<78408096>] blk_run_queue+0x36/0x80 [ 36.248850] [<784c7530>] scsi_next_command+0x30/0x50 [ 36.253877] [<784c76bb>] scsi_end_request+0xab/0xe0 [ 36.258817] [<784c83c9>] scsi_io_completion+0xa9/0x3d0 [ 36.264016] ...
Heh. Not good. There are drivers that set max phys segmsnts to 1. Either through some host-specific setting (MMC), or explicitly (eg a grep shows viodasd uses just a constant 1 there). Linus -
That would hurt... Care to commit your for_each_sg() uglification fixup for now then? Or disable the allocation debug config entry, so that the sg+1 deref wont crash? Whichever your prefer, just don't want to cause crash pains for the unfortunate souls that happen to run into this before the real fix is done. -- Jens Axboe -
Well, in practice, it will only crash with DEBUG_PAGEALLOC, so few enough
are going to be hit by it. In that sense I don't think we're in any deep
trouble yet.
That said, maybe this is an acceptable, if hacky, replacement for the
current "for_each_sg()" loop.
It does:
- starts at one *before* the sglist
- does the sg_next() at the *top* of the loop rather than the bottom of it
- has a "--count" before that sg_next, so that we don't do it for the
case when we break out and have used up all segments.
Totally untested, but it *may* work, and it doesn't look horribly ugly.
Ingo, does this actually make any difference?
Linus
---
include/linux/scatterlist.h | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 2dc7464..f5c8e11 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -51,11 +51,18 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
return sg;
}
+static inline struct scatterlist *sg_safe_next(struct scatterlist *sg, int left)
+{
+ if (left < 0)
+ return NULL;
+ return sg_next(sg);
+}
+
/*
* Loop over each sg element, following the pointer to a new list if necessary
*/
#define for_each_sg(sglist, sg, nr, __i) \
- for (__i = 0, sg = (sglist); __i < (nr); __i++, sg = sg_next(sg))
+ for (__i = (nr), sg = (sglist)-1; (sg = sg_safe_next(sg, --__i)) != NULL ; )
/**
* sg_last - return the last scatterlist entry in a list
-
On Wed, 17 Oct 2007 14:11:34 -0700 (PDT) Looks that (sglist) - 1 isn't initialized and we use sg_next for it? -
sg_next() - as it stands now - never actually looks at the SG that its argument points to: it explicitly *only* looks at the next one. That's the bug. If sg_next() looked at the actual *current* sg entry, we wouldn't have any issues to begin with, and that's what I'm arguing we should do in the longer run (where "longer run" is defined as "when Jens does it asap"). So the hacky solution as it stands right now is to actually use the behaviour of "sg_next()" to our advantage in for_each_sg(), and starting off by setting sg to (sglist)-1. That way we can do the "sg_next()" (which will *not* look at the uninitialized space before the array) before entering the loop, regardless of whether it's the first time through the loop or not. So no, it's not technically "strictly conforming ANSI C", because we use a pointer (not do not dereference it!) outside of its allocation, which is strictly speaking against the standard. However, the kernel does those kinds of things all the time, since the kernel does its own memory allocation, so that's not actually relevant. The *standard* may say that you cannot decrement a pointer to before the beginning of the object and then increment it back up again and be strictly conforming, but the fact is, we very much depend on contiguous and flat kernel pointer model, and always have (and probably always will) Linus -
From: Linus Torvalds <torvalds@linux-foundation.org> What the thing really wants is some kind of indication of state, without having to bloat up the scatterlist structure. I believe that we have enough of a limited set of accessors to sg->page that we can more aggressively encode things in the lower bits. I'm thinking of encoding the low two bits of sg->page as follows: 1) bits == 0 then the SG list is linear and sg_next() is sg++ 2) bits == 1 the nest SG is an indirect chunk, sg_next() is therefore something like: next = *((struct scatterlist **)(sg + 1)); 3) bits == 2 this is the last entry in the scatterlist, sg_next() is NULL So for the cases of ARCH_HAS_SG_CHAIN not being set (ie. back compatible), we can do no bit encoding in page->flags and just do sg_next() == sg++, as is done now. When doing SG chaining, in each non-linear chunk we have to allocate one more pointer past the end of the scatterlist array (instead of a full extra scatterlist entry for the indirect pointer encode). Next, all sg->page accesses have to be guarded to clear the state bits out first. I don't know, maybe it would work, and would make the loop termination issues easier to handle properly. -
I like it. Basically the only real change is using bit 2 as a
termination point, so we avoid going beyond the end of the sgtable.
Here's a starting point, it actually booted for me in the first go
(boggle). Only x86 so far, archs will need to be converted. And lots
more drivers I'm sure, I only fixed up the ones that botched my compile.
So just consider this a directional patch.
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 5cdfab6..daaf636 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -302,7 +302,7 @@ static int dma_map_sg_nonforce(struct device *dev, struct scatterlist *sg,
#endif
for_each_sg(sg, s, nents, i) {
- unsigned long addr = page_to_phys(s->page) + s->offset;
+ unsigned long addr = page_to_phys(sg_page(s)) + s->offset;
if (nonforced_iommu(dev, addr, s->length)) {
addr = dma_map_area(dev, addr, s->length, dir);
if (addr == bad_dma_address) {
@@ -397,7 +397,7 @@ static int gart_map_sg(struct device *dev, struct scatterlist *sg, int nents,
start_sg = sgmap = sg;
ps = NULL; /* shut up gcc */
for_each_sg(sg, s, nents, i) {
- dma_addr_t addr = page_to_phys(s->page) + s->offset;
+ dma_addr_t addr = page_to_phys(sg_page(s)) + s->offset;
s->dma_address = addr;
BUG_ON(s->length == 0);
diff --git a/arch/x86/kernel/pci-nommu_64.c b/arch/x86/kernel/pci-nommu_64.c
index e85d436..d64a4a5 100644
--- a/arch/x86/kernel/pci-nommu_64.c
+++ b/arch/x86/kernel/pci-nommu_64.c
@@ -62,8 +62,8 @@ static int nommu_map_sg(struct device *hwdev, struct scatterlist *sg,
int i;
for_each_sg(sg, s, nents, i) {
- BUG_ON(!s->page);
- s->dma_address = virt_to_bus(page_address(s->page) +s->offset);
+ BUG_ON(!sg_page(s));
+ s->dma_address = virt_to_bus(page_address(sg_page(s)) +s->offset);
if (!check_addr("map_sg", hwdev, s->dma_address, s->length))
return 0;
s->dma_length = s->length;
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 3935469..d02783c ...From: Jens Axboe <jens.axboe@oracle.com>
Here are some sparc64 bits:
diff --git a/arch/sparc64/kernel/iommu.c b/arch/sparc64/kernel/iommu.c
index 29af777..73852a2 100644
--- a/arch/sparc64/kernel/iommu.c
+++ b/arch/sparc64/kernel/iommu.c
@@ -473,7 +473,7 @@ static void dma_4u_unmap_single(struct device *dev, dma_addr_t bus_addr,
}
#define SG_ENT_PHYS_ADDRESS(SG) \
- (__pa(page_address((SG)->page)) + (SG)->offset)
+ (__pa(page_address(sg_page(SG))) + (SG)->offset)
static void fill_sg(iopte_t *iopte, struct scatterlist *sg,
int nused, int nelems,
@@ -566,7 +566,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
if (nelems == 1) {
sglist->dma_address =
dma_4u_map_single(dev,
- (page_address(sglist->page) +
+ (page_address(sg_page(sglist)) +
sglist->offset),
sglist->length, direction);
if (unlikely(sglist->dma_address == DMA_ERROR_CODE))
diff --git a/arch/sparc64/kernel/iommu_common.c b/arch/sparc64/kernel/iommu_common.c
index d7ca900..ec863e0 100644
--- a/arch/sparc64/kernel/iommu_common.c
+++ b/arch/sparc64/kernel/iommu_common.c
@@ -73,7 +73,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
daddr = dma_sg->dma_address;
sglen = sg->length;
- sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+ sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
while (dlen > 0) {
unsigned long paddr;
@@ -123,7 +123,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
sg = sg_next(sg);
if (--nents <= 0)
break;
- sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+ sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
sglen = sg->length;
}
if (dlen < 0) {
@@ -191,7 +191,7 @@ void verify_sglist(struct scatterlist *sglist, int nents, iopte_t *iopte, int np
printk("sg(%d): page_addr(%p) off(%x) length(%x) "
"dma_address[%016x] ...Thanks a lot, Dave! The patch is a monster right now, I'll work on splitting it into a 3-step process. Any arch help is greatly appreciated. -- Jens Axboe -
From: Jens Axboe <jens.axboe@oracle.com>
I have some other bits that my compile hit, such as some things in the
crypto layer.
But I hesitate to send them to you because I think the on-stack cases
need some helpers such that DEBUG_SG works for them.
BTW, you missed a case in drivers/usb/core/message.c because of
the config used in your build. This thing below is a good
argument for trying to avoid HIGHMEM et al. ifdefs in drivers :-)
--- drivers/usb/core/message.c~ 2007-10-18 01:46:44.000000000 -0700
+++ drivers/usb/core/message.c 2007-10-18 03:15:20.000000000 -0700
@@ -438,7 +438,7 @@
io->urbs[i]->transfer_buffer = NULL;
#else
io->urbs[i]->transfer_buffer =
- page_address(sg[i].page) + sg[i].offset;
+ page_address(sg_page(&sg[i])) + sg[i].offset;
#endif
} else {
/* hc may use _only_ transfer_buffer */
-
Yeah, I have tons of that so far. I hope to have an allyesconfig Indeed. I convert where appropriate, but I'm not anal about it. If they don't use sg_next() and/or for_each_sg() on their list, it should be ok. Thanks :-) -- Jens Axboe -
OK here goes, this compiles with allyesconfig on x86-64. Not too bad, the scsi/ drivers were by far the worst. arch/sparc64/kernel/iommu.c | 4 - arch/sparc64/kernel/iommu_common.c | 12 ++-- arch/sparc64/kernel/pci_sun4v.c | 4 - arch/x86/kernel/pci-calgary_64.c | 12 ++-- arch/x86/kernel/pci-gart_64.c | 4 - arch/x86/kernel/pci-nommu_64.c | 4 - block/ll_rw_blk.c | 5 + crypto/digest.c | 2 crypto/hmac.c | 3 - crypto/scatterwalk.c | 2 crypto/scatterwalk.h | 6 +- crypto/tcrypt.c | 4 - crypto/xcbc.c | 2 drivers/ata/libata-core.c | 10 +-- drivers/ata/libata-scsi.c | 2 drivers/block/cciss.c | 4 - drivers/block/cpqarray.c | 3 - drivers/block/cryptoloop.c | 12 ++-- drivers/block/ub.c | 8 +- drivers/ide/ide-probe.c | 4 + drivers/ide/ide-taskfile.c | 2 drivers/ieee1394/dma.c | 2 drivers/ieee1394/sbp2.c | 2 drivers/infiniband/core/umem.c | 11 ++- drivers/infiniband/hw/ipath/ipath_dma.c | 4 - drivers/infiniband/hw/ipath/ipath_mr.c | 2 drivers/infiniband/hw/mthca/mthca_memfree.c | 24 +++++--- drivers/infiniband/ulp/iser/iser_memory.c | 6 +- drivers/md/dm-crypt.c | 21 +++---- drivers/media/common/saa7146_core.c | 3 - drivers/media/video/ivtv/ivtv-udma.c | 4 - drivers/media/video/videobuf-dma-sg.c | 8 +- drivers/mmc/card/queue.c | 15 ++--- drivers/mmc/host/mmc_spi.c | 8 +- drivers/mmc/host/sdhci.c ...
From: Jens Axboe <jens.axboe@oracle.com> It build cleanly here on sparc64 too. It's late and I'm about to hit bed so I'm too chicken to do a test boot it right now :-) -
Don't boot it Dave, odds are that something will break and you'll then be stuck debugging that since you can't relax and sleep until it's working :-) -- Jens Axboe -
On Oct. 18, 2007, 14:15 +0200, Jens Axboe <jens.axboe@oracle.com> wrote: Hmm, sg_next is not supposed to return a pointer to the chain entry itself, but rather skip it. I think that the fix needs only check the "last" flag before incrementing sg. +#ifdef CONFIG_DEBUG_SG + BUG_ON(sg->sg_magic != SG_MAGIC); +#endif + if (sg_is_last(sg)) + return NULL; + sg++ if (unlikely(sg_is_chain(sg))) -
That would make sense, definitely. I'll add that. -- Jens Axboe -
BUG_ON(!sg_is_last(ret)); it should be, not sg. That's what I merged. -- Jens Axboe -
OK, that found something interesting - mapping a request may shrink it,
so we need to update the end marker to move it earlier in the list.
Basically just a
if (sg)
__sg_mark_end(sg);
at the bottom of blk_rq_map_sg(). Updated patch below, booted on other
machines now as well without incident.
diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index e1c4707..094a95e 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -465,7 +465,7 @@ EXPORT_SYMBOL(pci_free_consistent);
Write dma_length of each leader with the combined lengths of
the mergable followers. */
-#define SG_ENT_VIRT_ADDRESS(SG) (page_address((SG)->page) + (SG)->offset)
+#define SG_ENT_VIRT_ADDRESS(SG) (page_address(sg_page((SG)) + (SG)->offset))
#define SG_ENT_PHYS_ADDRESS(SG) __pa(SG_ENT_VIRT_ADDRESS(SG))
static void
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 44ab0da..9d371e4 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -442,7 +442,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
BUG_ON(dir == DMA_NONE);
for (i = 0; i < nents; i++, sg++) {
- struct page *page = sg->page;
+ struct page *page = sg_page(sg);
unsigned int offset = sg->offset;
unsigned int length = sg->length;
void *ptr = page_address(page) + offset;
diff --git a/arch/blackfin/kernel/dma-mapping.c b/arch/blackfin/kernel/dma-mapping.c
index 94d7b11..dbf7222 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -160,7 +160,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
BUG_ON(direction == DMA_NONE);
for (i = 0; i < nents; i++, sg++) {
- sg->dma_address = (dma_addr_t)(page_address(sg->page) +
+ sg->dma_address = (dma_addr_t)(page_address(sg_page(sg)) +
sg->offset);
invalidate_dcache_range(sg_dma_address(sg),
diff --git a/arch/ia64/hp/common/sba_iommu.c ...PPC needs this. I'm guessing other arches might too. Otherwise seems to boot
and work well (with Jeff's sata patch).
Thanks,
-Olof
diff --git a/include/asm-powerpc/scatterlist.h b/include/asm-powerpc/scatterlist.h
index b9f1dbc..fcf7d55 100644
--- a/include/asm-powerpc/scatterlist.h
+++ b/include/asm-powerpc/scatterlist.h
@@ -14,6 +14,9 @@
#include <asm/dma.h>
struct scatterlist {
+#ifdef CONFIG_DEBUG_SG
+ unsigned long sg_magic;
+#endif
unsigned long page_link;
unsigned int offset;
unsigned int length;
-
Oh duh, indeed I didn't add that to the archs when converting. Thanks for the patch, I'll update the rest as well. And thanks a lot for boot testing it! -- Jens Axboe -
Alright, small modifications and tested somewhat. It boots on my laptop
just fine (CONFIG_DEBUG_SG is set) without warnings. Also tested with
PAGEALLOC set, that also boots and works fine without issue.
So if this is what we want, the question is how to proceed. Right now
i386 and x86-64 works, sparc64 PROBABLY works but is not tested. Any
other arch will NOT compile, since it's not converted to use page_link.
I'll try and do ppc64 now.
diff --git a/arch/sparc64/kernel/iommu.c b/arch/sparc64/kernel/iommu.c
index 29af777..73852a2 100644
--- a/arch/sparc64/kernel/iommu.c
+++ b/arch/sparc64/kernel/iommu.c
@@ -473,7 +473,7 @@ static void dma_4u_unmap_single(struct device *dev, dma_addr_t bus_addr,
}
#define SG_ENT_PHYS_ADDRESS(SG) \
- (__pa(page_address((SG)->page)) + (SG)->offset)
+ (__pa(page_address(sg_page(SG))) + (SG)->offset)
static void fill_sg(iopte_t *iopte, struct scatterlist *sg,
int nused, int nelems,
@@ -566,7 +566,7 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist,
if (nelems == 1) {
sglist->dma_address =
dma_4u_map_single(dev,
- (page_address(sglist->page) +
+ (page_address(sg_page(sglist)) +
sglist->offset),
sglist->length, direction);
if (unlikely(sglist->dma_address == DMA_ERROR_CODE))
diff --git a/arch/sparc64/kernel/iommu_common.c b/arch/sparc64/kernel/iommu_common.c
index d7ca900..ec863e0 100644
--- a/arch/sparc64/kernel/iommu_common.c
+++ b/arch/sparc64/kernel/iommu_common.c
@@ -73,7 +73,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
daddr = dma_sg->dma_address;
sglen = sg->length;
- sgaddr = (unsigned long) (page_address(sg->page) + sg->offset);
+ sgaddr = (unsigned long) (page_address(sg_page(sg)) + sg->offset);
while (dlen > 0) {
unsigned long paddr;
@@ -123,7 +123,7 @@ static int verify_one_map(struct scatterlist *dma_sg, struct scatterlist **__sg,
sg = sg_next(sg);
if (--nents <= 0)
...OK, I think that covers every arch out there. I haven't been able to compile any of them, but it's mostly search'n replace operations. I hope nothing is missing linux/scatterlist.h includes... arch/alpha/kernel/pci_iommu.c | 2 arch/arm/common/dmabounce.c | 2 arch/blackfin/kernel/dma-mapping.c | 2 arch/ia64/hp/common/sba_iommu.c | 2 arch/ia64/hp/sim/simscsi.c | 4 arch/ia64/sn/pci/pci_dma.c | 2 arch/m68k/kernel/dma.c | 2 arch/mips/mm/dma-default.c | 8 - arch/powerpc/kernel/dma_64.c | 2 arch/powerpc/kernel/ibmebus.c | 2 arch/powerpc/kernel/iommu.c | 2 arch/powerpc/platforms/ps3/system-bus.c | 2 arch/sparc/kernel/ioport.c | 16 +- arch/sparc/mm/io-unit.c | 2 arch/sparc/mm/iommu.c | 8 - arch/sparc/mm/sun4c.c | 2 arch/sparc64/kernel/iommu.c | 4 arch/sparc64/kernel/iommu_common.c | 12 - arch/sparc64/kernel/ldc.c | 2 arch/sparc64/kernel/pci_sun4v.c | 4 arch/um/drivers/ubd_kern.c | 2 arch/x86/kernel/pci-calgary_64.c | 12 + arch/x86/kernel/pci-gart_64.c | 4 arch/x86/kernel/pci-nommu_64.c | 4 block/ll_rw_blk.c | 5 crypto/digest.c | 2 crypto/hmac.c | 3 crypto/scatterwalk.c | 2 crypto/scatterwalk.h | 6 crypto/tcrypt.c | 4 crypto/xcbc.c | 2 drivers/ata/libata-core.c | 10 - drivers/ata/libata-scsi.c | 2 drivers/block/cciss.c | 4 ...
Jens, again, please correct me if I'm wrong, but when sg points at the
entry right before a chain entry this implementation of sg_next will
here's how I think sg_next should be implemented:
*/
static inline struct scatterlist *sg_next(struct scatterlist *sg)
{
+#ifdef CONFIG_DEBUG_SG
+ BUG_ON(sg->sg_magic != SG_MAGIC);
+#endif
+ if (sg_is_last(sg))
+ return NULL;
+
sg++;
if (unlikely(sg_is_chain(sg)))
sg = sg_chain_ptr(sg);
return sg;
}
-
Yep, thanks for catching that! -- Jens Axboe -
Patch fails on drivers/scsi/scsi_lib.c.
I replaced that part of the patch with this updated portion instead:
--- a/drivers/scsi/scsi_lib.c 2007-10-18 09:35:28.000000000 -0400
+++ b/drivers/scsi/scsi_lib.c 2007-10-18 09:46:47.000000000 -0400
@@ -295,7 +295,7 @@
int i, err, nr_vecs = 0;
for_each_sg(sgl, sg, nsegs, i) {
- page = sg->page;
+ page = sg_page(sg);
off = sg->offset;
len = sg->length;
data_len += len;
@@ -764,6 +764,8 @@
if (unlikely(!sgl))
goto enomem;
+ sg_init_table(sgl, sgp->size);
+
/*
* first loop through, set initial index and return value
*/
@@ -779,6 +781,13 @@
sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
/*
+ * if we have nothing left, mark the last segment as
+ * end-of-list
+ */
+ if (!left)
+ sg_mark_end(sgl, this);
+
+ /*
* don't allow subsequent mempool allocs to sleep, it would
* violate the mempool principle.
*/
@@ -2351,7 +2360,7 @@
*offset = *offset - len_complete + sg->offset;
/* Assumption: contiguous pages can be accessed as "page + i" */
- page = nth_page(sg->page, (*offset >> PAGE_SHIFT));
+ page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
*offset &= ~PAGE_MASK;
/* Bytes in this sg-entry from *offset to the end of the page */
-
Hmm, what are you applying against? Must be a clean tree, throw away any
patches that you already applied in this thread.
Updated below.
diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index e1c4707..094a95e 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -465,7 +465,7 @@ EXPORT_SYMBOL(pci_free_consistent);
Write dma_length of each leader with the combined lengths of
the mergable followers. */
-#define SG_ENT_VIRT_ADDRESS(SG) (page_address((SG)->page) + (SG)->offset)
+#define SG_ENT_VIRT_ADDRESS(SG) (page_address(sg_page((SG)) + (SG)->offset))
#define SG_ENT_PHYS_ADDRESS(SG) __pa(SG_ENT_VIRT_ADDRESS(SG))
static void
diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 44ab0da..9d371e4 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -442,7 +442,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
BUG_ON(dir == DMA_NONE);
for (i = 0; i < nents; i++, sg++) {
- struct page *page = sg->page;
+ struct page *page = sg_page(sg);
unsigned int offset = sg->offset;
unsigned int length = sg->length;
void *ptr = page_address(page) + offset;
diff --git a/arch/blackfin/kernel/dma-mapping.c b/arch/blackfin/kernel/dma-mapping.c
index 94d7b11..dbf7222 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -160,7 +160,7 @@ dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
BUG_ON(direction == DMA_NONE);
for (i = 0; i < nents; i++, sg++) {
- sg->dma_address = (dma_addr_t)(page_address(sg->page) +
+ sg->dma_address = (dma_addr_t)(page_address(sg_page(sg)) +
sg->offset);
invalidate_dcache_range(sg_dma_address(sg),
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 3c95f41..0b74c63 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -246,7 +246,7 @@ static int reserve_sba_gart = ...Squeaky-clean linux-2.6.23 + patch-2.6.23-git13 + your patch. Fails on scsi_lib.c. -ml -
I'll re-pull everything fresh again from kernel.org and rebuild with the "updated below" patch you posted. Thanks. -
Okay, fresh pull of everything from kernel.org, and now your latest patch does apply cleanly to -git13. Something weird (at this end). Thanks. -
Thanks for confirming, I did double check that my HEAD was uptodate - and it is. -- Jens Axboe -
Umm. May I suggest (I haven't read the whole thread yet, maybe somebody
else already did) that
static inline unsigned long sg_phys(struct scatterlist *sg)
{
return page_to_phys(sg_page(sg)) + sg->offset;
}
would be a good thing to have?
Very few drivers should care so much about the *page* itself (or the
offset). That's something that the generic allocation code etc cares
about, but the driver is almost bound to care mostly about the actual DMA
address, so adding that helper function that abstracts the sg access would
be helpful in hiding some of the cruft?
Linus
-
I'll add it to the mix. -- Jens Axboe -
Actually, only 11 occurrences in the patch. But still a nice little cleanup. -- Jens Axboe -
On Thu, 18 Oct 2007 09:55:04 -0700 (PDT) .... but will that work for systems with IOMMU ? or is it fundamentally the wrong interface -
They use foo_to_bus() on the address. sg_phys() should of course only be used where the user previously did page_to_phys() on the sg page. -- Jens Axboe -
On Thu, 18 Oct 2007 19:14:29 +0200 I can take care of IOMMU stuff when I'll send IOMMU merging fix patchset: http://marc.info/?l=linux-scsi&m=119079718126157&w=2 -
FWIW libata cares about both. When doing DMA, it cares about the DMA address. When doing PIO, it cares about the actual page, because it does a kmap before sending the data through a 16-bit/32-bit data FIFO. Jeff -
Yes, that sounds sane. Although I also wonder whether we want one global per-arch ARCH_HAS_SG_CHAIN, or perhaps have something more dynamic, which allows a per-SG-list choice (which in turn would require some kind of "head" entry that is passed into sg_next(), and in general descibes the SG list) Linus -
From: Linus Torvalds <torvalds@linux-foundation.org> It's there because the DMA mapping support code for a platform has to be converted to handle these chains and audited to make sure the conversion is right. Some platforms, such as sparc64, took a lot of work to get right :-) Once that macro is set, the block device driver has to support it too, and there are knobs all the way down to the scsi host driver to indicate this. The idea is that all of this mess gets deleted in the end, it was meant to be a safe transition scheme. -
It would be good to have something soon-ish. This "dead at boot time" issue is impacting the general ability to test -
In the meantime, does the patch I sent out help people? I'd like to get feedback, but I'm a lazy bum, and don't use DEBUG_PAGEALLOC myself, so I was hoping that people who actually see this could comment on my untested suggestion. Linus -
Oh.. so this bug is supposed to only bite with DEBUG_PAGEALLOC=y ?? Then something else is broken, perhaps. I just saw a long traceback scroll off the top of the screen, with lots of bio_* functions in the list and assumed it was the same bug. Mmm.. I'll get out the camera and try it again now.. -
I'm currently bisecting on two machines (sata_mv and sata_nv). sata_mv suddenly started spewing errors in recent kernels, but kernels from ~48 hours ago are just fine. AMD64 sata_nv machine will simply lock up if I push it too hard. Again, reproducible, and kernels from ~48 hours ago are fine. AHCI machine so far seems fine (kernel ~24 hours ago), storage-wise, but the hda_intel audio decided to stop working :) I'm quite happy to test patches, but I never run with DEBUG_PAGEALLOC so that shouldn't be causing the problems here. Jeff -
Okay, mine is dying with EIP at blk_rq_map_sg+0xcb/0x160. Screen photo is at http://rtr.ca/recent/2.6.23-git12-crash.jpg, but the top was cut off (isn't there a new config option or patch to do double-columns or scrollback or something ???. Cheers -
Is this a sata_mv box? If so, could you try this patch? Jeff
If anything, that shrinks the size of the resulting request. Did this patch make any difference to you? Now sata_mv should not be doing this (already discussed), but as long as it's only lowering the physical sg count then it should not cause any bugs at least. -- Jens Axboe -
Not a sata_mv box, so no point here. ata_piix. -
Can you try the big patch I just posted? -- Jens Axboe -
I'll hunt for it and try it, but your earlier little patch already works fine. Cheers -
I found the latest rev, and it failed to apply cleanly on -git12 or -git13
due to scsi_lib.c. After fixing that portion (replacement chunk below),
I'm now running with -git12, with the sg list debug option enabled (no messages).
Looks okay so far
--- a/drivers/scsi/scsi_lib.c 2007-10-18 09:35:28.000000000 -0400
+++ b/drivers/scsi/scsi_lib.c 2007-10-18 09:46:47.000000000 -0400
@@ -295,7 +295,7 @@
int i, err, nr_vecs = 0;
for_each_sg(sgl, sg, nsegs, i) {
- page = sg->page;
+ page = sg_page(sg);
off = sg->offset;
len = sg->length;
data_len += len;
@@ -764,6 +764,8 @@
if (unlikely(!sgl))
goto enomem;
+ sg_init_table(sgl, sgp->size);
+
/*
* first loop through, set initial index and return value
*/
@@ -779,6 +781,13 @@
sg_chain(prev, SCSI_MAX_SG_SEGMENTS, sgl);
/*
+ * if we have nothing left, mark the last segment as
+ * end-of-list
+ */
+ if (!left)
+ sg_mark_end(sgl, this);
+
+ /*
* don't allow subsequent mempool allocs to sleep, it would
* violate the mempool principle.
*/
@@ -2351,7 +2360,7 @@
*offset = *offset - len_complete + sg->offset;
/* Assumption: contiguous pages can be accessed as "page + i" */
- page = nth_page(sg->page, (*offset >> PAGE_SHIFT));
+ page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
*offset &= ~PAGE_MASK;
/* Bytes in this sg-entry from *offset to the end of the page */
-
OK, thanks a lot for testing! -- Jens Axboe -
No.. it's a pretty ordinary ata_piix machine, with no serial ports either. Cheers -
That could explain it: if the SG allocation is simply too small, the scatter-gather code will run off the end of the SG list, and encounter random uninitialized entries, and if any of those have the low bit set, they will be considered to be "link" entries, and the SG list goes wild. That SG code is really pretty fragile. I don't see it *ever* checking that the SG allocation is large enough when it fills it in. And these things take the sizes from different places (ie "cmd->use_sg" bs "req->nr_phys_segments" vs God knows what else..) Linus -
I tried the fancy "boot_delay=nnn" feature, but that doesn't slow down the tracebacks at all. So I hardcoded some mdelay(1000) lines into the traceback code, and here's the top part of the oops now: http://rtr.ca/recent/2.6.23-git12-crash2.jpg And here's the .config for the kernel, just in case it's of any use here: # # Automatically generated make config: don't edit # Linux kernel version: 2.6.23-git12 # Thu Oct 18 00:28:52 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_USER_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=16 # CONFIG_CPUSETS is not set # CONFIG_FAIR_GROUP_SCHED is not set CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y CONFIG_EMBEDDED=y CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y # CONFIG_SLUB_DEBUG is ...
...
And, yes, I make that out as being this line from blk_rq_map_sg():
next_sg = sg_next(sg);
"objdump -d" output from my actual kernel:
00003ce0 <blk_rq_map_sg>:
3ce0: 55 push %ebp
3ce1: 57 push %edi
3ce2: 56 push %esi
3ce3: 53 push %ebx
3ce4: 83 ec 24 sub $0x24,%esp
3ce7: 89 44 24 04 mov %eax,0x4(%esp)
3ceb: 8b 98 44 01 00 00 mov 0x144(%eax),%ebx
3cf1: 83 e3 01 and $0x1,%ebx
3cf4: 89 5c 24 14 mov %ebx,0x14(%esp)
3cf8: 8b 52 34 mov 0x34(%edx),%edx
3cfb: c7 44 24 10 00 00 00 movl $0x0,0x10(%esp)
3d02: 00
3d03: 85 d2 test %edx,%edx
3d05: 89 54 24 1c mov %edx,0x1c(%esp)
3d09: 0f 84 f4 00 00 00 je 3e03 <blk_rq_map_sg+0x123>
3d0f: 89 cb mov %ecx,%ebx
3d11: 31 ff xor %edi,%edi
3d13: 89 4c 24 0c mov %ecx,0xc(%esp)
3d17: 8b 44 24 1c mov 0x1c(%esp),%eax
3d1b: 0f b7 48 16 movzwl 0x16(%eax),%ecx
3d1f: 8b 50 2c mov 0x2c(%eax),%edx
3d22: 89 4c 24 18 mov %ecx,0x18(%esp)
3d26: 0f b7 40 14 movzwl 0x14(%eax),%eax
3d2a: 39 c1 cmp %eax,%ecx
3d2c: 0f 8d be 00 00 00 jge 3df0 <blk_rq_map_sg+0x110>
3d32: 8d 04 49 lea (%ecx,%ecx,2),%eax
3d35: 8d 34 82 lea (%edx,%eax,4),%esi
3d38: 0f b6 44 24 14 movzbl 0x14(%esp),%eax
3d3d: 88 44 24 23 mov %al,0x23(%esp)
3d41: eb 05 jmp 3d48 <blk_rq_map_sg+0x68>
3d43: 89 f7 ...Ok, I think your picture cut off the last hex digits on the right, but what I can make out of the disassembly, I have to admit that it looks very much like it might be exactly the same thing Ingo was seeing. The disassembly is a bit scrambled, but the particular instruction that oopses for you is mov 0x10(%ebx),%eax and the instructions around it do seem to bear some similarity to the whole "sg_next()" thing, ie I see a test $0x1,%al and a lea 0x10(%ebx),%eax around there too. If you cut down the number of entries printed for the call trace (which involves editing the source code, no nice config options, I'm afraid), you could get the rest of it.. Linus -
Yeah, this particular one really should only bite you with DEBUG_PAGEALLOC. The SG code potentially _derefences_ a field past the end of the SG array, but it should be a read, and the result should never be used (it's used to calculate the pointer to past the end of the queue, so if it's used, Please do. I wouldn't be surprised if it's also related to the SG changes, but I don't think it's the exact particular bug that Ingo saw. Linus -
Your patch from this posting http://lkml.org/lkml/2007/10/17/285 does not seem to make much difference here. It still crashes at exactly the same place. -
However, Jens's patch from that same thread:
http://lkml.org/lkml/2007/10/17/269
..allowed me to boot and post this followup message from -git12
Jeff: try that one.
Patch reproduced here for convenience:
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..3935469 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1322,8 +1322,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
struct scatterlist *sglist)
{
struct bio_vec *bvec, *bvprv;
- struct scatterlist *next_sg, *sg;
struct req_iterator iter;
+ struct scatterlist *sg;
int nsegs, cluster;
nsegs = 0;
@@ -1333,7 +1333,7 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
* for each bio in rq
*/
bvprv = NULL;
- sg = next_sg = &sglist[0];
+ sg = NULL;
rq_for_each_segment(bvec, rq, iter) {
int nbytes = bvec->bv_len;
@@ -1349,8 +1349,10 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
sg->length += nbytes;
} else {
new_segment:
- sg = next_sg;
- next_sg = sg_next(sg);
+ if (!sg)
+ sg = sglist;
+ else
+ sg = sg_next(sg);
memset(sg, 0, sizeof(*sg));
sg->page = bvec->bv_page;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0c86be7..aac8a02 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -764,6 +764,8 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd *cmd, gfp_t gfp_mask)
if (unlikely(!sgl))
goto enomem;
+ memset(sgl, 0, sizeof(*sgl) * sgp->size);
+
/*
* first loop through, set initial index and return value
*/
-
That's already in my upstream kernel, here. commits ba951841ceb7fa5b06ad48caa5270cc2ae17941e and a3bec5c5aea0da263111c4d8f8eabc1f8560d7bf. sata_mv and sata_nv still reliably poop themselves here, whereas its rock solid with 2.6.23.1. Sounds like different issues from yours, as I see a stream of SATA errors on the bad kernels, errors which are often a symptom of something whacked in the DMA engine (misprogramming causes the silicon to generate bogus FIS's, which the device then chokes on) Jeff -
Do you know if this poop involves the segment padding that sometimes goes on in libata? -- Jens Axboe -
Definitely not, in this case -- it's all ATA, nothing ATAPI.
It throws SError { Handshk } which then triggers the EH to reset the
link, and so it goes, over and over :) The same thing happens when I
intentionally screw up the PRD tables. Not much more data points than
that, so far.
I'll try the SCSI_MAX_SG_SEGMENTS patch too, to see if that fixes things.
Jeff
-
On a related hardware note: FWIW, most ATA controllers have scatter/gather tables that terminate themselves by a bit in the final s/g entry. The 90% case needs to know the last scatterlist entry, at the end of the s/g walk. So however this all gets worked out, please make sure not to unduly punish libata controllers with an expensive sg_last() inside a loop, or something like that. :) Jeff -
In fact, I think you'll hit it even if you use the "for_each_sg()"
helper function. Exactly because it does the sg = sg_next(sg) at the end
of the for-loop, so it will do it for the last entry too, even though that
will access one past the end.
So it really *is* the case that every *single* use of that SG chain needs
to be switched over to only do the sg_next() when required, or sg_next()
needs to be fixed to look at the current-in-use entry (which we *can*
access) when it decides what to do about the next one (which we can *not*
access).
Moving the sg_is_chain() bit to the previous entry would work, but it
requires that nobody who could have a chained scatterlist must *ever*
access "sg->page" directly - you'd always need to use some helper function
that masks off the bit, eg
- rename "sg->page" into "sh->page_and_flag", and make it "unsigned long"
instead of a pointer.
- change every single "sg->page" access to use "sg_page(sg)" instead:
#define sg_pointer(sg) ((void *)((sg)->page_and_flag & ~1ul))
static inline struct page *sg_page(struct scatterist *sg)
{
return sg_pointer(sg);
}
- change "sg_next()" to do
static inline struct scatterlist *sg_next(struct scatterlist *sg)
{
if (sg->page_and_flag & 1)
sg = sg_pointer(sg+1)-1;
return ++sg;
}
where the magic is exactly the fact that now "sg_next()" will *not*
derefence the next SG entry unless the current one was marked
appropriately.
And then *creating* the chain obviously needs to properly mark the
next-to-last entry with the "next entry is a pointer" flag.
Big diff, it sounds like. But I don't see many alternatives. Jens?
Linus
-
Auch yes, ok nevermind that last email. There really is no way around allocating an extra 'pad' entry right now, so the SCSI_MAX_SG_SEGMENTS Big diff is not a problem for me, my primary concern would be making sure that the interface is solid. The above also has the nice side effect of making all sg elements usable, so we don't waste any. IIRC this was even debated months ago when I first proposed sg chaining, I shied away from it because I thought it would seem huge and too invasive. Ah well. I'll get digging on this. -- Jens Axboe -
Just to confirm - is this with the added memset() in drivers/scsi/scsi_lib.c? Or did you get this oops before that patch? Linus -
yes, Jens's last patch was applied - apparently this ATA crash was masked by the other one. I got one successful bootup after applying Jens's patch and it has crashed on bootup twice since then. Ingo -
Hi Jens, I just hit what I believe is the same bug, though with a slightly different OOPS (I see a NULL pointer dereference, see below). The patches fixes the bug, and - as Linus suggested - the piece touching scsi_lib.c is enough. The OOPS: scsi4 : ata_piix scsi5 : ata_piix ata5: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0xffa0 irq 14 ata6: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15 sda1 sda2 sda3 < sda5 sda6 > sd 0:0:0:0: [sda] Attached SCSI disk ata5.00: ATAPI: MATSHITADVD-RAM UJ-850S, 1.22, max UDMA/33 Unable to handle kernel NULL pointer dereference at 0000000000000020 RIP: [<ffffffff802d275e>] blk_rq_map_sg+0xf6/0x16a PGD 4711067 PUD 508c067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 0 Modules linked in: sd_mod ohci1394 ehci_hcd ata_piix ieee1394 ahci uhci_hcd libata scsi_mod usbcore thermal processor fan Pid: 0, comm: swapper Not tainted 2.6.23-ge6d5a11d #3 RIP: 0010:[<ffffffff802d275e>] [<ffffffff802d275e>] blk_rq_map_sg+0xf6/0x16a RSP: 0018:ffffffff804eccd8 EFLAGS: 00010087 RAX: 0000000005331000 RBX: ffff810004740220 RCX: ffff810001000000 RDX: 0000000000000000 RSI: ffff8100052d4f50 RDI: 0000000005142c00 RBP: 0000000000001400 R08: 0000000000000000 R09: ffff8100052a3980 R10: 0000000005143000 R11: 0000000000000400 R12: ffff8100047a4000 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff8049a000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000020 CR3: 0000000005118000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff804ac000, task ffffffff8046c340) Stack: ffffffff88041834 000000010314ba00 ffff81000477fa80 ffff81000503dd80 ffff81000503dd80 ffff81000517fc00 000000000000001e 0000000000001d4c ffffffff8804193c 0000000015925849 ffff81000473f800 ffff81000503dd80 Call Trace: <IRQ> ...
Ok, I think that one-liner fixes the real bug.
But I think the rest of your changes are simply bad.
The fix to block/ll_rw_block.c should likely be something like the
appended instead:
- remove the "memset()" you had added earlier. It's bogus. It cannot be
the right thing. If the sg list wasn't initialized correctly much
earlier, trying to initialize it late is pointless - it contains crap.
- the old code was fine, but let's initialize "sg" to NULL to make it
clear that the initial value of sg is pointless, and only "next_sg"
matters (since sg had better be assigned from that).
Hmm?
Linus
---
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 9e3f3cc..54d974e 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1333,7 +1333,8 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
* for each bio in rq
*/
bvprv = NULL;
- sg = next_sg = &sglist[0];
+ sg = NULL;
+ next_sg = &sglist[0];
rq_for_each_segment(bvec, rq, iter) {
int nbytes = bvec->bv_len;
@@ -1352,7 +1353,6 @@ new_segment:
sg = next_sg;
next_sg = sg_next(sg);
- memset(sg, 0, sizeof(*sg));
sg->page = bvec->bv_page;
sg->length = nbytes;
sg->offset = bvec->bv_offset;
-
It's required to clear output members (like dma_len and so on), since If you prefer the old next_sg approach to my alternative, that is fine with me. But we do need the memset(). -- Jens Axboe -
No it's NOT! The whole point here is that the sg had *already* better be cleared. If it wasn't cleared before, that's a bug regardless of anything else. So a memset() is guaranteed to be either a no-op or hiding another bug! So yes, those members had better be zero. It's just that they had better be zero long before that memset! The memset should have been done when Really? Explain why. Entering that code with a non-initialized SG list is a bug to begin with. Linus -
For the chain elements - yes, definitely! But we also want to clear dma mapping output values, at least sparc64 wants that. You could argue that the IOMMU code should be fixed up, but I don't think we should mix the two. So we need the memset() in blk_rq_map_sg() AS WELL AS the initial sg Not for the output members, they will be filled AFTER blk_rq_map_sg() returns and someone use iommu_map_sg() to re-iterate the sglist and map the pages appropriately. -- Jens Axboe -
I still don't see your argument. The SG table is *already*zeroed*. The fact that output values will be changed later is irrelevant. They haven't been changed *before* - or at least you haven't given a reasonable explanation for why they should have. So explain to me why the memset() in blk_rq_map_sg() helps, considering that the memory must have been zeroed at allocation time anyway? Tell me what it is that fills any of the fields we don't already initialize, and that can happen *before* that memset? Linus -
Ah ok, I see why you are confused. The SCSI case is one, it allocs and frees the sg table each time. The entries are thus always initialized when they end up in blk_rq_map_sg(). However, other drivers allocate one at driver init time and use that one all the time. It needs to be initially cleared, but it wont be cleared before each request mapping. And that is fine, as long as we do clear entries in blk_rq_map_sg(). -- Jens Axboe -
Ahh, ok, thanks for clearing that up. Linus -
built and booted your patch (removed Jens's) but it still crashes in blk_rq_map_sg() - see below. Should i have left something from Jens's patch? Ingo -------------------> [ 34.698199] VFS: Mounted root (ext3 filesystem) readonly. [ 34.703917] Freeing unused kernel memory: 372k freed [ 34.746609] BUG: unable to handle kernel paging request at virtual address 6e616872 [ 34.754106] printing eip: 7840503b *pde = 00000000 [ 34.758960] Oops: 0000 [#1] DEBUG_PAGEALLOC [ 34.763120] [ 34.764594] Pid: 1, comm: swapper Not tainted (2.6.23 #7) [ 34.769965] EIP: 0060:[<7840503b>] EFLAGS: 00010006 CPU: 0 [ 34.775429] EIP is at blk_rq_map_sg+0xbb/0x170 [ 34.779845] EAX: 3ffcd000 EBX: 7c88a68c ECX: 3ffce000 EDX: 007ff9a0 [ 34.786085] ESI: 6e616862 EDI: 798ea9a0 EBP: 00001000 ESP: 7b421bf4 [ 34.792324] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 34.797697] Process swapper (pid: 1, ti=7b420000 task=7b416000 task.ti=7b420000) [ 34.804890] Stack: 00000046 7b520000 00002000 3ffcf000 7c88ac80 00000001 00000001 00000001 [ 34.813209] 7c88a600 01000000 7c88ab00 7b521d1c 7c88a708 7b520000 784c7ac5 7b520000 [ 34.821529] 784c75ca 7c88a708 7b524ce4 7b521d1c 7c88a708 784e4570 7b416000 7813661b [ 34.829847] Call Trace: [ 34.832448] [<784c7ac5>] scsi_init_io+0x55/0xe0 [ 34.837042] [<784c75ca>] scsi_get_cmd_from_req+0x2a/0x40 [ 34.842414] [<784e4570>] sd_prep_fn+0x80/0x940 [ 34.846920] [<7813661b>] __lock_acquire+0x4ab/0xe20 [ 34.851860] [<781450a6>] add_to_page_cache+0x66/0xb0 [ 34.856887] [<78404633>] elv_dispatch_sort+0x23/0xe0 [ 34.861912] [<784041a0>] elv_next_request+0xa0/0x130 [ 34.866938] [<784c8c04>] scsi_request_fn+0x1e4/0x370 [ 34.871965] [<78120f02>] del_timer+0x62/0x70 [ 34.876298] [<78407445>] __generic_unplug_device+0x25/0x30 [ 34.881844] [<78407715>] generic_unplug_device+0x15/0x30 [ 34.887217] [<78404e00>] blk_backing_dev_unplug+0x0/0x10 [ 34.892590] [<78404e0c>] ...
The *real* fix in Jens' patch (and the only thing that really matters) is the one to drivers/scsi/scsi_lib.c. The rest is all fluff. Linus -
too late, crashed my box with it already :-) Ingo -
Sorry about that, please try the next one that includes the scsi_lib.c one liner to clear the sg table on alloc :-) -- Jens Axboe -
with that patch it not crashes on NULL dereference - see crashlog below. Compiler bug perhaps? Ingo ----------------> [ 34.605614] EXT3-fs: INFO: recovery required on readonly filesystem. [ 34.611842] EXT3-fs: write access will be enabled during recovery. [ 34.635861] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000008 [ 34.644227] printing eip: 7840840e *pde = 00000000 [ 34.649081] Oops: 0002 [#1] DEBUG_PAGEALLOC [ 34.653239] [ 34.654713] Pid: 1, comm: swapper Not tainted (2.6.23 #3) [ 34.660086] EIP: 0060:[<7840840e>] EFLAGS: 00010046 CPU: 0 [ 34.665548] EIP is at blk_rq_map_sg+0x8e/0x190 [ 34.669965] EAX: 00000000 EBX: 7c885180 ECX: 00000004 EDX: 033b6000 [ 34.676205] ESI: 00001000 EDI: 00000008 EBP: 00000000 ESP: 7b4219b8 [ 34.682444] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 34.687818] Process swapper (pid: 1, ti=7b420000 task=7b416000 task.ti=7b420000) [ 34.695010] Stack: 7b521d38 00000008 7c884000 7b520000 00002000 033b6000 7c885080 00000001 [ 34.703329] 00000001 7c885880 01000002 7c886e00 7b521d1c 7c8857c4 7b520000 784c7ae5 [ 34.711649] 7b520000 784c75ea 7c8857c4 7b524ce4 7b521d1c 7c8857c4 784e4590 7b416000 [ 34.719968] Call Trace: [ 34.722568] [<784c7ae5>] scsi_init_io+0x55/0xe0 [ 34.727161] [<784c75ea>] scsi_get_cmd_from_req+0x2a/0x40 [ 34.732534] [<784e4590>] sd_prep_fn+0x80/0x940 [ 34.737041] [<7813661b>] __lock_acquire+0x4ab/0xe20 [ 34.741981] [<78135c9c>] trace_hardirqs_on+0x9c/0xb0 [ 34.747007] [<78770e30>] _spin_unlock_irq+0x20/0x30 [ 34.751946] [<78404633>] elv_dispatch_sort+0x23/0xe0 [ 34.756973] [<784041a0>] elv_next_request+0xa0/0x130 [ 34.761999] [<784c8c24>] scsi_request_fn+0x1e4/0x370 [ 34.767025] [<78120f02>] del_timer+0x62/0x70 [ 34.771358] [<784072d5>] __generic_unplug_device+0x25/0x30 [ 34.776905] [<784075a5>] generic_unplug_device+0x15/0x30 [ 34.782278] [<78404e0c>] blk_backing_dev_unplug+0xc/0x10 [ 34.787650] ...
I really don't think this should work. Doing "sg_next()" on a valid sg is *always* ok. So if the old code didn't work, then "sg" wasn't valid to start with (and the code *after* the sg_next() would have oopsed even if you try to avoid using sg_next. So avoiding the "sg_next()" on the last entry is pointless. Also, your patch makes the code almost totally unreadable, with that subtle issue of the "if (bvprv && cluster)" case not triggering on the first case, so the NULL initial sg is "safe". So at a guess, I think the *real* problem is simply that the passed-in sglist was just too small. What guarantees that the sg list allocation (apparently done by scsi_alloc_sgtable()) is big enough? If I read things right, scsi_alloc_sgtable() will allocate "cmd->use_sg" SG enties, no? But I also notice that it does not seem to initialize the SG allocation, so those SG entries contain random crap - including, perhaps, a random - and bogus - chain pointer in sg->page.. Yes, we set sh->page *if* we create a chain, but if we don't chain, we leave the old random contents around which in turn may include old and stale chain pointers. Or am I missing something? So when you added that "memset(sg, 0, sizeof(*sg))" into blk_rq_map_sg(), you did it way too late - it needs to be done when the sg chain is allocated, and for every entry (and then the "link" entry needs to be linked in separately) I think. Linus -
Yeah, I didn't quite understand why if sg was valid, why dereferencing Hmm I think it's quite readable, but perhaps that's just me :-). The first is much cleaner, and the last part just reads 'if sg is not set Right, we allocate an sgtable that will hold ->use_sg entries, which contains request->nr_phys_segments. And that should definitely fit. Regarding the init of the sglist, that was the revert I was talking about. We do need that memset() in there, so all those sg entries will Yep, and that is what Ingo did test as well and it worked. For that case, now libata is crapping out elsewhere in sg_next(). -- Jens Axboe -
Actually, I take that back. If 'sg' is the last entry in a *non*linked scatter-gather list (ie we don't use the last entry as a link, we actually use it as a real SG entry), then "sg_next(sg)" will indeed access past the end of the whole allocated array, and will access one past the end. And with page-alloc debugging, that *will* blow up. So I think your change to use "sg_next()" only when you actually need a next pointer is the correct one after all. Linus -
Thanks, so I'm not totally crazy :-) Can you just pull: git://git.kernel.dk/data/git/linux-2.6-block.git for-linus then so we get those two pieces correct? Then the remaining issue seems to be a new one that is biting Ingo elsewhere, at least we'll all be on the same page then. -- Jens Axboe -
Jens, for_each_sg still calls sg_next on the last entry which will
dereference a possibly bogus sg->page (for the sg_is_chain(sg)
condition in sg_next) if the last entry is the last one on the page
of unchained entry and sg+1 falls over into an uninitialized page.
How about the following?
(untested yet.
sg.c included here as an example for usage out of scatterlist.h)
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 2dc7464..3a27e03 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -30,7 +30,7 @@ static inline void sg_init_one(struct scatterlist *sg, const void *buf,
((struct scatterlist *) ((unsigned long) (sg)->page & ~0x01))
/**
- * sg_next - return the next scatterlist entry in a list
+ * sg_next_unsafe - return the next scatterlist entry in a list
* @sg: The current sg entry
*
* Usually the next entry will be @sg@ + 1, but if this sg element is part
@@ -41,7 +41,7 @@ static inline void sg_init_one(struct scatterlist *sg, const void *buf,
* the current entry, this function will NOT return NULL for an end-of-list.
*
*/
-static inline struct scatterlist *sg_next(struct scatterlist *sg)
+static inline struct scatterlist *sg_next_unsafe(struct scatterlist *sg)
{
sg++;
@@ -51,11 +51,27 @@ static inline struct scatterlist *sg_next(struct scatterlist *sg)
return sg;
}
+/**
+ * sg_next - return the next scatterlist entry in a list
+ * @sg: The current sg entry
+ * @next: Index of next sg entry
+ * @nr: Number of sg entries in the list
+ *
+ * Note that the caller must ensure that there are further entries after
+ * the current entry, this function will NOT return NULL for an end-of-list.
+ *
+ */
+static inline struct scatterlist *sg_next(struct scatterlist *sg,
+ int next, int nr)
+{
+ return next < nr ? sg_next_unsafe(sg) : NULL;
+}
+
/*
* Loop over each sg ...Things have progressed a lot since, see my recent posting based on Davem's proposal. Will post another patch soonish, that is also tested. -- Jens Axboe -
From: Jens Axboe <jens.axboe@oracle.com>
One core issue here is that we need to decide whether this thing to be
iterated like an array or like a linked list. It's trying to be both.
If we decide upon a looping construct for consumers and stick to it,
we'll be in much better shape than we are now and bugs will be eaiser
to spot. It would be so much simpler to audit if all we saw in the
consumers were things like:
while (sg) {
do_stuff(sg);
sg = sg_next(sg);
}
I would suggest that we just get it over with and convert the whole
tree now rather than trying to do this kind of thing in stages.
Because then we can say that ever scatterlist creator has to set
the "end" bit and therefore you use well established patterns
for scatterlist iteration such as "traverse sg_next() until NULL"
as shown above.
I also noticed that there is the issue of on-stack and embedded
scatterlist users. We'll need some sort of "DECLARE_SCATTERLIST"
and a "scatterlist_init()" thing so that we can keep DEBUG_SG
working even in those cases. But for all I know Jens could be
working on that already :-)
The only other real option if we don't convert the whole tree now to
the "end" marker stuff, is to enforce that every scatterlist iterator
only traverse the number of entries there were told are in the one
given to them.
-
The above should work now, PROVIDED that the end has been marked
properly. I have added some helpers to init an sglist (or sg entry).
You can still use
for_each_sg()
and pass in the number of entries, that'll work even with an end marker.
Heh, got some of it covered! The stack usage is generally just converted
to either use sg_init_one() if we can, or sg_init_table(). Adding some
soft of DECLARE_SCATTERLIST() sounds like a good idea, though. I'll take
Driver that do all their own sg management typically still use a manual
loop over the number of entries they know are there. I didn't change
those, and the original sg chaining didn't convert those either. Some
manual labor is required in utilizing chained sg lists, unless you are
part of a subsystem (like SCSI) that allocates and inits themf or you.
So I still think that baby steps are a good idea - only convert things
that matter. Leave the rest to janitors or others willing to dive in.
--
Jens Axboe
-
That still leaves the initialization issue. The link pointers need to all be initialized at SG allocation time (and not just the last one for the case where it's a linked entry). And if you initialize them at allocation time, does an allocation ever get re-used without being free'd in between (if it does, you do indeed need to initialize the non-link pointers each time - but I don't see it being the case, so ..) Linus -
The links will not change, so reuse should actually be OK. The reuse cannot be larger than the original would hold, but that would be a bug without chaining as well of course. WRT reuse, if we end up requeuing a request, than we will go through the whole thing again. I don't think Ingo is hitting that either, though. -- Jens Axboe -
find below the disassembled code. Here's the faulting source line
according to gdb:
(gdb) list *0x78406669
0x78406669 is in blk_rq_map_sg (include/linux/scatterlist.h:48).
43 */
44 static inline struct scatterlist *sg_next(struct scatterlist *sg)
45 {
46 sg++;
47
48 if (unlikely(sg_is_chain(sg)))
49 sg = sg_chain_ptr(sg);
50
51 return sg;
52 }
(gdb) list *0x78406673
0x78406673 is in blk_rq_map_sg (block/ll_rw_blk.c:1355).
1350 } else {
1351 new_segment:
1352 sg = next_sg;
1353 next_sg = sg_next(sg);
1354
1355 sg->page = bvec->bv_page;
1356 sg->length = nbytes;
1357 sg->offset = bvec->bv_offset;
1358 nsegs++;
1359 }
(gdb)
the compiler is gcc-4.2.2. (vanilla, built from sources)
Ingo
784065b0 <blk_rq_map_sg>:
784065b0: 55 push %ebp
784065b1: 57 push %edi
784065b2: 56 push %esi
784065b3: 53 push %ebx
784065b4: 83 ec 28 sub $0x28,%esp
784065b7: 89 44 24 04 mov %eax,0x4(%esp)
784065bb: 8b 98 08 01 00 00 mov 0x108(%eax),%ebx
784065c1: 83 e3 01 and $0x1,%ebx
784065c4: 89 5c 24 18 mov %ebx,0x18(%esp)
784065c8: 8b 52 3c mov 0x3c(%edx),%edx
784065cb: c7 44 24 14 00 00 00 movl $0x0,0x14(%esp)
784065d2: 00
784065d3: 85 d2 test %edx,%edx
784065d5: 89 54 24 20 mov %edx,0x20(%esp)
784065d9: 0f 84 fc 00 00 00 je 784066db <blk_rq_map_sg+0x12b>
784065df: 89 ce mov %ecx,%esi
784065e1: 31 d2 xor %edx,%edx
784065e3: 89 4c 24 10 mov %ecx,0x10(%esp)
784065e7: 8b 44 24 20 mov 0x20(%esp),%eax
784065eb: 0f b7 58 1a ...