Re: [PATCH] block: fix residual byte count handling

Previous thread: [PATCH 2/2] x86 : relocate uninitialized variable in init DATA section into init BSS section by Huang, Ying on Thursday, February 21, 2008 - 1:15 am. (8 messages)

Next thread: [PATCH] UDF: fix anchor point detection by Pavel Emelyanov on Thursday, February 21, 2008 - 1:55 am. (1 message)
From: Mike Galbraith
Date: Thursday, February 21, 2008 - 1:42 am

Greetings,

K3b recently (9a4c854..5d9c4a7 pull) began terminally griping about
buffer underrun upon every attempt to burn a CD.  I can't fully bisect
the problem  because intervening kernels hang soft during boot.  Using
git bisect visualize, and converting to postable text:

bisect/bad   block: add request->raw_data_len (6b00769fe1502b4ad97bb327ef7ac971b208bfb5)
bisect           block: update bio according to DMA alignment padding (40b01b9bbdf51ae543a04744283bf2d56c4a6afa)
libata: update ATAPI overflow draining
bisect/good-e164094964e6e20fe7fce418e06a9dce952bb7a4

Serial console log of hung kernel 40b01b9bbdf51ae543a04744283bf2d56c4a6afa below

[    0.000000] Linux version 2.6.25-rc2-smp (root@homer) (gcc version 4.2.1 (SUSE Linux)) #14 SMP PREEMPT Thu Feb 21 08:49:51 CET 2008
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
[    0.000000]  BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
[    0.000000] 0MB HIGHMEM available.
[    0.000000] 1023MB LOWMEM available.
[    0.000000] Scan SMP from b0000000 for 1024 bytes.
[    0.000000] Scan SMP from b009fc00 for 1024 bytes.
[    0.000000] Scan SMP from b00f0000 for 65536 bytes.
[    0.000000] found SMP MP-table at [b00f5320] 000f5320
[    0.000000] Zone PFN ranges:
[    0.000000]   DMA             0 ->     4096
[    0.000000]   Normal       4096 ->   262128
[    0.000000]   HighMem    262128 ->   262128
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[1] active PFN ranges
[    0.000000]     0:        0 ->   262128
[    0.000000] DMI 2.3 present.
[    ...
From: Jens Axboe
Date: Friday, February 22, 2008 - 12:32 am

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Saturday, February 23, 2008 - 12:42 am

<crickets chirping>  He must be off having a life or something ;-)

 Meanwhile back at the ranch, reverting
6b00769fe1502b4ad97bb327ef7ac971b208bfb5
40b01b9bbdf51ae543a04744283bf2d56c4a6afa and the one entangled line from
dde2020754aeb14e17052d61784dcb37f252aac2 did restore my burner.

	-Mike



--

From: Mike Galbraith
Date: Sunday, February 24, 2008 - 12:54 am

It looks like the reason for boot failure with
40b01b9bbdf51ae543a04744283bf2d56c4a6afa may be that one hunk of
6b00769fe1502b4ad97bb327ef7ac971b208bfb5 was supposed to land in
40b01b9bbdf51ae543a04744283bf2d56c4a6afa (per comment);

diff --git a/block/blk-map.c b/block/blk-map.c
index a7cf63c..09f7fd0 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -154,6 +155,7 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
 
 		bio->bi_io_vec[bio->bi_vcnt - 1].bv_len += pad_len;
 		bio->bi_size += pad_len;
+		rq->data_len += pad_len;
 	}
 
 	rq->buffer = rq->data = NULL;


Something else looks funny with
6b00769fe1502b4ad97bb327ef7ac971b208bfb5, did something go missing?

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 135c1d0..ba21d97 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1014,10 +1014,6 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
 	}
 
 	req->buffer = NULL;
-	if (blk_pc_request(req))
-		sdb->length = req->data_len;
-	else
-		sdb->length = req->nr_sectors << 9;
 
 	/* 
 	 * Next, walk the list, and fill in the addresses and sizes of  <== here
@@ -1026,6 +1022,10 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
 	count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
 	BUG_ON(count > sdb->table.nents);
 	sdb->table.nents = count;
+	if (blk_pc_request(req))
+		sdb->length = req->data_len;
+	else
+		sdb->length = req->nr_sectors << 9;
 	return BLKPREP_OK;
 }
 




--

From: Mike Galbraith
Date: Tuesday, February 26, 2008 - 2:48 am

Greetings,

I straced both a good and a bad kernel (good being .git with attached
revert patch applied) and filtered/diffed/merged the output.  Scroll
down to "HERE" to see the problem (resid).

I'm poking around, but not having much luck.

--- good	2008-02-26 09:11:08.000000000 +0100
+++ bad	2008-02-26 09:03:44.000000000 +0100
@@ -1,48 +1,44 @@
 open("/dev/sr0", O_RDWR|O_NONBLOCK) = 3
 fcntl64(3, F_GETFL)         = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE)
 fcntl64(3, F_SETFL, O_RDWR|O_LARGEFILE) = 0
-ioctl(3, CDROMAUDIOBUFSIZ or SCSI_IOCTL_GET_IDLUN, 0xaf8d9194) = 0
-ioctl(3, SCSI_IOCTL_GET_BUS_NUMBER, 0xaf8d9190) = 0
-ioctl(3, SG_GET_VERSION_NUM, 0xaf8d9198) = 0
+ioctl(3, CDROMAUDIOBUFSIZ or SCSI_IOCTL_GET_IDLUN, 0xafa1a2d4) = 0
+ioctl(3, SCSI_IOCTL_GET_BUS_NUMBER, 0xafa1a2d0) = 0
+ioctl(3, SG_GET_VERSION_NUM, 0xafa1a2d8) = 0
 write(2, "Linux sg driver version: 3.5.27\n", 32Linux sg driver version: 3.5.27
 ) = 32
-ioctl(3, CDROMAUDIOBUFSIZ or SCSI_IOCTL_GET_IDLUN, 0xaf8d9134) = 0
-ioctl(3, SCSI_IOCTL_GET_BUS_NUMBER, 0xaf8d9130) = 0
-ioctl(3, SG_SET_TIMEOUT, 0xaf8d9030) = 0
-fstat64(3, {st_dev=makedev(0, 13), st_ino=4758, st_mode=S_IFBLK|0640, st_nlink=1, st_uid=0, st_gid=6, st_blksize=4096, st_blocks=0, st_rdev=makedev(11, 0), st_atime=2008/02/26-08:45:17, st_mtime=2008/02/26-08:45:17, st_ctime=2008/02/26-08:45:17}) = 0
+ioctl(3, CDROMAUDIOBUFSIZ or SCSI_IOCTL_GET_IDLUN, 0xafa1a274) = 0
+ioctl(3, SCSI_IOCTL_GET_BUS_NUMBER, 0xafa1a270) = 0
+ioctl(3, SG_SET_TIMEOUT, 0xafa1a170) = 0
+fstat64(3, {st_dev=makedev(0, 13), st_ino=4572, st_mode=S_IFBLK|0640, st_nlink=1, st_uid=0, st_gid=6, st_blksize=4096, st_blocks=0, st_rdev=makedev(11, 0), st_atime=2008/02/26-09:36:43, st_mtime=2008/02/26-09:36:43, st_ctime=2008/02/26-09:36:43}) = 0
 geteuid32()                 = 0
 getuid32()                  = 0
 write(1, "Using libscg version \'schily-0.9"..., 35) = 35
 write(1, "Driveropts: \'burnfree\'\n", 23) = 23
-ioctl(3, SG_GET_RESERVED_SIZE, 0xaf8d93d4) = 0
-ioctl(3, ...
From: Mike Galbraith
Date: Tuesday, February 26, 2008 - 6:36 am

Seems the problem is data_len changes, but raw_data_len doesn't.  I've
not the foggiest IO-land clue, but k3b works again, so the below may
have some diagnostic value.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba21d97..7a6f784 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -871,7 +871,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 			scsi_end_bidi_request(cmd);
 			return;
 		}
-		req->data_len = scsi_get_resid(cmd);
+		req->data_len = req->raw_data_len = scsi_get_resid(cmd);
 	}
 
 	BUG_ON(blk_bidi_rq(req)); /* bidi not support for !blk_pc_request yet */

	-Mike

--

From: Andrew Morton
Date: Tuesday, February 26, 2008 - 4:08 pm

cc's added.



Thanks.
--

From: Mike Galbraith
Date: Tuesday, February 26, 2008 - 7:24 pm

Yeah, it fixes the problem.  (wrt recap, if I could write it, it would
be a changelog;)

	-Mike

--

From: Mike Galbraith
Date: Tuesday, February 26, 2008 - 11:00 pm

Hm.  After rummaging around some more in both kernel and userland, I
think this patchlet is not only functional, but  (random accident)
technically correct.  What the heck, let's see if it flies...

snippet from userland:
/*
 * Return the residual DMA count for last command.
 * If this count is < 0, then a DMA overrun occured.
 */
EXPORT int
scg_getresid(scgp)
	SCSI	*scgp;
{
	return (scgp->scmd->resid);
}

This function is used all over the place in cdrecord to determine
transfer size.

(patchlet takes wing, and... goes splat?)
 
Fix CD burning regression introduced by
6b00769fe1502b4ad97bb327ef7ac971b208bfb5.  raw_data_len must be updated
to reflect residual data upon IO completion because it is used by
blk_complete_sghdr_rq() to set hdr->resid which eventually becomes
visible to userland.

Signed-off-by: Mike Galbraith <efault@gmx.de>

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba21d97..7a6f784 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -871,7 +871,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd,
unsigned int good_bytes)
 			scsi_end_bidi_request(cmd);
 			return;
 		}
-		req->data_len = scsi_get_resid(cmd);
+		req->data_len = req->raw_data_len = scsi_get_resid(cmd);
 	}
 
 	BUG_ON(blk_bidi_rq(req)); /* bidi not support for !blk_pc_request yet
*/


	-Mike

--

From: Mike Galbraith
Date: Wednesday, February 27, 2008 - 12:07 am

Bugger, went splat... forgot preformat for patchlet insert.  <quiltuple
checks>
 
Fix CD burning regression introduced by
6b00769fe1502b4ad97bb327ef7ac971b208bfb5.  raw_data_len must be updated
to reflect residual data upon IO completion because it is used by
blk_complete_sghdr_rq() to set hdr->resid which eventually becomes
visible to userland.

Signed-off-by: Mike Galbraith <efault@gmx.de>

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ba21d97..7a6f784 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -871,7 +871,7 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 			scsi_end_bidi_request(cmd);
 			return;
 		}
-		req->data_len = scsi_get_resid(cmd);
+		req->data_len = req->raw_data_len = scsi_get_resid(cmd);
 	}
 
 	BUG_ON(blk_bidi_rq(req)); /* bidi not support for !blk_pc_request yet */


--

From: Tejun Heo
Date: Thursday, February 28, 2008 - 12:43 am

Hello, all.

Sorry about the delay.  Was buried under other stuff.  Mike, thanks a
lot for reporting and analyzing the problem; however, the patch is
slightly incorrect.  rq->data_len is rq->data_len + extra stuff for
alignment and padding, so the correct thing to do is...

  req->raw_data_len -= req->data_len - scsi_get_resid(cmd);
  req->data_len = scsi_get_resid(cmd);

which is ugly and error-prone.  In addition, this isn't the only place
where resid is set.  Other block drivers do this too.  This definitely
should be done in block layer.

With rq->data_len and rq->raw_data_len, it's impossible to translate
resid of rq->data_len to resid of rq->raw_data_len as block layer
doesn't know how much was extra data after rq->data_len is modified.
The attached patch substitutes rq->raw_data_len w/ rq->extra_len and
adds blk_rq_raw_data_len().  Things look cleaner this way and the resid
problem should be solved with this.

Can you please verify the attached patch fixes the problem?

Thanks.

-- 
tejun
From: Mike Galbraith
Date: Thursday, February 28, 2008 - 1:20 am

From: Tejun Heo
Date: Thursday, February 28, 2008 - 1:50 am

rq->raw_data_len introduced for block layer padding and draining
(commit 6b00769fe1502b4ad97bb327ef7ac971b208bfb5) broke residual byte
count handling.  Block drivers modify rq->data_len to notify residual
byte count to the block layer which blindly reported unmodified
rq->raw_data_len to userland.

To keep block drivers dealing only with rq->data_len, this should be
handled inside block layer.  However, how much extra buffer was
appened is lost after rq->data_len is modified.

This patch replaces rq->raw_data_len with rq->extra_len and add
blk_rq_raw_data_len() helper to calculate raw data size from
rq->data_len and rq->extra_len.  The helper returns correct raw
residual byte count when called on a rq whose data_len is modified to
carry residual byte count.

This problem was reported and diagnosed by Mike Galbraith.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
---
 block/blk-core.c          |    3 +--
 block/blk-map.c           |    2 +-
 block/blk-merge.c         |    1 +
 block/bsg.c               |    8 ++++----
 block/scsi_ioctl.c        |    4 ++--
 drivers/ata/libata-scsi.c |    3 ++-
 include/linux/blkdev.h    |    8 +++++++-
 7 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 775c851..929ab61 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -127,7 +127,7 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
 	rq->special = NULL;
-	rq->raw_data_len = 0;
+	rq->extra_len = 0;
 	rq->buffer = NULL;
 	rq->tag = -1;
 	rq->errors = 0;
@@ -2016,7 +2016,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 	rq->hard_cur_sectors = rq->current_nr_sectors;
 	rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
 	rq->buffer = bio_data(bio);
-	rq->raw_data_len = bio->bi_size;
 	rq->data_len = bio->bi_size;
 
 	rq->bio = rq->biotail = bio;
diff --git a/block/blk-map.c b/block/blk-map.c
index ...
From: Jens Axboe
Date: Thursday, February 28, 2008 - 8:35 am

Tejun, this patch isn't much cleaner at all. It really shows the pain of
these two seperate, yet related, variables.

-- 
Jens Axboe

--

From: Tejun Heo
Date: Thursday, February 28, 2008 - 8:46 am

Not much cleaner compared to what?  I think padding stuff is bound to be
somewhat complex.  It's a nasty thing in nature.  I think ->extra_len is
better than ->raw_data_len because ->extra_len only needs to be updated
where the dirty jobs are done and extra buffer areas are added.  Any
better suggestions?

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Friday, February 29, 2008 - 9:47 am

Well, I just investigated a bug report in the SCSI transport class.  Our
SMP handler is broken in exactly the same way.  We rely on the incoming
reported request lengths to size our request data, and they've blown up
from the true length to 512 bytes (the size of our alignment).

With the original patch, I have to run through the whole of libsas and
scsi_transport_sas doing

s/data_len/raw_data_len/

With your update it looks like I have to run through them all doing

s/data_len/data_len - extra_len/

which is even worse.  Can't we put things back to a point where data_len
means exactly that and extra_len means how much we have spare on the
end, so you know you can DMA up to data_len + extra_len if need be?

That way we don't have to sweep through every block driver altering the
way it uses data_len.

James


--

From: Jens Axboe
Date: Friday, February 29, 2008 - 1:11 pm

Fully agree. The reason why I think it's so ugly is that you have to
keep these two seperate variables in sync. The burning was just one bug,
there will be others...

-- 
Jens Axboe

--

From: Tejun Heo
Date: Friday, February 29, 2008 - 11:17 pm

Hello, Jens, James.



If SMP is broken because it needs start address alignment but not
padding to align the size, what should be done is to make that exact
requirement visible to the block layer.  Say,
blk_queue_dma_start_alignment() or maybe change
blk_queue_dma_alignment() such that it only indicates start address
alignment and add blk_queue_dma_size_alignment() for drivers which
require size to be aligned too.  I think those are few.

I think the decision which value rq->data_len represents comes down to
which size is used more in low level drivers because no matter which way
we choose we'll have to update some of the drivers which expects the
other thing from rq->data_len.

blk_rq_raw_data_len() is needed iff a driver needs dummy buffers
attached at the end and still needs to know the original request size

The posted modification isn't too bad as the maintenance of the two
variables is at places where the nasty things happen.  I think what
rq->data_len should represent when seen from LLDs is more important and
please note that if SMP is broken because it simply doesn't require
512byte size alignment, it's a different issue.

As long as both raw_data_len and data_len are accessible, I'm okay
either way.  My biggest reluctance is against breaking sum(sg) ==
rq->data_len.  I think this can lead to much more subtle problems such
as programming the controller w/ wrong bytes count and wrapped-around
resid calculation.

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Saturday, March 1, 2008 - 8:19 am

I know we *could* sweep through all the block drivers altering them; my
point is that I don't think we *should*.  Fundamentally, every driver
that cares is assuming req->data_len is the length of the request that
came down.  The fact that it got padded is irrelevant (and actually
detrimental) to most of them as the SMP driver illustrates.

We use a high dma_alignment not because we care about padding, but
because we want to avoid scatter gather.  So we care about alignment of
the start of the buffer (to avoid sg), but fundamentally, we need to
know what its true length (not its padded length) is.  The true length
feeds into the smp frame size and is checked by the interfaces, which is
why the changes caused an SMP failure.

Just for the principle of least surprise, can we not keep req->data_len
what it has always been, namely the true data length of the request and
express the fact that we've padded it by req->extra_len or something, so

But this is true of *every* current user of the block layer apart from
IDE ... we all care about alignment not padding.  Any current user that
actually cares about padding will be doing their own adjustments, so
they need changing anyway.

We can frame that with a different API, but blk_queue_dma_alignment()

Right, and currently, apart from IDE, they all want it to mean the true


But we still have to find all the bugs this causes in all the block

OK, so can we go back to data_len being the true value and add an
extra_len for drivers who want to know where the padding lies?

James


--

From: FUJITA Tomonori
Date: Sunday, March 2, 2008 - 7:52 am

On Sat, 01 Mar 2008 15:17:32 +0900

sum(sg) == rq->data_len is already broken; sg sends such requests
(though it would be nice if it doesn't).

I've not followed the earlier discussion (because I thought the drain
buffer stuff affected only libata but seems it doesn't ...). Why did
we need to change the meaning of rq->data_len?

rq->data_len meant the true data length and the patch to change it
doesn't look to make anything simple. Can we revert the meaning of
rq->data_len? I'm not sure that we need to add rq->extra_len but it's
fine as long as it's only for drivers that want to use it.

This is only compile tested.

=
diff --git a/block/blk-core.c b/block/blk-core.c
index 775c851..bfec406 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -127,7 +127,6 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
 	rq->special = NULL;
-	rq->raw_data_len = 0;
 	rq->buffer = NULL;
 	rq->tag = -1;
 	rq->errors = 0;
@@ -135,6 +134,7 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->cmd_len = 0;
 	memset(rq->cmd, 0, sizeof(rq->cmd));
 	rq->data_len = 0;
+	rq->extra_len = 0;
 	rq->sense_len = 0;
 	rq->data = NULL;
 	rq->sense = NULL;
@@ -2016,7 +2016,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 	rq->hard_cur_sectors = rq->current_nr_sectors;
 	rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
 	rq->buffer = bio_data(bio);
-	rq->raw_data_len = bio->bi_size;
 	rq->data_len = bio->bi_size;
 
 	rq->bio = rq->biotail = bio;
diff --git a/block/blk-map.c b/block/blk-map.c
index 09f7fd0..3287637 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -19,7 +19,6 @@ int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 		rq->biotail->bi_next = bio;
 		rq->biotail = bio;
 
-		rq->raw_data_len += bio->bi_size;
 		rq->data_len += bio->bi_size;
 	}
 	return 0;
@@ -155,7 +154,7 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
 
 ...
From: Mike Christie
Date: Sunday, March 2, 2008 - 11:46 am

Actually, I think I was half wrong on that when you asked about 
scsi_debug. The scatterlist that sg.c uses is never seen by the block 
layer or scsi layer. It is just used as a container to hold segments. 
sg.c and st.c use their scatterlist to manage their preallocated 
pages/segments. When they pass it to scsi_execute_async, that function 
will create a request struct and add bios for the pages.

In 2.6.24 and below, sg.c will send a scatterlist length that does not 
match the IO length, and scsi_execute_async will goof up and send a 
rq->data_len that does not match the sum of the bios. That is what I was 
trying to fix in 2.6.24, but the patch got messed up. In 2.6.25-rc2 and 
above that is fixed and scsi_execute_async will catch sg.c doing this 
and set rq->data_len and the bio lengths correctly.

So hopefully that helps any fixes you might have planned.
--

From: Mike Galbraith
Date: Sunday, March 2, 2008 - 8:27 pm

Is it possible to teach Thunderturd to NOT munge the cc line?  It
stripped names from cc addresses, and here when that happens the message
lands (intentionally) in my spam grinder.  I just happened to see this
one before flushing, but now, thanks to Thunderturd, every follow-up
will also land there.  (well, not any more since I restored it)

	-Mike

--

From: Tejun Heo
Date: Sunday, March 2, 2008 - 7:40 pm

At this point, it's not clear what the original meaning of rq->data_len
is because before moving alignment and padding to block layer,
rq->data_len equaled both the requested data length and the length of sg
list.  AFAIK, it's SCSI midlayer which makes sg list and data length
mismatch not block layer.

From the POV of the block layer, as now it might extend the sg list, it
has to decide what rq->data_len means in this case - the requested
transfer length from userland or the length of mapped sg list.

I think that currently the biggest problem is that drivers which don't
require request size adjustment are getting it because alignment setting
doesn't distinguish between start address alignment and size alignment.
For drivers which don't require data size adjustment from block layer,
data_len or raw_data_len doesn't matter.  They're equal anyway.  I'm
prepping a patch for this now.

For drivers which do require request size adjustments, I think it's
better to keep rq->data_len in line with the size of mapped sg list.
The rationales are...

- Those are dumb controllers which want to see requests which meet
certain size requirements and they're likely to care more about actual
data buffer size than user requested buffer size.  IOW, they wanna see
sizes which meet certain requirements, so give them those values.

- I think bugs caused by using raw_data_len instead of data_len are more
subtle than the other way around.  Using data_len instead of
raw_data_len usually affects the application layer while using

If we're gonna go this way, we'll need blk_rq_total_data_len() and use
it for drivers which requires request size adjustments.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Sunday, March 2, 2008 - 8:59 pm

On Mon, 03 Mar 2008 11:40:08 +0900

Yeah. It meant the requested transfer length from userland in the past
and I think that chaning to the length of mapped sg list doesn't make

The drivers care about the actual data buffer size.

The dumb controllers drivers can get what they want to use

If we add extra_len, we can get what raw_data_len and data_len
provide.


I can't see what changing the meaning of rq->data_len (and

No problem. It would be much better to add blk_rq_total_data_len
rather than chainging the meaning of rq->data_len and all the block
drivers.

Here's an updated patch (I forgot to remove the bi_size adjustment in
blk_rq_map_user in the previous patch). Can we agree on it if we add
blk_rq_total_data_len()?


diff --git a/block/blk-core.c b/block/blk-core.c
index 775c851..bfec406 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -127,7 +127,6 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
 	rq->special = NULL;
-	rq->raw_data_len = 0;
 	rq->buffer = NULL;
 	rq->tag = -1;
 	rq->errors = 0;
@@ -135,6 +134,7 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->cmd_len = 0;
 	memset(rq->cmd, 0, sizeof(rq->cmd));
 	rq->data_len = 0;
+	rq->extra_len = 0;
 	rq->sense_len = 0;
 	rq->data = NULL;
 	rq->sense = NULL;
@@ -2016,7 +2016,6 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 	rq->hard_cur_sectors = rq->current_nr_sectors;
 	rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
 	rq->buffer = bio_data(bio);
-	rq->raw_data_len = bio->bi_size;
 	rq->data_len = bio->bi_size;
 
 	rq->bio = rq->biotail = bio;
diff --git a/block/blk-map.c b/block/blk-map.c
index 09f7fd0..f559832 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -19,7 +19,6 @@ int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 		rq->biotail->bi_next = bio;
 		rq->biotail = bio;
 
-		rq->raw_data_len += bio->bi_size;
 		rq->data_len += bio->bi_size;
 	}
 ...
From: Tejun Heo
Date: Sunday, March 2, 2008 - 9:09 pm

No matter which way you go, you change the meaning of rq->data_len and
you MUST inspect rq->data_len usage whichever way you go.  Apply your
patch and try to do sg IO on IDE cdrom w/ various transfer lengths.

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 1:26 am

On Mon, 03 Mar 2008 13:09:13 +0900

The patch doens't change that rq->data_len means the true data
length. But yeah, it breaks rq->data_len == sum(sg). So it might break

I've just tried the patch with both ata and libata and it seems to
work.

For anyone hitting this problem, please try the following patch:

http://lkml.org/lkml/2008/3/2/218


Thanks,
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 2:21 am

Yeah, that's what I was saying.  You end up breaking one of the two
assumptions.  As sglist is getting modified for any driver if it has DMA
alignment set, whether rq->data_len is adjusted together or not, sglist

Right, I missed you added extra_len in libata and IDE isn't using block

Whether rq->data_len stays with requested data buffer size or sum(sg), I
think we need to separate out padding from address alignment; otherwise,
we'll have to audit every block driver to make sure they can deal with
extended sglist no matter which value rq->data_len ends up indicating.

If padding is applied iff explicitly requested, rq->data_len indicates
matters only to the drivers which want to see the data length adjusted,
so most of the problems go away.

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 5:17 am

On Mon, 03 Mar 2008 18:21:13 +0900

My patch (well, James' original approach) doesn't affect drivers that
don't use drain buffer. rq->data_len still means the true data length
and rq->data_len is equal to sum(sg) for them. So right now we need to
audit only libata.

But your patch changes the meaning of rq->data_len. It affects all the
drivers. So it breaks non libata stuff, like the SMP handler. We need
to audit all the drivers.
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 6:38 am

Your patch does change sglist for any driver which sets DMA alignment.

With both patches applied, sglist and data_len are adjusted only for
libata, so only drivers which explicitly requested buffer size
manipulation (currently only libata) need to be audited / updated.

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 6:50 am

On Mon, 03 Mar 2008 22:38:55 +0900

I overlook it. Where does it changes sglist?
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 6:55 am

At the end of blk_rq_map_user() together with data_len / extra_len
mangling or were you talking about James' original patch?

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 7:01 am

On Mon, 03 Mar 2008 22:55:56 +0900

With my patch, at the end of blk_rq_map_user, we have:

	if (len & queue_dma_alignment(q)) {
		unsigned int pad_len = (queue_dma_alignment(q) & ~len) + 1;

		rq->extra_len += pad_len;
	}


So no change as compared with 2.6.24?
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 7:22 am

Oh.. you killed sg list manipulation.  Many controllers do allow odd
bytes as the last sg entry but not all.  Also, if you append drain
buffer after it, it ends up with unaligned sg entry in the middle and
rq->data_len + rq->extra_len will overrun the sg entry after the drain
page which is really dangerous.

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 7:52 am

On Mon, 03 Mar 2008 23:22:46 +0900

Until 2.6.24, these drivers have taken care about the issue by

The drivers know that they use drain buffer. They can take care about
themselves on this too. If we want to do explicitly, we could have
rq->pad_len and rq->drain_len instead of rq->extra_len, though I think
that we are fine without these values because these drivers already
tell the block layer what they want and know that the block layer
gives it.


Jens, want's your verdict on this?
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 3:44 pm

Yeah, libata did its own padding and needed to add draining.  Private
implementation was complex as hell and James suggested moving them to

So, if a driver has requested aligning and draining, the driver should
extend the sg entry before the last one by the alignment if draining was
used for the request and extent the last sg if the draining wasn't used.
 I'd rather just implement them in the drivers.

-- 
tejun
--

From: FUJITA Tomonori
Date: Monday, March 3, 2008 - 7:11 pm

On Tue, 04 Mar 2008 07:44:13 +0900

No, I'm not. I've been working on the IOMMUs to remove such
workarounds in LLDs.

What drivers need to do on this is just adding a padding length, that
is, drivers don't need to change the structure of the sg list (like
splitting a sg entry), right? And it doesn't break the SAS drivers
that support SATAPI, does it?

But I agree that drivers want to get a complete sglist so I'm fine
with adjusting sglist entries in the block layer with your secode
patch (separate out padding from alignment). As we discussed, I'm fine
with breaking sum(sg) == rq->data_len as long as rq->data_len means

The block layer extends the sg entry? The drivers just adjust
sg->length?
--

From: Tejun Heo
Date: Monday, March 3, 2008 - 7:32 pm

As long as the second patch is in, what value rq->data_len indicates
doesn't matter to drivers which don't use explicit padding or draining,
so the situation is much more controlled.  I don't care which value
rq->data_len would indicate.  I'd prefer it equal sum(sg) as that value
is what IDE and libata which will be the major users of padding and/or
draining expect in rq->data_len but fixing up that shouldn't be too
difficult.  I guess this can be determined by Jens.  If Jens likes

Still, do you really wanna force such things into low level drivers?
That will be one extremely fragile API and will be really difficult to
tell when things go wrong.

-- 
tejun
--

From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 1:53 am

On Tue, 04 Mar 2008 11:32:56 +0900

OK, I prefer rq->data_len means the true data length though you prefer
rq->data_len means the allocated buffer length (the true data length
plus padding and drain). We agree on other things. We can live with
either way.


No, I don't, as I explained above. As long as rq->data_len means the
true data length, I'm fine. I knew that James' drain buffer patch
breaks rq->data_len == sum(sg). I don't care about it. I can
understand that drivers wants to a perfect sglist.
--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 1:59 am

I completely agree with you, ->data_len meaning true data length is way
cleaner imho. Only the driver should care for the padded length, all
other parts of the kernel only need to know what they actually got.

-- 
Jens Axboe

--

From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 2:06 am

On Tue, 4 Mar 2008 09:59:46 +0100

OK, now we can fix the whole SG_IO (and bsg handler) mess.

Here's my patch with a proper description. which several people have
already tested (thanks!). Then we need an updated version of Tejun's
separate out padding from alignment patch.

=
From: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Subject: [PATCH] block: restore the meaning of rq->data_len to the true data length

The meaning of rq->data_len was changed to the length of an allocated
buffer from the true data length. It breaks SG_IO friends and
bsg. This patch restores the meaning of rq->data_len to the true data
length and adds rq->extra_len to store an extended length (due to
drain buffer and padding).

This patch also removes the code to update bio in blk_rq_map_user
introduced by the commit 40b01b9bbdf51ae543a04744283bf2d56c4a6afa.
The commit adjusts bio according to memory alignment
(queue_dma_alignment). However, memory alignment is NOT padding
alignment. This adjustment also breaks SG_IO friends and bsg. Padding
alignment needs to be fixed in a proper way (by a separate patch).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 block/blk-core.c          |    3 +--
 block/blk-map.c           |    6 +-----
 block/blk-merge.c         |    2 +-
 block/bsg.c               |    8 ++++----
 block/scsi_ioctl.c        |    4 ++--
 drivers/ata/libata-scsi.c |    6 +++---
 include/linux/blkdev.h    |    2 +-
 7 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 775c851..bfec406 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -127,7 +127,6 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
 	rq->special = NULL;
-	rq->raw_data_len = 0;
 	rq->buffer = NULL;
 	rq->tag = -1;
 	rq->errors = 0;
@@ -135,6 +134,7 @@ void rq_init(struct request_queue *q, struct request *rq)
 	rq->cmd_len = 0;
 	memset(rq->cmd, 0, sizeof(rq->cmd));
 ...
From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 2:22 am

On Tue, 04 Mar 2008 18:06:48 +0900

OK, I've updated his patch. Tejun, can you audit this?

Thanks,

=
From: Tejun Heo <htejun@gmail.com>
Subject: [PATCH] block: separate out padding from alignment

Block layer alignment was used for two different purposes - memory
alignment and padding.  This causes problems in lower layers because
drivers which only require memory alignment ends up with adjusted
rq->data_len.  Separate out padding such that padding occurs iff
driver explicitly requests it.

Tomo: restorethe code to update bio in blk_rq_map_user
      introduced by the commit 40b01b9bbdf51ae543a04744283bf2d56c4a6afa
      according to padding alignment.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 block/blk-map.c           |   20 +++++++++++++-------
 block/blk-settings.c      |   17 +++++++++++++++++
 drivers/ata/libata-scsi.c |    3 ++-
 include/linux/blkdev.h    |    2 ++
 4 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index f559832..4e17dfd 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -43,6 +43,7 @@ static int __blk_rq_map_user(struct request_queue *q, struct request *rq,
 			     void __user *ubuf, unsigned int len)
 {
 	unsigned long uaddr;
+	unsigned int alignment;
 	struct bio *bio, *orig_bio;
 	int reading, ret;
 
@@ -53,8 +54,8 @@ static int __blk_rq_map_user(struct request_queue *q, struct request *rq,
 	 * direct dma. else, set up kernel bounce buffers
 	 */
 	uaddr = (unsigned long) ubuf;
-	if (!(uaddr & queue_dma_alignment(q)) &&
-	    !(len & queue_dma_alignment(q)))
+	alignment = queue_dma_alignment(q) | q->dma_pad_mask;
+	if (!(uaddr & alignment) && !(len & alignment))
 		bio = bio_map_user(q, NULL, uaddr, len, reading);
 	else
 		bio = bio_copy_user(q, uaddr, len, reading);
@@ -141,15 +142,20 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
 
 	/*
 	 * __blk_rq_map_user() copies the buffers ...
From: Jens Axboe
Date: Tuesday, March 4, 2008 - 2:35 am

Looks excellent to me, has a variant of this been tested as OK by the
users reporting the regression?

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 5:37 am

K3b burning seems to be a nogo here.  This is git pulled this morning
though, so it's a somewhat different tree than previously tested fwtw.

[  136.440021] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
[  136.440043] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[  136.440045]          cdb 51 00 00 00 00 00 00 00  02 00 00 00 00 00 00 00
[  136.440047]          res 58/00:02:00:02:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation)
[  136.440053] ata1.01: status: { DRDY DRQ }
[  136.440086] ata1: soft resetting link
[  165.327627] ata1.01: qc timeout (cmd 0xa1)
[  165.327627] ata1.01: failed to IDENTIFY (I/O error, err_mask=0x4)
[  165.327627] ata1.01: revalidation failed (errno=-5)
[  165.327627] ata1: failed to recover some devices, retrying in 5 secs
[  177.272373] ata1: port is slow to respond, please be patient (Status 0x80)
[  180.388879] ata1: device not ready (errno=-16), forcing hardreset
[  180.388879] ata1: soft resetting link
[  210.832471] ata1.01: qc timeout (cmd 0xa1)
[  210.832471] ata1.01: failed to IDENTIFY (I/O error, err_mask=0x4)
[  210.832471] ata1.01: revalidation failed (errno=-5)
[  210.832471] ata1: failed to recover some devices, retrying in 5 secs
[  223.392899] ata1: port is slow to respond, please be patient (Status 0x80)
[  225.920376] ata1: device not ready (errno=-16), forcing hardreset
[  225.920376] ata1: soft resetting link
[  256.542565] ata1.01: qc timeout (cmd 0xa1)
[  256.542565] ata1.01: failed to IDENTIFY (I/O error, err_mask=0x4)
[  256.542565] ata1.01: revalidation failed (errno=-5)
[  256.542565] ata1.01: disabled
[  259.995199] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[  259.995214] ata1.00: revalidation failed (errno=-5)
[  259.995219] ata1: failed to recover some devices, retrying in 5 secs
[  265.047502] ata1: soft resetting link
[  262.397570] ata1.00: limited to UDMA/33 due to 40-wire cable
[  262.420039] ata1.00: configured for UDMA/33
[  262.420039] sr 0:0:1:0: [sr0] Result: ...
From: Jens Axboe
Date: Tuesday, March 4, 2008 - 5:39 am

can you please try git as of this morning without any patches applied,
and then pull

git://git.kernel.dk/linux-2.6-block.git for-linus

into that and see if that works?

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 5:43 am

I'll give it a shot in a bit.

	-Mike

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 5:58 am

Aw poo, so many choices.
I did:
	git add remote block-for-linus git://git.kernel.dk/linux-2.6-block.git
	git remote update
Now, which one do I check out?  block-for-linus/master maybe, or
block-for-linus/for-linus?

homer:..git/linux-2.6 # git checkout block-for-linus
error: pathspec 'block-for-linus' did not match any file(s) known to git.
Did you forget to 'git add'?
homer:..git/linux-2.6 # git branch -a
* master
  x86/master
  x86/mm
  block-for-linus/blktrace
  block-for-linus/cmdfilter
  block-for-linus/dynpipe
  block-for-linus/fcache
  block-for-linus/for-akpm
  block-for-linus/for-linus
  block-for-linus/io-cpu-affinity
  block-for-linus/io-cpu-affinity-kthread
  block-for-linus/loop-extent_map
  block-for-linus/loop-fastfs
  block-for-linus/master
  block-for-linus/plug
  block-for-linus/splice
  block-for-linus/syslet
  block-for-linus/syslet-share
  block-for-linus/timeout
  linux-next/master
  linux-next/stable
  origin/HEAD
  origin/master
  x86/base
  x86/for-akpm
  x86/for-linus
  x86/latest
  x86/master
  x86/mm
  x86/origin
  x86/testing


--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 6:03 am

Re-read my original mail! It states that you should just pull:

git://git.kernel.dk/linux-2.6-block.git for-linus

into your linus branch, or just create a test branch off linus' master
and pull into that. IOW, it's the for-linus branch that you should pull,
nothing else.

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 7:25 am

Well, I had a good reason.  You know how to un-pull, I know how to
un-remote to get back to pristine after I'm done testing... guaranteed
without whimpering pathetically on the git list ;-)

Anyway, I checked out the one with the big-fat-hint in it's name
(block-for-linus/for-linus).
Same error.  Git this morning with patches...
	restore_meaning_of_data_len.diff
	seperate_out_padding_from_alignment.diff
...reverted restored me to the originally reported k3b error, nothing
new noted.

If I tested the wrong branch, whack me upside the head, and I'll follow
your pull destructions, and figure out how to un-pull later.

	-Mike

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 11:17 am

OK, if you're on master, it's pretty easy:

$ git branch test-branch
$ git checkout test-branch
$ git pull git://git.kernel.dk/linux-2.6-block.git for-linus

[build, boot, test]
$ git checkout master

for-linus is the right branch, but I'm just a little worried that you
didn't test what you think you tested. What does cat .git/HEAD say? If
that is a ref to a file (eg refs/heads/master), what does that file
contain?

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 11:35 am

Hm, that's simple enough.  I'll do this for the edification.  Thanks.
Maybe some day, I'll cease to be so paranoid that my test setup may

That wouldn't surprise me one bit.  (ergo...)

It says cc66b4512cae8df4ed1635483210aabf7690ec27... kewpie doll?

	-Mike

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 11:45 am

That looks right, then perhaps there's still an issue there :/
Logs?

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 11:49 am

Tejuns patchlet (below) fixed it here.

Date:	Wed, 05 Mar 2008 01:42:45 +0900
From:	Tejun Heo <htejun@gmail.com>
To:	FUJITA Tomonori <tomof@acm.org>
CC:	efault@gmx.de, jens.axboe@oracle.com, fujita.tomonori@lab.ntt.co.jp, James.Bottomley@HansenPartnership.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org, linux-scsi@vger.kernel.org, jgarzik@pobox.com, bzolnier@gmail.com
Subject: Re: [PATCH] block: fix residual byte count handling


Okay, I got it.  Heh, it turns out SCSI and/or block layer is not
ready for rq->data_len != sum(sg).  When adjusted command completes,
SCSI midlayer completes the command with rq->data_len for PC commands
which eventually ends up in __end_that_request_first().  As there are
extra sg area left after completing rq->data_len, blk layer says so to
SCSI layer and SCSI layer retries the command only with the appended
area.

The following patch gets the writing going.  I really think it's a
serious mistake to break rq->data_len == sum(sg).  If we break
rq->data_len == requested size, the worst bugs are giving wrong size
when issuing commands to application layer of devices which is
relatively easy to spot and not all that command anyway.  Breaking
rq->data_len == sum(sg), bugs will be in internal mechanics, DMA
engine programming and transport layer.  Oh well...

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index fecba05..32439ac 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 +757,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 				"Notifying upper driver of completion "
 				"(result %x)\n", cmd->result));
 
-	good_bytes = scsi_bufflen(cmd);
+	good_bytes = scsi_bufflen(cmd) + cmd->request->data_len;
         if (cmd->request->cmd_type != REQ_TYPE_BLOCK_PC) {
 		drv = scsi_cmd_to_driver(cmd);
 		if (drv->done)




--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 11:54 am

OK, can you try changing that to

        good_bytes = scsi_bufflen(cmd) + cmd->request->extra_len;

and retest?

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 12:26 pm

Yup, disk #42 is happily burning away.

	-Mike

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 12:28 pm

Super, patch heading to Linus now. Thanks for all your testing, Mike!

-- 
Jens Axboe

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 11:29 am

Or just re-pull Linus' tree, the stuff is in now.

-- 
Jens Axboe

--

From: James Bottomley
Date: Tuesday, March 4, 2008 - 9:04 am

Works for me with the SAS SMP handler.  Both input request and output
response frame sizes are picked up and returned with the correct
residues.

James


--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 11:46 am

Goodie, now we just need to figure out why it doesn't work for Mike
yet...

-- 
Jens Axboe

--

From: walt
Date: Tuesday, March 4, 2008 - 10:34 am

Unfortunately this doesn't fix a problem I've discussed off-list with
Kiyoshi Ueda, who suggested that I should follow this thread and try
any patches posted here.

Here is what happens when I try to mount a CD (before and after I
pull 'for-linus'):

hdc: ide_cd_check_ireason: wrong transfer direction!
cdrom: failed setting lba address space
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest }
ide: failed opcode was: unknown
hdc: drive not ready for command
hdc: status timeout: status=0xd0 { Busy }
ide: failed opcode was: unknown
hdc: DMA disabled
hdc: drive not ready for command
hdc: ATAPI reset complete
ISO 9660 Extensions: Microsoft Joliet Level 3
ISOFS: changing to secondary root
VFS: busy inodes on changed media.

The mount can take from 5 seconds on up to a minute or so before the
CD can be accessed.

--

From: Kiyoshi Ueda
Date: Tuesday, March 4, 2008 - 12:42 pm

Hi,

 
I think there was misunderstanding between us.
On off-list, I meant:
  o Your original problem was CD burning, and it looked same problem

    So I suggested you to watch this thread and try patches of this
    thread for CD burning problem.

  o The problem of ide_cd_check_ireason looked different from
    CD burning one.
    So I suggested you to report it as a different problem.

Thanks,
Kiyoshi Ueda
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 10:59 am

Which version did you try?  There was a recent IDE bug fix which
affected CD recording.  Commit bcd88ac3b2ff2eae3d0fa57a6b02d4fce5392f32
which is included in 2.6.25-rc3.  Also, did 2.6.24 work?

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 5:40 am

Aiee... device going down after timing out on READ_DISC_INFO.  That's
gruesome.  Can you please try the other patches?

Thanks.

-- 
tejun
--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 5:45 am

I tried your last yesterday, and k3b worked fine.

	-Mike

--

From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 6:30 am

On Tue, 04 Mar 2008 21:40:53 +0900

Tejun, I thought that libata needs a fix for sum(sg) != rq->data_len. No?

Now Jens' git tree should work with all the non libata stuff, ide,
firewire, bsg, etc. But I'm not sure about libata.
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 6:50 am

The extra_len you added to qc->nbytes should be it.  The only other
place to pay attention is the ATAPI transfer chunk size and your patch

With the second patch, all others should be fine no matter what.  I'll
go check libata part again.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 9:17 am

I can reproduce the problem here and it's very weird.  I'll report back
when I know more.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 9:42 am

Okay, I got it.  Heh, it turns out SCSI and/or block layer is not
ready for rq->data_len != sum(sg).  When adjusted command completes,
SCSI midlayer completes the command with rq->data_len for PC commands
which eventually ends up in __end_that_request_first().  As there are
extra sg area left after completing rq->data_len, blk layer says so to
SCSI layer and SCSI layer retries the command only with the appended
area.

The following patch gets the writing going.  I really think it's a
serious mistake to break rq->data_len == sum(sg).  If we break
rq->data_len == requested size, the worst bugs are giving wrong size
when issuing commands to application layer of devices which is
relatively easy to spot and not all that command anyway.  Breaking
rq->data_len == sum(sg), bugs will be in internal mechanics, DMA
engine programming and transport layer.  Oh well...

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index fecba05..32439ac 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 +757,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 				"Notifying upper driver of completion "
 				"(result %x)\n", cmd->result));
 
-	good_bytes = scsi_bufflen(cmd);
+	good_bytes = scsi_bufflen(cmd) + cmd->request->data_len;
         if (cmd->request->cmd_type != REQ_TYPE_BLOCK_PC) {
 		drv = scsi_cmd_to_driver(cmd);
 		if (drv->done)


-- 
tejun
--

From: James Bottomley
Date: Tuesday, March 4, 2008 - 11:27 am

This doesn't look right.  scsi_bufflen(cmd) is req->data_len for PC
commands ... did you mean to add extra_len here?

James


--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 11:33 am

Yeap, sorry about the confusion.  Adding two times data_len accidentally
worked tho.  :-)

-- 
tejun
--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 11:45 am

Bingo.

	-Mike

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 12:25 pm

Pretty please test this on top of current -git?

I'll merge this up, it should do the trick. Would just be nice if you
could verify! :-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index fecba05..e5c6f6a 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 +757,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 				"Notifying upper driver of completion "
 				"(result %x)\n", cmd->result));
 
-	good_bytes = scsi_bufflen(cmd);
+	good_bytes = scsi_bufflen(cmd) + cmd->request->extra_len;
         if (cmd->request->cmd_type != REQ_TYPE_BLOCK_PC) {
 		drv = scsi_cmd_to_driver(cmd);
 		if (drv->done)

-- 
Jens Axboe

--

From: Mike Galbraith
Date: Tuesday, March 4, 2008 - 12:33 pm

?  That's the patch I just tested, and the tree.  Oh.. 976dde0..87baa2b
just a sec....

	-Mike

--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 12:34 pm

Yeah it is, mid-air collision of emails. So just disregard this one!

-- 
Jens Axboe

--

From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 12:19 pm

On Wed, 05 Mar 2008 01:42:45 +0900


Hmm, does SCSI mid-layer need to care about how many bytes the block
layer allocates? I don't think that extra_len is NOT good_bytes.

I think that the block layer had better take care about it (fix
__end_that_request_first?).
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 4:33 pm

Yeah, probably calling completion functions w/o bytes count is the right
thing to do but what I was talking about was what could break when the
semantics of rq->data_len changed.  If we keep rq->data_len() ==
sum(sg), we keep it business as usual for all the rest except for the
device application layer if we don't we do the reverse and SCSI midlayer
completion was a good example, I think.

Things going the other way is fine with me but I at least want to hear a
valid rationale.  Till now all I got is "because that's the true size"
which doesn't really make much sense to me.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Tuesday, March 4, 2008 - 5:26 pm

On Wed, 05 Mar 2008 08:33:05 +0900

sglist is a low-level I/O representation for device drivers. SCSI
midlayer should not care about sglist. We should not fix SCSI midlayer
for rq->data_len != sum(sg) change (so I can't agree with your
diagrams in another mail).

When if we change a rule, we need to fix something.

If we keep rq->data_len == sum(sg), we need to fix the device
application layer. If we keep rq->data_len == the true data length, we
need to fix the low-level drivers.

Now I'm fine with the commit e97a294ef6938512b655b1abf17656cf2b26f709
since we are in -rc stages. But I plan to send a patch to revert it
and fix this issue in the block layer. I'd like to test it in -mm for
a while.

Only sglist stuff in SCSI midlayer is scsi_req_map_sg now. As you

Most of users of request structure care about only the real data
length, don't care about padding and drain length. Why do they bother
to use a helper function to get the real data length?
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 5:44 pm

Hello,




If the way forward is to make anything but the low level drivers not
care about sglist, in the long term, the current scheme is fine but I
still don't think this way of doing things is safe one.  We're affecting
large portion of code based on what things should be in future not what

I think this is where the difference comes from.  To me it seems
internal usage seems more wide-spread and more delicate and not too many
care about the true size and when they do only in well defined places.
Maybe it comes from the difference between your most and my most.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Wednesday, March 5, 2008 - 9:56 pm

On Wed, 05 Mar 2008 09:44:01 +0900

I don't think that they only in well defined places.

If you see scsi mid-layer (and LLDs), you can find several places that
use rq->data_len as the true data length.

Breaking rq->data_len == the true data length theoretically
wrong. Even if it affects only libata now, it will hurt us, I think.
--

From: Tejun Heo
Date: Wednesday, March 5, 2008 - 10:02 pm

Hello, FUJITA.


Yeap, I fully agree it's much better not to break any of the two
assumptions except when it's actually needed.  Both padding and draining
are requirements from low level driver which usually stems from hardware
kinkiness, so adjusting sg and length there and let the rest of system
not care about it sounds like a good idea to me.  Maybe something good
can come out of this long thread.  :-)

Thanks.

-- 
tejun
--

From: Boaz Harrosh
Date: Wednesday, March 5, 2008 - 3:16 am

No this commit is a serious bug, and the only fix is like you suggested
in __end_that_request_first. This is because it breaks that scsi-ml loop
where scsi_bufflen() can be less then blk_rq_bytes(). In that case this 

Submitted is the right fix to this problem, as pointed out by TOMO.
Please test it solves the CD burning problem.
(The patch includes the revert of commit e97a294e)
---
From: Boaz Harrosh <bharrosh@panasas.com>
Date: Wed, 5 Mar 2008 12:07:12 +0200
Subject: [PATCH] blk: missing add of padded bytes to io completion byte count

the commit e97a294ef6938512b655b1abf17656cf2b26f709 was very wrong. This is
because scsi-ml supports the ability to split a request into smaller chunks,
in which case scsi_bufflen() is smaller then request length. Then at completion
time the remainder can be issued as a new scsi command. In that case the above
commit is a data corruption.

Also in this fix all users of block layer are taken care of, and not only
scsi devices.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 block/blk-core.c    |    4 ++++
 drivers/scsi/scsi.c |    2 +-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2a438a9..37fcccc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1549,6 +1549,9 @@ static int __end_that_request_first(struct request *req, int error,
 			     nr_bytes >> 9, req->sector);
 	}
 
+	if (nr_bytes >= blk_rq_bytes(req))
+		nr_bytes += req->extra_len;
+
 	total_bytes = bio_nbytes = 0;
 	while ((bio = req->bio) != NULL) {
 		int nbytes;
@@ -1616,6 +1619,7 @@ static int __end_that_request_first(struct request *req, int error,
 	if (!req->bio)
 		return 0;
 
+	BUG_ON(total_bytes >= blk_rq_bytes(req));
 	/*
 	 * if the request wasn't completed, update state
 	 */
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e5c6f6a..fecba05 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 ...
From: Mike Galbraith
Date: Wednesday, March 5, 2008 - 5:28 am

Works for me.


--

From: Jens Axboe
Date: Wednesday, March 5, 2008 - 5:33 am

Make that a WARN_ON() first please. It's indeed a bug, but it wont be
critical and it's not fair killing everything since this padding stuff
is so fresh and may still need a tweak or two.

I'd be fine with making it a BUG_ON() post 2.6.25.

-- 
Jens Axboe

--

From: Boaz Harrosh
Date: Wednesday, March 5, 2008 - 5:46 am

Updated, you are absolutely right, thanks.

Will you commit below patch for 2.6.25? I know that, at the time, I have 
seen this scsi-ml-loop in action on a sata drive here in the lab, on an 
x86_64 machine. The current solution will silently corrupt data, which 
is very hard to find.

Boaz
---
From: Boaz Harrosh <bharrosh@panasas.com>
Date: Wed, 5 Mar 2008 12:07:12 +0200
Subject: [PATCH] blk: missing add of padded bytes to io completion byte count

the commit e97a294ef6938512b655b1abf17656cf2b26f709 was very wrong. This is
because scsi-ml supports the ability to split a request into smaller chunks,
in which case scsi_bufflen() is smaller then request length. Then at completion
time the remainder can be issued as a new scsi command. In that case the above
commit is a data corruption.

Also in this fix all users of block layer are taken care of, and not only
scsi devices.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
---
 block/blk-core.c    |    4 ++++
 drivers/scsi/scsi.c |    2 +-
 2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2a438a9..c82e68a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1549,6 +1549,9 @@ static int __end_that_request_first(struct request *req, int error,
 			     nr_bytes >> 9, req->sector);
 	}
 
+	if (nr_bytes >= blk_rq_bytes(req))
+		nr_bytes += req->extra_len;
+
 	total_bytes = bio_nbytes = 0;
 	while ((bio = req->bio) != NULL) {
 		int nbytes;
@@ -1616,6 +1619,7 @@ static int __end_that_request_first(struct request *req, int error,
 	if (!req->bio)
 		return 0;
 
+	WARN_ON(total_bytes >= blk_rq_bytes(req));
 	/*
 	 * if the request wasn't completed, update state
 	 */
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e5c6f6a..fecba05 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 +757,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 				"Notifying upper driver of ...
From: Jens Axboe
Date: Wednesday, March 5, 2008 - 5:48 am

Yes, was just hoping you'd resend with the above corrected, so thanks!
I'll add it to the pending queue for 2.6.25.

-- 
Jens Axboe

--

From: Tejun Heo
Date: Wednesday, March 5, 2008 - 6:45 am

Hello, Jens, Boaz.


Thanks for catching the stupidity.  Did it actually happen?  PC commands
are not completed in pieces and padding / draining should only happen
for those.  qc->extra_len should be zero where commands can be splitted

This is getting insanely subtle.  Let's say there's PIO driver which
transfer certain sized chunks at a time and completes request partially
after completing each chunk and the driver uses draining to eat up
whatever excess data, which seems like a legit use case to me.  But it
won't work because __end_that_request_first() will terminate when it
reaches reaches the 'true' transfer size.  That's just broken API.  FWIW,

Nacked-by: Tejun Heo <htejun@gmail.com>

-- 
tejun
--

From: Jens Axboe
Date: Wednesday, March 5, 2008 - 6:51 am

Yeah, I think I may have gone a bit overboard in applying this so
quickly. It's just not a good interface, silently adding the extra
length if asked to complete more. It may even happen right now, for a
driver that does no padding (it probably wont do any harm here either,
but still).

I'll try and see if I can come up with something cleaner.

My basic design paradigm for this is that the _driver_ (or mid layer, if
SCSI wants to handle it) should care about the padding. So make it easy
for them to pad, but have it 'unrolled' by completion time. We should
NOT need any extra_len checks or additions in the block/ directory,
period.

-- 
Jens Axboe

--

From: Tejun Heo
Date: Wednesday, March 5, 2008 - 7:08 am

Hello, Jens.


Unless it explicitly requests padding, it shouldn't be a problem
extra_len will always be zero and currently the only driver which uses

Maybe I'm from Mars but I don't really understand all this fuss.  The
two patches I posted way back work perfectly fine and don't have any of
these problems and as I have said again and again that's because it
doesn't break the assumption which our internal mechanics depend on.

Can you please put the "true" size aside for a while and consider those
patches?  There's nothing fundamentally wrong with letting the
rq->data_len be sum(sg) which can differ from user requested data length
if and only if low level driver requests so.

If you can come up with something nicer, that will be great too but I
really don't think the current scheme will work.

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Wednesday, March 5, 2008 - 8:21 am

Right, that's why my original proposal was to do nothing for padding
(other than ensure the driver could adjust the length if it wanted to)
and to add an extra element always for draining, which the driver could
ignore.  It basically pushed the use paradigm onto the driver.

If we want the use paradigm shared between block and driver, then I
think the best approach is to keep all the bios the same (so not adjust
for padding), but do adjust in the blk_rq_map_sg().  That way we have
the padding and draining unwind information by comparing with the bio.

For passing on to the driver:  req->data_len still needs to be the input
(bio) lenght.  req->extra_len can record how much padding and draining
was added.  The completion length also needs to be in terms of the true
(bio) length.  Now, here's the subtlety.  Because of the way transfers
work, we expect the padded length not to contribute to overrun (because
it represents transfers that were successfully completed at the correct
length), but we *do* expect drain usage to be recorded as overrun.
However, if we keep the bios intact, we have all the information to make
this determination in the block layer at completion time, with the
expectation that the lower layers report the exact amount they
transferred.

James


--

From: FUJITA Tomonori
Date: Wednesday, March 5, 2008 - 9:41 pm

On Wed, 05 Mar 2008 09:21:24 -0600

Adjusting only sg in blk_rq_map_sg (like drain) looks much
better. This works with libata for me.

diff --git a/block/blk-map.c b/block/blk-map.c
index c07d9c8..e949969 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -140,26 +140,6 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
 		ubuf += ret;
 	}
 
-	/*
-	 * __blk_rq_map_user() copies the buffers if starting address
-	 * or length isn't aligned to dma_pad_mask.  As the copied
-	 * buffer is always page aligned, we know that there's enough
-	 * room for padding.  Extend the last bio and update
-	 * rq->data_len accordingly.
-	 *
-	 * On unmap, bio_uncopy_user() will use unmodified
-	 * bio_map_data pointed to by bio->bi_private.
-	 */
-	if (len & q->dma_pad_mask) {
-		unsigned int pad_len = (q->dma_pad_mask & ~len) + 1;
-		struct bio *tail = rq->biotail;
-
-		tail->bi_io_vec[tail->bi_vcnt - 1].bv_len += pad_len;
-		tail->bi_size += pad_len;
-
-		rq->extra_len += pad_len;
-	}
-
 	rq->buffer = rq->data = NULL;
 	return 0;
 unmap_rq:
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 0f58616..2a81c87 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -220,6 +220,13 @@ new_segment:
 		bvprv = bvec;
 	} /* segments in rq */
 
+	if (sg && (q->dma_pad_mask & rq->data_len)) {
+		unsigned int pad_len = (q->dma_pad_mask & ~rq->data_len) + 1;
+
+		sg->length += pad_len;
+		rq->extra_len += pad_len;
+	}
+
 	if (q->dma_drain_size && q->dma_drain_needed(rq)) {
 		if (rq->cmd_flags & REQ_RW)
 			memset(q->dma_drain_buffer, 0, q->dma_drain_size);
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e5c6f6a..fecba05 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -757,7 +757,7 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 				"Notifying upper driver of completion "
 				"(result %x)\n", cmd->result));
 
-	good_bytes = scsi_bufflen(cmd) + cmd->request->extra_len;
+	good_bytes = scsi_bufflen(cmd);
         if ...
From: Jens Axboe
Date: Thursday, March 6, 2008 - 6:41 am

Looks like a much better solution to me. Anyone have any valid
objections against moving the padding to the sg map time?

-- 
Jens Axboe

--

From: Tejun Heo
Date: Thursday, March 6, 2008 - 5:07 pm

Not necessarily objections but some concerns.

* As completion is done in bio terms, it makes completion from LLDs a
bit cumbersome, but this is unavoidable if we break sum(bio) == sum(sg).

* I've been wondering why we are not using sg chain / table or whatever
directly in bios and maybe rq_map_sg can go away in future.

How about separating out the padding / draining adjustment into a
separate interface?  Say, blk_rq_apply_extra() and blk_rq_undo_extra()
and make it the responsibility of the LLD which requested
padding/draining to apply and undo the adjustments?  It can undo the
adjustments when it returns the the request to its upper layer.  If rq
completion is handled by upper layer, it will do the right thing.  If rq
completion is handled by LLD, it can see the bio it wants to see.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Friday, March 7, 2008 - 8:07 am

On Fri, 07 Mar 2008 09:07:23 +0900


You mean that LLDs use bios directly? For me, sg and bio have very

If possible, I'd like to avoid creating APIs for them. I think that
the current approach is much better than such APIs.
--

From: Tejun Heo
Date: Friday, March 7, 2008 - 6:06 pm

LLDs which loop over sg's trying to complete rq incrementally will see

Actually the other way, block layer use sg instead of bio_vec in bio.
Layer separation doesn't necessarily require copying about the same
information to differently formatted data structure.  I'm not sure it
will be a clean win tho.  Requests hang longer in scheduler queue and
and bio_vec is smaller and scatterlist.

The thing is that, to me, blk_rq_map_sg() doesn't really look necessary,
it can be done just as well when the request is fetched from the queue

And, so, I'm not too sure whether putting more mechanisms into it is a
good idea.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Thursday, March 20, 2008 - 5:54 am

From: Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [PATCH] blk: missing add of padded bytes to io completion byte   count

What's the situation with this fix?
--

From: Boaz Harrosh
Date: Wednesday, March 5, 2008 - 7:46 am

I don't understand? Drivers can still do that. Do you mean That it wants
to also complete the draining portion in smaller chunks? I thought the draining
is always done at once, at most. Is that theoretical or is it so in any of the 
drivers.

Any way Nack from my side on the scsi_finish_command(), it makes too many 
assumptions that are unchecked anywhere. And it's a terrible layering violation.
scsi is a pass-threw block device, the fix should be in block or in using device
drivers (eg libata), that know what is going on.

Any way you are always saying req->data_len == sum(sg) but that  was certainly
never true for scsi_bufflen() == sum(sg) so leave that alone please.

Any other block layer fixes are welcome. But for now this is the best fix we have
that only breaks theoretical, yet to be submitted drivers.

Boaz
--

From: Tejun Heo
Date: Wednesday, March 5, 2008 - 8:11 am

Hello, Boaz.


Ah... I wasn't really Nacking your patch specifically.  I was trying to
say "this scheme isn't gonna work".  Your patch does make good sense
given the situation (and I think I did acknowledge that above).  Sorry


I don't really care about scsi_bufflen() and I'm not willing to change
any of that.  If SCSI LLDs are happy with scsi_bufflen() != sum(sg), no
problem at all.  What I'm against is pushing that into block layer,
which until now had "true" size == rq->data_len == sum(sg).

We're about to break one of the two equals if we're gonna do sg
manipulation in block layer (Jens seems to be planning something
different) and all I'm saying is we're far better off breaking the
former one.

First, I don't really think SCSI LLDs will make much use of explicit
padding or draining.  Secondly, even when such need arises, keeping
scsi_bufflen() at the "true" size is easy no matter which way we go with
rq->data_len.


Yeap, given the current code, I agree.

Thanks.

-- 
tejun
--

From: FUJITA Tomonori
Date: Wednesday, March 5, 2008 - 10:02 pm

On Wed, 05 Mar 2008 12:16:15 +0200

Ah, I knew that the patch doesn't work with partial completion but I
thought that it doesn't happen with PC commands... And touching
__end_that_request_first looked really hacky so I didn't send such
patch.

Moving the padding adjustment to blk_rq_map_sg (James' proposal) looks
fine. Maybe Jens will come up with something better.
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 4:54 pm

I'm giving it another shot.  When the padding / draining thing was in
libata (or IDE) in that matter.  The whole thing looked like this.

  user - blk - SCSI - libata - LLD - controller - device
  <---------------------><----------------------><----->
            a                       b               c

a: Uses the 'true' request size and matching sg
b: Requires adjusted request size and matching sg
c: Don't really care about sg, but sometimes needs the true size.  For
   anything which gets attached behind ATA and which may require
   padding, transfer size is also sent in the CDB as well, which not
   all devices honor and that's one of the reasons why size adjustment
   is necessary.

If we move the adjustment to block layer and keep data_len == sum(sg),
it looks like.

  user - blk - SCSI - libata - LLD - controller - device
  <------><-------------------------------------><----->
      a                      b                      c

And a, b and c stay the same.  If we keep the requested size in
data_len.  Whole b gets inconsistent values in the middle while c gets
the value it wants in data_len, so we're risking much more to keep the
true size in rq->data_len when we could simply make it mean sum(sg).

Before the only thing which need updating was to correctly determine
the transfer size to feed to device.  Now we need to audit whole b.
In addition, such adjustments are made only when the driver explicitly
requested it, so for all others it doesn't really matter.

Thanks.

-- 
tejun
--

From: Boaz Harrosh
Date: Tuesday, March 4, 2008 - 11:26 am

Are you sure? is it not:

I hate this patch. I wish you could maybe take the extra_len into 
account inside blk_end_request. The padding should be transparent
to all concerned but the requesting LLD and the internals of the
block layer. If block layer added padding it should take that into
account on completion. My $0.2.

Boaz
 
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 11:35 am

Yeah, I hate it too.  As I've been saying all along, I think it just
should be rq->data_len.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 2:40 am

Yeah, the other version which added extra_len to data_len has been
verified to work.  The only difference is now libata is adding
extra_len, so this one should be safe.

-- 
tejun
--

From: Jens Axboe
Date: Tuesday, March 4, 2008 - 2:46 am

Great, since we all agree, I'll merge it up and pass it on.

-- 
Jens Axboe

--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 2:30 am

Looks good to me.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Tuesday, March 4, 2008 - 2:29 am

Hello, Jens.


Oh well, I guess I'm the one with strange taste he
re.  My logic is that the only thing below the block layer is the driver
which requested size adjustment.  This means residual bytes calculation
is pushed to low level drivers which isn't anything major but still.
Anyways, I'll review FUJITA's modified patch.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Sunday, March 2, 2008 - 11:08 pm

rq->raw_data_len introduced for block layer padding and draining
(commit 6b00769fe1502b4ad97bb327ef7ac971b208bfb5) broke residual byte
count handling.  Block drivers modify rq->data_len to notify residual
byte count to the block layer which blindly reported unmodified
rq->raw_data_len to userland.

To keep block drivers dealing only with rq->data_len, this should be
handled inside block layer.  However, how much extra buffer was
appened is lost after rq->data_len is modified.

This patch replaces rq->raw_data_len with rq->extra_len and add
blk_rq_raw_data_len() helper to calculate raw data size from
rq->data_len and rq->extra_len.  The helper returns correct raw
residual byte count when called on a rq whose data_len is modified to
carry residual byte count.

This problem was reported and diagnosed by Mike Galbraith.

Signed-off-by: Tejun Heo <htejun@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
---
Comments updated compared to the previous version.

 block/blk-core.c          |    3 +--
 block/blk-map.c           |    2 +-
 block/blk-merge.c         |    1 +
 block/blk-settings.c      |    4 ++++
 block/bsg.c               |    8 ++++----
 block/scsi_ioctl.c        |    4 ++--
 drivers/ata/libata-scsi.c |    3 ++-
 include/linux/blkdev.h    |    8 +++++++-
 8 files changed, 22 insertions(+), 11 deletions(-)

Index: work/block/blk-core.c
===================================================================
--- work.orig/block/blk-core.c
+++ work/block/blk-core.c
@@ -127,7 +127,7 @@ void rq_init(struct request_queue *q, st
 	rq->nr_hw_segments = 0;
 	rq->ioprio = 0;
 	rq->special = NULL;
-	rq->raw_data_len = 0;
+	rq->extra_len = 0;
 	rq->buffer = NULL;
 	rq->tag = -1;
 	rq->errors = 0;
@@ -2016,7 +2016,6 @@ void blk_rq_bio_prep(struct request_queu
 	rq->hard_cur_sectors = rq->current_nr_sectors;
 	rq->hard_nr_sectors = rq->nr_sectors = bio_sectors(bio);
 	rq->buffer = bio_data(bio);
-	rq->raw_data_len = bio->bi_size;
 	rq->data_len = bio->bi_size;
 
 	rq->bio = ...
From: Tejun Heo
Date: Sunday, March 2, 2008 - 11:10 pm

Block layer alignment was used for two different purposes - memory
alignment and padding.  This causes problems in lower layers because
drivers which only require memory alignment ends up with adjusted
rq->data_len.  Separate out padding such that padding occurs iff
driver explicitly requests it.

Signed-off-by: Tejun Heo <htejun@gmail.com>
---
As wrote before, the major problem was that drivers which don't want
size adjustment got it acciedentally by mixing up aligning and padding
which are two conceptually separate things.  Let padding occur iff the
driver explicitly requested it.  This makes both parties happy.

 block/blk-map.c           |   16 +++++++++-------
 block/blk-settings.c      |   17 +++++++++++++++++
 drivers/ata/libata-scsi.c |    3 ++-
 include/linux/blkdev.h    |    2 ++
 4 files changed, 30 insertions(+), 8 deletions(-)

Index: work/block/blk-settings.c
===================================================================
--- work.orig/block/blk-settings.c
+++ work/block/blk-settings.c
@@ -293,6 +293,23 @@ void blk_queue_stack_limits(struct reque
 EXPORT_SYMBOL(blk_queue_stack_limits);
 
 /**
+ * blk_queue_dma_pad - set pad mask
+ * @q:     the request queue for the device
+ * @mask:  pad mask
+ *
+ * Set pad mask.  Direct IO requests are padded to the mask specified.
+ *
+ * Appending pad buffer to a request modifies ->data_len such that it
+ * includes the pad buffer.  The original requested data length can be
+ * obtained using blk_rq_raw_data_len().
+ **/
+void blk_queue_dma_pad(struct request_queue *q, unsigned int mask)
+{
+	q->dma_pad_mask = mask;
+}
+EXPORT_SYMBOL(blk_queue_dma_pad);
+
+/**
  * blk_queue_dma_drain - Set up a drain buffer for excess dma.
  *
  * @q:  the request queue for the device
Index: work/block/blk-map.c
===================================================================
--- work.orig/block/blk-map.c
+++ work/block/blk-map.c
@@ -43,6 +43,7 @@ static int __blk_rq_map_user(struct requ
 			     void __user *ubuf, ...
From: James Bottomley
Date: Monday, March 3, 2008 - 11:27 am

This puts the libsas SMP handler back into a working state again.

Thanks,

James


--

From: Jeff Garzik
Date: Tuesday, February 26, 2008 - 5:46 pm

I would love to get an answer as to what data_len (and of course 
raw_data_len) should be set to AFTER the command completes, which is 
what is going on here.

I can see the above being correct -- scsi_get_resid() returns the length 
of the left-over data after the command is processed -- but I am mainly 
curious why setting [raw_]data_len matters after I/O completion.

	Jeff


--

From: Mike Galbraith
Date: Tuesday, February 26, 2008 - 7:58 pm

Yeah, blk_complete_sghdr_rq() used to do hdr->resid = irq->data_len,
which is modified down lower.  How/where that hdr->resid percolates back
up, and turns into a retry/nogo, I don't know.

	-Mike

--

Previous thread: [PATCH 2/2] x86 : relocate uninitialized variable in init DATA section into init BSS section by Huang, Ying on Thursday, February 21, 2008 - 1:15 am. (8 messages)

Next thread: [PATCH] UDF: fix anchor point detection by Pavel Emelyanov on Thursday, February 21, 2008 - 1:55 am. (1 message)