Re: [patch 0/1] Apply segment size and segment boundary to integrity data

Previous thread: [PATCH] x86, cpu: Add Intel CPUID flags: ECMD, PLN, PTM and TSC_DEADLINE by John Villalovos on Thursday, July 15, 2010 - 8:27 am. (1 message)

Next thread: Re: Workaround hardware bug addressing physical memory by Unai Uribarri on Thursday, July 15, 2010 - 9:46 am. (2 messages)
From: Christof Schmitt
Date: Thursday, July 15, 2010 - 8:34 am

While experimenting with the data integrity support in the Linux
kernel, i found that the block layer integrity code can send integrity
data segments for a request that do not adhere to the queue limits. 
The integrity data segment can be larger than queue_max_segment_size
and the segment does not adhere to the queue_segment_boundary.

It appears to me that the right way would be to apply the same
restrictions that are in place for data segments also to integrity
data segments. The patch works for my experiments and applies on top
of the current Linux tree (2.6.35-rc5).

Christof
--

From: Christof Schmitt
Date: Thursday, July 15, 2010 - 8:34 am

From: Christof Schmitt <christof.schmitt@de.ibm.com>

Apply the conditions used in __blk_recalc_rq_segments also to
integrity data: Adhere to the maximum segment size and the segment
boundary set by the driver. Without this change, a driver would
receive integrity data blocks that do not adhere to the limits set for
the request queue.

Signed-off-by: Christof Schmitt <christof.schmitt@de.ibm.com>
---
 block/blk-integrity.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -41,15 +41,22 @@ int blk_rq_count_integrity_sg(struct req
 {
 	struct bio_vec *iv, *ivprv;
 	struct req_iterator iter;
-	unsigned int segments;
+	unsigned int segments, seg_size;
 
 	ivprv = NULL;
 	segments = 0;
+	seg_size = 0;
 
 	rq_for_each_integrity_segment(iv, rq, iter) {
 
-		if (!ivprv || !BIOVEC_PHYS_MERGEABLE(ivprv, iv))
+		if (!ivprv ||
+		    !BIOVEC_PHYS_MERGEABLE(ivprv, iv) ||
+		    seg_size + iv->bv_len > queue_max_segment_size(rq->q) ||
+		    !BIOVEC_SEG_BOUNDARY(rq->q, ivprv, iv)) {
 			segments++;
+			seg_size = iv->bv_len;
+		} else
+			seg_size += iv->bv_len;
 
 		ivprv = iv;
 	}
@@ -81,9 +88,16 @@ int blk_rq_map_integrity_sg(struct reque
 	rq_for_each_integrity_segment(iv, rq, iter) {
 
 		if (ivprv) {
+			if (sg->length + iv->bv_len
+				    > queue_max_segment_size(rq->q))
+				goto new_segment;
+
 			if (!BIOVEC_PHYS_MERGEABLE(ivprv, iv))
 				goto new_segment;
 
+			if (!BIOVEC_SEG_BOUNDARY(rq->q, ivprv, iv))
+				goto new_segment;
+
 			sg->length += iv->bv_len;
 		} else {
 new_segment:

--

From: Jens Axboe
Date: Thursday, July 15, 2010 - 8:53 am

Thanks. The level of duplication there is nasty, particularly for such
intricate code as this that has had bugs in the past in the core.

-- 
Jens Axboe

--

From: Martin K. Petersen
Date: Thursday, July 15, 2010 - 9:03 am

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

Christof> While experimenting with the data integrity support in the
Christof> Linux kernel, i found that the block layer integrity code can
Christof> send integrity data segments for a request that do not adhere
Christof> to the queue limits.  The integrity data segment can be larger
Christof> than queue_max_segment_size and the segment does not adhere to
Christof> the queue_segment_boundary.

Correct.  That was a deliberate design decision.

Modern HBAs allow essentially indefinite chaining and our block layer
segmentation controls are to some extent legacy baggage.  I did not want
to put in a set of constraints on the DI scatterlist because I was
afraid it would encourage vendors to actually them.


Christof> It appears to me that the right way would be to apply the same
Christof> restrictions that are in place for data segments also to
Christof> integrity data segments. The patch works for my experiments
Christof> and applies on top of the current Linux tree (2.6.35-rc5).

Who says constraints on the integrity scatterlist are the same as on the
data ditto?  In my experience they are not.  If you must do this, then
the DI constraints should be separate from the data segmentation ones.
But I'm interested in what motivated this change to begin with.

Your change also has repercussions when merging requests and bios.  We'd
need to honor the DI segmentation constraints when merging.  Otherwise
we may end up going beyond the controller limits when mapping the sgl.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Jens Axboe
Date: Thursday, July 15, 2010 - 9:14 am

That sounds like a very batch design decision. Either the limits are
explicitly given and different, or if not we have to assume that they
are the same as the data limits at least.

So do they have limits or not? Essentially indefinite, what does that
mean? If they are limited, then we must have settings to cope and map
around those. If these limits are not the same as the regular data

Yes me too, what bug triggered this patch?

-- 
Jens Axboe

--

From: Martin K. Petersen
Date: Thursday, July 15, 2010 - 9:35 am

>>>>> "Jens" == Jens Axboe <axboe@kernel.dk> writes:

Jens> That sounds like a very batch design decision. Either the limits
Jens> are explicitly given and different, or if not we have to assume
Jens> that they are the same as the data limits at least.

Imagine a controller that has a 4KB segment, 1 entry limit.  If we
capped the DI sgl the same way as the data we'd only be able to issue
512-byte requests unless the DI entries happened to be contiguous in
memory.

For several types of I/O the DI sgl is much longer than the data sgl.
Especially if the submitter is using buffer_heads to map 512-byte
blocks.

And consequently we require vendors to be able to handle the
pathological case in which any data scatterlist honoring the
segmentation constraints given by the driver can be matched with an
integrity scatterlist in which there is a separate entry for each
logical block.  No vendor has had any problems with this.  Therefore
there are no block layer data integrity queue limits.

If a device appears that does in fact have constraints I have no
problems intruducing a set of suitable knobs.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Jens Axboe
Date: Thursday, July 15, 2010 - 9:40 am

OK, lets wait and hear what problem that Christof ran into here.
Is it ensuring that a segment doesn't corss eg the 4GB boundary?

-- 
Jens Axboe

--

From: Christof Schmitt
Date: Friday, July 16, 2010 - 1:30 am

The motivation stems from research how the integrity data can be
mapped to the hardware interface used by the zfcp driver. When passing
data segments to the zfcp hardware controller, there is the constraint
that each data segment has a maximum size of 4k and a segment must not
cross a 4k boundary. Right now, this is done by reporting the
maximum segment size and segment boundary accordingly from zfcp. When
issuing a request, zfcp simply walks the sg list and passes the
segments to the hardware controller, no mapping or readjustment is
necessary in the driver.

Adding integrity data to this interface implies that the integrity
data segments are passed the same way and with the same restrictions.
integrity data segments.  In fact, i am planning to post an
experimental patch for zfcp for upstream inclusion. While this is
still research, it does not affect non-integrity I/O and it will ease
future work on the integrity data mapping for zfcp.

Maybe my thinking is too much with the zfcp hardware interface where
it is obvious to have the same constraints for everything passed along
to the hardware. Another motivation is that i do not want to have the
need in the driver to re-map or copy data segments, when the block
layer already has a generic method of doing this.

What would be the right approach for hardware that has specific
constraints for integrity data? Add an interface for reporting
integrity data constraints independently of constraints for regular

Meaning the integrity data sg list would have more entries than
max_segments? I have not seen this during my experiments, but then i
likely have not hit every case of a possible request layout.

Christof
--

From: Martin K. Petersen
Date: Monday, July 19, 2010 - 9:45 pm

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

Christof,

Christof> The motivation stems from research how the integrity data can
Christof> be mapped to the hardware interface used by the zfcp
Christof> driver. When passing data segments to the zfcp hardware
Christof> controller, there is the constraint that each data segment has
Christof> a maximum size of 4k and a segment must not cross a 4k
Christof> boundary.

Ok.


Christof> Right now, this is done by reporting the maximum segment size
Christof> and segment boundary accordingly from zfcp. When issuing a
Christof> request, zfcp simply walks the sg list and passes the segments
Christof> to the hardware controller, no mapping or readjustment is
Christof> necessary in the driver.

In that case I don't really have a problem with adhering to the queue
segment size and segment boundary for the integrity metadata.  As long
as we avoid using the max number of data segments to cap the integrity
scatterlist because that'll definitely break a lot of stuff.


Christof> Meaning the integrity data sg list would have more entries
Christof> than max_segments? I have not seen this during my experiments,
Christof> but then i likely have not hit every case of a possible
Christof> request layout.

It happens all the time.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Christof Schmitt
Date: Tuesday, July 20, 2010 - 2:28 am

Yes. There is a hardware limit for the maximum number of scatterlist
entries the driver can send to the hardware at the same time. For
adding integrity data, we have to come up with a way to map both,
integrity data and user data in the same hardware request.

This is the experimental zfcp integrity data patch i sent for upstream
inclusion: http://marc.info/?l=linux-scsi&m=127928781200392&w=2 The
approach is that the driver has to adhere to the hardware constraint
of sum of all data segments (user plus integrity data). To have a
simple approach that covers the case with one integrity data segment
per user data segment, we only report half the size for the
scatterlist length when running DIX. This guarantees that the other

Ok, i have to look into that as well. It would be an issue with the
approach we are looking at now: If there are max_segments data
segments, and more than max_segments integrity data segments, we will
overrun the hardware constraint.

Christof
--

From: Martin K. Petersen
Date: Tuesday, July 20, 2010 - 9:20 pm

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

Christof> To have a simple approach that covers the case with one
Christof> integrity data segment per user data segment, we only report
Christof> half the size for the scatterlist length when running
Christof> DIX. This guarantees that the other half can be used for
Christof> integrity data.

Yup, a few of our partners did something similar.

My concern is the scenario where we submit lots of 512-byte writes that
get merged into (in your case) 4 KB segments.  Each of those 512-byte
writes could come with an 8-byte integrity metadata tuple.  And so you'd
need 8 DI scatterlist elements per data element.


Christof> Meaning the integrity data sg list would have more entries
Christof> than max_segments? I have not seen this during my experiments,
Christof> but then i likely have not hit every case of a possible
Christof> request layout.

dd to the block device is usually a good way to issue long scatterlists.


Christof> Ok, i have to look into that as well. It would be an issue
Christof> with the approach we are looking at now: If there are
Christof> max_segments data segments, and more than max_segments
Christof> integrity data segments, we will overrun the hardware
Christof> constraint.

Ok.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Christof Schmitt
Date: Monday, August 2, 2010 - 4:05 am

After looking at the given facts, this seems to be the missing part:
The zfcp hardware interface has an maximum number of data segments
that can be part of one request. In the experimental zfcp DIF/DIX
patch (now in scsi-misc), zfcp reserves half of the data segments
for integrity data. But if small bios have been merged until hitting
queue_max_segments, there may be more integrity data segments.

To summarize the limits i see in the zfcp hardware:
- Maximum size of 4k per segment
- The segments must not cross page boundaries
- The number of segments per request is limited

My preferred approach would be to set the limits on the queue, so that
the request adheres to the hardware limitations and can be passed on
directly to the hardware. I would like to avoid splitting large
segments in the driver code, and i especially want to avoid having to
copy the integrity data to new buffers to adhere to the hardware
constraints.

Looking at the block layer, the number of integrity data segments
could be limited with an additional check in ll_new_hw_segment.

What would be the preferred approach for handling the integrity data
limits in the block layer? Introduce new queue limits for integrity
data, or assume that the limits for integrity data are the same as for
user data? I can update my patch accordingly and include a check for
the maximum number of segments.

Christof
--

From: Martin K. Petersen
Date: Monday, August 2, 2010 - 9:44 pm

>>>>> "Christof" == Christof Schmitt <christof.schmitt@de.ibm.com> writes:

Christof> To summarize the limits i see in the zfcp hardware:
Christof> - Maximum size of 4k per segment
Christof> - The segments must not cross page boundaries
Christof> - The number of segments per request is limited

The interesting thing here is that your hw has a limit for the total
number of segments whereas other DIX HBAs have separate limits for data
and integrity scatterlists.


Christof> What would be the preferred approach for handling the
Christof> integrity data limits in the block layer? Introduce new queue
Christof> limits for integrity data, or assume that the limits for
Christof> integrity data are the same as for user data? I can update my
Christof> patch accordingly and include a check for the maximum number
Christof> of segments.

I've been messing with this tonight.  It's not entirely trivial because
of the housekeeping involved, having to accomodate different types of
hardware, having to avoid breaking existing setups, and having to work
with integrity compiled and without.

My first attempt at this got quite messy.  I think I have found a way
but it's bedtime here.  Give me a day or two to get back to you with
something that'll hopefully work for everyone.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Christof Schmitt
Date: Wednesday, August 11, 2010 - 1:07 am

Yes, the hw interface only has a limit for the total number. The best
solution would be an interface that allows reporting this limit to the
block layer. If this is not possible, or deemed too exotic, reporting
seperate limits for integrity segments and data segments would also be

Ok, when you have something, i can have a look at it and see if it
matches the requirements here.

Thanks,

Christof
--

Previous thread: [PATCH] x86, cpu: Add Intel CPUID flags: ECMD, PLN, PTM and TSC_DEADLINE by John Villalovos on Thursday, July 15, 2010 - 8:27 am. (1 message)

Next thread: Re: Workaround hardware bug addressing physical memory by Unai Uribarri on Thursday, July 15, 2010 - 9:46 am. (2 messages)