Another post of my block I/O data integrity patches. This kit goes on
top of the scsi_data_buffer and sd.h cleanups I posted earlier today.
There has been no changes to the block layer code since my last
submission.
Within SCSI, the changes are cleanups based on comments from Christoph
as well as working support for Type 3 and 4KB sectors.
What's This All About?
----------------------
These patches allow data integrity information (checksum and more) to
be attached to I/Os at the block/filesystem layers and transferred
through the entire I/O stack all the way to the physical storage
device.
The integrity metadata can be generated in close proximity to the
original data. Capable host adapters, RAID arrays and physical disks
can verify the data integrity and abort I/Os in case of a mismatch.
Right now this is SCSI disk only but similar efforts are in progress
for SATA and SCSI tape. With a few minor nits due to protocol
limitations the proposed SATA format is identical to the SCSI ditto
for easy interoperability.
T10 DIF
-------
SCSI drives can usually be reformatted to 520-byte sectors, yielding 8
extra bytes per sector. These 8 bytes have traditionally been used by
RAID controllers to store internal protection information.
DIF (Data Integrity Field) is an extension to the SCSI Block Commands
that standardizes the format of the 8 extra bytes and defines ways to
interact with the contents at the protocol level.
Each 8-byte DIF tuple is split into three chunks:
- a 16-bit guard tag containing a CRC of the data portion of
the sector.
- a 16-bit application tag which is up for grabs.
- a 32-bit reference tag which contains an incrementing
counter for each sector. For DIF Type 1 it also needs to
match the physical LBA on the drive.
There are three types of DIF defined: Type 1, Type 2, and Type 3.
These patches support Type 1 and Type 3. Type 2 depends on 32-byte
CDBs and is work in progress.
Since the DIF ...2 files changed, 26 insertions(+)
fs/bio.c | 24 ++++++++++++++++++++++++
include/linux/bio.h | 2 ++
Helper function to find the sector offset in a bio given bvec index
and page offset.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
diff -r 550d61001baa -r 318fa71e735d fs/bio.c
--- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
@@ -1232,6 +1232,30 @@
return bp;
}
+sector_t bio_sector_offset(struct bio *bio, unsigned short index, unsigned int offset)
+{
+ struct bio_vec *bv;
+ unsigned int sector_sz = bio->bi_bdev->bd_disk->queue->hardsect_size;
+ sector_t sectors;
+ int i;
+
+ sectors = 0;
+
+ BUG_ON(index >= bio->bi_vcnt);
+
+ bio_for_each_segment(bv, bio, i) {
+ if (i == index) {
+ if (offset > bv->bv_offset)
+ sectors += (offset - bv->bv_offset) / sector_sz;
+ return sectors;
+ }
+
+ sectors += bv->bv_len / sector_sz;
+ }
+
+ BUG();
+}
+EXPORT_SYMBOL(bio_sector_offset);
/*
* create memory pools for biovec's in a bio_set.
diff -r 550d61001baa -r 318fa71e735d include/linux/bio.h
--- a/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
+++ b/include/linux/bio.h Sat Jun 07 00:45:14 2008 -0400
@@ -315,6 +315,8 @@
extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
unsigned int, unsigned int);
extern int bio_get_nr_vecs(struct block_device *);
+extern sector_t bio_sector_offset(struct bio *, unsigned short, unsigned int);
+
extern struct bio *bio_map_user(struct request_queue *, struct block_device *,
unsigned long, unsigned int, int);
struct sg_iovec;
--
Unless I've missed something, this helper isn't used anywhere in your patchset. Cheers, Jeff --
>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: Jeff> Unless I've missed something, this helper isn't used anywhere in Jeff> your patchset. You are right. bio_sector_offset() is used in the patch that adds integrity support to DM. I didn't post the MD/DM patches this time around in an attempt to limit the number of patches and focus on the core code. You can see the DM patch here: http://oss.oracle.com/mercurial/mkp/linux-2.6-di/rev/6b79173c501a -- Martin K. Petersen Oracle Linux Engineering --
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> --
2 files changed, 38 insertions(+), 28 deletions(-)
fs/bio.c | 36 ++++++++----------------------------
include/linux/bio.h | 30 ++++++++++++++++++++++++++++++
Move struct bio_set and biovec_slab definitions to bio.h so they can
be used outside of bio.c.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
diff -r fb6f53e13ff3 -r 550d61001baa fs/bio.c
--- a/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
+++ b/fs/bio.c Sat Jun 07 00:45:14 2008 -0400
@@ -28,24 +28,9 @@
#include <linux/blktrace_api.h>
#include <scsi/sg.h> /* for struct sg_iovec */
-#define BIO_POOL_SIZE 2
-
static struct kmem_cache *bio_slab __read_mostly;
-#define BIOVEC_NR_POOLS 6
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
mempool_t *bio_split_pool __read_mostly;
-
-struct biovec_slab {
- int nr_vecs;
- char *name;
- struct kmem_cache *slab;
-};
/*
* if you change this list, also change bvec_alloc or things will
@@ -60,23 +45,18 @@
#undef BV
/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-struct bio_set {
- mempool_t *bio_pool;
- mempool_t *bvec_pools[BIOVEC_NR_POOLS];
-};
-
-/*
* fs_bio_set is the bio_set containing bio and iovec memory pools used by
* IO code that does not need private memory pools.
*/
-static struct bio_set *fs_bio_set;
+struct bio_set *fs_bio_set;
-static inline struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
+inline int bvec_nr_vecs(int idx)
+{
+ return bvec_slabs[idx].nr_vecs;
+}
+EXPORT_SYMBOL(bvec_nr_vecs);
+
+struct bio_vec *bvec_alloc_bs(gfp_t gfp_mask, int nr, unsigned long *idx, struct bio_set *bs)
{
struct bio_vec *bvl;
diff -r fb6f53e13ff3 -r ...Don't sell yourself short; you also implemented bvec_nr_vecs. Looks okay to me. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> --
4 files changed, 83 insertions(+) include/linux/crc-t10dif.h | 8 +++++ lib/Kconfig | 7 ++++ lib/Makefile | 1 lib/crc-t10dif.c | 67 ++++++++++++++++++++++++++++++++++++++++++++ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- diff -r b76a6f493f0c -r fb6f53e13ff3 include/linux/crc-t10dif.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/include/linux/crc-t10dif.h Sat Jun 07 00:45:14 2008 -0400 @@ -0,0 +1,8 @@ +#ifndef _LINUX_CRC_T10DIF_H +#define _LINUX_CRC_T10DIF_H + +#include <linux/types.h> + +__u16 crc_t10dif(unsigned char const *, size_t); + +#endif diff -r b76a6f493f0c -r fb6f53e13ff3 lib/Kconfig --- a/lib/Kconfig Sat Jun 07 00:45:14 2008 -0400 +++ b/lib/Kconfig Sat Jun 07 00:45:14 2008 -0400 @@ -28,6 +28,13 @@ modules require CRC16 functions, but a module built outside the kernel tree does. Such modules that use library CRC16 functions require M here. + +config CRC_T10DIF + tristate "CRC calculation for the T10 Data Integrity Field" + help + This option is only needed if a module that's not in the + kernel tree needs to calculate CRC checks for use with the + SCSI data integrity subsystem. config CRC_ITU_T tristate "CRC ITU-T V.41 functions" diff -r b76a6f493f0c -r fb6f53e13ff3 lib/Makefile --- a/lib/Makefile Sat Jun 07 00:45:14 2008 -0400 +++ b/lib/Makefile Sat Jun 07 00:45:14 2008 -0400 @@ -45,6 +45,7 @@ obj-$(CONFIG_BITREVERSE) += bitrev.o obj-$(CONFIG_CRC_CCITT) += crc-ccitt.o obj-$(CONFIG_CRC16) += crc16.o +obj-$(CONFIG_CRC_T10DIF)+= crc-t10dif.o obj-$(CONFIG_CRC_ITU_T) += crc-itu-t.o obj-$(CONFIG_CRC32) += crc32.o obj-$(CONFIG_CRC7) += crc7.o diff -r b76a6f493f0c -r fb6f53e13ff3 lib/crc-t10dif.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/lib/crc-t10dif.c Sat Jun 07 00:45:14 2008 -0400 @@ -0,0 +1,67 @@ +/* + * T10 Data Integrity Field CRC16 calculation + * + * Copyright (c) 2007 Oracle Corporation. All rights reserved. + * ...
5 files changed, 90 insertions(+)
drivers/scsi/Kconfig | 15 +++++++++++++++
drivers/scsi/scsi_lib.c | 42 ++++++++++++++++++++++++++++++++++++++++++
drivers/scsi/scsi_scan.c | 3 +++
include/scsi/scsi_cmnd.h | 29 +++++++++++++++++++++++++++++
include/scsi/scsi_device.h | 1 +
- Add support for an extra scatter-gather list containing protection
information.
- Remember devices with protection information turned on in INQUIRY.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
---
diff -r 5be7c534c954 -r ad65bfde4e05 drivers/scsi/Kconfig
--- a/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/Kconfig Sat Jun 07 00:45:15 2008 -0400
@@ -260,6 +260,21 @@
default m
depends on SCSI
depends on MODULES
+
+config SCSI_PROTECTION
+ bool "SCSI Data Integrity Protection"
+ depends on SCSI
+ depends on BLK_DEV_INTEGRITY
+ help
+ Some SCSI devices support data protection features above and
+ beyond those implemented in the transport. Select this
+ option to enable protection information to be transferred to
+ and from a device. Specifically, this option will enable DIF
+ (Data Integrity Field) for SCSI disks.
+
+ The SCSI protection features depend on the block layer data
+ integrity infrastructure so the latter must be enabled for
+ this option to work.
menu "SCSI Transports"
depends on SCSI
diff -r 5be7c534c954 -r ad65bfde4e05 drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
+++ b/drivers/scsi/scsi_lib.c Sat Jun 07 00:45:15 2008 -0400
@@ -778,6 +778,13 @@
kmem_cache_free(scsi_sdb_cache, bidi_sdb);
cmd->request->next_rq->special = NULL;
}
+
+#if defined(CONFIG_SCSI_PROTECTION)
+ if (scsi_prot_sg_count(cmd)) {
+ scsi_free_sgtable(cmd->prot_sdb);
+ kmem_cache_free(scsi_sdb_cache, cmd->prot_sdb);
+ }
+#endif
}
EXPORT_SYMBOL(scsi_release_buffers);
@@ -1031,6 +1038,32 @@
return BLKPREP_OK;
}
+#if ...4 files changed, 825 insertions(+), 3 deletions(-) fs/Makefile | 1 fs/bio-integrity.c | 715 +++++++++++++++++++++++++++++++++++++++++++++++++++ fs/bio.c | 27 + include/linux/bio.h | 85 ++++++ Allows integrity metadata to be attached to a bio. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- diff -r 318fa71e735d -r f2ae9d5bce4c fs/Makefile --- a/fs/Makefile Sat Jun 07 00:45:14 2008 -0400 +++ b/fs/Makefile Sat Jun 07 00:45:15 2008 -0400 @@ -19,6 +19,7 @@ obj-y += no-block.o endif +obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o obj-$(CONFIG_INOTIFY) += inotify.o obj-$(CONFIG_INOTIFY_USER) += inotify_user.o obj-$(CONFIG_EPOLL) += eventpoll.o diff -r 318fa71e735d -r f2ae9d5bce4c fs/bio-integrity.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fs/bio-integrity.c Sat Jun 07 00:45:15 2008 -0400 @@ -0,0 +1,715 @@ +/* + * bio-integrity.c - bio data integrity extensions + * + * Copyright (C) 2007, 2008 Oracle Corporation + * Written by: Martin K. Petersen <martin.petersen@oracle.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; see the file COPYING. If not, write to + * the Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, + * USA. + * + */ + +#include <linux/blkdev.h> +#include <linux/mempool.h> +#include <linux/bio.h> +#include <linux/workqueue.h> + +static struct kmem_cache *bio_integrity_slab __read_mostly; +static struct workqueue_struct *kintegrityd_wq; + +/** + * ...
Seems i've missing something. What is purpose of this black magic? do you want just express following? Why here and later sector_size == 4096 is so special, what about 1k and --
DM> Seems i've missing something. What is purpose of this black magic? DM> do you want just express following? nr_sectors = (len + DM> bi->tag_size - 1) / bi->tag_size; Yep. In the original DIF spec, only 2-byte tags were supported so the check for an odd length was a fast and elegant solution. But now that I implemented Type 3 which has 6 bytes of tag space that's an invalid DM> Why here and later sector_size == 4096 is so special, what about DM> 1k and 2k sect_sz? Do you want just transform value from 512 to DM> bi->sectors_size? Well, so only 512-byte DIF storage devices are currently available. The whole industry is in the process of transitioning to 4KB sectors. There will be no DIF devices with 1KB or 2KB sectors. And even as it stands it's unclear that 4KB sectors are going to look like they do in the current version of the spec. It's going to be an interoperability nightmare as it is now as the tag is attached to the hardware sector size. This means that it's still only 8 bytes of DIF for a device with 4KB sectors (IOW, 4104 bytes and not 4160). That means that *two* protection buffers would have to be generated for - say - a mirror with heterogeneous sector sizes. And tagging won't work as there's not the same space available for both drives of the mirror. The tag space problem also causes issues with RAID arrays exporting 512 byte sectors to the host but using drives with 4KB sectors in the back. Where is the array going to store the tags for the last 7 512 byte sectors? So 4KB vs. DIF is up in the air at this point. The current checks are there because I've been messing with 4KB sector devices for other reasons. And technically they are in accordance with the current spec. -- Martin K. Petersen Oracle Linux Engineering --
"Martin K. Petersen" <martin.petersen@oracle.com> writes:
I see you've changed this to:
nr_sectors = (len + bi->tag_size - 1) / bi->tag_size;
set_tag and get_tag are almost identical. Any chance you want to factor
^^^^^^^^^^
Hmm, up until this point you use bi to mean bio_integrity, but now
struct blk_integrity_exchg is not yet defined in your patch set, so this
The above two loops look pretty much the same to me. Can you factor
that out to a helper?
Cheers,
Jeff
--
Jeff> That assignment isn't necessary. Zap! Jeff> nr_sectors = (len + bi->tag_size - 1) / bi->tag_size; Jeff> why not simply use DIV_ROUND_UP? Fixed. Jeff> set_tag and get_tag are almost identical. Any chance you want Jeff> to factor out that code? Done. Jeff> Hmm, up until this point you use bi to mean bio_integrity, but Jeff> now it means blk_integrity. Confusion will ensue. ;) Err, uhm. There is no bio_integrity. There's the bio integrity payload which I always refer to as struct bip *bip. And struct blk_integrity which is always bi. I'm also anal about using bv for the data bio_vec and iv for the integrity bio_vec. I can't see any place where I'm inconsistent. Jeff> struct blk_integrity_exchg is not yet defined in your patch set, Jeff> so this will likely break git bisect. bio-integrity.patch and blk-integrity.patch are artificially split up to ease the review process. They are not meant to be separate Jeff> Does this actually need to be zeroed? Jeff> The above two loops look pretty much the same to me. Can you Jeff> factor that out to a helper? I've created helpers for marking head and tail of the ivec. -- Martin K. Petersen Oracle Linux Engineering --
OK, just wanted to make sure you were aware of it. Cheers, Jeff --
10 files changed, 849 insertions(+) Documentation/block/data-integrity.txt | 327 +++++++++++++++++++++++++++ block/Kconfig | 12 block/Makefile | 1 block/blk-core.c | 7 block/blk-integrity.c | 385 ++++++++++++++++++++++++++++++++ block/blk-merge.c | 3 block/blk.h | 8 block/elevator.c | 6 include/linux/blkdev.h | 97 ++++++++ include/linux/genhd.h | 3 Support for merging and mapping bio integrity metadata. Block device integrity type registration. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- diff -r f2ae9d5bce4c -r 5be7c534c954 Documentation/block/data-integrity.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Documentation/block/data-integrity.txt Sat Jun 07 00:45:15 2008 -0400 @@ -0,0 +1,327 @@ +---------------------------------------------------------------------- +1. INTRODUCTION + +Modern filesystems feature checksumming of data and metadata to +protect against data corruption. However, the detection of the +corruption is done at read time which could potentially be months +after the data was written. At that point the original data that the +application tried to write is most likely lost. + +The solution is to ensure that the disk is actually storing what the +application meant it to. Recent additions to both the SCSI family +protocols (SBC Data Integrity Field, SCC protection proposal) as well +as SATA/T13 (External Path Protection) try to remedy this by adding +support for appending integrity metadata to an I/O. The integrity +metadata (or protection information in SCSI terminology) includes a +checksum for each sector as well as an incrementing counter that +ensures the individual sectors are written in the right order. And +for some protection schemes also that the I/O is written to the right +place on ...
All of the sysfs interfaces should be documented in Documentation/ABI/ not here, otherwise they will be tough to find and figure out how to use :) thanks, greg k-h --
Greg> All of the sysfs interfaces should be documented in Greg> Documentation/ABI/ not here, otherwise they will be tough to Greg> find and figure out how to use :) Will do. -- Martin K. Petersen Oracle Linux Engineering --
12 files changed, 935 insertions(+), 7 deletions(-) Documentation/scsi/data-integrity.txt | 57 ++ drivers/scsi/Kconfig | 1 drivers/scsi/Makefile | 2 drivers/scsi/scsi_error.c | 3 drivers/scsi/scsi_lib.c | 4 drivers/scsi/scsi_sysfs.c | 4 drivers/scsi/sd.c | 58 ++ drivers/scsi/sd.h | 22 + drivers/scsi/sd_dif.c | 644 +++++++++++++++++++++++++++++++++ include/scsi/scsi_cmnd.h | 3 include/scsi/scsi_dif.h | 140 +++++++ include/scsi/scsi_host.h | 4 Configure DMA of protection information and issue READ/WRITE commands with RDPROTECT/WRPROTECT set accordingly. Force READ CAPACITY(16) if the target has the PROTECT bit set and grab an extra byte of response (P_TYPE and PROT_EN are in byte 12). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- diff -r ad65bfde4e05 -r 8bc1728dc75a Documentation/scsi/data-integrity.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Documentation/scsi/data-integrity.txt Sat Jun 07 00:45:15 2008 -0400 @@ -0,0 +1,57 @@ +---------------------------------------------------------------------- +1.0 INTRODUCTION + +For a general overview of the data integrity framework please consult +Documentation/block/data-integrity.txt. + +---------------------------------------------------------------------- +2.0 SCSI LAYER IMPLEMENTATION DETAILS + +The scsi_command has been extended with a scatterlist for the +integrity metadata. Note that all SCSI mid layer changes refer to +this using the term "protection information" which is what it is +called in the T10 spec. + +The term DIF (Data Integrity Field) is specific to SCSI disks (SBC). +The SCSI midlayer doesn't know, or care, about the contents of the +protection scatterlist, except it calls blk_rq_map_integrity_sg() +during command initialization. + + +2.1 SCSI DEVICE ...
Pointers to archives would have been appreciated. I can't, for the life Thanks for all of the great documentation. It would be good to include some instructions on how one would test this, and what testing you performed. Also, please use the '-p' switch to diff, as it makes reviewing patches much easier. I set out to try your changes, but ran into some problems. First, this patch set didn't apply cleanly to a git checkout. So, I grabbed your mercurial repository, but got a build failure: block/blk-core.c: In function 'generic_make_request': include/linux/bio.h:469: sorry, unimplemented: inlining failed in call to 'bio_i ntegrity_enabled': function body not available block/blk-core.c:1388: sorry, unimplemented: called from here make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error 2 Cheers, Jeff --
>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: Jeff> Pointers to archives would have been appreciated. I can't, for Jeff> the life of me, find these. http://marc.info/?l=linux-scsi&m=121272302931588&w=2 http://marc.info/?l=linux-scsi&m=121278031605941&w=2 http://marc.info/?l=linux-scsi&m=121302438515260&w=2 http://marc.info/?l=linux-scsi&m=121278067906564&w=2 Jeff> Thanks for all of the great documentation. It would be good to Jeff> include some instructions on how one would test this, and what Jeff> testing you performed. modprobe scsi_debug dix=199 dif=1 guard=1 dev_size_mb=1024 num_parts=1 I'm testing with XFS and btrfs. Generally doing kernel builds, etc. ext2/3 are still problematic because they modify pages in flight. Jeff> I set out to try your changes, but ran into some problems. Jeff> First, this patch set didn't apply cleanly to a git checkout. I generally track Linus closely so it must be because of the patches you were missing. You can grab my patch stack here. It's always in sync with the hg repo: http://oss.oracle.com/~mkp/patches/ Jeff> block/blk-core.c: In function 'generic_make_request': Jeff> include/linux/bio.h:469: sorry, unimplemented: inlining failed Jeff> in call to 'bio_i ntegrity_enabled': function body not available Jeff> block/blk-core.c:1388: sorry, unimplemented: called from here Jeff> make[1]: *** [block/blk-core.o] Error 1 make: *** [block] Error Jeff> 2 Odd. Which compiler are you using? Compiles just fine for me on both EL5 and FC9. Judging from the error I'm guessing it's objecting to the inlining. Tried to work around it. Please pull, update and let me know whether that did the trick. -- Martin K. Petersen Oracle Linux Engineering --
So, is it safe to say that the library routines for integrity-aware file systems have not been tested at all? Specifically, I'm talking about: bio_integrity_tag_size bio_integrity_set_tag I did a new clone (just to be sure I got your change) and I get the same problem. I also can't see the changeset in the log, so are you sure you pushed it? I got rid of the inline in the definition in bio.h. The .c file didn't define the function as inline, so I didn't have to change it. It seems to be building now. Cheers, Jeff --
>>>>> "Jeff" == Jeff Moyer <jmoyer@redhat.com> writes: Jeff> So, is it safe to say that the library routines for Jeff> integrity-aware file systems have not been tested at all? Jeff> Specifically, I'm talking about: bio_integrity_tag_size Jeff> bio_integrity_set_tag bio_integrity_get_tag I have not tried using them from within a filesystem, if that's what you mean. But I have attached random strings to bios and read them back later. Jeff> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-41) Ok, I'm trying to chase down a 5.2 box to figure out what the problem is. Maybe I'll just move that function to the header file. Jeff> I did a new clone (just to be sure I got your change) and I get Jeff> the same problem. I also can't see the changeset in the log, so Jeff> are you sure you pushed it? Yup, it's there. Jeff> I got rid of the inline in the definition in bio.h. The .c file Jeff> didn't define the function as inline, so I didn't have to change Jeff> it. It seems to be building now. The problem is that your gcc is unhappy about the fact that the inlined function is defined elsewhere. The gcc info page said only declare it inline in the header and not the declaration. The change I pushed removed inline from the .c file. But that didn't help. -- Martin K. Petersen Oracle Linux Engineering --
I was checking to see if you had exercised the code at all, and you have. Great! Cheers, Jeff --
On Tue, Jun 10, 2008 at 11:28 AM, Martin K. Petersen Have you made the ext2/3/4 developers aware of this? Could you elaborate on the interaction between the data integrity support in the block layer and a given filesystem? Shouldn't _any_ filesystem "just work" given that the block layer is what is generating the checksums and then verifying them on read? regards, Mike --
Mike> Have you made the ext2/3/4 developers aware of this? Yep. Mike> Shouldn't _any_ filesystem "just work" given that the block Mike> layer is what is generating the checksums and then verifying Mike> them on read? Yep. There are a couple of issues. One problem is that pages are no longer locked down during I/O. Instead the writeback bit is being set to indicate that I/O is in progress. Not all corners of ext* have been adapted to that properly. Especially ext2 suffers and often modifies pages containing metadata while they are in flight. If I remember correctly, ext2/dir.c hasn't been made aware of writeback at all and assumes the page lock still works like it used to. That is normally not a huge problem because the page is being scheduled for write again shortly thereafter. So the inconsistent block on disk gets overwritten pretty much instantly. But that kind of sloppy behavior is a no-go with integrity checking turned on. There also appears to be some quirks in the page cache in general. There's something not quite right in clear_page_dirty() / page_mkwrite() territory. If I sync excessively I can make any fs keel over. peterz said that an mmapped page is supposed to be read-only during writeback but that appears to be racy when a forced sync is involved. That's my recollection, anyway. I've been busy with the innards of the integrity code stuff for a couple of months and haven't poked at the fs/vm issues for a while. -- Martin K. Petersen Oracle Linux Engineering --
