Re: tools support for non-512 byte sector sizes

Previous thread: Re: [patch 2/4] Configure out file locking features by Matthew Wilcox on Tuesday, July 29, 2008 - 2:17 pm. (6 messages)

Next thread: Kernel oops in 2.6.27-rc1 qdisc. by Steven Jan Springl on Tuesday, July 29, 2008 - 2:23 pm. (3 messages)
To: Ric Wheeler <rwheeler@...>
Cc: <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:26 pm

Matt Domsch spoke with me about this at OLS. I took that opportunity,
and I'll take this one, to pimp my ata-ram driver which allows you to
alter the sector sizse to whatever you want:

http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=at...

I'll admit to having not tested it with anything other than 512, but it
ought to support 4096 byte sectors just fine. I haven't looked at what
would be required to support 520-byte sectors.

Jeff, any interest in merging ata-ram soon? I've got some users inside
Intel, and Zab persuaded me to add the multiple port support last night,
so it's not just useful for me. I think it's also a nice template to

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 5:54 pm

I'm happy to include ata_ram whenever it is working and you think it's
ready for inclusion.

Jeff

--

To: Matthew Wilcox <matthew@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:41 pm

>>>>> "Matthew" == Matthew Wilcox <matthew@wil.cx> writes:

Matthew> I'll admit to having not tested it with anything other than
Matthew> 512, but it ought to support 4096 byte sectors just fine. I
Matthew> haven't looked at what would be required to support 520-byte
Matthew> sectors.

I recently added multiple sector support to scsi_debug. On a recent
kernel you can modprobe scsi_debug sector_size=4096.

I have only tested 4KB but it also supports 1 and 2KB.

--
Martin K. Petersen Oracle Linux Engineering

--

To: Matthew Wilcox <matthew@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:37 pm

scsi_debug does exactly the same thing, so it reports anything you tell
it (Martin Petersen actually added this so he could test with 4k
sectors).

The problem, which ata_ram also suffers, is that the tools we most need
to test are the ones for manipulating non volatile characteristics (like
partition tables). We'd really like the disk contents to survive reboot
for this ...

James

--

To: James Bottomley <James.Bottomley@...>
Cc: Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Wednesday, July 30, 2008 - 1:51 am

SCST (http://scst.sf.net) fully supports non-512 bytes sectors up to
4096. Available target drivers for transports: software iSCSI, FC,
InfiniBand SRP, parallel SCSI, SAS (not much tested, because of lack of
hardware). With VDISK dev handler you can use files as a backstorage.

I personally for a long time have been working with 4K sectors, because
it's better for performance, but so far found the only tool, which
--

To: James Bottomley <James.Bottomley@...>
Cc: Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:48 pm

>>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes:

James> The problem, which ata_ram also suffers, is that the tools we
James> most need to test are the ones for manipulating non volatile
James> characteristics (like partition tables). We'd really like the
James> disk contents to survive reboot for this ...

Yeah, I should add that I wanted persistence too. I went through a
whole stack (well, 5-6 or so) fibre channel drives from various
vendors and attempted to low-level format them to 4KB sectors. Most
of them laughed in my face. One of them tried to comply and
irreparably confused its firmware in the process.

Just yesterday I received a couple of prototype drives in the mail.
I'll ask the vendor whether they support 4KB and if so I'll give them
a whirl.

--
Martin K. Petersen Oracle Linux Engineering

--

To: Martin K. Petersen <martin.petersen@...>
Cc: James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 9:51 am

I have access to disks with native 4KB sectors now too. Would
interested parties be willing to share test plans, so we could be sure
we have coverage wrt correctness: kernel internals, userspace tools like parted,
fdisk, kpartx, apps using O_DIRECT)? Benchmarking winds up being an
NDA activity this early in the game so I don't want the focus of any
joint work to be benchmarks yet.

--
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Friday, August 1, 2008 - 12:11 pm

Are they SCSI? I just got round to trying 4k sector sizes in ata_ram
(after adding file backed capability) and found that libata currently
silently ignores the identify bits that report sector size. I'll work
on fixing that this afternoon if nobody beats me to it.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, August 5, 2008 - 12:57 pm

yes (SAS).

--
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, August 5, 2008 - 12:54 pm

OK, I have patches. I'll send them to linux-ide. If anyone wants to
try them, I pushed out two trees; one for libata:

http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=at...

and one for ata_ram supporting:
- large sectors
- file backing
http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=at...

I hope that will help some more people do testing.

Here's the dmesg from running:

$ sudo modprobe ata_ram sector_size=4096 capacity=262144 nr_ports=2

(note that you'll need at least 2.5GB of ram in your machine to try this,
or Linux gets really unhappy. You can, of course, reduce the capacity.
Would there be interest in a lazily allocated option for ata_ram?)

[ 1134.017240] scsi7 : ata_ram
[ 1134.017420] scsi8 : ata_ram
[ 1134.017489] ata8: SATA max UDMA/133 ata_ram_0
[ 1134.017495] ata9: SATA max UDMA/133 ata_ram_1
[ 1134.017557] ata8.00: ATA-8: Linux RAM Drive, 0.01, max UDMA7
[ 1134.017563] ata8.00: 262144 sectors, multi 0: LBA
[ 1134.017602] ata8.00: configured for UDMA/133
[ 1134.017631] ata9.00: ATA-8: Linux RAM Drive, 0.01, max UDMA7
[ 1134.017636] ata9.00: 262144 sectors, multi 0: LBA
[ 1134.017668] ata9.00: configured for UDMA/133
[ 1134.035741] scsi 7:0:0:0: Direct-Access ATA Linux RAM Drive 0.01 PQ: 0 ANSI: 5
[ 1134.035904] sd 7:0:0:0: [sdb] 262144 4096-byte hardware sectors (1074 MB)
[ 1134.035926] sd 7:0:0:0: [sdb] Write Protect is off
[ 1134.035932] sd 7:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 1134.035961] sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 1134.036039] sd 7:0:0:0: [sdb] 262144 4096-byte hardware sectors (1074 MB)
[ 1134.036061] sd 7:0:0:0: [sdb] Write Protect is off
[ 1134.036066] sd 7:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[ 1134.036095] sd 7:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 1134.036119] sdb: unknown partition table
[ 1134.036276] sd 7:0:0:0: [sdb] Attached SCSI disk
[ 1134.036463] sd 7...

To: Matt Domsch <Matt_Domsch@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Tuesday, August 5, 2008 - 12:57 pm

I forgot to mention ... I didn't add support for 520-byte (or 4104-byte
or 4160-byte) sectors. Martin helpfully pointed me to
http://www.t13.org/Documents/UploadedDocuments/docs2008/e07162r2-Externa...
but it seems like T13 haven't allocated some words for this yet. If
anyone wants to work on this, please contact me; I have some thoughts.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 1:16 pm

Do they expose that sector size?
I.e., does ioctl(fd,BLKSSZGET,&ss) set ss to 4096?

I'm interested because I'm preparing GNU Parted's partition table
manipulation code (not its FS code) for just that.
In particular, now I've heard two stories:

- disk makers will eventually sell drives with >512-byte sectors

- some disk makers have sort of agreed not to do that, and
expect forever to hide the larger underlying sector size
behind a virtual 512 (of course, this imposes alignment
restrictions, but that's a smaller problem)

Even if the latter is the case, we still have to deal with

Speaking of O_DIRECT, both dd and shred (both in coreutils), use
O_DIRECT, so you could get _some_ coverage just by running shred
and experimenting with dd's oflag=direct and iflag=direct options.
--

To: Jim Meyering <jim@...>
Cc: Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 1:29 pm

yes, this is happening also.

There will be 3 types of disks eventually:
1) those that report a 512-byte sector size, and are really a 512-byte
size. This is nearly all disks today.

2) those that report a 512-byte sector size, but are really a
4096-byte size, and the drive does the conversions and
read/modify/write. T10 and T13 are looking to add commands to
expose this different underlying physical sector size so the OS
could be aware of it. This is primarily being driven to mitigate
any problems that may happen with "legacy" OSs that are not aware
of the difference.

3) those that report a 4096-byte sector size, and are really a
4096-byte size. This seems ideal for aware OSs.

Which of 2) or 3) hit the market in mass remains to be seen. I want
Linux to be able to handle either painlessly.

--
Matt Domsch
Linux Technology Strategist, Dell Office of the CTO
linux.dell.com & www.dell.com/linux
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Jim Meyering <jim@...>, Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Saturday, August 9, 2008 - 9:21 am

How is this going to work with journaling? This has nasty property
that if you are writing to sector n during powerfail, disk may also
kill sectors n-3, n-2 and n-1..... and that's bad right?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Jim Meyering <jim@...>, Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 2:13 pm

As usual, the biggest problem will be "legacy" userspace. For
example, most partition tools are still generating legacy partition
tables that look like this:

Disk /dev/sda: 255 heads, 63 sectors, 38913 cylinders

Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
1 80 1 1 0 254 63 121 63 1959867 83
2 00 0 1 122 254 63 619 1959930 8000370 82
3 00 0 1 620 254 63 1023 9960300 615177045 05
4 00 0 0 0 0 0 0 0 0 00
5 00 1 1 620 254 63 1023 63 615176982 8e

Note the starting sector# for the first partition.....

- Ted
--

To: Theodore Tso <tytso@...>, Matt Domsch <Matt_Domsch@...>, Jim Meyering <jim@...>, Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 2:28 pm

If I remember correctly, the MS Vista new alignment for data partitions
is on a 0 offset, 1MB aligned boundary. The support for 4096 byte
sectors is only for data partitions (not boot).

Array vendors, who consume a fair amount of drives, are most likely more
friendly to native 4k drives. The big fear from disk vendors is getting
a wave of returns from Best Buy, etc when people go and plug in a new,
native 4k drive into an old box....

ric

--

To: Ric Wheeler <rwheeler@...>
Cc: Matt Domsch <Matt_Domsch@...>, Jim Meyering <jim@...>, Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 2:45 pm

Or a new box running XP, either via the Dell "upgrade to XP" program,
or from a corporate I/T load[1]. :-)

[1] http://www.theinquirer.net/gb/inquirer/news/2008/06/23/intel-dumps-vista

More to the point for Linux, are *our* partition table programs (i.e.,
fdisk, cfdisk, et. al) fixed with better defaults in upstream, and
what are the upcoming enterprise distributions going to ship with?
Since that's what a large number of Linux customers will end up using
for the next 3-5 years....

- Ted
--

To: Matt Domsch <Matt_Domsch@...>
Cc: Jim Meyering <jim@...>, Martin K. Petersen <martin.petersen@...>, James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>
Date: Wednesday, July 30, 2008 - 1:24 pm

I am expecting 3 to turn up some _minor_ problem cases. Many older ATA
controllers magically know the sector size of media and the internal
state machines and FIFO they use for performance is potentially going to
go gaga in this case when we do a PIO transfer.

Alan
--

To: Martin K. Petersen <martin.petersen@...>
Cc: James Bottomley <James.Bottomley@...>, Matthew Wilcox <matthew@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:54 pm

Isn't this a great use case for a SCSI target device where our target
can be a software disk on a remote host? What is missing for us to put
something like that together?

ric

--

To: <rwheeler@...>
Cc: Martin K. Petersen <martin.petersen@...>, Matthew Wilcox <matthew@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>, FUJITA Tomonori <fujita.tomonori@...>
Date: Tuesday, July 29, 2008 - 2:56 pm

Technically nothing. Tomo should already have one for the STGT test
infrastructure (I've cc'd him).

James

--

To: <James.Bottomley@...>
Cc: <rwheeler@...>, <martin.petersen@...>, <matthew@...>, <linux-scsi@...>, <linux-ide@...>, <jim@...>, <linux-kernel@...>, <jeff@...>, <Matt_Domsch@...>, <fujita.tomonori@...>
Date: Tuesday, July 29, 2008 - 7:41 pm

On Tue, 29 Jul 2008 13:56:14 -0500

Yeah, stgt also enables you to use a software media changer and a
software DVD drive (and we are working on VTL).

http://stgt.berlios.de/

You can connect to a remote host with iSCSI. FCoE might work since
Mike Christie has used stgt to work on the FCoE initiator driver.

stgt doesn't support non-512 byte sector sizes now but I'll add the
support shortly. I want to try DIF with iSCSI and FCoE.
--

To: James Bottomley <James.Bottomley@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:42 pm

Ummm... _reboot_, or _module unload/reload_? I could certainly include
an option to populate the ramdisc from a file. Is the ioctl to re-read
the partition table not enough?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:44 pm

reboot ... we'd like to take the tools through shutdown restart testing
to make sure they're all working ... of course, then there's the
bios ...

James

--

To: James Bottomley <James.Bottomley@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 2:50 pm

It's not up to us to fix the BIOS.

Since the vast majority of users use a distro, and the vast majority of
distros use a fully modular kernel, wouldn't initialising the contents
of ata-ram from the initrd/initramfs solve the problem?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--

To: Matthew Wilcox <matthew@...>
Cc: Ric Wheeler <rwheeler@...>, <linux-scsi@...>, <linux-ide@...>, Jim Meyering <jim@...>, <linux-kernel@...>, Martin Petersen <mkp@...>, Jeff Garzik <jeff@...>, Matt Domsch <Matt_Domsch@...>
Date: Tuesday, July 29, 2008 - 3:00 pm

Well ... we'd really like it file backed to truly verify ... sort of
like scsi_debug on a loopback. But saving on shutdown and
reinitialising from the saved image on boot would likely be perfect.

James

--

Previous thread: Re: [patch 2/4] Configure out file locking features by Matthew Wilcox on Tuesday, July 29, 2008 - 2:17 pm. (6 messages)

Next thread: Kernel oops in 2.6.27-rc1 qdisc. by Steven Jan Springl on Tuesday, July 29, 2008 - 2:23 pm. (3 messages)