Re: ATA 4 KiB sector issues.

Previous thread: linux-next: build failure after merge of the scsi-post-merge final tree by Stephen Rothwell on Sunday, March 7, 2010 - 8:43 pm. (1 message)

Next thread: linux-next: Tree for March 8 by Stephen Rothwell on Sunday, March 7, 2010 - 9:27 pm. (6 messages)
From: Tejun Heo
Date: Sunday, March 7, 2010 - 8:48 pm

Hello, guys.

It looks like transition to ATA 4k drives will be quite painful and we
aren't really ready although these drives are already selling widely.
I've written up a summary document on the issue to clarify stuff as
it's getting more and more confusing and develop some consensus.  It's
also on the linux ata wiki.

  http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

I've cc'd people whom I can think of off the top of my head but I
surely have missed some people who would have been interested.  Please
feel free to add cc's or forward the message to other MLs.
Especially, I don't know much about partitioners so the details there
are pretty shallow and could be plain wrong.  It would be great if
someone who knows more about this stuff can chime in.

Thanks.

=== Document follows ===

ATA 4 KiB sector issues

Background
==========

Up until recently, all ATA hard drives have been organized in 512 byte
sectors.  For example, my 500 GB or 477 GiB hard drive is organized of
976773168 512 byte sectors numbered from 0 to 976773167.  This is how
a drive communicates with the driver.  When the operating system wants
to read 32 KiB of data at 1 MiB position, the driver asks the drive to
read 64 sectors from LBA (Logical block address, sector number) 2048.

Because each sector should be addressable, readable and writable
individually, the physical medium also is organized in the same sized
sectors.  In addition to the area to store the actual data, each
sector requires extra space for book keeping - inter-sector space to
enable locating and addressing each sector and ECC data to detect and
correct inevitable raw data errors.

As the densities and capacities of hard drives keep growing, stronger
ECC becomes necessary to guarantee acceptable level of data integrity
increasing the space overhead.  In addition, in most applications,
hard drives are now accessed in units of at least 8 sectors or 4096
bytes and maintaining 512 byte granularity has become ...
From: Greg Freemyer
Date: Sunday, March 7, 2010 - 10:38 pm

cc'ing Martin Petersen since I believe he is one of the most
knowledgeable kernel hackers on this topic and has been working the
issue for the last year.




-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
Preservation and Forensic processing of Exchange Repositories White Paper -
<http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html>

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: James Bottomley
Date: Monday, March 8, 2010 - 12:00 am

Just a quick note:

The 2TB size for msdos partitions is a problem independent of the 4k
sector issue.  Traditional 512 byte sector drives are now available in
those sizes.  It looks like we're going to have to move to a new
partitioning label to solve this.

There's actually another barrier at 8 or 16TB, which is where a 4k
logical sector filesystem tops out using 32 bit block offsets (it's 8TB
if the fs hasn't been proof checked against sign extension problems).

However, for 4k sectors, the main issues which have shown up in testing
by others (mostly Martin) are

     1. In native 4k mode, we work perfectly fine.  *however*, most
        BIOSs can't boot native 4k drives.
     2. Even if the BIOS can boot native 4k, our own boot loaders seem
        to be hard coded for 512 byte sectors in several places.
     3. If we run in the 512 byte sector emulation mode, we end up with
        the partition alignment problems you allude to.
     4. The aligment problem is made more complex by drives that make
        use of the offset exponent feature (what you refer to as offset
        by one) ... fortunately very few of these have been seen in the
        wild and we're hopeful they can be shot before they breed.
     5. I'm really, really sorry to have to mention it, but it looks
        like uefi is going to be the only way we can boot non-msdos
        partitioned devices with native 4k sectors.

so the bottom line seems to be that if you want the device as a non boot
disk, use native 4k sectors and a non-msdos partition label.  If you
want to boot from the drive and your bios won't book 4k natively,
partition everything using the 512 emulation and try to align the
partitions correctly.  If your bios/uefi will boot 4k natively, just use
it and whatever partition label the bios/uefi supports.

Martin can fill in the pieces I've left out.

James


--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 12:53 am

I would very much like a reference for a platform which has firmware 
which can successfully boot from 4K-logical media.  It would be very 
useful for bootloader testing.

Aligning partitions is something we should have done long ago.  It 
affects RAID and many flash drives just as much or more than 4K-sectored 
disks.

Legacy BIOS doesn't care at all how the disk is partitioned, so as long 
as the BIOS can read the disk at all the rest is up to the bootloader. 
Of course, since there hasn't been the opportunity to test, bootloaders 
generally don't handle it correctly (early versions of Syslinux 
supported any sector size, but that bitrotted, and for the lack of 
testing I eventually ended up hard-coding the number.  Now I'd like to 
get it working properly.)

As far as partitioning... I believe we should be using GPT partition 
tables where possible.  Even on non-EFI systems, it's simply a much 
better partition table format.

	-hpa
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:34 am

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> I would very much like a reference for a platform which has
hpa> firmware which can successfully boot from 4K-logical media.  It
hpa> would be very useful for bootloader testing.

I have yet to find one.


hpa> Aligning partitions is something we should have done long ago.  It
hpa> affects RAID and many flash drives just as much or more than
hpa> 4K-sectored disks.

Yup.


hpa> As far as partitioning... I believe we should be using GPT
hpa> partition tables where possible.  Even on non-EFI systems, it's
hpa> simply a much better partition table format.

Agreed.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Daniel Taylor
Date: Tuesday, March 9, 2010 - 3:36 pm

hpa> I would very much like a reference for a platform which has 
hpa> firmware which can successfully boot from 4K-logical media.  It 
hpa> would be very useful for bootloader testing.


I am told that the Mac UEFI platform will boot from 4K logical/physical
drives.

Now I have to scrounge one of the old drives to test it.
--

From: Greg Freemyer
Date: Tuesday, March 9, 2010 - 3:46 pm

GPT can not be used for boot disks in non-EFI systems, right?

Greg
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:05 pm

Hello,


IIUC, I think any BIOS should be able to do so as it only cares about
the code part of MBR not the partitions and even with GPT the MBR
remains the same with the partition part describing the rest of the
while disk as a single chunk containing GPT managed area.  The only
problem is the older operating systems (like XP) which don't
understand GPT wouldn't be able to access those partitions.

Thanks.

-- 
tejun
--

From: Daniel Taylor
Date: Tuesday, March 9, 2010 - 5:14 pm

The MBR in a GPT installation doesn't map the first GPT partition, it maps
the entire drive
drive after the first sector, as well as marking it type 0xEE.  The start
LBA of the file system
is not correctly located in the MBR.

I will run some experiments to see if any of the systems on my desk can boot
Linux from a GPT.
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:26 pm

Hello,


Yeah, yeah, that was exactly what I was saying by "describing the rest
of the whole disk as a single chunk containing GPT managed area" with

Sure it's not but MBR belongs to the boot loader not the BIOS.  BIOS
just needs to load MBR and handles control to it.  If the MBR or more
likely later stages of the bootloader loaded by MBR knows how to boot

I'm not sure about grub although I strongly suspect recent version of
it should work but AFAICS lilo should definitely work as it doesn't
care how the disk is logically organized at all.

Thanks.

-- 
tejun
--

From: H. Peter Anvin
Date: Tuesday, March 9, 2010 - 5:36 pm

In the case of Syslinux, you have to install gptmbr.bin, but otherwise 
it works unmodified (Syslinux itself doesn't care about the partition 
table at all.)

Note: the official standard for GPT booting on BIOS is still evolving, 
so I might change gptmbr to match the new standard.

	-hpa
--

From: H. Peter Anvin
Date: Tuesday, March 9, 2010 - 10:17 pm

There is something called a "hybrid MBR", which is basically a GPT disk 
with a single partition (the current bootable partition) mapped as an 
MBR partition, instead of marking the whole disk 0xEE.

	-hpa
--

From: Gabor Gombas
Date: Wednesday, March 10, 2010 - 12:09 am

My desktop with a BIOS from 2005 has no problems with GPT. AFAIK a
recent Debian installer automatically chooses GPT if the disk is 2 TB or
larger.

Gabor
--

From: H. Peter Anvin
Date: Tuesday, March 9, 2010 - 5:32 pm

It can.  The BIOS doesn't care about the partition table at all -- all 
it does is load the MBR.

	-hpa
--

From: Johannes Stezenbach
Date: Wednesday, March 10, 2010 - 3:46 am

A little story for your entertainment pleasure:

I have a Gigabyte GA-MA78GM-S2H board, and during install
turned off the power after partitioning but before formatting
any partition because I got distracted by something else.

Result: System could not boot anymore, BIOS hung before
I could get to the "select boot device" screen. This also
happened when I removed the hdd from the boot device
list in BIOS. The last BIOS message was "Verifying DMI Pool Data"
and you can find numerous similar reports by searching for
'gigabyte bios hang "Verifying DMI Pool Data"'.

In my case it worked to switch the SATA mode from AHCI to
something else, then wipe the partition table and switch
back to AHCI.  But I read on the net that some people had
to format the drive in another PC, or hotplug it after the BIOS
got past "Verifying DMI Pool Data".


Johannes
--

From: H. Peter Anvin
Date: Wednesday, March 10, 2010 - 4:22 am

Well, yes, there are buggy BIOSes of a gazillion varieties.  A fair 
number of them read the partition table to try to guess what C/H/S 
geometry the user intended.  However, the GPT spec specifically uses a 
"Protective MBR" to guard against this and other issues like it; it 
makes the entire disk look to MBR-reading software like a single fully 
partitioned disk with one large partition on it.

	-hpa
--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 12:56 am

The limit for the MS-DOS partition tables is 2^32 sectors.  The patch 
that Daniel posted was for a Linux kernel internal limit that set the 
limit to 2 TB.

	-hpa
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:33 am

>>>>> "James" == James Bottomley <James.Bottomley@suse.de> writes:

James> However, for 4k sectors, the main issues which have shown up in
James> testing by others (mostly Martin) are

James>      1. In native 4k mode, we work perfectly fine.  *however*,
James>         most BIOSs can't boot native 4k drives.

Correct.  I have engaged with pretty much all the big OEMs in the
industry and so far the interest has been near zero.


James>      4. The aligment problem is made more complex by drives that
James>         make use of the offset exponent feature (what you refer
James>         to as offset by one) ... fortunately very few of these
James>         have been seen in the wild and we're hopeful they can be
James>         shot before they breed.

This topic is constantly up for debate in IDEMA.  However, it looks like
we might win because of the impending demise of XP.


James> so the bottom line seems to be that if you want the device as a
James> non boot disk, use native 4k sectors and a non-msdos partition
James> label.  If you want to boot from the drive and your bios won't
James> book 4k natively, partition everything using the 512 emulation
James> and try to align the partitions correctly.  If your bios/uefi
James> will boot 4k natively, just use it and whatever partition label
James> the bios/uefi supports.

James> Martin can fill in the pieces I've left out.

Here's my latest take given what I hear on the grapevine:

1. 512-byte logical block size drives will be around forever for legacy
   deployments because nobody is willing to do the required BIOS int13
   work.  It's not just a BIOS thing, this requires heavy changes to HBA
   boot ROMs as well.

2. Some vendors are working on EFI firmware and will support booting off
   of 4KB LBS drives there.  This is mostly aimed at the server space.

3. 4 KB logical block size drives will mainly be targeted for use inside
   arrays.  Off the shelf enterprise drive models will most likely
   continue to ship with a ...
From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:38 am

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Thus implying that ATA doesn't support 4 KB LBS, just that people stick
to the tried-and-true 512.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:41 am

Martin> There are 4 KB LBS SSDs out there but in general the industry is
Martin> sticking to ATA for local boot.

Martin> Thus implying that ATA doesn't support 4 KB LBS, just that
Martin> people stick to the tried-and-true 512.

*sigh* I haven't had my breakfast tea yet...

What I meant to say was that I know ATA supports 4 KB LBS and that
nobody appears to care about it.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 11:50 am

Well, apparently Western Digital are looking at it for USB drives due to
XP compatibility requirements -- those presumably are ATA internally and
use a USB-ATA bridge.

On the flipside, though, there really is very little net benefit to 4K
as opposed to 512 byte logical sectors: the additional protocol overhead
is relatively minimal, and as long as writes are aligned full blocks,
there shouldn't be any additional overhead on either the OS or the drive
side.  On the plus side, you get full compatibility with the existing
software stack.  The equation really seems rather simple.

	-hpa
--

From: James Bottomley
Date: Monday, March 8, 2010 - 11:58 am

There's another problem that afflicts 4k drives emulating 512b: they
have to do a read modify write for any isolated 512b write ... that
leads to potential corruption of adjacent 512b blocks if power is lost
at the moment the write is being done.  Since most Linux filesystems are
4k sectors, misalignment really hammers this, plus most journal writes
seem to be done in 512 byte increments.  I suppose for USB this could be
regarded as flakey as usual, though.

James


--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 12:11 pm

Misalignment sucks in general.  This is nothing new - the RAID and flash
people have had these problems for a long time now.  It's clear we need
to align our filesystems, period.

As to the read-modify-write issue: to some degree there is very little
you can do about it other than a big enough capacitor.  If you can't
write a sector atomically and have it stick, you're screwed no matter what.

	-hpa
--

From: =?UTF-8?B?Q2zDoXVkaW8=?= Martins
Date: Monday, March 8, 2010 - 1:02 pm

Most users assume that a single 512B sector write is atomic as far as
power failure is concerned. Hasn't this requirement been carried over
to the new 4k physical sector?

 It seems reasonable that if a 512B sector write is atomic in the older
drives, a 4k sector write would also be atomic on the newer drives,
since the time required to write it is negligible when compared to
capacitor voltage decay and inertia of the disk platters.

 Anyway, I suppose most of the energy/time required for a sector write
operation, is being expended on head assembly positioning and the wait
for the correct sector passing under the write head. That is, the write
operation itself takes so little time that it should make no difference
whether you write 512B or 4k.

 So the question is: what are hard drive makers guaranteeing (if
anything at all)? Was a 512B sector write really atomic? Is a 4k one?
Or was it completely manufacturer-dependent to start?

 Regards

Cláudio

--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 2:07 pm

>>>>> "Cláudio" == Cláudio Martins <ctpm@ist.utl.pt> writes:

Cláudio> So the question is: what are hard drive makers guaranteeing (if
Cláudio> anything at all)?

No guarantees.  Nothing that you can get in writing, anyway.


Cláudio> Was a 512B sector write really atomic?

Sometimes.


Cláudio> Is a 4k one?  

Sometimes, maybe.

The problem with 4KB physical blocks is that if you do a partial or
misaligned write you'll end up having to do read-modify-write.  And that
introduces are scenario where a subsequent write error will affect
logical blocks that were not part of the I/O request.

However, you also have that with regular drives because they often write
more than the actual block undergoing I/O.  For instance to reduce
hotspot bleed to adjacent sectors.

There have been several unsuccessful attempts at nudging the drive
vendors into giving us real guarantees (supercapacitors, NVRAM or
flash-backed write cache).  No luck so far.  So people that care use
arrays with non-volatile caches.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 1:19 pm

>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes:

hpa> On the flipside, though, there really is very little net benefit to
hpa> 4K as opposed to 512 byte logical sectors: the additional protocol
hpa> overhead is relatively minimal, and as long as writes are aligned
hpa> full blocks, there shouldn't be any additional overhead on either
hpa> the OS or the drive side.  On the plus side, you get full
hpa> compatibility with the existing software stack.  The equation
hpa> really seems rather simple.

4KB sectors are not a win for anybody except the drive vendors.

There is a push in the industry right now to keep the 512-byte logical
blocks forever.  The first step would be to report misaligned accesses
or accesses that are not a multiple of the physical block size.  Second
step would be to eventually reject any write that's not a properly
aligned multiple of the physical block size.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 2:16 pm

Obviously.  However, larger physical storage unit sizes -- 4K for
spinning media, but frequently much larger for flash, for example -- is
already in wide use, and having a huge mishmash of logical block sizes

I personally suspect that that is the way it is going to go, rather than
trying to change the software ecosystem to a different logical block
size.  It has been tried in the past and failed, with the sole exception
of CD-ROMs, pretty much.

	-hpa
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:34 pm

Hello,


This should work right now as long as the bridge chip doesn't screw
up, which we can't do much about anyway.  USB is used as SCSI
transport and SCSI layer has been working fine with devices with

Yeap, for addressing, whether 9 bit is shifted or 12 doesn't really
matter.  That's only 8 times difference which may be breached in
probably under three years.  If the current 48 bit addressing limit is
reached, we would be far better off introducing 64 or 128 bit
addressing.  That was the reason why I thought that I would never see
an ATA disk w/ 4KiB logical sector and got pretty surprised that it
was being considered for XP compatibility where 3 year offset could be
pretty meaningful.

Thanks.

-- 
tejun
--

From: Matthew Wilcox
Date: Wednesday, March 10, 2010 - 12:53 am

I sent patches to add support ... they were ignored.

Part of the problem is that ATA is heinously broken wrt non-512 byte
sector sizes.  You have to know which commands work in multiples of
the block size, and which commands work in multiples of 512-bytes.
There's no easy way to figure it out; you need a table.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Jeff Garzik
Date: Wednesday, March 10, 2010 - 6:47 am

Not true, read the rest of the thread.

	Jeff



--

From: Damian Lukowski
Date: Wednesday, March 10, 2010 - 9:19 am

Hello,
I have practically no knowledge of Linux' block device drivers,
but is this really a partitioning issue? I think the problem is
with the filesystems when clustering multiple blocks without
knowledge of the sector alignment and sector size of the underlying
block device. Maybe it is a better solution to adapt the filesystem
buffer routine which reads/writes data from/to the block device?

Best regards
 Damian
--

From: Theodore Tso
Date: Thursday, March 11, 2010 - 6:04 am

No, it's really a partitioning issue.   If the paging subsystem wants a 4k block to fill a particular page, we need to read that 4k block into memory.  If we need to swap out that 4k block, we need to write that 4k block to swap space, or to the memory segment's backing store.   If the partition is misaligned by 512 bytes, this is simply not possible.   The file system has to do what is requested of it by its users, and the reality is that we need to do 4k aligned reads and writes with respect to the beginning of the partition, far more often than not.

Hence, if we want the best performance on 4k sector drives, the partition needs to be aligned with respect to what is most desirable for the device in question.

Best regards,

-- Ted

--

From: Nikanth Karthikesan
Date: Thursday, March 11, 2010 - 6:57 am

I guess, what he meant was, to keep filesystem blocks aligned, even if the 
partition is not. Say if the partition is mis-aligned by 512-bytes, let the 
filesystem waste 4k-512bytes and keep it's blocks aligned. But it might be a 
case of over-engineering, possibly requiring disk format change.

Thanks
Nikanth
--

From: Theodore Tso
Date: Thursday, March 11, 2010 - 7:28 am

Ah, yes, I agree with you; that's probably what he meant.

Sure, that's theoretically possible, but it would mean changing every single filesystem, and it would require a file system format change --- or at least a file system format extension.

It would seem to be way easier to simply fix the partitioning tools to do the right thing, though.

-- Ted

--

From: James Bottomley
Date: Thursday, March 11, 2010 - 7:39 am

Actually, it's a layering violation.  The filesystem shouldn't need to
probe the device layout ... particularly when there are complexities
like is it logical 512 or physical, and if logical 512 on 4k does it
have an offset exponent or not.

We can transmit certain abstractions of information up the stack (like
stripe width for RAID arrays which should be the fs optimal write size),
but for this type of alignment, which can be completely solved at the
partition layer, the information should really stay there and the
filesystem should "just work".

James


--

From: Nikanth Karthikesan
Date: Thursday, March 11, 2010 - 8:05 am

Right. It would be layering violation and we have LVM to solve it already.

The real problem, here is just that partitioning-tools should create 
partitions that can work with both XP as well as Windows7. May be distro 
installers, should ask the user which compatibility he needs.

Thanks
Nikanth
--

From: tytso
Date: Thursday, March 11, 2010 - 8:25 am

4k aligned sectors will *work* with Windows XP, will it not?  It's
just simply a matter of Windows XP, being really ancient, doesn't
create properly alligned partitions by default.   

And how often are we going to see Windows XP systems with these new 4k
physical sector drives anyway, where the first OS to touch the
partition is Windows XP?  And in the case where this does happy, the
resulting partition will be result in terribly performance for Windows
XP as well as Linux.

What's the specific scenario which you are trying to solve, and how
likely is it to occur in real life?

					- Ted
--

From: Gene Heskett
Date: Thursday, March 11, 2010 - 9:26 am

And potentially one more question from a list lurker, Ted.  Where are the 
tools that allow us to check and/or adjust that?  I ask since I have 3 of 
these terrabyte drives in this box now and have no clue how to either check 
to see if we're off, or how to fix it if it is.  And I have called my self 
following this discussion without noting if the tools have been specifically 
named.

Thanks

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Authors are easy to get on with -- if you're fond of children.
		-- Michael Joseph, "Observer"
--

From: Greg Freemyer
Date: Thursday, March 11, 2010 - 9:34 am

Ted,

Apparently the real issue is Win2K, not XP.

It seems to require the boot partition and possibly all partitions
start on a cylinder boundary.  And may have 255/63 hard-coded in to
define what a cylinder is.  I agree with the apparent consensus that a
2010 era linux partitioner does not need to be Win2K compatible.  If
someone wants to install Win2K they will need to either use an older
generation partitioner to create the partitions or use specific
command-line args to force a non-optimal alignment.

I do think the linux partitioners should provide a way to force a
cylinder alignment.  Tejun, I would like to see your doc describe how
to force a win2k compatible partition layout.

fyi: The same issue apparently also exists for users still running OS/2.

Greg
--

From: Tejun Heo
Date: Thursday, March 11, 2010 - 6:09 pm

Hello,


I suppose I can play with fdisk and list it as an example but if
anyone knows better/proper way to force certain partitions to legacy
alignment while leaving others properly aligned, I'll be happy to
include it.

Thanks.

-- 
tejun
--

From: Mike Snitzer
Date: Thursday, March 11, 2010 - 7:48 am

Yes, the current supported approach is to rely on partitions (parted,
fdisk) or LVM to account for 'alignment_offset'.

This avoids having a filesystem add its own padding (format change).
But e2fsprogs at least warns if a device, that it is to format, has an
alignment_offset != 0.

Mike
--

From: Nikanth Karthikesan
Date: Thursday, March 11, 2010 - 8:00 am

Yes. May be, just a simple but transparent device-mapper like mapping on top 
of the mis-aligned partition, to do the alignment. Then the file-system code 
need not change much.

But Linux already has device-mapper and Linux will not be affected with mis-
aligned partitions, when we use LVM.

But the actual problem here is that partitioning tools might create partitions 
that wont allow other operating-systems to boot. So it might be enough, if the 
partitioning tools just create partitions with (mis-)alignment requirement for 
Windows.

Thanks
Nikanth
--

From: Tejun Heo
Date: Thursday, March 11, 2010 - 8:10 am

Hello,


Turns out XP is generally OK.  The reported problem was only on
specific configurations (some BIOS stuff).  Windows 2000 reportedly
would be hurt but I really think we don't have to care about that too
much.  So, it seems like we wouldn't have to worry too much about it
and just go ahead with new alignment schemes.  I'll update the doc
this weekend with new information from this now rather large thread.

Thanks.

-- 
tejun
--

From: Mike Snitzer
Date: Thursday, March 11, 2010 - 9:01 am

Well, device-mapper and LVM needed to be updated to make them "just

I'm not following...

Anyway, 4K drives that are 512b logical and 4K physical may or may not
also have "DOS partition compensation" that use LBA -1 as the first
naturally (4K) aligned start.  This means that the partition tools
need to shift the start of the first primary partition to be offset by
3584 bytes (7 512b sectors) for use with Linux.  But for windows,
AFAIK windows XP and windows 7 create all partitions aligned on 1MB
boundaries.  Linux's parted and fdisk create 1MB aligned partitions
now too.

So the only outlier is older versions of windows (< XP) and Linux (old
fdisk and parted, etc also use DOS partitioning) that don't use
naturally aligned (e.g. 1MB) partition boundaries.  In those versions
of Windows and LInux there are ways to change the default start of
sector 63.   That said, there is an opportunity to improve
documentation for how to workaround DOS partitioning on these
operating systems.

One other piece worth mentioning on this "IO Toplogy" support in the
entire Linux I/O Stack is the virt layers.  hch has already extended
the virt-io protocol and qemu is in the finishing stages of being
updated to properly consume the "IO Topology" information.  So we
really don't have any gaps in the Linux I/O stack.

mkp in particular, Jens, James, myself, and others implemented and
refined the SCSI and block changes.  kzak, jim meyering, hans de
goede, hch, eric sandeen, bob peterson, myself and others updated all
other I/O stack layers ranging from DM to LVM, libblkid, fdisk, parted
to anaconda to mkfs.ext[234], mkfs.xfs, mkfs.gfs2 to virt-io and qemu.
 FYI, all of these advances will be in Fedora 13 (quite a few are
already in Fedora 12).

There are obviously other Linux systems and userland tools (likely
Xen, other mkfs.* and more) that should be updated.  Hopefully
maintainers and/or contributors of these projects will follow-up to
address those that need updating.

Again please ...
From: Christoph Hellwig
Date: Thursday, March 11, 2010 - 11:26 am

I also have some older patches for btrfs that I need to get back out
to the list.  There was some talk of major changes to the organization
of the tools so I held it back for a while longer.

--

From: H. Peter Anvin
Date: Thursday, March 11, 2010 - 9:33 am

That's basically what you end up having to do for FAT filesystems to be 
aligned.

	-hpa
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:18 am

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> The [Windows Vista/7] partitioner seems to be using 1M as the
Tejun> basic alignment unit and offsetting from there if explicitly
Tejun> requested by the drive

Yep.


Tejun> Please note that hdparm is misreporting the alignment offset.  It
Tejun> should be reporting 512 instead of 256 for offset-by-one drives.

Already fixed.  Your hdparm must be old.



Tejun> Partitioners maybe should only align partitions which will be
Tejun> used by Linux and default to the traditional layout for others
Tejun> while allowing explicit override.

I don't think we take the partition type into account.  Karel?


Tejun> Reportedly, commonly used partitioners aren't ready to handle
Tejun> drives larger than 2 TiB in any configuration and alignment isn't
Tejun> done properly for drives with 4 KiB physical sectors.  4 KiB
Tejun> logical sector support is broken in both the kernel 

Huh, what?  My homedir is on a 4KiB LBS/PBS drive and has been for ~2
years.


Tejun> (need more details and probably a whole section on partitioner
Tejun> behaviors)

I'm Cc:'ing Karel Zak and Jim Meyering who have been doing all the
alignment work for fdisk and parted respectively.  Karel, Jim: The full
writeup is here:

	http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues

It'd be great if you guys could share what you have been doing to the
tooling.


Tejun> Unfortunately, the transition to 4 KiB sector size, physical only
Tejun> or logical too, is looking fairly ugly.  Hopefully, a reasonable
Tejun> solution can be reached in not too distant future but even with
Tejun> all the software side updated, it looks like it's gonna cause
Tejun> significant amount of confusion and frustration.

With regards to XP compatibility I don't think we should go too much out
of our way to accommodate it.  XP has been disowned by its master and I
think virtualization will take care of the rest.

FWIW, recent fdisk has a command line flag that ...
From: H. Peter Anvin
Date: Monday, March 8, 2010 - 11:29 am

We should not take the partition type into account.  The other aspect is
that FAT partitions need to be formatted differently to maintain the
alignment once set; I have recently contributed patches (which were
accepted) into mkdosfs to do the right thing there.

Looking at the Windows XP article, it looks like it is limited to
certain BIOSes; unfortunately it doesn't say what the particular BIOS
issue is.  If we can find a system which actually exhibits the bug it

For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition

I think that's is wildly optimistic, but I do observe there is a fix

Yes, unfortunately it is still on by default.

	-hpa
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 1:01 pm

hpa> For > 2 TiB drives with 4 KiB logical sectors and MS-DOS partition
hpa> tables, it is.


hpa> I think that's is wildly optimistic, 

I don't expect XP to go away any time soon.  But do I think that the
number of fresh XP installs in combination with Linux will be fairly
limited.  And general lack of hardware enablement will eventually kill
off XP on raw metal.

I think it's ok that we have stop-gap solutions in place for
interoperability.  But I wouldn't want to waste all our resources on
designing for the past.  I'm much more interested in making sure that

hpa> Yes, unfortunately it is still on by default.

I agree that this is a don't-be-broken option and I would prefer it the
other way around (I know that's the plan for the next release.  I just
hope the distributions get things right).

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Mike Snitzer
Date: Monday, March 8, 2010 - 12:34 pm

On Mon, Mar 8, 2010 at 10:18 AM, Martin K. Petersen

I've been keeping track of all the pieces in play, have coordinated
with kzak and jim, and have a summary that offers some amount of macro
detail (at the end I touch on parted and fdisk):

http://people.redhat.com/msnitzer/docs/io-limits.txt
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:53 pm

Hello,


Ah... this is great.  I'll link the doc and shamelessly steal parts of
it if that's okay with you.

Thanks.

-- 
tejun
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:20 pm

Tejun> Ah... this is great.  I'll link the doc and shamelessly steal
Tejun> parts of it if that's okay with you.

There's also this one:

    http://oss.oracle.com/~mkp/docs/linux-advanced-storage.pdf

It is more aimed at storage vendors than end users, though.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Michael Tokarev
Date: Monday, March 8, 2010 - 11:53 pm

Mike Snitzer wrote:

What I don't see in this thread and in this document is - any mention
of linux md layer.  I think it is the first candidate to test the whole
thing, the easiest and most important one.  I mean the alignment and
"recommended I/O size" and all this similar stuff.

Think of a raid5 array - with all the mentioned good stuff in place
fdisk should figure out to align partitions on the array stripe
boundary, and should do that automatically.  And this should be
most easy to debug/test, since the whole thing is controllable
by kernel.

But apparently it does not implement anything of this sort.
Adding Neilb to the Cc list.......

Thanks!

/mjt
--

From: Karel Zak
Date: Tuesday, March 9, 2010 - 3:01 am

Yes. For userspace there is not a difference between RAID and non-RAID
device -- the topology support in kernel provides unified API to all
devices. It means we needn't any extra support for RAIDs in
fdisk/parted. The userspace tools follow topology data from kernel.

The good thing with 1MiB default alignment is that it is usable for
usual stripe sizes (for sizes greater than 1MiB we use optimal I/O

I did almost all my tests with scsi_debug or MD RAID0 on scsi_debug.
It works as expected. (Note that kernel 2.6.31 has a problem with
alignment_offset calculation on stacked devices, so use the latest
kernel where the bug is already fixed.)

But I didn't tried to use unpartitioned (whole) 4K disks for RAIDs,
because scsi_debug does not allow to create more devices (and I don't
have a real HW).

Some tests are available in util-linux-ng sources:
http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=tree;f=tests/ts/fdisk

    Karel


 # modprobe scsi_debug dev_size_mb=2500 sector_size=512 physblk_exp=3

    [..create partitions...]

 # fdisk -lcu /dev/sdb 

 Disk /dev/sdb: 2621 MB, 2621440000 bytes
 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 4096 bytes / 32768 bytes
 Disk identifier: 0xb585b0be

 Device Boot         Start         End      Blocks   Id  System
 /dev/sdb1            2048     1026047      512000   83  Linux
 /dev/sdb2         1026048     2050047      512000   83  Linux
 /dev/sdb3         2050048     3074047      512000   83  Linux
 /dev/sdb4         3074048     4098047      512000   83  Linux


 # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}

     [...create partitions on the raid...]

 # fdisk -lcu /dev/md8

 Disk /dev/md8: 1572 MB, 1572667392 bytes
 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size ...
From: Michael Tokarev
Date: Tuesday, March 9, 2010 - 3:16 am

No, it's not that simple.  For raid5 (and I especially mentioned raid5
above), raid4 and raid6, 1MiB is only good when the number of devices
is 2^N+1 (for raid[45]) or 2^N+2 (for raid6).  For raid5 that means
3, 5, 9, 17, .. disks.  In all other cases the alignment (which should
match stripe size) will not be power of two.  For example, for a 4-disk
raid5 array with 1MiB chunk size the partitions should be aligned at
3MiB boundaries.  For 6-disk raid5 with 256KiB chunk size it is
5x256=1280 Kib.  And so on.

Yes it has little to do with the $subject (4KiB sectors), but it is

Actually, for raid0, the alignment is questionable.  Should it be a
multiple of chunk size or whole stripe size?  I'm not sure, both ways
has bad and good sides..  But if it is the latter, the same issues
pops up again: do a 3-disk raid0 and you'll have to align to 3*2^N.



That's 3-disk stripe size with default 64Kb chunk size, which makes

And here we go: fdisk does not see the right number: nothing
is dividable by 3.


And that's where the issue is.  md does not {sup,re}port all
this stuff yet.

This is what I'm talking about.

Thanks!

/mjt
--

From: Dave Chinner
Date: Tuesday, March 9, 2010 - 4:15 am

Yes, alignment is still needed, especially for filesystems that can
do stripe unit aligned allocation like XFS. If you don't align the
filesystem properly, all the data IO will be mis-aligned to the
underlying disks and stripe unit sized IO will hit multiple disks
rather than just one....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Michael Tokarev
Date: Tuesday, March 9, 2010 - 4:38 am

I understand alignment is needed, the question is if the alignment
should be to chunk size or full-stripe size.  In neither case it
will be bad for underlying disks.

/mjt
--

From: Dave Chinner
Date: Tuesday, March 9, 2010 - 5:20 am

Depends on the RAID implementation. High end RAID arrays often have
cache bypass features that are triggered by stripe width aligned and
sized IOs. cwWhen receiving well formed IO they can more than double
write performance because they are not limited by internal cache
mirroring bandwidth (e.g. the controller magically switches to
write-through for those well formed IOs instead of writeback).

So from that perspective, alignment needs to be to stripe width,
not stripe unit. Similarly for RAID5/6 alignment needs to be to
stripe width, so that a well formed IO issued by the filesystem
only hits one RAID5/6 stripe.

FWIW, XFS takes great care to ensure that it doesn't place all it's
allocation group headers on the same stripe unit.  Failing to
distribute the AG headers across all the ѕtripe units evenly loads
the disks/luns in the stripe unevenly. As soon as you have uneven
load on a stripe the performance tanks as stripe is only as fast as
it's slowest member.

Also, while XFS prefers to align to stripe unit, there are mount
options to change the default allocation alignment to be stripe
width based. Hence if you have large files and applications that are
doing well formed IO, stripe width alignment of the filesystem to
the underlying block device is critical to acheiving deterministic
throughput close to the maximum the hardware can support.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Karel Zak
Date: Tuesday, March 9, 2010 - 4:50 am

Note that I have 2.6.31.12-174.2.22.fc12.x86_64 kernel on my laptop.
It would be better for serious tests to use 2.6.33.

    Karel
 
-- 
 Karel Zak  <kzak@redhat.com>
--

From: Karel Zak
Date: Tuesday, March 9, 2010 - 5:18 am

Well, the same setup with 2.6.34-0.9.rc0.git13.fc14.x86_64:

 # fdisk -luc /dev/sdb

 Disk /dev/sdb: 2621 MB, 2621440000 bytes
 255 heads, 63 sectors/track, 318 cylinders, total 5120000 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 4096 bytes / 32768 bytes
 Disk identifier: 0x77fbab55

 Device Boot         Start         End      Blocks   Id  System
 /dev/sdb1            2048     1026047      512000   83  Linux
 /dev/sdb2         1026048     2050047      512000   83  Linux
 /dev/sdb3         2050048     3074047      512000   83  Linux
 /dev/sdb4         3074048     4098047      512000   83  Linux


 # mdadm --create /dev/md8 --level=5 --raid-devices=4 /dev/sdb{1,2,3,4}


 # fdisk -luc /dev/md8

 Disk /dev/md8: 1572 MB, 1572667392 bytes
 2 heads, 4 sectors/track, 383952 cylinders, total 3071616 sectors
 Units = sectors of 1 * 512 = 512 bytes
 Sector size (logical/physical): 512 bytes / 4096 bytes
 I/O size (minimum/optimal): 65536 bytes / 65536 bytes


 # cat /sys/block/md8/queue/{minimum,optimal}_io_size 
 65536

 Hmm...

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Martin K. Petersen
Date: Tuesday, March 9, 2010 - 10:06 pm

>>>>> "Karel" == Karel Zak <kzak@redhat.com> writes:

[Cleaned up the CC: list from hell]

Karel>  # cat /sys/block/md8/queue/{minimum,optimal}_io_size
Karel>  65536 65536

This one had me puzzled.  We set min_io and opt_io correctly in raid5.c
depending on number of non-parity disks.  And yet it turns into
something nonsensical after.

Turns out we overrun unsigned int calculating the lowest common multiple
in the stacking function.  That's why we ended up with the wrong value.

I never noticed this because my userland topology regression test tool
uses unsigned long.

I'll get a patch off to Jens right away.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Henrique de Moraes Holschuh
Date: Wednesday, March 10, 2010 - 1:50 pm

And please get the whole fixed deal in -stable eventually, for 2.6.32.y's
benefit :-)

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Martin K. Petersen
Date: Tuesday, March 9, 2010 - 9:57 pm

>>>>> "Michael" == Michael Tokarev <mjt@tls.msk.ru> writes:

[MD I/O topology support]

Michael> But apparently it does not implement anything of this sort.
Michael> Adding Neilb to the Cc list.......

git show 8f6c2e4b

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Karel Zak
Date: Monday, March 8, 2010 - 12:58 pm

Yes, you're right. 

(IMHO our goal should be to minimize number of places where anything

The limit is specific for DOS partition table (with 512-byte log.
sectors), but for example GPT uses 64-bit LBA. I believe that our

 small summary:

 - libblkid provides unified API to topology information, it supports:
    - ioctls (kernel >= 2.6.32)
    - sysfs (kernel >= 2.6.31)
    - stripe chunk size and stripe width for DM, MD. LVM and evms on
      old kernels
 - libparted and fdisk are linked against libblkid

 - fdisk supports 4KiB logical sector size (util-linux-ng >= 2.15
 - fdisk supports 4KiB physical sector size (util-linux-ng >= 2.17)
 - fdisk uses 1MiB alignment (or more if optimal I/O size is bigger)
   and alignment_offset for all partitions in non-DOS mode
   (util-linux-ng >= 2.17.1)

 - parted supports 4KiB physical sector size
 - parted uses 1MiB alignment for disks with unknown topology, disks
   with topology information are aligned to optimal (or minimum) I/O
   size (parted >= 2.1)
 
 - EFI GPT code in the kernel has been updated to works properly with 
   4KiB sectors (kernel >= 2.6.33)

 - mkfs.{ext,xfs,gfs2,ocfs2} have been update to work properly with
   topology information, mkfs.{ext,xfs} are linked against libblkid
   for compatibility with old kernel (for stripe chunk size / width)

 - Fedora-13/RHEL6 installer uses libparted with 4KiB support


 yes, util-linux-ng 2.17.1, fdisk -c
 
 Note that non-DOS mode will be default in the next major
 util-linux-ng release.

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:34 pm

Hello,


Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just


That's great.  Daniel, maybe you were testing older versions?  Or
maybe those failures were manifested from libata mishandling 4KiB r/w

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1).  It


Yeah, good point.  I'm just a bit worried that it might generate a lot
of frustrated bug reports.  Well, maybe we should just advise users to

I'll try to merge these information into the ata-4k doc.

Thank you very much.

-- 
tejun
--

From: Jeff Garzik
Date: Monday, March 8, 2010 - 7:42 pm

Does libata-dev.git#sectsize miss any details?

	Jeff



--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:49 pm

Hello,


I haven't looked at it yet.  I'll review it soon but the thing is
without actual hardware it would be a bit difficult to tell.  It's not
only the drivers.  I have this mighty unhappy feeling that some
controllers (especially some of the SATA ones with internal state
machine to emulate SFF) would be sniffing the commands and making the
wrong assumption if 4KiB logical sector size is used, so we'll need to
test various controllers.  Some PATA-SATA bridge chips will definitely
be having problems too.  Then there are the USB and other bridges too
but well those aren't libata's problem at least.  :-)

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:42 pm

Hello, again.


I misread it.  C-1 would be disks w/o alignment information which will
be aligned to optimal_io_size which again would be 0 and thus 1MiB
alignment.  So, this should work, right?

Thanks.

-- 
tejun
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:11 pm

Tejun> I misread it.  C-1 would be disks w/o alignment information which
Tejun> will be aligned to optimal_io_size which again would be 0 and
Tejun> thus 1MiB alignment.  So, this should work, right?

Correct.  ATA only provides physical block size whereas SCSI has the
extra knobs in the block limits VPD.  And consequently ATA block devices
have min_io = physical block size and optimal_io = 0.

So we'll align to 1 MB by default.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:09 pm

Tejun> By default, they aren't aligned properly, are they?

Single partition.  I did the alignment manually.


Tejun> libata is broken for logical 4KiB ATA devices tho.  I'll fix it
Tejun> up.

Matthew implemented support for this a while back...


Tejun> I'm just a bit worried that it might generate a lot of frustrated
Tejun> bug reports.  Well, maybe we should just advise users to install
Tejun> windows first and then install Linux.

Unfortunately there is no simple solution given that we can't go back in
time and fix legacy DOS/XP behavior.

The 1-alignment jumper (that some drives have) fixes things for the
first partition but will mess up our alignment for subsequent ones
unless the firmware actually reports the shift.  So no matter what we do
the user will have to have a bare minimum of knowledge about 512-byte
LBS/4 KB PBS drives.  That sucks.  But even Windows users are presented
with extra documentation and alignment utilities during the transition.

Having a 1 MB alignment by default and hoping that devices that lie will
be 0-aligned is the best we can do, I think.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Daniel Taylor
Date: Monday, March 8, 2010 - 8:38 pm

-----Original Message-----
From: Tejun Heo [mailto:tj@kernel.org] 
Sent: Monday, March 08, 2010 6:34 PM
To: Karel Zak
Cc: Martin K. Petersen; linux-ide@vger.kernel.org; lkml; Daniel Taylor; Jeff
Garzik; Mark Lord; tytso@mit.edu; H. Peter Anvin;
hirofumi@mail.parknet.co.jp; Andrew Morton; Alan Cox; irtiger@gmail.com;
Matthew Wilcox; aschnell@suse.de; knikanth@suse.de; jdelvare@suse.de; Jim
Meyering
Subject: Re: ATA 4 KiB sector issues.

Hello,


Hmmm... the 'reportedly' was from Daniel Taylor or maybe I just
misinterpreted the conversation.  Daniel, can you please fill in?

DLT> The problem that I see is that the installers and upper level
applications do not make good choices for partition layout.
DLT> "parted", itself, seems to work OK in the latest version.  One of the
things I've heard since I started this process is that
DLT> there are some libraries associated with the process of
partitioning/formatting.  Perhaps the upper layers and those


That's great.  Daniel, maybe you were testing older versions?  Or maybe
those failures were manifested from libata mishandling 4KiB r/w requets.

DLT> As I said, above, it could be libraries.  I was not aware that so much

This will result in incorrect alignment for drives which lie about the
physical sector size to work around BIOS/drivers issues (C-1).  It would
probably be best to align to at least 1MiB.



Yeah, good point.  I'm just a bit worried that it might generate a lot of
frustrated bug reports.  Well, maybe we should just advise users to install
windows first and then install Linux.

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
USB-attached, used with XP will be MBR-partitioned
DLT> and use 4096-byte sectors.  We need to be able to read/write those

I'll try to merge these information into the ata-4k doc.

Thank you very much.

DLT> One last comment: I just tried to partition and format a >2TiB drive on
fully updated Ubuntu 9.10 with GParted.
DLT> I selected not to cylinder ...
From: Martin K. Petersen
Date: Monday, March 8, 2010 - 9:54 pm

>>>>> "DLT" == Daniel Taylor <Daniel.Taylor@wdc.com> writes:

DLT> Simple reality is that XP is "forever".  Drives >2TiB, which may be
DLT> USB-attached, used with XP will be MBR-partitioned and use
DLT> 4096-byte sectors.  We need to be able to read/write those disks on
DLT> Linux systems.

Shouldn't be a problem as long as the DOS partition table vs. 4 KiB
sectors thing is fixed.


DLT> One last comment: I just tried to partition and format a >2TiB
DLT> drive on fully updated Ubuntu 9.10 with GParted.  I selected not to
DLT> cylinder align, use GPT and ext3, and to put 1 MiB preceeding and
DLT> following.  libparted failed with "unable to satisfy all
DLT> constraints of the partition".  Using "parted", I created the
DLT> partition, and then GParted was able to apply the ext3 file system.

I don't think ubuntu has adopted any of the relevant updates yet.

I believe the Fedora 13 Alpha is due to be released this week.  That
would be the best test platform because several of the people who have
been actively engaged in the 4 KiB sector enablement process are Fedora
developers.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Jim Meyering
Date: Tuesday, March 9, 2010 - 12:27 am

Thanks for the summary, Karel.
In case anyone wants more high-level detail on the parted front,
here's its NEWS file:

    http://git.debian.org/?p=parted/parted.git;a=blob;f=NEWS

Currently, I'm not planning much for Parted, other than clean-up.
For example, I want to remove all of the FS-related code (it's
horribly bit-rotted) from the package, with the exception of
HFS/HFS+ and FAT resizing capabilities, since AFAIK, Parted
has the only free implementations.  If any of you know of other
implementations or work in progress, please let me know.


Related information, prompted by my recent encounter with a
tool that refused to let me use a GPT partition table.

Partition table formats: prefer GUID/GPT:

  Having spent more than my share of time looking at partition table
  formats recently, I am now strongly biased against DOS partition
  tables, and for GUID/GPT ones.  In addition to allowing for >2GiB
  partition offsets and lengths, GPT tables provide for better
  protection in case of corruption (checksums, backup table at end
  of disk) and don't have the anachronistic distinction of primary
  and extended/logical partitions (all partitions are "primary").
  You can even give each partition a name.  The only reason to use a
  DOS partition table on a new installation is if you're stuck with
  a requirement of using an OS like XP on bare metal.

Please consider encouraging the use of GPT partition tables...
or at least do not *dis*courage their use.
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 4:56 pm

Hello,


I'll surely include it.

Thanks.

-- 
tejun
--

From: H. Peter Anvin
Date: Monday, March 8, 2010 - 1:12 pm

Please correct the following bit in C-3:

"A different partition format - GPT[6] - should be used beyond 2^32
sectors, which could harm compatibility with older BIOSs or other
operating systems which don't recognize the new format."

BIOS does not care about the partition table format.  There might be
issues with > 2^32 sectors for BIOSes (e.g. truncating sector counts),
but that would be unrelated.

	-hpa
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:22 pm

Hello,


Updated to,

  This might also be beneficial for operating systems which don't
  suffer from this limitation.  A different partition format - GPT[6]
  - should be used beyond 2^32 sectors, which could harm compatibility
  with other operating systems which don't recognize the new format.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, March 8, 2010 - 7:44 pm

Hello,


Yeah, I know Mark fixed it but couldn't find where the tree was.  SF
only had old releases, so...

(other stuff replied further down the thread)

Thanks.

-- 
tejun
--

From: Martin K. Petersen
Date: Monday, March 8, 2010 - 8:18 pm

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> Yeah, I know Mark fixed it but couldn't find where the tree was.
Tejun> SF only had old releases, so...

Tejun> (other stuff replied further down the thread)

Looks like Mark hasn't made an hdparm release since I posted the patch.
It's here:

http://marc.info/?l=linux-ide&m=126427438620651&w=2

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Mark Lord
Date: Tuesday, March 9, 2010 - 7:32 am

..

Holy crap.  I thought I'd put that out months ago!

Anyway, it's there now:  https://sourceforge.net/projects/hdparm/

Thanks!
--

From: Mikael Abrahamsson
Date: Monday, March 8, 2010 - 11:34 pm

Is this really true? WD ships their EARS drives with an alignment tool 
that as far as I can understand, moves the partition so
it's aligned to 4KiB:

http://www.wdc.com/en/products/advancedformat/

So an XP fresh install (including letting XP partition the drive) will be 
misaligned, but if you clone xp onto a properly aligned partition (or run 
the tool and let it move the partition), it'll be ok. So saying that XP 
"depends" on traditional partition layout might be a bit of a streth?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se
--

From: Michal Soltys
Date: Tuesday, March 9, 2010 - 3:06 am

XP SP2 (or later) can boot from any place, including logical partitions 
(tested that recently). Most important thing is "hidden sectors" (recent 
chain.c32 can set that automatically through ntldr and/or sethidden 
options). No idea about pre-SP2 ; Win 2000 will not boot from "misaligned" 
(with reference to cylinder boundary) partition.

--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:11 pm

Hello,


Hmmm... I based that claim on the MS KB page and as you pointed out
the problem there could probably be issues with specific BIOS

I was thinking about testing XP booting this weekend but really want
to avoid it, so thanks a lot for the info.  I'll update the doc
accordingly but can you please enlighten me on how it works and what's
broken in detail?  So, XP should be fine with any alignment?

Thanks.

-- 
tejun
--

From: Michal Soltys
Date: Sunday, March 14, 2010 - 2:09 pm

Sorry for late reply.

s/sp2/sp3 - although it shouldn't make a difference from sp2 onwards.

Anyway - the tests I did were because of weird laptop, where I shrinked 
whole win7 stuff and having no primary partitions left to use, I tested 
my usual windows xp installation I deploy with ntfsclone. Originally 
that XP were installed from installation disk merged with sp3 (or how 
it's usually called in windows world - slipstreamed). Of course, 
windows xp itself will not present any options to install itself into 
logical partition in the usual way - but during later deployment it's not 
a problem to put it where one's want.

It's possible that this wouldn't work, if windows were installed first 
from pre-sp2 media, and then service pack was installed (in such case, 
ntldr in C:\ is not updated afaik). It's also possible, that "brute-force" 
copied pre-sp2 or win2k to a partition made with either - a) xp sp2+'s disk 
manager or b) mkfs.ntfs and with updated most recent ntldr -  would boot as 
well (the partition requirement is due to potential differences between the code 
in bootsector, or more precisely - $Boot - first 8KiB of ntfs partition).

Obvious requirements besides the above (ntldr, perhaps $Boot as well) are:

- mentioned "hidden sectors" (must be manually adjusted, recent syslinux's 
chain.c32 has option to do it automatically)
- adjusted boot.ini (to point to new partition, eventually other windowish 
stuff as necessary)

As you can see, there're many "if"s and combinations here that I didn't test.

On a related note - ironically, while I had 0 problems making it work 
through syslinux (both regular chaining and through direct ntldr loading) - 
I couldn't make win7's bootmgr (bcd, bcdedit ....) do it properly. Oh well.

--

From: s ponnusa
Date: Sunday, March 14, 2010 - 3:56 pm

Has been following this thread and I might possibly be testing with
Windows XP soon. Will update the results.
-
SP

--

From: Mark Lord
Date: Tuesday, March 9, 2010 - 6:55 am

On 03/07/10 22:48, Tejun Heo wrote:
..

That issue was fixed quite a while ago.
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:00 pm

Heh heh, *you* were keeping it from me!  Anyways, is there hdparm
devel tree published somewhere?  I wandared the SF page for quite a
bit (which for some reason is very difficult to find things in) but I
couldn't find one.  If it's not, it might be a good idea to put it on
SF or git.kernel.org?

Thanks.

-- 
tejun
--

From: Mark Lord
Date: Tuesday, March 9, 2010 - 11:08 pm

..

No tree.  There's just my working copy (private),
and the published versions at SF.

But yes, SF has gotten incredibly more cryptic to use of late,
and I might have to move it somewhere more accessible soon.

Cheers!
--

From: Arnd Bergmann
Date: Tuesday, March 9, 2010 - 4:46 pm

Any idea if XP can cope with partition tables that use a 32-sector, 128-head
geometry rather than the default 63-sector, 255-head one? That seems to
be what some flash memory cards are using and it would make any cylinder
aligned partition also 4096-byte aligned, at the cost of moving the
1024-cylinder boundary from 7.88 GiB to 2 GiB.

Do we know of anything that requires 63s/255h?

	Arnd
--

From: Tejun Heo
Date: Tuesday, March 9, 2010 - 5:20 pm

Michal Soltys pointed out that XP doesn't really depend on the legacy
layout although 2000 does (can't boot), so I guess it shouldn't be
much of a problem.

Regarding the gemetry, IIUC changing it isn't meaningful for
compatibility.  Geometry information is obtained using a BIOS call
(the int Xh thing) and the hard disk itself doesn't carry that
information , so unless you go into the BIOS set up and enter those
values manually (and I don't think you can do that on many BIOSs these
days), there's no way for anyone else to know custom geometry other
than solving equations using the CHS and LBA information in the
partition table.

So, feeding custom geometry to a partitioner which uses CHS to
determine the layout is useful to make it create partitions aligned in
certain way but as the information regarding the geometry is not
recorded anywhere, others will just keep using whatever they were
using (255*63) and figure that CHS and LBA in the partition tables
just don't match.

Thanks.

-- 
tejun
--

From: Denys Vlasenko
Date: Wednesday, March 10, 2010 - 2:14 am

63s/255h is more or less "standard" now.

Alignment issues can be solved by picking a good multiple of
_heads_ or _cylinders_:

For first partition, pick the start at 8th head:

cyl 0 head 1 sector 1: LBA sector 63) - bad
cyl 0 head 8 sector 1: LBA sector 8*63) - good (4k aligned)

For any other partition, pick start cylinder which is a multiple of 8:

cyl 8*x head 0 sector 1: LBA sector 8*x*255*63 - good (4k aligned)

This will actually work well for *any* geometry, not only for 63s/255h.
-- 
vda
--

From: H. Peter Anvin
Date: Sunday, March 14, 2010 - 6:21 pm

Yes, but it does squat for a flash disk that wants, say, 256K alignment.

	-hpa
--

From: Denys Vlasenko
Date: Sunday, March 14, 2010 - 7:26 pm

4K makes sense. 256K not so much.

256K alignment is hard to swallow for a lot of reasons anyway.
Unless the filesystem packs small files into blocks a-la reiserfs,
256K block filesystems will be very inefficient for a typical
storage scenarios.

It looks like flash storage manufacturers just have to bite
the bullet and develop smarter algorithms that combine wear
leveling, block remapping and such and make their internal
preference for huge continuous aligned writes nearly invisible
from the outside - just like hard disks which do not expose
their zoned recording, variable sector counts etc.

Such algorithms aren't trivial, but they are possible.
Whoever will incorporate them in their products,
delivers a significantly better user experience.

I just played with ubuntu installation on an usb stick.
Yes, it works. Soft of. Write performance is abysmal.
I would pay x2 or x3 for the same sized stick if it
would perform better.

-- 
vda
--

From: Greg Freemyer
Date: Sunday, March 14, 2010 - 7:56 pm

In general USB sticks don't offer the same performance as SSDs.

You can find sticks with both USB and eSata.  I'd hope those offer
better performance.

You should read some performance reviews.  I'm sure you can find a few
sticks that are much better than what you get from a vanilla usb
stick.

Greg
--

From: H. Peter Anvin
Date: Sunday, March 14, 2010 - 9:00 pm

Noone has talked about using 256K filesystem blocks.  The fact of the
matter, though, is that both flash and RAID have much larger alignment
requirements than a mere 4K for optimal performance.

You might not like it, but that's the way it is.

	-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Arnd Bergmann
Date: Monday, March 15, 2010 - 5:30 am

Well, logfs has just been merged and works with block sizes in that
range, but obviously only if the partition is correctly aligned.

	Arnd
--

From: david
Date: Sunday, March 14, 2010 - 10:20 pm

the thing is, if the OS can learn that it's more efficiant to write in 
256K aligned chunks, then it can batch up things so that the drive doesn't 
have to do a read-modify-write cycle and can instead just replace the 
entire chunk.

raid arrays can benifit from this as well as SSDs.

the OS can do this when writing things to swap, flushing dirty buffers, 
mmaped files, etc (in fact, if the OS knows the full contents of the 
chunk, it may be more efficiant for the OS to write the entire thing then 
to write part of it and have the drive/array do the read-modify-write 
cycle)

--

From: Denys Vlasenko
Date: Monday, March 15, 2010 - 2:56 am

I think Linux already is doing this. The problem is, in many cases
OS can't possibly do this, short of using a specially designed
filesystem.

If you untar a Linux kernel source tarball on a seriously
fragmented ext2 filesystem, there will be a lot of discontiguous
and/or misaligned writes smaller than 256K.
Only smart firmware can help in this case.
-- 
vda
--

From: H. Peter Anvin
Date: Monday, March 15, 2010 - 7:47 am

Yes, but guess what... there is a lot of stupid firmware out there, and
there are lots of RAID arrays, and so on.

"Seriously fragmented" means you have already lost in the first place.

This doesn't change the fact that this is a real issue and that that is
the major reason why aligning to 63*4K is a bad idea.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Tejun Heo
Date: Monday, March 15, 2010 - 7:30 pm

Hello,


I've got a couple of comments stating that picking a good geometry
parameters can resolve the whole issue but I simply fail to see how it
could.  We can pick any parameter we wish, but there is no reliable
way to communicate the custom geometry parameters to others.

Geometry is determined by two parameters sec/trk and heads/cyl.  You
can punch in those numbers if the BIOS has a menu for it (many don't
these days).  Or hope that BIOS can somehow figure it out from the
partition table which some BIOSs actually try to do.  The problem is
that to determine the two parameters you need to equations matching
CHSs and LBAs and that's available iff the first partition ends before
CHS addressing limit according to the custom geometry, which usually
is not the case.

So, custom geometry is only useful to trick partitioners which align
using cylinders into using better alignments but doesn't help anything
for compatibility as no one can determine the used geometry reliably
after the partitioning is complete.  With compatibility benefit gone,
there simply is no reason to stick to the cylinder abstraction at all.

Am I missing something?

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Monday, March 15, 2010 - 7:32 pm

Aieee... critical typo.

                                                ^^

-- 
tejun
--

From: James Bottomley
Date: Monday, March 15, 2010 - 11:14 pm

Sort of.  As you say, C/H/S doesn't exist for any modern disk.  However,
the msdos label, for reasons lost in the mists of time, uses cylinders
as the units of partition boundaries, so we have to invent a bogus C/H/S
geometry for that partition label.  Because of the problems with picking
C/H/S, most boot loaders take care to ensure that BIOS never cares about
it either (by using the block offset I/O routines), so for most linux
bootloaders, the BIOS problems with C/H/S is a red herring.

So, it is true to say that picking a certain H/S geometry (which is
entirely withing the gift of the partitioner) will align msdos label
partitions, but will be don't care for all other labels: all other
partition labels (like gpt) use block as offset and don't have any truck
with the fictitious C/H/S stuff.

The big problem is that 99% of the x86 systems out there still use the
ancient msdos label for their boot disks, so aligning H/S going forwards
will give us a nice "just works" for x86 boxes.

James


--

From: Tejun Heo
Date: Monday, March 15, 2010 - 11:22 pm

Hello, James.


For any modern Linux and Windows, CHS simply doesn't matter.  They

What I don't get is that how picking up a custom geometry can make
things work when there is *no* reliable way to determine which
geometry was used during partitioning once the partitioning is
complete.  Most BIOSs these days will simply report the geometry as
being 255/63 regardless of the geometry used during partitioning.  So,
how can using a custom geometry give that nice "just works" for x86
boxes when nobody knows what geometry is in use?

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Tuesday, March 16, 2010 - 6:24 am

For msdos labels, it's embedded in the label ... for all other labels,

Because the msdos label can only partition in units of cylinders.  If
you're using an msdos label, picking the right H/S gets you alignment.

James


--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 6:56 am

Hello, James.


Where in the label?

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Tuesday, March 16, 2010 - 7:21 am

No idea ... I only know you can use fdisk expert mode to change the
C/H/S layout and the change is preserved across reboots.

James


--

From: Arnd Bergmann
Date: Tuesday, March 16, 2010 - 7:25 am

IIRC, the layout is guessed from the partition end locations, in the
assumption that each partition is aligned to full cylinders. That
gives you the heads/sectors number, while the cylinder number can be
calculated from the total number of sectors using these numbers.

	Arnd
--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 7:50 am

Hello,


The CHS addresses are stored alongside with the LBA addresses.  The
problem is that the geometry parameters (sectors/track and heads/cyl)
are not stored anywhere and CHS addresses don't make any sense without
the two parameters.  The only way to figure out the geometry
parameters is to solve two equations involving CHS addresses and LBA
addresses.

e.g.  If the first partition begins at CHS 0/32/33 and ends at
12/233/19 and the corresponding LBA addresses are 2048 and 206848, you
can solve the equation and determine that the parameters gotta be 63
secs/trk and 255 heads/cyl to make those two pairs of addresses match
each other and in fact some BIOSs try to do this depending on
configuration (and sometimes falls into infinite loop or causes other
boot related problems if the parameters are too uncommon).

This method can't work reliably even at theoretical level because it
requires at least two pairs of CHS/LBA addresses to match (two unknown
parameters to solve for) and there is only single pair available if
the first partition goes over the CHS limit which at maximum is 8GiB.

So, CHS *values* are preserved if it falls below the CHS limit of the
geometry used during partitioning but the geometry information is lost
making the CHS values completely meaningless, so the only sane thing
to do is to stick to whatever geometry parameters provided by the BIOS
which usually is 255/63 these days.  Otherwise, the results are...

* If the first partition ends before the CHS limit and BIOS is
  configured to calculate back the parameters, BIOS may be able to
  report the geometry correctly.

* If the first partition goes over the CHS limit,

  * BIOS can use 255/63 or whatever default parameters and CHS and LBA
    addresses won't match each other which won't be a problem for
    modern OSes as they don't look at the CHS addresses at all but
    older operating systems which consider both CHS and LBA addresses
    may get confused.

  * BIOS can set up arbitrary ...
From: James Bottomley
Date: Tuesday, March 16, 2010 - 8:02 am

for an msdos label, this is illegal, that was Arnd's point.  The
partitions have to begin and end on cylinder boundaries*.  Knowing that,
you can deduce the geometry from the last sector entry.

James

* at least if you want to preserve windows compatibility, which is what
most of our partitioning tools seem to do.


--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 8:20 am

Well, the thing is that

* Anything remotely modern (>= XP) doesn't give a hoot about cylinder
  alignment.

* Anything older (<= 2000) is very likely to get confused with custom
  geometry starting from the BIOS itself.  For those cases, the only
  thing we can do is aligning partitions to cylinders abiding BIOS
  supplied geometry parameters which will usually be 255/63.

So, using custom geometry doesn't help compatibility at all.

Thanks.

-- 
tejun
--

From: James Bottomley
Date: Tuesday, March 16, 2010 - 8:23 am

Our partitioning tool still obey the integral cylinder rule ... we can
argue about whether they should, but what we need is a strategy for
fixing what is rather than what should be.

James

--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 8:37 am

Hello,


The updated ones don't anymore.  They just align to 1MiB + whatever
the drive requests for offset (the offset-by-one thing).  They will
basically behave the same as windows vista/7 ones, so it's already
fixed.  What we can argue is whether adding CHS tricks on top to make
those larger alignments somewhat meaningful w/ CHS interpretation too,
which I'm objecting on the ground that it doesn't help compatibility
at all.

Thanks.

-- 
tejun
--

From: Ric Wheeler
Date: Tuesday, March 16, 2010 - 1:42 pm

Dropping any mention of CHS seems to be the only sensible thing. Why 
waste any time to continue some myth about drives that no modern 
hardware supports (and then have the joy of explaining that to users)?

Talking about it only confuses people and in the worst case, could cause 
them to misalign their partitions by clinging to these pretend borders :-)

ric

--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 7:04 pm

Hello, Ric.


I don't think not mentioning it would clear up the myth.  It would
probably be a good idea to beef up the document to clear
misconceptions around disk geometry.  I'll give a shot at it.

Thanks.

-- 
tejun
--

From: Martin K. Petersen
Date: Tuesday, March 16, 2010 - 8:22 am

>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> * Anything remotely modern (>= XP) doesn't give a hoot about
Tejun>   cylinder alignment.

Tejun> * Anything older (<= 2000) is very likely to get confused with
Tejun>   custom geometry starting from the BIOS itself.  For those
Tejun>   cases, the only thing we can do is aligning partitions to
Tejun>   cylinders abiding BIOS supplied geometry parameters which will
Tejun>   usually be 255/63.

Tejun> So, using custom geometry doesn't help compatibility at all.

Great reads on this topic.  Might be worth linking to:

	http://www.win.tue.nl/~aeb/partitions/partition_types.html

	http://www.win.tue.nl/~aeb/linux/largedisk.html

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 7:07 pm

Hello,


Thanks for the links.  I'll read and link them.  BTW, if you can spot
something wrong regarding this in the doc, please let me know.  I'm
still learning how all these legacy stuff is supposed to work so there
likely are some points that I got wrong.

-- 
tejun
--

From: Bill Davidsen
Date: Wednesday, March 17, 2010 - 10:04 am

I think you hit on the real culprit and ignored it, it seems that even modern 
BIOS implementations, at least some of them, do not want to cross a cylinder 
boundary doing boot. Or maybe that's dumb MBR code, which at least has the 
excuse of being size limited.

I did try using 48 sector geometry on a virtual drive, and it seems as though 
both Linux and XP will install. Then I tried on a USB stick and the BIOS in 
several old Asus laptops will boot that.

I cautiously suggest that since nothing past boot used chs, and using 48 spt 
seems to work and gives correct alignment, perhaps there is value in custom 
geometry.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

--

From: Denys Vlasenko
Date: Tuesday, March 16, 2010 - 7:38 am

The "end of partition" is expected to be at the last head and sector.
Of course this heuristic fails if there are more than one primary
partition and they have differing last head and sector.

But on most "sanely" partitioned disks they are the same:

Disk /dev/sda: 255 heads, 63 sectors, 36481 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl      Start       Size ID
 1 00   1   1    0 254  63  850         63   13671252 0b
 2 80   0   1  851 254  63 1023   13671315  572395950 05
 3 00   0   0    0   0   0    0          0          0 00
 4 00   0   0    0   0   0    0          0          0 00
 5 00   1   1  851 254  63  972         63    1959867 83
 6 00   1   1  973 254  63 1023         63   31246362 83
 7 00 254  63 1023 254  63 1023         63  195318207 83
 8 00 254  63 1023 254  63 1023         63  343871262 83
                   ^^^  ^^

Which suggests another idea how to align a partition:
since there is no requirement on the partition *start*,
we don't have to start at head1,sector1 or head0,sector1

In the example above, 1st partition might be modified to start
at head1,sector2, IOW, LBA 64, thus making it 32k aligned.

As long as partition *ends* adhere to the convention
of being exactly at last_head,last_sector, nothing should break.
-- 
vda
--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 8:12 am

Hello,


C/H/S of 1023/254/63 is a special marker indicating the value there is
out-of-range.  It doesn't actually carry any information regarding the
geometry parameters other than that the matching LBA can't be
expressed within its range.  The end marker doesn't change according


That has almost nothing to do with compatibility.  Just let the
cylinder alignment go.  Anything remotely modern doesn't care about it
at all and anything older will puke way easier with custom geometry
massaging.  For those, we'll just have to stick with cylinder aligning
according to the BIOS supplied parameters.

Thanks.

-- 
tejun
--

From: Denys Vlasenko
Date: Tuesday, March 16, 2010 - 8:25 am

You misunderstood my ^^^ markers. I was trying to highlight
the whole columns of "end head" and "end sector", not the
last partition's 1023/254/63 values.

In the partition table like shown above it is obvious

If neither the start nor the end is aligned to cylinder's end
and disk has just one partition and it's bigger than 8G,
there is not way to determine geometry.

If everybody adopts the convention of ending the partitions
at the cylinder end, geometry can be trivially determined by
looking at partition end values. Sans "no of cylinders" value,

Then (some) bootloaders will stop working.

-- 
vda
--

From: Tejun Heo
Date: Tuesday, March 16, 2010 - 8:47 am

Hello,


Oh, if you have at least one partition contained under the CHS limit,
you can definitely determine the parameters.  You need to know two
params and there are two equations.  You don't even have to consider

But this is irrelevant because we don't and can't control everybody.
Actually, nobody can.  Codes dealing with partition tables have
already been out there for a very long time and there's no way to
retroactively make them agree on anything.  The only reason why we
care about CHS values at all is backward compatibility.  Going
forward, we don't need them at all.

Thanks.

-- 
tejun
--

From: H. Peter Anvin
Date: Tuesday, March 16, 2010 - 11:48 pm

This is doubly false.

An MS-DOS partition table can partition at any boundary.  Some OSes 
(like some versions of MS-DOS) needed track alignment because their boot 
loaders did not support crossing track boundaries.

Second, the primary field in the (modern) MS-DOS partition table is an 
LBA field.  The CHS fields are largely historic and useless because of 
the 1024-cylinder limitation, and by only being 24 bits total.

	-hpa
--

From: Thomas Chou
Date: Monday, March 15, 2010 - 11:27 pm

The key issue is not "just work", but "performance". When unaligned, the 
write performance can be lower than 50% of the expected rate.

- Thomas
--

Previous thread: linux-next: build failure after merge of the scsi-post-merge final tree by Stephen Rothwell on Sunday, March 7, 2010 - 8:43 pm. (1 message)

Next thread: linux-next: Tree for March 8 by Stephen Rothwell on Sunday, March 7, 2010 - 9:27 pm. (6 messages)