Linux: Reaching Maximum Speed With SATA

Submitted by Jeremy
on March 28, 2004 - 9:24pm

Jeff Garzik offered a patch to libata that increases the maximum size of the requests sent to Serial ATA hardware. He explains, "With this simple patch, the max request size goes from 128K to 32MB... so you can imagine this will definitely help performance. Throughput goes up. Interrupts go down. Fun for the whole family." This lead into a lengthy and interesting discussion debating whether or not this was the best course of action.

On one side of the debate, Nick Piggin [interview], Jens Axboe and Andrea Arcangeli argued that increasing the request size so significantly would actually have a negatvie impact by increasing latency. Nick explained, "I think 32MB is too much. You incur latency and lose scheduling grainularity. I bet returns start diminishing pretty quickly after 1MB or so." Jens Axboe agreed, and went on to note that the driver is what would know the limitations of a device. He explained, "take floppy.c for instance, I really don't want 1MB requests there, since that would take a minute to complete. And I might not want 1MB requests on my Super-ZXY storage, because that beast completes io easily at an iorate of 200MB/sec."

Jeff Garzik disagreed, suggesting that the driver should make the maximums possible, letting it be tuned in the user space by administrators. He explains:

"People shouldn't be tuning max_sectors at the source code level: that just embeds the policy decisions in the source code, and leads to constant fiddling with the driver to get things 'just right'. Over time, disks get faster and latency falls naturally. Thus the definition of 'just right' must be constantly tuned in the driver source code as time passes. I also wouldn't want to lock out any users who wanted to use SATA at full speed ;-)"


From: Jeff Garzik [email blocked]
To:  linux-ide
Subject: [PATCH] speed up SATA
Date: Sat, 27 Mar 2004 17:37:14 -0500


The "lba48" feature in ATA allows for addressing of sectors > 137GB, and 
also allows for transfers of up to 64K sector, instead of the 
traditional 256 sectors in older ATA.

libata simply limited all transfers to a 200 sectors (just under the 256 
sector limit).  This was mainly being careful, and making sure I had a 
solution that worked everywhere.  I also wanted to see how the iommu S/G 
stuff would shake out.

Things seem to be looking pretty good, so it's now time to turn on 
lba48-sized transfers.  Most SATA disks will be lba48 anyway, even the 
ones smaller than 137GB, for this and other reasons.

With this simple patch, the max request size goes from 128K to 32MB... 
so you can imagine this will definitely help performance.  Throughput 
goes up.  Interrupts go down.  Fun for the whole family.

The attached patch is for 2.6.x kernels only.  It should apply to 
2.6.5-rc2 or later, including my latest 2.6-libata patch on kernel.org. 
This patch should be pretty harmless, but you never know what could 
happen when you throw the throttle wide open.  Testing in -mm would be a 
good thing, for example :)

Volunteers are welcome to post a 2.4 backport of this patch to 
linux-ide@vger.kernel.org, and I'll merge it into my 2.4 libata queue.

Here's what dmesg looks like on my workstation.  Look for the "max 
request 32MB" message just after SCSI prints out the disk information.

libata version 1.02 loaded.
ata_piix version 1.02
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ata1: SATA max UDMA/133 cmd 0x24F0 ctl 0x280A bmdma 0x24D0 irq 169
ata2: SATA max UDMA/133 cmd 0x24F8 ctl 0x280E bmdma 0x24D8 irq 169
ata1: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4003 85:7c69 86:3e01 87:4003 
88:207f
ata1: dev 0 ATA, max UDMA/133, 488281250 sectors (lba48)
ata1: dev 0 configured for UDMA/133
scsi0 : ata_piix
ata2: SATA port has no device.
ata2: thread exiting
scsi1 : ata_piix
   Vendor: ATA       Model: Maxtor 6Y250M0    Rev: 1.02
   Type:   Direct-Access                      ANSI SCSI revision: 05
ata1: dev 0 max request 32MB (lba48)
SCSI device sda: 488281250 512-byte hdwr sectors (250000 MB)
SCSI device sda: drive cache: write through
  sda: sda1
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi generic sg0 at scsi0, channel 0, id 0, lun 0,  type 0



[patch  text/plain (990 bytes)]
===== drivers/scsi/libata-scsi.c 1.18 vs edited =====
--- 1.18/drivers/scsi/libata-scsi.c	Sat Mar 27 00:21:29 2004
+++ edited/drivers/scsi/libata-scsi.c	Sat Mar 27 16:04:39 2004
@@ -168,6 +168,23 @@
 	sdev->use_10_for_ms = 1;
 	blk_queue_max_phys_segments(sdev->request_queue, LIBATA_MAX_PRD);
 
+	if (sdev->id < ATA_MAX_DEVICES) {
+		struct ata_port *ap;
+		struct ata_device *dev;
+
+		ap = (struct ata_port *) &sdev->host->hostdata[0];
+		dev = &ap->device[sdev->id];
+
+		if (dev->flags & ATA_DFLAG_LBA48) {
+			sdev->host->max_sectors = 65534;
+			blk_queue_max_sectors(sdev->request_queue, 65534);
+			printk(KERN_INFO "ata%u: dev %u max request 32MB (lba48)\n",
+			       ap->id, sdev->id);
+		} else
+			printk(KERN_INFO "ata%u: dev %u max request 128K\n",
+			       ap->id, sdev->id);
+	}
+
 	return 0;	/* scsi layer doesn't check return value, sigh */
 }
 


From: Stefan Smietanowski [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 00:04:23 +0100 Hi Jeff. What will happen when a PATA disk lies behind a Marvel(ous) bridge, as in most SATA disks today? Is large transfers mandatory in the LBA48 spec and is LBA48 really mandatory in SATA? And yes, I saw that the dmesg showed a Maxtor drive, but I'm uncertain if that disk of yours has a Marvel chip on or not, since newer Maxtors might (have) come out (already) without a Marvel chip, I just don't know. // Stefan
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 18:11:57 -0500 Stefan Smietanowski wrote: > What will happen when a PATA disk lies behind a Marvel(ous) bridge, as > in most SATA disks today? Larger transfers work fine in PATA, too. WRT bridges, it is generally the best idea to limit to UDMA/100 (udma), but larger transfers are OK. > Is large transfers mandatory in the LBA48 spec and is LBA48 really > mandatory in SATA? Yes and no, in that order :) SATA doesn't mandate lba48, but it is highly unlikely that you will see SATA disk without lba48. Regardless, libata supports what the drive supports. Older disks still work just fine. Jeff
From: Bartlomiej Zolnierkiewicz [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 00:32:22 +0100 What about latency? What about recently discussed PRD table "limit" of 256 entries? AFAIR these are the reasons why IDE driver is currently limiting max request size to 1024K on LBA48 disks. > What will happen when a PATA disk lies behind a Marvel(ous) bridge, as > in most SATA disks today? Most modern PATA disks support LBA48 and IDE driver has been using large transfers for some time. :-) > Is large transfers mandatory in the LBA48 spec and is LBA48 really > mandatory in SATA? large transfers are part of LBA48 spec > And yes, I saw that the dmesg showed a Maxtor drive, but I'm uncertain > if that disk of yours has a Marvel chip on or not, since newer Maxtors > might (have) come out (already) without a Marvel chip, I just don't > know. Regards, Bartlomiej
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 18:36:12 -0500 Bartlomiej Zolnierkiewicz wrote: > > What about latency? > > What about recently discussed PRD table "limit" of 256 entries? > > AFAIR these are the reasons why IDE driver is currently > limiting max request size to 1024K on LBA48 disks. That's the main limitation on request size right now... libata limits S/G table entries to 128[1], so a perfectly aligned, fully merged transfer will top out at 8MB. You don't see that unless you're on a totally quiet machine with tons of free, contiguous pages. So in practice it winds up being much smaller, the more loaded the system gets (and pagecache gets fragmented). Latency definitely changes for the default case, but remember that a lot of that is writeback, or streaming writes. Latency-sensitive applications already know how to send small or no-wait I/Os, because standard pagecache writeback latency is highly variable at best :) Jeff
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 18:40:26 -0500 Jeff Garzik wrote: > That's the main limitation on request size right now... libata limits > S/G table entries to 128[1], so a perfectly aligned, fully merged ... [1] because even though the block layer properly splits on segment boundaries, pci_map_sg() may violate those boundaries (James B and others are working on fixing this). So... for right now the driver must check the s/g entry boundaries after DMA mapping, and split them (again) if necessary. IDE does this in ide_build_dmatable().
From: Bartlomiej Zolnierkiewicz [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 01:13:58 +0100 You are right but small clarification is needed: code in ide_build_dmatable() predates segment boundary support in block layer (IDE never relied on it).
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 19:08:44 -0500 Bartlomiej Zolnierkiewicz wrote: > > You are right but small clarification is needed: code in ide_build_dmatable() > predates segment boundary support in block layer (IDE never relied on it). Agreed... I'm saying it's still needed. When the iommu layer knows about the boundaries it should respect, that code can be removed from libata and drivers/ide, IMO... But also double-check and make sure IDE driver supports the worst case, by limiting to 128 PRD entries, not 256. Jeff
From: Nick Piggin [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 09:37:45 +1000 Jeff Garzik wrote: > > The "lba48" feature in ATA allows for addressing of sectors > 137GB, and > also allows for transfers of up to 64K sector, instead of the > traditional 256 sectors in older ATA. > > libata simply limited all transfers to a 200 sectors (just under the 256 > sector limit). This was mainly being careful, and making sure I had a > solution that worked everywhere. I also wanted to see how the iommu S/G > stuff would shake out. > > Things seem to be looking pretty good, so it's now time to turn on > lba48-sized transfers. Most SATA disks will be lba48 anyway, even the > ones smaller than 137GB, for this and other reasons. > > With this simple patch, the max request size goes from 128K to 32MB... > so you can imagine this will definitely help performance. Throughput > goes up. Interrupts go down. Fun for the whole family. > Hi Jeff, I think 32MB is too much. You incur latency and lose scheduling grainularity. I bet returns start diminishing pretty quickly after 1MB or so.
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 18:44:10 -0500 Nick Piggin wrote: > I think 32MB is too much. You incur latency and lose > scheduling grainularity. I bet returns start diminishing > pretty quickly after 1MB or so. See my reply to Bart. Also, it is not the driver's responsibility to do anything but export the hardware maximums. It's up to the sysadmin to choose a disk scheduling policy they like, which implies that a _scheduler_, not each individual driver, should place policy limitations on max_sectors. Jeff
From: Nick Piggin [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 09:47:54 +1000 Jeff Garzik wrote: > Also, it is not the driver's responsibility to do anything but export > the hardware maximums. > > It's up to the sysadmin to choose a disk scheduling policy they like, > which implies that a _scheduler_, not each individual driver, should > place policy limitations on max_sectors. > Yeah I suppose you're right there. In practice it doesn't work that way though, does it?
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sat, 27 Mar 2004 18:59:43 -0500 Nick Piggin wrote: > > Yeah I suppose you're right there. In practice it doesn't > work that way though, does it? Not my problem <grin> People shouldn't be tuning max_sectors at the source code level: that just embeds the policy decisions in the source code, and leads to constant fiddling with the driver to get things "just right". Over time, disks get faster and latency falls naturally. Thus the definition of "just right" must be constantly tuned in the driver source code as time passes. I also wouldn't want to lock out any users who wanted to use SATA at full speed ;-) Jeff
From: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 16:10:14 +0200 On Sat, Mar 27 2004, Jeff Garzik wrote: > I also wouldn't want to lock out any users who wanted to use SATA at > full speed ;-) And full speed requires 32MB requests? -- Jens Axboe
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 12:31:05 -0500 Jens Axboe wrote: > On Sat, Mar 27 2004, Jeff Garzik wrote: > >>I also wouldn't want to lock out any users who wanted to use SATA at >>full speed ;-) > > > And full speed requires 32MB requests? Full speed is the SATA driver supporting the hardware maximum. The block layer and general fragmentation limit things further from there. Jeff
From: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 19:35:08 +0200 On Sun, Mar 28 2004, Jeff Garzik wrote: > > Full speed is the SATA driver supporting the hardware maximum. The Come on Jeff, don't be such a slave to the hardware specifications. Just because it's possible to send down 32MB requests doesn't necessarily mean it's a super thing to do, nor that it automagically makes 'things go faster'. The claim is that back-to-back 1MB requests are every bit as fast as a 32MB request (especially if you have a small queue depth, in that case there truly should be zero benefit to doing the bigger ones). The cut-off point is likely even lower than 1MB, I'm just using that figure as a value that is 'pretty big' yet doesn't incur too large latencies just because of its size. -- Jens Axboe
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 12:48:11 -0500 Jens Axboe wrote: > > Come on Jeff, don't be such a slave to the hardware specifications. Just > because it's possible to send down 32MB requests doesn't necessarily > mean it's a super thing to do, nor that it automagically makes 'things > go faster'. The claim is that back-to-back 1MB requests are every bit as > fast as a 32MB request (especially if you have a small queue depth, in > that case there truly should be zero benefit to doing the bigger ones). > The cut-off point is likely even lower than 1MB, I'm just using that > figure as a value that is 'pretty big' yet doesn't incur too large > latencies just because of its size. For me this is a policy issue. I agree that huge requst hurt latency. I just disagree that the _driver_ should artificially lower its maximums to fit a guess about what the best request size should be. If there needs to be an overall limit on per-size size, do it at the block layer. It's not scalable to hardcode that limit into every driver. That's not the driver's job. The driver just exports the hardware limits, nothing more. A limit is fine. I support that. An artificial limit in the driver is not. Jeff
From: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 19:54:36 +0200 On Sun, Mar 28 2004, Jeff Garzik wrote: > > For me this is a policy issue. > > I agree that huge requst hurt latency. I just disagree that the > _driver_ should artificially lower its maximums to fit a guess about > what the best request size should be. > > If there needs to be an overall limit on per-size size, do it at the > block layer. It's not scalable to hardcode that limit into every > driver. That's not the driver's job. The driver just exports the > hardware limits, nothing more. > > A limit is fine. I support that. An artificial limit in the driver > is not. Sorry, but I cannot disagree more. You think an artificial limit at the block layer is better than one imposed at the driver end, which actually has a lot more of an understanding of what hardware it is driving? This makes zero sense to me. Take floppy.c for instance, I really don't want 1MB requests there, since that would take a minute to complete. And I might not want 1MB requests on my Super-ZXY storage, because that beast completes io easily at an iorate of 200MB/sec. So you want to put this _policy_ in the block layer, instead of in the driver. That's an even worse decision if your reasoning is policy. The only such limits I would want to put in, are those of the bio where simply is best to keep that small and contained within a single page to avoid higher order allocations to do io. Limits based on general sound principles, not something that caters to some particular piece of hardware. I absolutely refuse to put a global block layer 'optimal io size' restriction in, since that is the ugliest of policies and without having _any_ knowledge of what the hardware can do. -- Jens Axboe
From: Jamie Lokier [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 19:08:09 +0100 Jens Axboe wrote: > Sorry, but I cannot disagree more. You think an artificial limit at > the block layer is better than one imposed at the driver end, which > actually has a lot more of an understanding of what hardware it is > driving? This makes zero sense to me. Take floppy.c for instance, I > really don't want 1MB requests there, since that would take a minute > to complete. And I might not want 1MB requests on my Super-ZXY > storage, because that beast completes io easily at an iorate of > 200MB/sec. The driver doesn't know how fast the drive is either. Without timing the drive, interface, and for different request sizes, neither the block layer _nor_ the driver know a suitable size. > I absolutely refuse to put a global block layer 'optimal io > size' restriction in, since that is the ugliest of policies and > without having _any_ knowledge of what the hardware can do. But the driver doesn't have _any_ knowledge of what the I/O scheduler wants. 1MB requests may be a cut-off above which there is negligable throughput gain for SATA, but those requests may be _far_ too large for a low-latency I/O scheduling requirement. If we have a high-level latency scheduling constraint that userspace should be able to issue a read and get the result within 50ms, or that the average latency for reads should be <500ms, how does the SATA driver limiting requests to 1MB help? It depends on the attached drive. The fundamental problem here is that neither the driver nor the block layer have all the information needed to select optimal or maximum request sizes. That can only be found by timing the device, perhaps every time a request is made, and adjusting the I/O scheduling and request splitting parameters according to that timing and high-level latency requirements. >From that point of view, the generic block layer is exactly the right place to determine those parameters, because the calculation is not device-specific. -- Jamie
From: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 20:15:03 +0200 On Sun, Mar 28 2004, Jamie Lokier wrote: > > The driver doesn't know how fast the drive is either. > > Without timing the drive, interface, and for different request sizes, > neither the block layer _nor_ the driver know a suitable size. The driver may not know exactly, but it does know a ball park figure. You know if you are driving floppy (sucky transfer and latency), hard drive, cdrom (decent transfer, sucky seeks), etc. > > I absolutely refuse to put a global block layer 'optimal io > > size' restriction in, since that is the ugliest of policies and > > without having _any_ knowledge of what the hardware can do. > > But the driver doesn't have _any_ knowledge of what the I/O scheduler > wants. 1MB requests may be a cut-off above which there is negligable It's not what the io scheduler wants, it's what you can provide at a reasonable latency. You cannot preempt that unit of io. > throughput gain for SATA, but those requests may be _far_ too large > for a low-latency I/O scheduling requirement. > > If we have a high-level latency scheduling constraint that userspace > should be able to issue a read and get the result within 50ms, or that > the average latency for reads should be <500ms, how does the SATA > driver limiting requests to 1MB help? It depends on the attached drive. Yep it sure does, but try and find a drive attached to a SATA controller that cannot do 40MiB/sec (or something like that). Storage doesn't move _that_ fast, you can keep up. > The fundamental problem here is that neither the driver nor the block > layer have all the information needed to select optimal or maximum > request sizes. That can only be found by timing the device, perhaps > every time a request is made, and adjusting the I/O scheduling and > request splitting parameters according to that timing and high-level > latency requirements. I agree with that, completely. And I still maintain that putting the restriction blindly into the hands of the block layer is not a good idea. The driver may not know completely what storage is attached to it, but it can peek and poke and get a general idea. As it stands right now, the block layer has _zero_ knowledge. Unless you start adding timing and imposing max request size based on the latencies. If you do that, then I would be quite happy with changing ->max_sectors to be the hardware limit. > >From that point of view, the generic block layer is exactly the right > place to determine those parameters, because the calculation is not > device-specific. If you start adding that type of code. That's a different discussion than this one, though, and it would raise a new set of problems (AS io scheduler already does some of this privately). -- Jens Axboe
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 13:55:43 -0500 Jens Axboe wrote: > On Sun, Mar 28 2004, Jamie Lokier wrote: > >>Jens Axboe wrote: >> >>The driver doesn't know how fast the drive is either. >> >>Without timing the drive, interface, and for different request sizes, >>neither the block layer _nor_ the driver know a suitable size. Nod, this is pretty much my objection to hardcoding an artificial limit in the driver... > The driver may not know exactly, but it does know a ball park figure. > You know if you are driving floppy (sucky transfer and latency), hard > drive, cdrom (decent transfer, sucky seeks), etc. Agreed. Really we have two types of information: * the device's hard limit * the default limit that should be applied to that class of devices I would much rather do something like blk_queue_set_class(q, CLASS_DISK) and have that default per-class policy switch (class) { case CLASS_DISK: q->max_sectors = min(q->max_sectors, CLASS_DISK_MAX_SECTORS); ... than hardcode the limit in the driver. That's easy and quick. That's a minimal solution that gives me what I want -- don't hardcode generic limits in the driver -- while IMO giving you what you want, a sane limit in an easy way. Right now we are hardcoding the same per-class limits into each floppy driver, each disk driver, etc. At the very least devices that act the same way should all be using the same tunable, whether it's a compile-time tunable (CLASS_xxx_MAX_SECTORS) or a runtime tunable. Long term, the IO scheduler and the VM should really be figuring out the best request size, from zero to <hardware limit>. Jeff
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 14:06:34 -0500 Jens Axboe wrote: > Yep it sure does, but try and find a drive attached to a SATA controller > that cannot do 40MiB/sec (or something like that). Storage doesn't move > _that_ fast, you can keep up. Nanosecond latencies and disturbingly high throughput are already possibly today :) Consider the battery-backed RAM gadgets that present themselves as ATA devices, or nbd over 10gige network. In fact I'm about to strip down drivers/scsi/sata_promise.c to a driver that just talks to the DIMM, and another else: drivers/block/pdc_mem.c. At that point you're really just looking at the PCI bus and RAM limits... Jeff
From: William Lee Irwin III [email blocked] To: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 10:12:23 -0800 On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote: > Sorry, but I cannot disagree more. You think an artificial limit at the > block layer is better than one imposed at the driver end, which actually > has a lot more of an understanding of what hardware it is driving? This > makes zero sense to me. Take floppy.c for instance, I really don't want > 1MB requests there, since that would take a minute to complete. And I > might not want 1MB requests on my Super-ZXY storage, because that beast > completes io easily at an iorate of 200MB/sec. > So you want to put this _policy_ in the block layer, instead of in the > driver. That's an even worse decision if your reasoning is policy. The > only such limits I would want to put in, are those of the bio where > simply is best to keep that small and contained within a single page to > avoid higher order allocations to do io. Limits based on general sound > principles, not something that caters to some particular piece of > hardware. I absolutely refuse to put a global block layer 'optimal io > size' restriction in, since that is the ugliest of policies and without > having _any_ knowledge of what the hardware can do. How about per-device policies and driver hints wrt. optimal io? -- wli
From: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 20:30:11 +0200 On Sunday 28 of March 2004 20:12, William Lee Irwin III wrote: > > How about per-device policies and driver hints wrt. optimal io? Yep, user-tunable per-device policies with sane driver defaults.
From: Jens Axboe [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 20:30:11 +0200 On Sun, Mar 28 2004, Bartlomiej Zolnierkiewicz wrote: > > Yep, user-tunable per-device policies with sane driver defaults. BTW, these are trivial to expose through sysfs as their as inside the queue already. Making something user tunable is usually not the best idea, if you can deduct these things automagically instead. So whether this is the best idea, depends on which way you want to go. -- Jens Axboe
From: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 20:45:07 +0200 On Sunday 28 of March 2004 20:30, Jens Axboe wrote: > > BTW, these are trivial to expose through sysfs as their as inside the > queue already. Yep, yep. > Making something user tunable is usually not the best idea, if you can > deduct these things automagically instead. So whether this is the best > idea, depends on which way you want to go. I think it's the best idea for now, long-term we are better with automagic. Bartlomiej
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 13:59:51 -0500 Bartlomiej Zolnierkiewicz wrote: > On Sunday 28 of March 2004 20:30, Jens Axboe wrote: >>Making something user tunable is usually not the best idea, if you can >>deduct these things automagically instead. So whether this is the best >>idea, depends on which way you want to go. > > > I think it's the best idea for now, long-term we are better with automagic. Mostly agreed: Like I mentioned in the last message, the IO scheduler and the VM should really just figure out request size and queue depth and such based on observation of device throughput and latency. So I agree w/ automagic. But the sysadmin should also be allowed to say "I don't care about latency" if he has gobs and gobs of memory and knows his configuration well. I like generic tunables such as "laptop mode" or "low latency" or "high throughput". These sorts of tunables should affect the "automagic" calculations. Jeff
From: Andrew Morton [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 12:32:40 -0800 Jeff Garzik [email blocked] wrote: > > I like generic tunables such as "laptop mode" or "low latency" or "high > throughput". These sorts of tunables should affect the "automagic" > calculations. Not sure. Things like "low latency" and "high throughput" may need other things, such as "seek latency" and "bandwidth" as _inputs_, not as outputs. Such device parameters should have reasonable defaults, and use a userspace app which runs a quick seek latency and bandwidth test at mount-time, poking the results into sysfs.
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 15:45:04 -0500 Andrew Morton wrote: > Jeff Garzik [email blocked] wrote: > >> I like generic tunables such as "laptop mode" or "low latency" or "high >> throughput". These sorts of tunables should affect the "automagic" >> calculations. > > > Not sure. Things like "low latency" and "high throughput" may need other > things, such as "seek latency" and "bandwidth" as _inputs_, not as outputs. I should probably better define the hypotheticals :) I think of "laptop mode" or "low latency versus high throughput" more as high level binary flags, influencing widely varying things like from an ATA disk's "low noise versus high performance" tunable to the IO scheduler's deadlines. > Such device parameters should have reasonable defaults, and use a userspace > app which runs a quick seek latency and bandwidth test at mount-time, > poking the results into sysfs. Certainly... Jeff
From: Andrea Arcangeli [email blocked] Subject: Re: [PATCH] speed up SATA Date: Mon, 29 Mar 2004 02:55:02 +0200 On Sun, Mar 28, 2004 at 01:59:51PM -0500, Jeff Garzik wrote: > Bartlomiej Zolnierkiewicz wrote: > >I think it's the best idea for now, long-term we are better with automagic. > > > Mostly agreed: > > Like I mentioned in the last message, the IO scheduler and the VM should this is not an I/O scheduler or VM issue. the max size of a request is something that should be set internally to the blkdev layer (at a lower level than the I/O scheduler or the VM layer). The point is that if you run read contigously from disk with a 1M or 32M request size, the wall time speed difference will be maybe 0.01% or so. Running 100 irqs per second or 3 irq per second doesn't make any measurable difference. Same goes for keeping the I/O pipeline full, 1M is more than enough to go at the speed of the storage with minimal cpu overhead. we waste 900 irqs per second just in the timer irq and another 900 irqs per second per-cpu in the per-cpu local interrupts in smp. In 2.4 reaching 512k DMA units that helped a lot, but going past 512k didn't help in my measurements. 1M maybe these days is needed (as Jens suggested) but >1M still sounds overkill and I completely agree with Jens about that. If one day things will change and the harddisk will require 32M large DMA transactions to keep up with the speed of the disk, the thing should be still solved during disk discovery inside the blkdev layer. The "automagic" suggestions discussed by Jamie and Jens should be just benchmarks internal to the blkdev layer, trying to read contigously first with 1M then 2M then 4M etc.. until the speed difference goes below 1% or whatever similar "autotune" algorithm. But definitely this is not an I/O scheduler or VM issue, it's all about discovering the minimal DMA transaction size that provides peak bulk I/O performance for a certain device. The smaller the size, the better the latencies and the less ram will be pinned at the same time (i.e. think a 64M machine writing at 32M chunks at time). Of course if we'll ever deal with hardware where 32M requests makes a difference, then we may have to add overrides to the I/O scheduler to lower the max_requests (i.e. like my obsolete max_bomb_segments did). But I expect that by default the contigous I/O will use the max_sector choosen by the blkdev layer (not choosen by VM or I/O scheduler) to guarantee the best bulk I/O performance as usual (the I/O scheduler option would be just an optional override). the max_sectors is just about using a sane DMA transaction size, good enough to run at disk-speed without measurable cpu overhead, but without being too big so that it provides sane latencies. Overkill huge DMA transactions might even stall the cpu when accessing the mem bus (though I'm not an hardware guru so this is just a guess). So far there was no need to autotune it, and settings like 512k were optimal. Don't take me wrong, I find extremely great that you now can raise the IDE request size to a value like 512k, the 128k limit was the ugliest thing of IDE ever, but you provided zero evidence that going past 512k is beneficial at all, and your bootup log showing 32M is all but exciting, I'd be a lot more excited to see 512k there. I expect that the boost from 128k to 512k is very significant, but I expect that from 512k to 32M there will be just a total waste of latency with zero performance gain in throughput. So unless you measure any speed difference from 512k to 32M I recommend to set it to 512k for the short term like most other driver does for the same reasons.
From: Jeff Garzik [email blocked] Subject: Re: [PATCH] speed up SATA Date: Sun, 28 Mar 2004 23:02:43 -0500 Andrea Arcangeli wrote: > > this is not an I/O scheduler or VM issue. This involves the interaction of three: blkdev layer, IO scheduler, and VM. VM: initiates most of the writeback, and is often the main initiator of large requests. The VM thresholds also serve to keep request size manageable. See e.g. http://marc.theaimsgroup.com/?l=linux-kernel&m=108043321326801&w=2 IO scheduler: the place to make the decision about whether the request latency is meeting expectations, etc. It should be straightforward to use a windowing algorithm to slowly increase the request size until (a) latency limits are reached, (b) hardware limits are reached, or (c) VM thresholds are reached. Ultimately there must be some -global- management of I/O, otherwise VM cannot survive, e.g. 128k requests on 1000 disks :) > the max size of a request is something that should be set internally to > the blkdev layer (at a lower level than the I/O scheduler or the VM > layer). Yes, I agree. My point is there are two maximums: 1) the hardware limit 2) the limit that "makes sense", e.g. 512k or 1M for most The driver should only care about #1, and should be "told" #2. A very, very, very minimal implementation could be this: --- 1.138/include/linux/blkdev.h Fri Mar 12 04:33:07 2004 +++ edited/include/linux/blkdev.h Sun Mar 28 22:44:15 2004 @@ -607,6 +607,24 @@ extern void drive_stat_acct(struct request *, int, int); +#define BLK_DISK_MAX_SECTORS 2048 +#define BLK_FLOPPY_MAX_SECTORS 64 Hardcoding such a maximum in the driver is inflexible and IMO incorrect. > If one day things will change and the harddisk will require 32M large > DMA transactions to keep up with the speed of the disk, the thing should > be still solved during disk discovery inside the blkdev layer. The 32M is probably too large, but 1M is probably too small for: a RAID array with 33 disks, that presents itself as a single SATA disk. solid-state storage: battery-backed RAM. These things like bigger requests, and were designed to solve a lot of the latency problems in hardware. > "automagic" suggestions discussed by Jamie and Jens should be just > benchmarks internal to the blkdev layer, trying to read contigously > first with 1M then 2M then 4M etc.. until the speed difference goes > below 1% or whatever similar "autotune" algorithm. Yes, agreed. My main goal is to -not- worry about this in the low-level driver. If you and Jens think 1M requests are maximum for disks, then put that in the _blkdev_ layer not my driver :) Long term, I would like to see something like --- 1.138/include/linux/blkdev.h Fri Mar 12 04:33:07 2004 +++ edited/include/linux/blkdev.h Sun Mar 28 23:01:42 2004 @@ -337,7 +337,8 @@ */ unsigned long nr_requests; /* Max # of requests */ - unsigned short max_sectors; + unsigned short max_sectors; /* blk layer-chosen */ + unsigned short max_hw_sectors; /* hardware limit */ unsigned short max_phys_segments; unsigned short max_hw_segments;

Related Links:

RAID

Anonymous
on
March 29, 2004 - 1:47pm

Well, I'm quite happy to see potential performance increases with SATA. Particularly since I'm using SATA on my own system.

Currently, with my ICH5R RAID controller I finally got working under 2.4, benchmarks give me 70-80 MB per second for all operations (think I used bonnie++). This is with 120GB SATA western digital drives with 8 megs of cache. To have even more performance is certainly attractive.

However, I'm still worried about SATA RAID support with 2.6. My controller still is not supported by 2.6 ( I did some patching to get it working with 2.4). Things are quite jerky under 2.4 that shouldn't be. Such as playing dvd's, which should not be the case with 512 ram and a p4 3.2 ghz. This was not the case with 2.6 (which I used on the same system prior to my raid working). So I'm really wanting to go back to 2.6. Apparently my controller is supposed to eventually be support through "device mapper" (I think this is because my controller is really just software raid). Though I don't really know any more then that. If someone could provide more details on that, I'd be greatful.

(S)ATA-RAID under 2.6

farnz
on
March 30, 2004 - 8:43am

Apparently my controller is supposed to eventually be support through "device mapper" (I think this is because my controller is really just software raid). Though I don't really know any more then that. If someone could provide more details on that, I'd be greatful.

It sounds like the ICH5R works in much the same way as the Highpoint series of RAID controllers; there is some on disk data that tells the OS how the data is laid out on the disks, and the OS is expected to stripe/mirror as needed. Under 2.4, there was the ATA-RAID driver to do this; under 2.6, devicemapper is supposed to handle it.
Wilfried Weissmann has written an initial HPT-RAID driver for EVMS, which has some problems; once it's fixed and working, it can be extended to other systems.

Alternatively, if you are lucky enough to be running a pure Linux system, it should be possible to run the drives as single hard disks; Linux's software RAID will then handle RAID for you.

unfortunately

Hiryu
on
March 30, 2004 - 2:16pm

Unfortunately, I'm not running a purely linux System. I run windows for gaming. Unfortunately, winex hasn't exactly erased my need for windows yet.

So hopefully things will work out soon and I'll be able to run 2.6 on my raid volume.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.