Some NCQ numbers...

Previous thread: 2.6.22-rc6-mm1 by Andrew Morton on Thursday, June 28, 2007 - 6:43 am. (112 messages)

Next thread: [PATCH] CONFIG_SCSI_FD_8xx no longer exists by Geert Uytterhoeven on Thursday, June 28, 2007 - 7:53 am. (1 message)
To: Kernel Mailing List <linux-kernel@...>
Cc: <linux-ide@...>, <linux-scsi@...>
Date: Thursday, June 28, 2007 - 6:51 am

[Offtopic notice: For the first time I demonstrated some
speed testing results on linux-ide mailinglist, as a
demonstration how [NT]CQ can help. But later, someone
becomes curious and posted that email to lkml, asking
for more details. Since that, I become more curious
as well, and decided to look at it more closely.
Here it goes...]

A test drive is Seagate Barracuda ST3250620AS "desktop" drive,
250Gb, cache size is 16Mb, 7200RPM.

The same results shows Seagate Barracuda ES drive, ST3250620NS.

I guess pretty similar results will be fore larger Barracudas from
Seagate. The only difference between 250Gb ones and larger ones is
the amount of disk platters and heads.

Test machine was using MPTSAS driver for the following card:
SCSI storage controller: LSI Logic / Symbios Logic SAS1064E PCI-Express Fusion-MPT SAS (rev 02)

Pretty similar results were obtained on an AHCI controller:
SATA controller: Intel Corporation 82801GR/GH (ICH7 Family) Serial ATA Storage Controller AHCI (rev 01)
on another machines.

The following tables shows data read/write speed in Megabytes/sec,
with different parameters.

All I/O performed directly on the whole drive, i.e.
open("/dev/sda", O_RDWR|O_DIRECT).

There are 5 kinds of tests were performed: linear read (linRd),
random read (rndRd), linear write (linWr), random write (rndWr),
and a combination of random read and write (rndR/W).

Each test has been tried with 1 (2 in case of r/w), 4 and 32 threads
doing I/O in parallel (Trd column). Linear read and writes were
performed only for single thread.

Two modes for each test -- with command queuing enabled (qena) and
disabled (qdis), using /sys/block/sda/device/queue_depth, by setting
queue depth to 31 (default) and 1 respectively.

And finally, each set of tests were performed for different block sizes --
4, 8, 16, 32, 128 and 1024 kb (1 kb = 1024 bytes).

First, tests with write cache enabled (factory default settings for the
drives in question):

BlkSz Trd linRd rndR...

To: Michael Tokarev <mjt@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Tuesday, July 3, 2007 - 4:19 am

And which elevator?

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Tuesday, July 3, 2007 - 4:29 pm

Well. It looks like the results does not depend on the
elevator. Originally I tried with deadline, and just
re-ran the test with noop (hence the long delay with
the answer) - changing linux elevator changes almost
nothing in the results - modulo some random "fluctuations".

In any case, NCQ - at least in this drive - just does
not work. Linux with its I/O elevator may help to
speed things up a bit, but the disk does nothing in
this area. NCQ doesn't slow things down either - it
just does not work.

The same's for ST3250620NS "enterprise" drives.

By the way, Seagate announced Barracuda ES 2 series
(in range 500..1200Gb if memory serves) - maybe with
those, NCQ will work better?

Or maybe it's libata which does not implement NCQ
"properly"? (As I shown before, with almost all
ol'good SCSI drives TCQ helps alot - up to 2x the
difference and more - with multiple I/O threads)

/mjt
-

To: Michael Tokarev <mjt@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Tuesday, July 3, 2007 - 9:19 pm

Hello,

Well, what the driver does is minimal. It just passes through all the
commands to the harddrive. After all, NCQ/TCQ gives the harddrive more
responsibility regarding request scheduling.

--
tejun
-

To: Tejun Heo <htejun@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 5:43 am

Here are actual results - the tests were still running when
I replied yesterday.

Again, this is Seagate ST3250620AS "desktop" drive, 7200RPM,
16Mb cache, 250Gb capacity. The tests were performed with
queue depth = 64 (on mptsas), drive write cache is turned
off.

noop scheduler:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 12.8 0.3 0.4 0.3 0.1/ 0.1
4 0.3 0.3 0.1/ 0.1
32 0.3 0.3 0.1/ 0.1
8k 1 24.6 0.6 0.9 0.6 0.3/ 0.3
4 0.6 0.6 0.3/ 0.3
32 0.6 0.6 0.3/ 0.3
16k 1 41.3 1.2 1.8 1.1 0.6/ 0.6
4 1.2 1.1 0.6/ 0.6
32 1.2 1.1 0.6/ 0.6
32k 1 58.4 2.2 3.5 2.1 1.1/ 1.1
4 2.3 2.1 1.1/ 1.1
32 2.3 2.1 1.1/ 1.1
128k 1 80.4 8.1 12.5 7.2 3.8/ 3.8
4 8.1 7.2 3.8/ 3.8
32 8.1 7.2 3.8/ 3.8
1024k 1 80.5 33.9 33.8 24.5 14.3/14.3
4 34.1 24.6 14.3/14.2
32 34.2 24.6 14.4/14.2

deadline scheduler:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 12.8 0.3 0.4 0.3 0.1/ 0.1
4 0.3 0.3 0.1/ 0.1
32 0.3 0.3 0.1/ 0.1
8k 1 24.5 0.6 0.9 0.6 0.3/ 0.3
4 0.6 0.6 0.3/ 0.3
32 0.6 0.6 0.3/ 0.3
16k 1 41.3 1.2 1.8 1.1 0.6/ 0.6
4 1.2 1.1 0.6/ 0.6
32 1.2 1.1 0.6/ 0.6
32k 1 57.7 2.3 3.4 2.1 1.1/ 1.1
4 2.3 2.1 1.1/ 1.1
32 2.3 2.1 1.1/ 1.1
128k 1 79.4 8.1 12.5 7.2 3.8/ 3.8
4 8.1 7.3 3.8/ 3.8
32 8.2 7.3 3.9/ 3.8
1024k 1 79.4 33.7 33.8 24.5 14.2/14.2
4 33.9 24.6 14.3/14.2
32 33.4 24....

To: Michael Tokarev <mjt@...>
Cc: Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 6:22 am

I found AS scheduler to be the premium and best for single-user performance.

You want speed? Use AS.

http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline_vs_noop.html

-

To: Justin Piszcz <jpiszcz@...>
Cc: Michael Tokarev <mjt@...>, Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Monday, July 9, 2007 - 8:26 am

Hmm, I find your data very weak for such a conclusion. Value of the test
itself withstanding, AS seems to be a lot faster for sequential output
for some reason, yet slower for everything else. Which is odd, deadline
should always be running at the same speed for writeout as AS. The only
real difference should be sequential and random reads.

So allow me to call your results questionable. It also looks like bonnie
(some version) output, I never found bonnie to provide good and
repeatable numbers. tiotest is much better, or (of course) fio.

--
Jens Axboe

-

To: Michael Tokarev <mjt@...>
Cc: Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 6:33 am

Does not include noop-- tested the main three though, renamed :)

http://home.comcast.net/~jpiszcz/sched/cfq_vs_as_vs_deadline.html

And for the archives:

p34-cfq,15696M,77114.3,99,311683,55.3333,184947,38.6667,79842.7,99,524065,41.3333,634.033,0.333333,16:100000:16/64,1043.33,8.33333,4419.33,11.6667,2942,17.3333,1178,10.3333,4192.67,12.3333,2619.33,19
p34-as,15696M,76202.3,99,443103,85,189716,34.6667,79552,99,507271,39.6667,607.067,0,16:100000:16/64,1153,10,13434,36,2769.67,16.3333,1201.67,10.6667,3951.33,12,2665.67,19
p34-deadline,15696M,76933.3,98.6667,386852,72,183016,29.6667,79530.7,99,512082,39.6667,678.567,0,16:100000:16/64,1230.33,10.3333,12349,32.3333,2945,17.3333,1258,11,8183,22.3333,2867,20.3333

Justin.
-

To: Justin Piszcz <jpiszcz@...>
Cc: Michael Tokarev <mjt@...>, Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Thursday, July 5, 2007 - 3:00 pm

I looked at these before, did you really run with a chunk size of just
under 16GB, or does "15696M" have some inobvious meaning?

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
-

To: Bill Davidsen <davidsen@...>
Cc: Michael Tokarev <mjt@...>, Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Monday, July 9, 2007 - 7:07 am

It says to use double your RAM, your RAM is 7848, so that is why I use
15696M :)

I did some tests recently, it appears JFS is 20-60MB/s faster for
sequential read/writes/re-writes but it does not have a defrag tool,
defragfs but its not included in Debian and people say not to use it on
Google/so I am not sure I want to go there.

Justin.
-

To: Michael Tokarev <mjt@...>
Cc: Tejun Heo <htejun@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Thursday, July 5, 2007 - 3:22 pm

But... with write cache off you don't let the drive do some things which
might show a lot of improvement with one scheduler or another. So your
data are only part of the story, aren't they?

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
-

To: Tejun Heo <htejun@...>
Cc: Michael Tokarev <mjt@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 10:40 am

Actually, in many ways the result support a theory of SCSI TCQ Jens used
when designing the block layer. The original TCQ theory held that the
drive could make much better head scheduling decisions than the
Operating System, so you just used TCQ to pass all the outstanding I/O
unfiltered down to the drive to let it schedule. However, the I/O
results always seemed to indicate that the effect of TCQ was negligible
at around 4 outstanding commands, leading to the second theory that all
TCQ was good for was saturating the transport, and making scheduling
decisions was, indeed, better left to the OS (hence all our I/O
schedulers).

The key difference between NCQ and TCQ is that NCQ allows a non
interlock setup and completion, but there can't be overlapping (or
interrupted) data transfers. TCQ and Disconnect (for SPI although there
are equivalents for most other transports) allow any style of overlap
you can construct, so NCQ was really designed more to allow the drive to
make the head scheduling decisions.

Where SCSI TCQ seems to win is that most devices pull the incoming TCQ
commands into a (usually quite large) pre-execute cache, which gives
them streaming command execution (usually they're executing command n-2
or 3 while accepting the data for command n), so they're using the cache
actually to smooth out internal latencies.

One final question: have you tried SAS devices for comparison? The
figures that give TCQ a 2x performance boost were with SPI and FC ...
I'm not aware that anyone has actually done a SAS test.

James

-

To: James Bottomley <James.Bottomley@...>
Cc: Tejun Heo <htejun@...>, Michael Tokarev <mjt@...>, Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Monday, July 9, 2007 - 8:26 am

Indeed, the above I still find to be true. The only real case where
larger depths make a real difference, is a pure random reads (or writes,
with write caching off) workload. And those situations are largely
synthetic, hence benchmarks tend to show NCQ being a lot more beneficial
since they construct workloads that consist 100% of random IO. Real life
is rarely so black and white.

Additionally, there are cases where drive queue depths hurt a lot. The
drive has no knowledge of fairness, or process-to-io mappings. So AS/CFQ
has to artificially limit queue depths competing IO processes doing
semi (or fully) sequential workloads, or throughput plummets.

So while NCQ has some benefits, I typically tend to prefer managing the
IO queue largely in software instead of punting to (often) buggy
firmware.

--
Jens Axboe

-

To: Michael Tokarev <mjt@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Thursday, June 28, 2007 - 7:01 am

Michael Tokarev wrote:

A quick followup, to demonstrate the "interesting" part.

Seagate SCSI ST3146854LC drive, 140Gb, 15KRPM, write cache disabled,
queue depth = 32:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 37.9 0.6 0.9 0.6 0.4/ 0.3
4 0.9 0.7 0.6/ 0.4
32 1.5 1.1 0.9/ 0.4
8k 1 75.2 1.2 1.9 1.1 0.8/ 0.6
4 1.7 1.5 1.1/ 0.7
32 2.9 2.2 1.7/ 0.9
16k 1 82.3 2.4 3.6 2.3 1.5/ 1.2
4 3.3 2.9 2.2/ 1.4
32 5.5 4.3 3.3/ 1.7
32k 1 86.3 4.7 6.9 4.4 2.9/ 2.3
4 6.4 5.6 4.2/ 2.7
32 10.2 8.0 6.2/ 3.1
128k 1 86.5 15.8 26.6 14.9 9.5/ 7.7
4 20.6 18.2 13.5/ 8.5
32 29.2 24.8 18.3/ 9.1
1024k 1 88.6 46.7 63.1 48.2 25.3/25.3
4 51.7 51.3 33.5/21.8
32 55.9 57.3 37.6/19.0

Fujitsu SCSI MAX3147NC drive, same parameters:

BlkSz Trd linRd rndRd linWr rndWr rndR/W
4k 1 37.4 0.7 1.0 0.6 0.4/ 0.3
4 0.9 0.8 0.6/ 0.4
32 1.5 1.2 0.9/ 0.4
8k 1 32.9 1.3 1.9 1.2 0.7/ 0.7
4 1.8 1.5 1.2/ 0.7
32 2.8 2.3 1.7/ 0.9
16k 1 89.6 2.6 3.7 2.4 1.4/ 1.3
4 3.5 3.0 2.4/ 1.4
32 5.4 4.4 3.3/ 1.7
32k 1 87.9 4.8 7.0 4.4 2.6/ 2.6
4 6.8 5.6 4.6/ 2.7
32 9.9 8.3 6.2/ 3.1
128k 1 90.7 16.2 22.5 15.3 8.6/ 8.6
4 21.8 18.6 15.0/ 8.1
32 28.6 25.0 18.2/ 9.1
1024k 1 90.6 48.9 60.0 47.4 25.3/25.9
4 55.6 51.7 34.4/22.5
32 59.8 56.2 38.6/19.7

/mjt

-

To: Michael Tokarev <mjt@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 11:44 am

Are you sure that NCQ was enabled between the controller and drive?
Did you verify this? I know about some versions that disable NCQ
support internally in their firmware (something to do with bugs in
error handling).

--
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il
-

To: Dan Aloni <da-x@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 12:17 pm

The next obvious question is: how to check/verify this?

/mjt
-

To: Michael Tokarev <mjt@...>
Cc: Kernel Mailing List <linux-kernel@...>, <linux-ide@...>, <linux-scsi@...>
Date: Wednesday, July 4, 2007 - 12:44 pm

On the lowest level, it's possible using a protocol analyzer. If you
don't have one, you need to be familiar with the controller's driver
or its firmware. If the driver is based on libata, I think it's
possible to get this information easier. Otherwise, such as in the
case of mptsas, it can be completely hidden by the firmware.

--
Dan Aloni
XIV LTD, http://www.xivstorage.com
da-x (at) monatomic.org, dan (at) xiv.co.il
-

Previous thread: 2.6.22-rc6-mm1 by Andrew Morton on Thursday, June 28, 2007 - 6:43 am. (112 messages)

Next thread: [PATCH] CONFIG_SCSI_FD_8xx no longer exists by Geert Uytterhoeven on Thursday, June 28, 2007 - 7:53 am. (1 message)