Re: Performance Characteristics of All Linux RAIDs (mdadm/bonnie++)

Previous thread: Fwd: QUESTION: How can I make a driver for a special serial keyboard which also supports output (maybe via serio_raw)? by Néstor Amigo Cairo on Wednesday, May 28, 2008 - 1:00 am. (1 message)

Next thread: MPTSAS problems in 2.6.26-rc2-mm1 by Balbir Singh on Wednesday, May 28, 2008 - 2:41 am. (4 messages)
From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 1:53 am

Hardware:

1. Utilized (6) 400 gigabyte sata hard drives.
2. Everything is on PCI-e (965 chipset & a 2port sata card)

Used the following 'optimizations' for all tests.

# Set read-ahead.
echo "Setting read-ahead to 64 MiB for /dev/md3"
blockdev --setra 65536 /dev/md3

# Set stripe-cache_size for RAID5.
echo "Setting stripe_cache_size to 16 MiB for /dev/md3"
echo 16384 > /sys/block/md3/md/stripe_cache_size

# Disable NCQ on all disks.
echo "Disabling NCQ on all disks..."
for i in $DISKS
do
   echo "Disabling NCQ on $i"
   echo 1 > /sys/block/"$i"/device/queue_depth
done

Software:

Kernel: 2.6.23.1 x86_64
Filesystem: XFS
Mount options: defaults,noatime

Results:

http://home.comcast.net/~jpiszcz/raid/20080528/raid-levels.html
http://home.comcast.net/~jpiszcz/raid/20080528/raid-levels.txt

Note: 'deg' means degraded and the number after is the number of disks 
failed, I did not test degraded raid10 because there are many ways you can 
degrade a raid10; however, the 3 types of raid10 were benchmarked 
f2,n2,o2.

Each test was run 3 times and averaged--FYI.

Justin.
--

From: Peter Rabbitson
Date: Wednesday, May 28, 2008 - 3:54 am

Results are meaningless without a crucial detail - what was the chunk size 
used during array creation time? Otherwise interesting test :)

Cheers

Peter

--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 4:05 am

Indeed, the chunk size used was 256 KiB for all tests.

Justin.

--

From: Chris Snook
Date: Wednesday, May 28, 2008 - 8:40 am

Given that one of the greatest benefits of NCQ/TCQ is with parity RAID, 
I'd be fascinated to see how enabling NCQ changes your results.  Of 
course, you'd want to use a single SATA controller with a known good NCQ 
implementation, and hard drives known to not do stupid things like 
disable readahead when NCQ is enabled.

-- Chris
--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 10:32 am

Only/usually on multi-threaded jobs/tasks, yes?

Also, I turn off NCQ on all of my hosts that has it enabled by default because
there are many bugs that occur when NCQ is on, they are working on it in the
libata layer but IMO it is not safe at all for running SATA disks w/NCQ as
with it on I have seen drives drop out of the array (with it off, no problems).

--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 10:53 am

I have done NCQ measurements in the past, for single threaded apps NCQ off 
is the way to go, check this out from earlier (10 raptors raid5):

http://home.comcast.net/~jpiszcz/ncq_vs_noncq/


--

From: Chris Snook
Date: Wednesday, May 28, 2008 - 12:22 pm

Generally, yes, but there's caching and readahead at various layers in 
software that can expose the benefit on certain single-threaded 

Are you using SATA drives with RAID-optimized firmware?  Most SATA 
manufacturers have variants of their drives for a few dollars more that 
have firmware that provides bounded latency for error recovery 
operations, for precisely this reason.

-- Chris
--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 12:27 pm

I see--however, as I understood it there were bugs utilizing NCQ in libata?

But FYI--
In this test, they were regular SATA drives, not special raid-ones (RE2,etc).

Thanks for the info!

Justin.

--

From: Kasper Sandberg
Date: Thursday, May 29, 2008 - 2:57 am

You wouldnt happen to have some more information about this? i havent
personally had problems yet, but i havent used it for very long - but
since it comes activated by DEFAULT, i would assume it to be relatively

--

From: Justin Piszcz
Date: Thursday, May 29, 2008 - 2:08 pm

Not off-hand, check LKML and my email address from early this year or last 
year and/or the ide-list.

Justin.

--

From: Jens Bäckman
Date: Wednesday, May 28, 2008 - 9:34 am

Either the RAID 1 read speed must be wrong, or something is odd in the
Linux implementation. There's six drives that can be used for reading
at the same time, as they contain the very same data. 63MB/s
sequential looks like what you would get from a single drive.
--

From: Chris Snook
Date: Wednesday, May 28, 2008 - 9:40 am

The test is a single thread reading one block at a time, so this is not 
surprising.  If you get this doing multi-megabyte readahead, or with 
several threads, something is very wrong.

-- Chris
--

From: Bryan Mesich
Date: Wednesday, May 28, 2008 - 9:46 am

The RAID 1 read speed metrics do not depict multithreaded
processes reading from the array simutaneouly. I would suspect
that the read performance metrics would look better if 2 bonnie
simulations were ran together (for RAID 1 that is).

Bryan 
--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 10:33 am

The RAID1 is correct.  As has been discussed on this list before, you will=
=20
only see raid speed > 1 disk if you run 2(?, or 3 minimal) threads=20
from the same device (raid1).

Justin.
From: Alan Cox
Date: Wednesday, May 28, 2008 - 11:57 am

On Wed, 28 May 2008 18:34:00 +0200

Which is fairly typical of a cheap desktop PC where the limitation is the
memory and PCI bridge as much as the drive.

Alan
--

From: Bill Davidsen
Date: Wednesday, May 28, 2008 - 4:00 pm

I really don't think that's any part of the issue, the same memory and 
bridge went 4-5x faster in other read cases. The truth is that the 
raid-1 performance is really bad, and it's the code causing it AFAIK. If 
you track the actual io it seems to read one drive at a time, in order, 
without overlap.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

--

From: Alan Cox
Date: Thursday, May 29, 2008 - 4:22 am

Make sure the readahead is set to be a fair bit over the stripe size if
you are doing bulk data tests for a single file. (Or indeed in the real
world for that specific case ;))

--

From: Bill Davidsen
Date: Friday, May 30, 2008 - 5:22 am

IIRC Justin has readahead at 16MB and chunk at 256k. I would think that 
if multiple devices were used at all by the md code, that the chunk 
rather than stripe size would be the issue. In this case the RA seems 
large enough to trigger good behavior, were there are available.

Note: this testing was done with an old(er) kernel, as were all of mine. 
Since my one large raid array has become more mission critical I'm not 
comfortable playing with new kernels. The fate of big, fast, and stable 
machines is to slide into production use. :-(
I suppose that's not a bad way to do it, I now have faith in what I'm 
running.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 


--

From: Keld
Date: Wednesday, May 28, 2008 - 12:02 pm

I added this in the wiki performance section.
I think it would have been informative if also a test with one drive in
a non-raid setup was described.

Are there any particular findings you want to highlight?

Is there some way to estimate random read and writes from this test?

Are the XFS file systems completely new when running the tests?

Best regards
keld

--

From: Justin Piszcz
Date: Wednesday, May 28, 2008 - 12:05 pm

Since the performance of bonnie++ deals with single threads/a raid1 would p=
Not in particular, just I could never find this information provided anywhe=
re
that showed all of the raid variation/types in one location that was easy t=
o
Yes, after the creation of each array, mkfs.xfs -f /dev/md3 was run to ensu=
re
> keld
From: Bill Davidsen
Date: Wednesday, May 28, 2008 - 4:09 pm

I have two tiny nits to pick with this information. One is the 
readahead, which as someone else mentioned is in sectors. The other is 
the unaligned display of the numbers, leading the eye to believe that 
values with a similar number of digits can be compared. In truth there's 
a decimal, but only sometimes. I imported the csv file, formatted all 
the numbers to an equal number of places after the decimal, and it is 
far easier to read.

Okay, and a half-nit, there were some patches to improve raid-1 
performance, I think by running io on multiple drives when you can, and 
by doing reads from the outer tracks if there are two idle drives. 
That's not in the stable version you used, I assume, it may not be in 
2.6.26 either, I'm doing other things at the moment.

A very nice bit of work, my only questions is if you ever feel motivated 
to repeat this test, it would be fun to do it with ext3 (or ext4) using 
the stride= parameter. I did limited testing and it really seemed to 
help, but nothing remotely as format as your test.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: Michal Soltys
Date: Wednesday, May 28, 2008 - 11:37 pm

Speaking about which, it would probably be good to adjust a little how 
the filesystem is created and mounted (both in xfs and ext3/4 cases). 
E.g. lazy-count=1 is still not the default last time I checked mkfs.xfs. 
And even ext4.txt from kernel documentation recommends mounting it with 
data=writeback,nobh when doing comparison with metadata journaling 
filesystems (the same would go for ext3).

Along with different journal sizes, keeping an eye on stripe & 
stripe-width, and other settings that might be of interest.
--

From: Holger Kiehl
Date: Wednesday, May 28, 2008 - 11:44 pm

Why is the Sequential Output (Block) for raid6 165719 and for raid5 only
86797? I would have thought that raid6 was always a bit slower in writting
due to having to write double amount of parity data.

Holger

--

From: Justin Piszcz
Date: Thursday, May 29, 2008 - 5:06 am

I will re-run the RAID5 test and also run the test on a single disk and 
update the results later.

Justin.
--

From: Justin Piszcz
Date: Thursday, May 29, 2008 - 10:02 am

RAID5 (2nd test of 3 averaged runs) & Single disk added:
http://home.comcast.net/~jpiszcz/raid/20080528/raid-levels.html
--

From: Bill Davidsen
Date: Friday, May 30, 2008 - 5:55 am

Other than repeating my (possibly lost) comment that this would be 
vastly easier to read if the number were aligned and all had the same 
number of decimal places in a single column, good stuff. For sequential 
i/o the winners and losers are clear, and you can set cost and 
performance to pick the winners. Seems obvious that raid-1 is the loser 
for single threaded load, I suspect that it would be poor against other 
levels in multithread loads, but not so much for read.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 


--

From: Keld
Date: Friday, May 30, 2008 - 7:23 am

On my wishlist to Justin is also what is the performance of the raid10's
in degraded mode.

And then I note that raid1 performs well on random seeks 702/s
while the raid10,f2 (my pet) only performs 520/s - but this is on a
2.6.23 kernel without the seek performance patch for raid10,f2.

I wonder if the random seeks are related to random read (and write) - it
probably is, but there seems to be a difference between the results
found with bonnie++ and my tests as reported on the
http://linux-raid.osdl.org/index.php/Performance page.

Best regards
keld
--

Previous thread: Fwd: QUESTION: How can I make a driver for a special serial keyboard which also supports output (maybe via serio_raw)? by Néstor Amigo Cairo on Wednesday, May 28, 2008 - 1:00 am. (1 message)

Next thread: MPTSAS problems in 2.6.26-rc2-mm1 by Balbir Singh on Wednesday, May 28, 2008 - 2:41 am. (4 messages)