Re: New raid level suggestion.

Previous thread: read errors corrected by James on Wednesday, December 29, 2010 - 8:20 pm. (14 messages)

Next thread: Re: read errors corrected by Richard Scobie on Thursday, December 30, 2010 - 1:19 pm. (1 message)
From: Rogier Wolff
Date: Thursday, December 30, 2010 - 1:23 am

Hi,

A friend has a webserver. He has 4 drive bays and due to previous
problems he's not content to have 3 or 4 drives in a raid5
configuration, but he wants a "hot spare" so that when it takes him a
week to find a new drive and some time to drive to the hosting
company, he isn't susceptible to a second drive crashing in the
meantime.

So in principle he'll build a 3-drive RAID5 with a hot spare.... 

Now we've been told that raid5 performs badly for the workload that is
expected. It would be much better to run the system in RAID10. However
if he'd switch to RAID10, after a single drive failure he has a window
of about a week where he has a 33% chance of a second drive failure
being "fatal".

So I was thinking.... He's resigned himself to a configuration where
he pays for 4x the disk space and only gets 2x the available space.

So he could run his array in RAID10 mode, however when a drive fails, 
a fallback to raid5 would be in order. In this case, after the resync 
a single-drive-failure tolerance is again obtained. 

In practise scaling down to raid5 is not easy/possible. RAID4 however
should be doable.

In fact this can almost be implemented entirely in userspace. Just
remove the mirror drive from the underlying raid0, and reinitialize as
raid4. If you do this correctly the data will still be there....

Although doing this with an active filesystem running on these drives
is probably impossible due to "device is in use" error messages.... 

So: Has anybody tried this before?
Can this be implemented without kernel support?
Anybody feel like implementing this?

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and ...
From: Steven Haigh
Date: Thursday, December 30, 2010 - 1:47 am

Maybe I'm not quite understanding right, however you can easily do RAID6 
with 4 drives. That will give you two redundant, effectively give you 
RAID5 if I drive fails, and save buttloads of messing around...


-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299
--

From: Rogier Wolff
Date: Thursday, December 30, 2010 - 2:42 am

Steven, My friend has a server where the drives take up to a third of
a second to respond. When asking for help, everybody pounced on us:
- NEVER use raid5 for a server doing small-file-io like a mailserver.
  (always use RAID10). 

So apparently RAID5 (and by extension RAID6) is not an option for some
systems.

I'm willing to tolerate the RAID4 situation during the time that it
takes me to replace the drive.

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--

From: Stan Hoeppner
Date: Thursday, December 30, 2010 - 3:39 am

Any RAID scheme that uses parity is less than optimal, and up to
horrible, for heavy random IO loads.  As always, this depends on "how
heavy" the load is.  For up to a few hundred constant IOPS you can get
away with parity RAID schemes.  If you need a few thousand or many
thousand IOPS, better stay away from parity RAID.

This includes RAID 3 and 4.  Both of these are now defunct because using
a dedicated disk for storing parity information for an array yields the
same or very slightly higher reliability than using a single disk (I
don't have the equation in front me to give exact probability of
failure).  Regardless, if the RAID 3/4 parity disk fails you lose the array.

If your friend's web server isn't going to see a ton of traffic, why
does he need anything beyond a 2 way mirror with a spare?  Paraniod?  Do
a 3 way mirror.  A mirrored pair of 10k RPM SATA drives should be more
than sufficient for most webservers, which typically gain their
performance from lots of buffer cache, not from fast disks.

If would help if we knew more about the specific web app he's hosting,
its IO patterns, and anticipated load once in production.  Unless he's
got a super complex (read inefficient) cgi/database back end my
recommendation of a pair of mirrored drives, stands.  7.2k would
probably be fine, 10k gives a little wiggle room if you underestimate
your load target, or the app turns out to be even less efficient that
anticipated.

-- 
Stan
--

From: John Robinson
Date: Thursday, December 30, 2010 - 4:58 am

On 30/12/2010 10:39, Stan Hoeppner wrote:

Sorry, I have to disagree with this, in this situation. RAID-6 over 4 
discs will be just as fast for reading multiple small files as RAID-10 
over 4 discs, and a web server is a read-mostly environment, while at 
the same time I can't imagine any RAID schema ever giving thousands of 
IOPS over 4 discs, parity or no.

Cheers,

John.

--

From: Stan Hoeppner
Date: Thursday, December 30, 2010 - 6:11 am

That's because you apparently didn't learn about paragraph's in English
class:  http://en.wikipedia.org/wiki/Paragraph  Do you Brits use
paragraphs differently than we do here in the states?

My first paragraph dealt with general performance of parity vs non
parity RAID WRT high IO loads.  My second paragraph covered the downside
of the redundancy methods of RAID 3/4.  My third paragraph dealt
specifically with Roger's web server.

Note that nothing in my first paragraph mentioned a web server workload.
 Also note that nowhere did I mention a count of 4 drive, nor commented
regarding the suitability of any RAID level with 4 drives.

Also note there were two "situations" mentioned by Roger.  The first
referenced a previous thread which dealt with a high transaction load
server similar to a mail server, IIRC.  My first paragraph related to
that.  The second "situation", to which you refer, dealt with Roger's
web server.

-- 
Stan
--

From: John Robinson
Date: Thursday, December 30, 2010 - 11:10 am

Yes, and I suppose that I should have pointed out that the OP's friend 
had been given slightly inappropriate advice, since a web server doesn't 
do small file I/O like a mailserver. You expanded on a general situation 
which didn't apply, and the statement you made was wrong, or at least 

You were wrong again there: if you lose the parity disc in RAID 3/4 you 
don't lose the array, as the data discs are all still there. It is true 
that with modern huge (1TB+) drives where the error rate per bit read is 
still much the same as when drives were tiny (1GB+) that a recovery is 
much more risky than it used to be due to the dramatically increased 


No indeed, but that was the context of the question; why give entirely 

I see no such reference, apart from noting that "when asking for help, 
everybody pounced on us: - NEVER use raid5 for a server doing 
small-file-io like a mailserver. (always use RAID10)" which as I say is 
in my opinion inappropriate advice, since they're not trying to run a 

I had surmised from the original question about using RAID-10, RAID-4 
etc that there was a desire to have more storage than a single drive 
mirrored twice, so I didn't think plain mirroring would suit, but 
perhaps that wasn't the intention and your solution would work.

Cheers,

John.

--

From: Stan Hoeppner
Date: Friday, December 31, 2010 - 3:23 am

Sorry I was a bit prickly in my reply John.  For some reason I became
defensive, and shouldn't have.  Chalk it up to mood I guess.

It's entirely possible that I misunderstood Roger's requirements.  I
believe he was talking about two different systems, one a transaction
type server in his first thread, the other just a web sever in this
thread.  That's why I recommended the possibility of simple RAID 1 for
the web server.

It's difficult for me to imagine a web server scenario that would need
anywhere close to 1TB of disk, or one that would need more IOPS than a
single disk could provide, or more fault tolerance than mirroring.  The
assumption today being that one satisfies web capacity needs with many
cheap nodes instead of one, or few, big ones.  I concede anything is
possible, and there are myriad requirements out there.  I've just never
seen/heard of a web server req for anything more than simple disk mirroring.

For instance, I've been using the following for a web node with good
success.  It's a "low power" node from both an all out performance and
heat dissipation perspective but can handle more than sufficient numbers
of simultaneous requests (it is noisy though, as all 1U units are).
Current cost of the components is less than $360 USD for a 1U 14" deep
single core 2.8GHz 45w AMD server, 4GB RAM, onboard single GigE, and 2 x
mirrored Seagate 160GB 7.2k 2.5" SATA II drives, and a single 260w PSU.
 These boxen don't have hot swap drive cages.  Using a box with hot swap
would increase total price by 35% to almost $500 per node.  Drive
failures are rare enough here that it's not a burden to de rack the
server and replace the dive, as this is a cluster web node.  For most
other server applications I use hot swap chassis (and redundant PSUs).
I run Debian Lenny on these w/lighttpd, etc.

These shallow boxen allow dog ear mounting without making me nervous, so
I save about $25-$40 per unit on slide rails.  I published the NewEgg
wish list of the parts for this build.  It ...
From: Jim Schatzman
Date: Thursday, December 30, 2010 - 4:20 pm

When I rebooted my server yesterday, not all the RAIDs came up. There were no errors in the system log. All devices appear to be working correctly. There is no evidence of  hardware errors or data corruption.

To prevent mdadm from failing RAID drives, I removed the RAID entries from /etc/mdadm.conf, and I have a cron script that does things like

mdadm -A --no-degraded /dev/md5 --uuid 291655c3:b6c334ff:8dfe69a4:447f777b
mdadm: /dev/md5 assembled from 2 drives (out of 4), but not started.

The question is, why did mdadm assemble only 2 drives, when all 4 drives appear to be fine?  The same problem occurred for 4 RAIDs, each with similar geometry, and using the same 4 physical drives.

Here is the status of all 4 partitions that should have been assembled into /dev/md5:

[root@l1 ~]# mdadm -E /dev/sda5
/dev/sda5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 291655c3:b6c334ff:8dfe69a4:447f777b
           Name : l1.fu-lab.com:5  (local to host l1.fu-lab.com)
  Creation Time : Thu Sep 23 13:41:31 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 957214849 (456.44 GiB 490.09 GB)
     Array Size : 2871641088 (1369.31 GiB 1470.28 GB)
  Used Dev Size : 957213696 (456.44 GiB 490.09 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4088b63f:68d66426:a2abd280:28476493

    Update Time : Wed Dec 22 08:27:57 2010
       Checksum : 48e371ac - correct
         Events : 339

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
[root@l1 ~]# mdadm -E /dev/sdi5
/dev/sdi5:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 291655c3:b6c334ff:8dfe69a4:447f777b
           Name : l1.fu-lab.com:5  (local to host l1.fu-lab.com)
  Creation Time : Thu Sep 23 13:41:31 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 957214849 ...
From: Neil Brown
Date: Thursday, December 30, 2010 - 6:08 pm

On Thu, 30 Dec 2010 16:20:58 -0700 Jim Schatzman

Add a '--verbose' to the '-A' command.  Hopefully it will reveal something
interesting.


--

From: Jim Schatzman
Date: Thursday, December 30, 2010 - 8:38 pm

After explicitly stopping the RAID (mdadm -S /dev/md5), and executing the mdadm -A --verbose command as suggested, more info is forthcoming. Mdadm appears to think that /dev/sdi5 and /dev/sdj5 are "busy". Since only /dev/sda5 and /dev/sdk5 can be added to the RAID, we are hosed.

I am puzzled as to why it thinks /dev/sdi5 and /dev/sdj5 are "busy". Fdisk reports normal partition data, dmesg and the system log report no problems, and I have no trouble copying data from /dev/sdi5 and /dev/sdj5 with dd. The system log does contain messages like 

"kernel: [   47.946357] dracut: Scanning devices md0 md1 md2 md3 sdi sdi1 sdi2 sdj sdj1 sdj2 sdk sdk1 sdk2 sdl  for LVM volume groups"

but there appears to be no essential difference in the logging messages for sdi,j versus sda,k. Also, the physical devices (a,i,j,k) are the same brand and model number disk drive. The system log reports

"3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)"

for all of them. Also, "fuser -m /dev/sdi5" and the same for sdj5 reports no users of the filesystems.

Stopping /dev/md5,6,7,8 and rerunning mdadm -A command, shows that mdadm also thinks that /dev/sdi6,7,8 and /dev/sdj6,7,8 are busy as well, even though they are not being used in any other RAID.  

What does this mean?

I am grasping at straws here. It seems that mdadm thinks that two of the devices are "busy", even though Linux apparently disagrees, and no I/O errors are being reported. I am mystified.

After googling, I did find one suggestion - remove dmraid from the system. I will try that and report the result. I am not sure why dmraid is installed - I never explicitly installed it. I see that I have

dmraid-events-1.0.0.rc16-12.fc13.i686
dmraid-1.0.0.rc16-12.fc13.i686

Thanks!

Jim


1st try
-----------------

mdadm -A --verbose --no-degraded /dev/md5 --uuid 291655c3:b6c334ff:8dfe69a4:447f777b
mdadm: looking for devices for /dev/md5
mdadm: no recogniseable superblock on /dev/dm-6
mdadm: /dev/dm-6 has wrong uuid.
mdadm: no recogniseable ...
From: Jim Schatzman
Date: Thursday, December 30, 2010 - 8:51 pm

All-

Using yum/rpm to remove dmraid from the system and rebooting fixed the problem.

Why is dmraid doing anything at all when my motherboard doesn't support FakeRAID?  It seems that it is arbitrarily picking some set of drives to tie up, effectively disabling mdadm for some arrays. Nice.

Jim

--

From: Neil Brown
Date: Thursday, December 30, 2010 - 3:01 am

On Thu, 30 Dec 2010 09:23:56 +0100 Rogier Wolff <R.E.Wolff@BitWizard.nl>

The kernel already supports this, though only with very recent kernels.
I'm not 100% sure about mdadm support, but if it isn't there yet, it probably
will be soon.

You can convert a RAID10 to a RAID0.  You probably have to remove two devices
first, so there are just two working devices - no redundancy.

  mdadm --grow /dev/md0 --level=0

Then you can convert the RAID0 to RAID4

  mdadm --grow /dev/md0 --level=4

Then add the good device back in

  mdadm /dev/md0 --add /dev/sdXX

This should all work, though you should certainly test it before you depend
on it at all.

NeilBrown

--

From: Ryan Wagoner
Date: Thursday, December 30, 2010 - 7:24 am

Paying for 4x the disk space and only getting 2x is about performance.
You can't just view disks by the raw space they provide.

Even if the scenario of converting RAID10 to RAID4 was possible you
now have no redundancy during the conversion. The chance of a disk
failing is greater than just running the RAID10 until you can replace
the faulty disk. Not to mention the performance penalty of the resync
parity calculation. RAID10 when degraded has minor hit to performance
compared to a RAID level with parity.

All you need to do is purchase 5 disks if you want 4 in RAID10. Have
the cold spare ready when one fails. This reduces the replacement time
as you don't have to wait for a drive to be ordered.

Ryan
--

Previous thread: read errors corrected by James on Wednesday, December 29, 2010 - 8:20 pm. (14 messages)

Next thread: Re: read errors corrected by Richard Scobie on Thursday, December 30, 2010 - 1:19 pm. (1 message)