Hi, A friend has a webserver. He has 4 drive bays and due to previous problems he's not content to have 3 or 4 drives in a raid5 configuration, but he wants a "hot spare" so that when it takes him a week to find a new drive and some time to drive to the hosting company, he isn't susceptible to a second drive crashing in the meantime. So in principle he'll build a 3-drive RAID5 with a hot spare.... Now we've been told that raid5 performs badly for the workload that is expected. It would be much better to run the system in RAID10. However if he'd switch to RAID10, after a single drive failure he has a window of about a week where he has a 33% chance of a second drive failure being "fatal". So I was thinking.... He's resigned himself to a configuration where he pays for 4x the disk space and only gets 2x the available space. So he could run his array in RAID10 mode, however when a drive fails, a fallback to raid5 would be in order. In this case, after the resync a single-drive-failure tolerance is again obtained. In practise scaling down to raid5 is not easy/possible. RAID4 however should be doable. In fact this can almost be implemented entirely in userspace. Just remove the mirror drive from the underlying raid0, and reinitialize as raid4. If you do this correctly the data will still be there.... Although doing this with an active filesystem running on these drives is probably impossible due to "device is in use" error messages.... So: Has anybody tried this before? Can this be implemented without kernel support? Anybody feel like implementing this? Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and ...
Maybe I'm not quite understanding right, however you can easily do RAID6 with 4 drives. That will give you two redundant, effectively give you RAID5 if I drive fails, and save buttloads of messing around... -- Steven Haigh Email: netwiz@crc.id.au Web: http://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 Fax: (03) 8338 0299 --
Steven, My friend has a server where the drives take up to a third of a second to respond. When asking for help, everybody pounced on us: - NEVER use raid5 for a server doing small-file-io like a mailserver. (always use RAID10). So apparently RAID5 (and by extension RAID6) is not an option for some systems. I'm willing to tolerate the RAID4 situation during the time that it takes me to replace the drive. Roger. -- ** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 ** ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 ** *-- BitWizard writes Linux device drivers for any device you may have! --* Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. Does it sit on the couch all day? Is it unemployed? Please be specific! Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ --
Any RAID scheme that uses parity is less than optimal, and up to horrible, for heavy random IO loads. As always, this depends on "how heavy" the load is. For up to a few hundred constant IOPS you can get away with parity RAID schemes. If you need a few thousand or many thousand IOPS, better stay away from parity RAID. This includes RAID 3 and 4. Both of these are now defunct because using a dedicated disk for storing parity information for an array yields the same or very slightly higher reliability than using a single disk (I don't have the equation in front me to give exact probability of failure). Regardless, if the RAID 3/4 parity disk fails you lose the array. If your friend's web server isn't going to see a ton of traffic, why does he need anything beyond a 2 way mirror with a spare? Paraniod? Do a 3 way mirror. A mirrored pair of 10k RPM SATA drives should be more than sufficient for most webservers, which typically gain their performance from lots of buffer cache, not from fast disks. If would help if we knew more about the specific web app he's hosting, its IO patterns, and anticipated load once in production. Unless he's got a super complex (read inefficient) cgi/database back end my recommendation of a pair of mirrored drives, stands. 7.2k would probably be fine, 10k gives a little wiggle room if you underestimate your load target, or the app turns out to be even less efficient that anticipated. -- Stan --
On 30/12/2010 10:39, Stan Hoeppner wrote: Sorry, I have to disagree with this, in this situation. RAID-6 over 4 discs will be just as fast for reading multiple small files as RAID-10 over 4 discs, and a web server is a read-mostly environment, while at the same time I can't imagine any RAID schema ever giving thousands of IOPS over 4 discs, parity or no. Cheers, John. --
That's because you apparently didn't learn about paragraph's in English class: http://en.wikipedia.org/wiki/Paragraph Do you Brits use paragraphs differently than we do here in the states? My first paragraph dealt with general performance of parity vs non parity RAID WRT high IO loads. My second paragraph covered the downside of the redundancy methods of RAID 3/4. My third paragraph dealt specifically with Roger's web server. Note that nothing in my first paragraph mentioned a web server workload. Also note that nowhere did I mention a count of 4 drive, nor commented regarding the suitability of any RAID level with 4 drives. Also note there were two "situations" mentioned by Roger. The first referenced a previous thread which dealt with a high transaction load server similar to a mail server, IIRC. My first paragraph related to that. The second "situation", to which you refer, dealt with Roger's web server. -- Stan --
Yes, and I suppose that I should have pointed out that the OP's friend had been given slightly inappropriate advice, since a web server doesn't do small file I/O like a mailserver. You expanded on a general situation which didn't apply, and the statement you made was wrong, or at least You were wrong again there: if you lose the parity disc in RAID 3/4 you don't lose the array, as the data discs are all still there. It is true that with modern huge (1TB+) drives where the error rate per bit read is still much the same as when drives were tiny (1GB+) that a recovery is much more risky than it used to be due to the dramatically increased No indeed, but that was the context of the question; why give entirely I see no such reference, apart from noting that "when asking for help, everybody pounced on us: - NEVER use raid5 for a server doing small-file-io like a mailserver. (always use RAID10)" which as I say is in my opinion inappropriate advice, since they're not trying to run a I had surmised from the original question about using RAID-10, RAID-4 etc that there was a desire to have more storage than a single drive mirrored twice, so I didn't think plain mirroring would suit, but perhaps that wasn't the intention and your solution would work. Cheers, John. --
Sorry I was a bit prickly in my reply John. For some reason I became defensive, and shouldn't have. Chalk it up to mood I guess. It's entirely possible that I misunderstood Roger's requirements. I believe he was talking about two different systems, one a transaction type server in his first thread, the other just a web sever in this thread. That's why I recommended the possibility of simple RAID 1 for the web server. It's difficult for me to imagine a web server scenario that would need anywhere close to 1TB of disk, or one that would need more IOPS than a single disk could provide, or more fault tolerance than mirroring. The assumption today being that one satisfies web capacity needs with many cheap nodes instead of one, or few, big ones. I concede anything is possible, and there are myriad requirements out there. I've just never seen/heard of a web server req for anything more than simple disk mirroring. For instance, I've been using the following for a web node with good success. It's a "low power" node from both an all out performance and heat dissipation perspective but can handle more than sufficient numbers of simultaneous requests (it is noisy though, as all 1U units are). Current cost of the components is less than $360 USD for a 1U 14" deep single core 2.8GHz 45w AMD server, 4GB RAM, onboard single GigE, and 2 x mirrored Seagate 160GB 7.2k 2.5" SATA II drives, and a single 260w PSU. These boxen don't have hot swap drive cages. Using a box with hot swap would increase total price by 35% to almost $500 per node. Drive failures are rare enough here that it's not a burden to de rack the server and replace the dive, as this is a cluster web node. For most other server applications I use hot swap chassis (and redundant PSUs). I run Debian Lenny on these w/lighttpd, etc. These shallow boxen allow dog ear mounting without making me nervous, so I save about $25-$40 per unit on slide rails. I published the NewEgg wish list of the parts for this build. It ...
When I rebooted my server yesterday, not all the RAIDs came up. There were no errors in the system log. All devices appear to be working correctly. There is no evidence of hardware errors or data corruption.
To prevent mdadm from failing RAID drives, I removed the RAID entries from /etc/mdadm.conf, and I have a cron script that does things like
mdadm -A --no-degraded /dev/md5 --uuid 291655c3:b6c334ff:8dfe69a4:447f777b
mdadm: /dev/md5 assembled from 2 drives (out of 4), but not started.
The question is, why did mdadm assemble only 2 drives, when all 4 drives appear to be fine? The same problem occurred for 4 RAIDs, each with similar geometry, and using the same 4 physical drives.
Here is the status of all 4 partitions that should have been assembled into /dev/md5:
[root@l1 ~]# mdadm -E /dev/sda5
/dev/sda5:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 291655c3:b6c334ff:8dfe69a4:447f777b
Name : l1.fu-lab.com:5 (local to host l1.fu-lab.com)
Creation Time : Thu Sep 23 13:41:31 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 957214849 (456.44 GiB 490.09 GB)
Array Size : 2871641088 (1369.31 GiB 1470.28 GB)
Used Dev Size : 957213696 (456.44 GiB 490.09 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 4088b63f:68d66426:a2abd280:28476493
Update Time : Wed Dec 22 08:27:57 2010
Checksum : 48e371ac - correct
Events : 339
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing)
[root@l1 ~]# mdadm -E /dev/sdi5
/dev/sdi5:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 291655c3:b6c334ff:8dfe69a4:447f777b
Name : l1.fu-lab.com:5 (local to host l1.fu-lab.com)
Creation Time : Thu Sep 23 13:41:31 2010
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 957214849 ...On Thu, 30 Dec 2010 16:20:58 -0700 Jim Schatzman Add a '--verbose' to the '-A' command. Hopefully it will reveal something interesting. --
After explicitly stopping the RAID (mdadm -S /dev/md5), and executing the mdadm -A --verbose command as suggested, more info is forthcoming. Mdadm appears to think that /dev/sdi5 and /dev/sdj5 are "busy". Since only /dev/sda5 and /dev/sdk5 can be added to the RAID, we are hosed. I am puzzled as to why it thinks /dev/sdi5 and /dev/sdj5 are "busy". Fdisk reports normal partition data, dmesg and the system log report no problems, and I have no trouble copying data from /dev/sdi5 and /dev/sdj5 with dd. The system log does contain messages like "kernel: [ 47.946357] dracut: Scanning devices md0 md1 md2 md3 sdi sdi1 sdi2 sdj sdj1 sdj2 sdk sdk1 sdk2 sdl for LVM volume groups" but there appears to be no essential difference in the logging messages for sdi,j versus sda,k. Also, the physical devices (a,i,j,k) are the same brand and model number disk drive. The system log reports "3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)" for all of them. Also, "fuser -m /dev/sdi5" and the same for sdj5 reports no users of the filesystems. Stopping /dev/md5,6,7,8 and rerunning mdadm -A command, shows that mdadm also thinks that /dev/sdi6,7,8 and /dev/sdj6,7,8 are busy as well, even though they are not being used in any other RAID. What does this mean? I am grasping at straws here. It seems that mdadm thinks that two of the devices are "busy", even though Linux apparently disagrees, and no I/O errors are being reported. I am mystified. After googling, I did find one suggestion - remove dmraid from the system. I will try that and report the result. I am not sure why dmraid is installed - I never explicitly installed it. I see that I have dmraid-events-1.0.0.rc16-12.fc13.i686 dmraid-1.0.0.rc16-12.fc13.i686 Thanks! Jim 1st try ----------------- mdadm -A --verbose --no-degraded /dev/md5 --uuid 291655c3:b6c334ff:8dfe69a4:447f777b mdadm: looking for devices for /dev/md5 mdadm: no recogniseable superblock on /dev/dm-6 mdadm: /dev/dm-6 has wrong uuid. mdadm: no recogniseable ...
All- Using yum/rpm to remove dmraid from the system and rebooting fixed the problem. Why is dmraid doing anything at all when my motherboard doesn't support FakeRAID? It seems that it is arbitrarily picking some set of drives to tie up, effectively disabling mdadm for some arrays. Nice. Jim --
On Thu, 30 Dec 2010 09:23:56 +0100 Rogier Wolff <R.E.Wolff@BitWizard.nl> The kernel already supports this, though only with very recent kernels. I'm not 100% sure about mdadm support, but if it isn't there yet, it probably will be soon. You can convert a RAID10 to a RAID0. You probably have to remove two devices first, so there are just two working devices - no redundancy. mdadm --grow /dev/md0 --level=0 Then you can convert the RAID0 to RAID4 mdadm --grow /dev/md0 --level=4 Then add the good device back in mdadm /dev/md0 --add /dev/sdXX This should all work, though you should certainly test it before you depend on it at all. NeilBrown --
Paying for 4x the disk space and only getting 2x is about performance. You can't just view disks by the raw space they provide. Even if the scenario of converting RAID10 to RAID4 was possible you now have no redundancy during the conversion. The chance of a disk failing is greater than just running the RAID10 until you can replace the faulty disk. Not to mention the performance penalty of the resync parity calculation. RAID10 when degraded has minor hit to performance compared to a RAID level with parity. All you need to do is purchase 5 disks if you want 4 in RAID10. Have the cold spare ready when one fails. This reduces the replacement time as you don't have to wait for a drive to be ordered. Ryan --
