First post. I've never used RAID but am thinking about it and looking for newbie-level info. Thanks in advance. I'm thinking about building a machine for long term number crunching of stock market data. Highest end processor I can get, 16GB and at least reasonably fast drives. I've not done RAID before and don't know how to choose one RAID type over another for this sort of workload. All I know is I want the machine to run 24/7 computing 100% of the time and be reliable at least in the sense of not losing data if 1 drive or possibly 2 go down. If a drive does go down I'm not overly worried about down time. I'll stock a couple of spares when I build the machine and power the box back up within an hour or two. What RAID type do I choose and why? Do I need a 5 physical drive RAID array to meet these requirements? Assume 1TB+ drives all around. How critical is it going forward with Linux RAID solutions to be able to get exactly the same drives in the future? 1TB today is 4TB a year from now, etc. With an 8 core processor (high-end Intel Core i7 probably) do I need to worry much about CPU usage doing RAID? I suspect not and I don't really want to get into hardware RAID controllers unless critically necessary which I suspect it isn't. Anyway, if there's a document around somewhere that helps a newbie like me I'd sure appreciate finding out about it. Thanks, Mark --
I'm not sure about a newbie doc, but here's some basics: You haven't said what kind of i/o rates you expect, nor how much storage you need. At a minimum I would build a 3-disk raid 6. raid 6 does a lot of i/o which may be a problem. Raid-5 is out of favor for me due to issues people are seeing with discrete bad sectors with the remaining drives after you have a drive failure. raid-6 tolerates those much better. Even raid 10 is not as robust as raid 6 and with the current generation drives robustness in the raid solution is more important than ever. But raid 6 uses 2 parity drives, so you'll only get 1TB of useable space from a 3-disk raid 6 made from 1TB drives. mdraid just requires replacement disks be bigger than the old disk you're replacing. You might consider layering LVM on top of mdraid to help you manage the array as it grows. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer Preservation and Forensic processing of Exchange Repositories White Paper - <http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html> The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
Good points. I guess I was assuming I'd want 1TB storage and I'd buy 3/5/6 1TB drives to get it. Honestly I probably don't need anything close to that. My weekly backups of stock data run about 1GB to 1TB should hold me for quite awhile I think. As for i/o rates I think it's pretty low. Real-time or historic stock data arrives here over the net so that's not fast. Crunching numbers *typically* amounts to loading a single data set from disk into memory and then operating from there so I suspect that even in backtesting it's pretty low but I'll see if I can get some data. None the less I'm not sure there's much overlap between when the disk is heavily used and when it gets CPU limited. Again, I'll have to give that some I've been looking at this page so far for the most basic info: http://en.wikipedia.org/wiki/RAID#Organization They show RAID 6 with 5 drives so I'll need to learn how to do this with fewer drives. I think you're point about more than 1 drive having problems around the same time is good input. While money is always important buying 1 or 2 more drives (say $200) isn't the biggest issue here. It's a new machine with a $500 processor so if more drives make a big difference in terms of reliability then I Two subject I haven't even thought of! Thanks for the info! Lots to study! Cheers, Mark --
There is s newbie setup howto at http://raid.wiki.kernel.org/index.php/Preventing_against_a_failing_disk This is for 2 disks, but you can add more disks for added redundency and speed. best regards keld --
Where are you sourcing the stock data? 1GB/week seems awful low. You must be getting top of book only? We get real-time full depth stock data. NYSE and Nasdaq data are each about 100 GB/month, compressed. Just something to keep in mind if you ever start working with full-depth feeds. Also: for backups, you might want to consider par2. You can use it to create a specified amount of parity data for each file or group of files. Useful in the case of media errors. --
Yeah, my description was a bit lacking. Sorry. Currently I only auto-trade index futures so I get only the indexes (tick & 1 minute) and then a few other general things (VIX, A/D, Gold, Oil, 3 month T bill, etc.) for correlation purposes. I don't mess much with individual stocks at all although I have it in my mind to start creating some private out of individual stock data so if I do that storage requirements will certainly go up. Thanks, Mark --
} -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of Greg Freemyer } Sent: Saturday, March 06, 2010 5:33 PM } To: Mark Knecht } Cc: Linux-RAID } Subject: Re: What RAID type and why? } } On Sat, Mar 6, 2010 at 5:02 PM, Mark Knecht <markknecht@gmail.com> wrote: } > First post. I've never used RAID but am thinking about it and looking } > for newbie-level info. Thanks in advance. } > } > I'm thinking about building a machine for long term number crunching } > of stock market data. Highest end processor I can get, 16GB and at } > least reasonably fast drives. I've not done RAID before and don't know } > how to choose one RAID type over another for this sort of workload. } > All I know is I want the machine to run 24/7 computing 100% of the } > time and be reliable at least in the sense of not losing data if 1 } > drive or possibly 2 go down. } > } > If a drive does go down I'm not overly worried about down time. I'll } > stock a couple of spares when I build the machine and power the box } > back up within an hour or two. } > } > What RAID type do I choose and why? } > } > Do I need a 5 physical drive RAID array to meet these requirements? } > Assume 1TB+ drives all around. } > } > How critical is it going forward with Linux RAID solutions to be able } > to get exactly the same drives in the future? 1TB today is 4TB a year } > from now, etc. } > } > With an 8 core processor (high-end Intel Core i7 probably) do I need } > to worry much about CPU usage doing RAID? I suspect not and I don't } > really want to get into hardware RAID controllers unless critically } > necessary which I suspect it isn't. } > } > Anyway, if there's a document around somewhere that helps a newbie } > like me I'd sure appreciate finding out about it. } > } > Thanks, } > Mark } } I'm not sure about a newbie doc, but here's some basics: } } You haven't said what kind of i/o rates you expect, nor how much } storage you ...
On Sat, Mar 6, 2010 at 3:17 PM, Guy Watkins <linux-raid@watkins-home.com> wrote: OK - good points. The 'data' is very important to me. Full disclosure (and it well could make a difference I suppose) but the stock data is really just part of a medium sized VMWare image on the order of 10GB. VMWare is running on Gentoo and hosting Windows XP currently, Windows 7 later possibly. Windows is the only platform that has programs that currently do what I need. (And my trading partner is completely Windows based so until I convert him to Linux Windows must be part of the recipe.) Anyway, I try to keep my VMWare images below 10GB so that I can tar them up once a week and write them to a dual-layer DVD for backup. Once a month I move one to the safe deposit box at the bank. Otherwise I keep three in the house at the far end in a fireproof box. (Although high temp might damage them if the whole house goes up. Who knows...) So, Gentoo hosts VMWare. VMWare maps a logical 10GB Windows C: drive into a bunch of 2GB files and the stock data it typically a single file for each stock where the file size approaches 100MB. I have little or no control in terms of how that 100MB file is placed on the drive. I haven't a clue what that really means in terms of disk access from a Linux point of view. Daily backups are just my programming. They are incremental and just sent to another computer on the network, and then also covered in the VMWare backup at the end of the week. Replying to Asdo - I'm hoping to use a more or less standard desktop or server motherboard with 6-8 SATA ports dedicating 3-6 ports to this RAID interface and then using Linux software RAID and no hardware RAID controllers. Is this unreasonable in your opinion for the work flow I'm describing using VMWare? I'm concerned about any RAID hardware that leaves me stranded if the controller dies. Thanks all! - Mark --
Several thoughts about that, if you encrypt the tar you can burn another copy and stick it in with the CDs in your car. Which hopefully is not parked in an attached garage. And look at dvdisaster. It's a software ECC which gives you a much improved chance to recover data if you have issues with it. Of course, if you have a trusted friend you can just do incremental over the net, unless you change all 10GB somehow. Makes for a slightly ugly recovery. I finally went to Blu-Ray for full backup, DVD for incremental. I did do In that case you start to look at server grade hardware, big UPS, or multiple systems doing realtime mirroring. If you want data center reliability you need data center paranoia and complexity. Decide what you need, and people will help you get there. When I ran the usenet servers for SBC we had a backup data center on another tectonic plate, think about how safe you want to be. Other thought: have a backup off-site which captures a few days worth of new data, and do incrementals often. Worst case you have to recompute starting from a backup and whatever new data came in since. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein --
More importantly, it sounds like his workload will be mostly /database/ driven. As far as I'm aware, databases tend to produce many small operations; which unfortunately pushes favor to the simple mirroring operations. If two drives going bad is a concern then using 2 backup copies per raid 1 mirror set would work. Most modern consumer systems come with 6 SATA ports or more, so it should be possible to get 6 hard drives installed and shared among two raid 1 sets of 3 drives each. LVM with striping could be used over the raid 1 sets. On the other hand, he says that the system will have 16 GB of memory; I'm not sure what size his working set is, but it sounds entirely plausible that a well constructed database could live entirely within the ram. If that's the case it doesn't really matter what the precise performance of the storage solution is. Raid 6 would offer more efficient drive use at similar rates of error tolerance at a cost savings of 2 drives in the six drive case. Update for new email: Just go with the raid 1 version; it sounds like you aren't trying to store terabytes of data so the raid 1 solution with even just 3 drives should be sufficient. Put the saved resources in to more, faster, or better memory. --
On Sat, 06 Mar 2010 18:17:44 -0500 and as md/raid6 requires at least 4 drives, RAID1 is not just the best solution to survive two failures on a 3-device array, it is the only solution. NeilBrown --
Raid10 can also do it. raid1 is in many ways obsolete and you should rather use raid10, which in my eyeys is just another way of doing the same conceptual thing as raid1. Best regards keld --
} -----Original Message----- } From: Keld Simonsen [mailto:keld@keldix.com] } Sent: Sunday, March 07, 2010 3:07 AM } To: Neil Brown } Cc: Guy Watkins; 'Greg Freemyer'; 'Mark Knecht'; 'Linux-RAID' } Subject: Re: What RAID type and why? } } On Sun, Mar 07, 2010 at 01:21:13PM +1100, Neil Brown wrote: } > On Sat, 06 Mar 2010 18:17:44 -0500 } > "Guy Watkins" <linux-raid@watkins-home.com> wrote: } > } > > } } > > } At a minimum I would build a 3-disk raid 6. raid 6 does a lot of } i/o } > > } which may be a problem. } > > } > > If he only needs 3 drives I would recommend RAID1. Can still loose 2 } drives } > > and you don't have the RAID6 I/O overhead. } > > } > } > and as md/raid6 requires at least 4 drives, RAID1 is not just the best } > solution to survive two failures on a 3-device array, it is the only } solution. } } Raid10 can also do it. } } raid1 is in many ways obsolete and you should rather use raid10, } which in my eyeys is just another way of doing the same conceptual thing } as raid1. } } Best regards } keld Are you sure RAID10 can loose 2 of 3 drives? I did not think it worked that way. I thought RAID10 maintained 2 copies, not 3. But I have never used RAID10. Guy --
If you ask mdadm to do it, yes. Example: mdadm --create /dev/md3 --chunk=256 -R -l 10 -n 3 -p f3 /dev/sd[abc]1 the "-p f3" is the one that asks to have 3 copies. best regards keld --
Yes, that way would work, except in that case it would use more complicated methods to split up the stripes among the drives. Since you're application seems to be read heavy, I agree with using 'far' for the stripe method. However the dis-advantage of mdadm raid10 has been two-fold compared to raid1 (until kernel 2.6.33+). 1) Fixed in 2.6.33: Striped storage did not previously support write-barriers (required for atomic write mechanisms/journals). 2) Still unsupported? : Reshape of raid10 arrays. --
Except that there also is raid10 with 3 mirrors. :)
MfG
Goswin
PS: Why doesn't raid6 still not allow 3 drives for the special case of
converting raid1 -> raid6?
--
That should be obvious: Possible stripes: Start: 1, 1, 1; 2, 2, 2; 'raid6' overtake... 1, q, Q; 2, q, Q; 'raid6' overtake with missing; 1, (missing 2), q, Q; 3, (missing 4), q, Q; In the first overtake case you have the requirement of generating 200% parity, which probably won't work for the algorithm and is a silly idea in general since it's computationally far less expensive to store another copy of either form of data instead. In the second you're gaining the space of a second disk at the cost of being already degraded; why not just go for raid 5 instead? You can overtake raid5 later with raid6 if you add more devices. --
Start:
1, 1, 1;
2, 2, 2;
3, 3, 3;
Middle:
1, P, Q;
P, Q, 2;
Q, 3, P;
...
End:
1, 2, P, Q;
4, P, Q, 3;
P, Q, 5, 6;
The sick 3 disk raid6 case should have both the P and Q identical to the
data block. It is indeed computational a waste to go through the
expensive P/Q parity algorithm for the same result as mirroring but this
Because then you are going from 2 mirror disks to 1 parity disk even if
only temporary. You are reducing the number of disks failures you can
survive from 2 to 1 and the high load during a reshape makes a failure
more likely than normal operations.
Or can you go from 3 way raid1 to 4 disk raid6 in a single step?
MfG
Goswin
--
You are not planning on staying with 3 devices though. Just stick with 2 redundancy raid 1 until you have four devices. Then overtaking from raid 1 + hotspares at 4 devices total to raid 6 with 2 data devices and 2 parity devices per stripe makes sense. --
Hi Mark I'll reply to just a few points. You don't need the same drives, only a few requirements: 1 - The drives need to play well with the controller. Do some tests. There were rare cases of certain drives being dropped by certain controllers e.g. on high I/O. Just do some tests before putting valuable data in. Maybe look at the HCL list for your controller before buying. 2 - the new drives need to be at least as large as the old ones. Extra space will be wasted (ok not exactly, you can use the extra space for other purposes). Speed is not relevant for bare functionality. 3 - It's better if new drives are not slower than the older drives. Everything moves at the speed of the slowest drive. 4 - Better to take raid-edition or enterprise-grade drives, especially because they usually have RTL "recovery time limit" also called TLER (time limited error recovery). Go google for it to understand why it is I think for a 5 disks raid array the CPU power is still not the limiting factor. Especially not in the fast raids like raid10. Note that only 1 core will be used for RAID parity computation for raid456. All cores will be used for handling interrupts from the drives though. Difficult to suggest what RAID you should use. We don't know access patterns, speed requirements, value of your data. Minimum for redundancy is raid-1 on 2 drives. --
