Re: RAID5

Previous thread: From China (Business Offer) by Mr Mou Xinsheng on Saturday, April 17, 2010 - 7:15 pm. (1 message)

Next thread: Question on how to mark drives on a particular raid as being good after a "failure" by Joe Landman on Monday, April 19, 2010 - 11:56 am. (2 messages)
From: Kaushal Shriyan
Subject: RAID5
Date: Sunday, April 18, 2010 - 8:46 pm

Hi

I am a newbie to RAID. is strip size and block size same. How is it
calculated. is it 64Kb by default. what should be the strip size ?

I have referred to
http://en.wikipedia.org/wiki/Raid5#RAID_5_parity_handling. How is
parity handled in case of RAID 5.

Please explain me with an example.

Thanks and Regards,

Kaushal
--

From: Michael Evans
Subject: Re: RAID5
Date: Sunday, April 18, 2010 - 9:21 pm

On Sun, Apr 18, 2010 at 8:46 PM, Kaushal Shriyan

You already have one good resource.

I wrote this a while ago, and the preface may answer some questions
you have about the terminology used.

http://wiki.tldp.org/LVM-on-RAID

However the question you're asking is more or less borderline
off-topic for this mailing list.  If the linked information is
insufficient I suggest using the Wikipedia article's links to learn
more.
--

From: Bill Davidsen
Subject: Re: RAID5
Date: Wednesday, April 21, 2010 - 6:32 am

I have some recent experience with this gained the hard way, by looking 
for a problem rather than curiousity. My experience with LVM on RAID is 
that, at least for RAID-5, write performance sucks. I created two 
partitions on each of three drives, and two raid-5 arrays using those 
partitions. Same block size, same tuning for stripe-cache, etc. I 
dropped an ext4 on on array, and LVM on the other, put ext4 on the LVM 
drive, and copied 500GB to each. LVM had a 50% performance penalty, took 
twice as long. Repeated with four drives (all I could spare) and found 
that the speed right on an array was roughly 3x slower with LVM.

I did not look into it further, I know why the performance is bad, I 
don't have the hardware to change things right now, so I live with it. 
When I get back from a trip I will change that.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

From: Michael Evans
Subject: Re: RAID5
Date: Wednesday, April 21, 2010 - 12:43 pm

This issues sounds very likely to be write barrier related.  Were you
using an external journal on a write-barrier honoring device?
--

From: Michael Tokarev
Subject: Re: RAID5
Date: Friday, April 23, 2010 - 7:26 am

This is most likely due to read-modify-write cycle which is present on
lvm-on-raid[456] if the number of data drives is not a power of two.
LVM requires the block size to be a power of two, so if you can't fit
some number of LVM blocks on whole raid stripe size your write speed
is expected to be ~3 times worse...

Even creating partitions on such raid array is difficult.

'Hwell.

Unfortunately very few people understand this.

As of write barriers, it looks like either they already work
(in 2.6.33) or will be (in 2.6.34) for whole raid5-lvm stack.

/mjt
--

From: MRK
Subject: Re: RAID5
Date: Friday, April 23, 2010 - 7:57 am

Seriously?
a number of data drives power of 2 would be an immense limitation.
Why should that be? I understand that LVM blocks would not be aligned to 
raid stripes, and this can worsen the problem for random writes, but if 
the write is sequential, the raid stripe will still be filled at the 
next block-output by LVM.
Maybe the very first stripe you write will get an RMW but the next ones 
will be filled in the wait, and also consider you have the 
preread_bypass_threshold feature by MD which helps in this.

Also if you really need to put an integer number of LVM blocks in an MD 
stripe (which I doubt, as I wrote above), this still does not mean that 
the number of drives needs to be a power of 2: e.g. you could put 10 LVM 
blocks in 5 data disks, couldn't you?


I would think more to a barriers thing... I'd try to repeat the test 
with nobarrier upon ext4 mount and see.
But Bill says that he "knows" what's the problem so maybe he will tell 
us earlier or later :-)
--

From: Michael Evans
Subject: Re: RAID5
Date: Friday, April 23, 2010 - 1:57 pm

Even when write barriers are supported what will a typical transaction
look like?

Journal Flush
Data Flush
Journal Flush (maybe)

If the operations are small (which the journal ops should be) then
you're forced to wait for a read, and then make a write barrier after
it.

J.read(2 drives)
J.write(2 drives) -- Barrier
D.read(2 drives)
D.write(2 drives) -- Barrier
Then maybe
J.read(2 drives) (Hopefully cached, but could cross in to a new stripe...)
J.write(2 drives) -- Barrier

This is why an external journal on another device is a great idea.
Unfortunately what I really want is something like 512mb of battery
backed ram (at any vaguely modern speed) to split up as a journal
devices, but now everyone is selling SDDs which are broken for such
needs.  Any ram drive units still being sold seem to be more along
data-center grade sizes.
--

From: Mikael Abrahamsson
Subject: Re: RAID5
Date: Friday, April 23, 2010 - 6:47 pm

http://benchmarkreviews.com/index.php?option=com_content&task=view&id=308&...

Basically it's DRAM with a battery backup and a CF slot where the data 
goes in case of poewr failure. It's a bit big and so on, but it should be 
perfect for journals... Or is this the kind of device you were referring 
to as "data center grade size"?

Some of te SSDs sold today have a capacitor for power failure as well, so 
all writes will complete, but they're not so common.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se
--

From: Michael Evans
Subject: Re: RAID5
Date: Friday, April 23, 2010 - 8:34 pm

Yeah, that's in the range I call 'data center grade' since the least
expensive model I can find using search tools is about 236 USD.  For
that price I could /buy/ two to three hard drives and get nearly the
same effect by reusing old drives (but wasting more power).

I should be able to find something with a cheep plastic shell for
mounting and a very simple PCB that has slots for older ram of my
selection, and a minimal onboard CPU for less than 50USD; I seriously
doubt the components cost that much.
--

From: Bill Davidsen
Subject: Re: RAID5
Date: Sunday, May 2, 2010 - 3:51 pm

Since I tried 3 and 4 drive setups, with several chunk sizes, I would 
hope that no matter how lvm counts data drives (why does it care?) it 


-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

From: Luca Berra
Subject: Re: RAID5
Date: Sunday, May 2, 2010 - 10:51 pm

uh?
PE size != block size.
PE size is not used for io, it is only used for laying out data.
It will influence data alignment, but i believe the issue may be
bypassed if we make PE size == chunk_size and do all creation/extension
of LV in multiple of data_disks, the resulting device-mapper tables
should be aligned.

L.

-- 
Luca Berra -- bluca@comedia.it
         Communication Media & Services S.r.l.
  /"\
  \ /     ASCII RIBBON CAMPAIGN
   X        AGAINST HTML MAIL
  / \
--

From: Bill Davidsen
Subject: Re: RAID5
Date: Sunday, May 2, 2010 - 3:45 pm

Not at all, just taking 60G of free space of the drives, creating two 
partitions (on 64 sector boundaries) and using them for raid-5. Tried 
various chunk sizes, better for some things, not so much for others.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

Previous thread: From China (Business Offer) by Mr Mou Xinsheng on Saturday, April 17, 2010 - 7:15 pm. (1 message)

Next thread: Question on how to mark drives on a particular raid as being good after a "failure" by Joe Landman on Monday, April 19, 2010 - 11:56 am. (2 messages)