Re: Raid/5 optimization for linear writes

Previous thread: Partitions not recognized when auto-assembling array in initramfs by Laurent Pinchart on Monday, December 27, 2010 - 8:18 am. (9 messages)

Next thread: [PATCH] Update CONFIG_MD_RAID6_PQ to CONFIG_RAID6_PQ in drivers/dma/iop-adma.c by Wei Yongquan on Wednesday, December 29, 2010 - 5:30 am. (1 message)
From: Doug Dumitru
Date: Tuesday, December 28, 2010 - 8:38 pm

Hello all,

I have been using an in-house mod to the raid5.c driver to optimize
for linear writes.  The optimization is probably too specific for
general kernel inclusion, but I wanted to throw out what I have been
doing in case anyone is interested.

The application involves a kernel module that can produce precisely
aligned, long, linear writes.  In the case of raid-5, the obvious plan
is to issue writes that are complete raid stripes of
'optimal_io_length'.

Unfortunately, optimal_io_length is often less than the advertised max
io_buf size value and sometime less than the system max io_buf size
value.  Thus just pumping up the max value inside of raid5 is dubious.
 Even though dubious, just punching up the
mddev->queue->limits.max_hw_sectors does seem to work, not break
anything obvious, and does help performance out a little.

In looking at long linear writes with the stock raid5 driver, I am
seeing a small amount of reads to individual devices.  The test
application code calling the raid layer has > 100MB of locked kernel
buffer slamming the raid5 driver, so exactly why raid5 needs to
back-fill some reads is not very clear to me.  Looking at the raid5
code, it does not look like there is a real "scheduler" for deciding
when to back-fill the stripe cache, but instead it just relies on
thread round trips.  In my case, I am testing on server-class systems
with 8 or 16 3GHz threads, so availability of CPU cycles for the raid5
code is very high.

My patch ended up special casing a single inbound bio that contained a
write for a single full raid stripe.  So for 8 drives raid-5, this is
7 * 64K or an IO 448KB long.  With 4K pages this is a bi_io_vec array
of 112 pages.  Big for kernel memory generally, but easily handled by
server systems.  With more drives, you can be talking well over 1MB in
a single bio call.

The patch takes this special case write, makes sure it is raid-5 and
layout 2, is not degraded and is not migrating.  If all of these are
true, the code allocates ...
From: Roberto Spadim
Date: Thursday, December 30, 2010 - 7:36 am

could we make a
write algorithm
read algorithm

for each raid type? we don´t need to change default md algorithm, just
put a option to select algorithm, it´s good since new developers could
"plugin" news read/write algorithm
thanks




-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--

From: Doug Dumitru
Date: Thursday, December 30, 2010 - 11:47 am

What I have been working on does not change the raid algorithm.  The
issue is scheduling.

When raid/456 gets a write, it needs to write not only the new blocks,
but also the parity blocks that are associated.  In order to calculate
the parity blocks, it needs data from other blocks in the same stripe
set.  The issue is, a) should the raid code issue read requests for
the needed blocks, or b) should the raid code wait for more write
requests hoping that these requests will contain data for the needed
blocks.  Both of these approaches are wrong some of the time.  To make
things worse, with some drives, guessing wrong just a fraction of a
percent of the time can hurt performance dramatically.

In my case, if the raid code can get an entire stripe in a single
write request, then it can bypass most of the raid logic and just
"compute and go".  Unfortunately, such big requests break a lot of
conventions about how big requests can be, especially for large drive
count arrays.

Doug Dumitru
EasyCo LLC




--
Doug Dumitru
EasyCo LLC
--

Previous thread: Partitions not recognized when auto-assembling array in initramfs by Laurent Pinchart on Monday, December 27, 2010 - 8:18 am. (9 messages)

Next thread: [PATCH] Update CONFIG_MD_RAID6_PQ to CONFIG_RAID6_PQ in drivers/dma/iop-adma.c by Wei Yongquan on Wednesday, December 29, 2010 - 5:30 am. (1 message)