Hello all,
I have been using an in-house mod to the raid5.c driver to optimize
for linear writes. The optimization is probably too specific for
general kernel inclusion, but I wanted to throw out what I have been
doing in case anyone is interested.
The application involves a kernel module that can produce precisely
aligned, long, linear writes. In the case of raid-5, the obvious plan
is to issue writes that are complete raid stripes of
'optimal_io_length'.
Unfortunately, optimal_io_length is often less than the advertised max
io_buf size value and sometime less than the system max io_buf size
value. Thus just pumping up the max value inside of raid5 is dubious.
Even though dubious, just punching up the
mddev->queue->limits.max_hw_sectors does seem to work, not break
anything obvious, and does help performance out a little.
In looking at long linear writes with the stock raid5 driver, I am
seeing a small amount of reads to individual devices. The test
application code calling the raid layer has > 100MB of locked kernel
buffer slamming the raid5 driver, so exactly why raid5 needs to
back-fill some reads is not very clear to me. Looking at the raid5
code, it does not look like there is a real "scheduler" for deciding
when to back-fill the stripe cache, but instead it just relies on
thread round trips. In my case, I am testing on server-class systems
with 8 or 16 3GHz threads, so availability of CPU cycles for the raid5
code is very high.
My patch ended up special casing a single inbound bio that contained a
write for a single full raid stripe. So for 8 drives raid-5, this is
7 * 64K or an IO 448KB long. With 4K pages this is a bi_io_vec array
of 112 pages. Big for kernel memory generally, but easily handled by
server systems. With more drives, you can be talking well over 1MB in
a single bio call.
The patch takes this special case write, makes sure it is raid-5 and
layout 2, is not degraded and is not migrating. If all of these are
true, the code allocates ...