dd uses unaligned stack-allocated buffers, and defaults to block sized I/O. To
call this inefficient is a gross understatement. Modern applications which care
about streaming I/O performance use large, aligned buffers which allow the
kernel to efficiently optimize things, or they use direct I/O to do it
themselves, or they make use of system calls like fadvise, madvise, splice, etc.
that inform the kernel how they intend to use the data or pass the work off to
the kernel completely. dd is designed to be incredibly lightweight, so it works
very well on a box with a 16 MHz CPU. It was *not* designed to take advantage
of the resources modern systems have available to enable scalability.
I suggest an application-oriented benchmark that resembles the application
you'll actually be using.
-- Chris
--