I post there because I couldn't find any information about this
elsewhere : on the same hardware ( Athlon X2 3500+, 512MB RAM, 2x400 GB
Hitachi SATA2 hard drives ) the 2.4 Linux software RAID-1 (tested 2.4.32
and 2.4.36.2, slightly patched to recognize the hardware :p) is way
faster than 2.6 ( tested 2.6.17.13, 2.6.18.8, 2.6.22.16, 2.6.24.3)
especially for writes. I actually made the test on several different
machines (same hard drives though) and it remained consistent across
the board, with /mountpoint a software RAID-1.
Actually checking disk activity with iostat or vmstat shows clearly a
cache effect much more pronounced on 2.4 (i.e. writing goes on much
longer in the background) but it doesn't really account for the
difference. I've also tested it thru NFS from another machine (Giga
ethernet network):dd if=/dev/zero of=/mountpoint/testfile bs=1M count=1024
kernel 2.4 2.6 2.4 thru NFS 2.6 thru NFS
write 90 MB/s 65 MB/s 70 MB/s 45 MB/s
read 90 MB/s 80 MB/s 75 MB/s 65 MB/sDuh. That's terrible. Does it mean I should stick to (heavily
patched...) 2.4 for my file servers or... ? :)--
--------------------------------------------------
Emmanuel Florac www.intellique.com
--------------------------------------------------
--
Keep in mind that the above test tests two subsystems at the same
time: RAID-1 + the filesystem on top of it. If you want to test RAID-1
performance you should specify a raw device to of=... instead of a
file (and the direct I/O flags). There are considerable performance
differences between filesystems for large files. E.g. XFS is a lot
faster than ext3 for large files (gigabytes).Bart.
--
Le Wed, 26 Mar 2008 09:42:19 +0100
I'm using XFS usually, and I've also checked against the raw devices
and it looks the same (2.4 still faster). I must add that the difference
is somewhat reduced when using a single disk drive vs. RAID-1,
obviously due to different buffering policy in the RAID subsystem.--
----------------------------------------
Emmanuel Florac | Intellique
------------------------------------------
On Wed, Mar 26, 2008 at 12:07 PM, Emmanuel Florac
You are welcome to post the numbers you obtained with dd for direct
I/O on a RAID-1 setup for 2.4 versus 2.6 kernel.Bart.
--
Le Wed, 26 Mar 2008 12:15:57 +0100
Here we go (tested on a slightly slower hardware : Athlon64 3000+,
nVidia chipset) . Actually, direct IO result is identical. However, the
significant number for the end user in this case is the NFS thruput.2.4 kernel (2.4.32), async write
--------------------------------
root@0[root]# ./dd if=/dev/zero of=/mnt/raid/testdd01 bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 13.407 seconds, 80.1 MB/s2.4 kernel (2.4.32), async write thru NFS mount
--------------------------------
emmanuel[/mnt/temp]$ dd if=/dev/zero of=./testdd01 bs=1M count=1024
1024+0 enregistrements lus
1024+0 enregistrements écrits
1073741824 bytes (1,1 GB) copied, 15,5176 s, 69,2 MB/s2.4 kernel (2.4.32), async read
--------------------------------
root@0[root]# ./dd if=/mnt/raid/testdd01 of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 15.752 seconds, 68.2 MB/s2.4 kernel (2.4.32), sync write
--------------------------------
root@0[root]# ./dd if=/dev/zero of=/mnt/raid/testdd01 bs=1M count=1024 \
oflag=direct,dsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 21.7874 seconds, 49.3 MB/s2.6 kernel (2.6.22.18), async write
--------------------------------
root@0[root]# ./dd if=/dev/zero of=/mnt/raid/testdd02 bs=1M
count=1024 1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 17.1347 seconds, 62.7 MB/s2.6 kernel (2.6.22.18), async write thru NFS mount
--------------------------------
emmanuel[/mnt/temp]$ dd if=/dev/zero of=./testdd02 bs=1M count=1024
1024+0 enregistrements lus
1024+0 enregistrements écrits
1073741824 bytes (1,1 GB) copied, 21,3618 s, 50,3 MB/s2.6 kernel (2.6.22.18), async read
--------------------------------
root@0[root]# ./dd if=/mnt/raid/testdd02 of=/dev/null bs=1M
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 15.7599 seconds, 68.1 MB/s2.6 kernel (2.6.22.18), sync write
...
The time you usually want to measure is time to get all data to another
drive. In that case fdatasync allows typical buffering while waiting at
the end of the copy until all bytes are on the destination platter. That
doesn't change the speed, just makes the numbers more stable. That's the
one I use, since most simple applications just use write() to send data.
This may or may not provide numbers more representative of your application.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot--
It's good to see that the synchronous write throughput is identical
for the 2.4.32 and 2.6.22.18 kernels.Regarding NFS: there are many parameters that influence NFS
performance. Are you using the userspace NFS daemon or the NFS daemon
in the kernel ? Telling NFS that it should use TCP instead of UDP
usually increases performance, as well as increasing the read and
write block size. And if there is only a single client accessing the
NFS filesystem, you can increase the attribute cache timeout in order
to decrease the number of NFS getattr calls. You could e.g. try the
following command on the client:mount -o remount,actimeo=86400,rsize=1048576,wsize=1048576,nfsvers=3,tcp,nolock
/mnt/tempPlease read the nfs(5) man page before using these parameters in a
production environment. Note: the output of the nfsstat command can be
helpful when optimizing NFS performance.Bart.
--
Unfortunately this shows the same trend as kernel compile, small
database operations, etc. If you are using a journaling filesystem on
2.6 and not 2.4 be sure you have the filesystem mounted "noatime" or
retest with a non-journaled f/s. If you are running LVM in the test all
bets are off as there are alignment issues (see linux-raid archives) to
consider.But the trend has unfortunately been slower, and responses demanding you
use another benchmark, saying that kernel compile is not a benchmark,
suggesting use of postgress or oracle instead of MySQL, etc, are seen.I wish it were not so, there seems to be more effort going to explaining
results than improving them. That said, tuning the location of the f/s,
the stride, chunk size, etc, can improve things, and there are patches
available for test (linux-raid again) which will address some of this
fairly soon.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--
It means you shouldn't use dd as a benchmark.
-- Chris
--
If you want to benchmark write speed, you should add
oflag=direct,dsync to the dd command line. For benchmarking read speed
you should specify iflag=direct. Or, even better, you can use xdd with
the flags -dio -processlock.Bart.
--
No, you want your benchmark to measure performance doing what the
application does. Do unless you have an application which has been
heavily Linux-ized you don't want to measure something unrelated to the
application requirements.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--
A basic fact I learned in science classes: if you measure something,
know very well what you measure and make sure your measurement is
repeatable. But it was some time ago I learned this. Maybe the whole
world changed since I learned that ?Bart.
--
Sounds like we're saying the same thing. For naive applications dd is
probably a closer model without direct or fconv, while if you want to
see what you could gain using additional features those are useful
options. I believe Chris was talking about the max speed possible, which
is a good thing to know but not similar to simple programming or shell
scripts using sed, grep, etc.--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark--
Thanks for the tip, I'll try these.
--
--------------------------------------------------
Emmanuel Florac www.intellique.com
--------------------------------------------------
--
What do you use as a benchmark for writing large sequential files or
reading them, and why is it better than dd at modeling programs which
read or write in a similar fashion?Media programs often do data access in just this fashion, multi-channel
video capture, streaming video servers, and similar.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--
dd uses unaligned stack-allocated buffers, and defaults to block sized I/O. To
call this inefficient is a gross understatement. Modern applications which care
about streaming I/O performance use large, aligned buffers which allow the
kernel to efficiently optimize things, or they use direct I/O to do it
themselves, or they make use of system calls like fadvise, madvise, splice, etc.
that inform the kernel how they intend to use the data or pass the work off to
the kernel completely. dd is designed to be incredibly lightweight, so it works
very well on a box with a 16 MHz CPU. It was *not* designed to take advantage
of the resources modern systems have available to enable scalability.I suggest an application-oriented benchmark that resembles the application
you'll actually be using.-- Chris
--
I was trying to speed up an app¹ I wrote which streams parts of a large file,
to separate files, and tested your advice above (on ext3 on 2.6.24.5-85.fc8).I tested reading blocks of 4096, both to stack and page aligned buffers,
but there were negligible differences between the CPU usage between the
aligned and non-aligned buffer case.
I guess the kernel could be clever and only copy the page to userspace
on modification in the page aligned case, but the benchmarks at least
don't suggest this is what's happening?What difference exactly should be expected from using page aligned buffers?
Note I also tested using mmap to stream the data, and there is a significant
decrease in CPU usage in user and kernel space as expected due to the
data not being copied from the page cache.thanks,
Pádraig.
Page alignment, by itself, doesn't do much, but it implies a couple of
things:1) cache line alignment, which matters more with some architectures than
others2) block alignment, which is necessary for direct I/O
You're on the right track with mmap, but you may want to use madvise()
to tune the readahead on the pagecache.-- Chris
--
dd has been capable of doing direct io for years, so I assume it can
emulate that behavior if it is appropriate to do so, and the buffer size
can be set as needed. I'm less sure that large buffers are allocated on
the stack, but often the behavior of the application models is the smallAnd this is what I was saying earlier, there is a trend to blame the
benchmark when in fact the same benchmark runs well on 2.4. Rather than
replacing the application or benchmark, perhaps the *regression* could
be fixed in the kernel. With all the mods and queued i/o and everything,
the performance is still going down.--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark--
2.6 has been designed to scale, and scale it does. The cost is added
overhead for naively designed applications, which dd is quite
intentionally. Simply enabling direct I/O in dd accomplishes nothing if
the I/O patterns you're instructing it to perform are not optimized. If
I/O performance is important to you, you really need to optimize your
application or tune your kernel for I/O performance.If you have a performance-critical application that is designed in a
manner such that a naive dd invocation is an accurate benchmark for it,
you should file a bug with the developer of that application.I've long since lost count of the number of times that I've seen
optimizing for dd absolutely killed real application performance.-- Chris
--
As I mentioned, it looks like 2.4 actually buffers write data on RAID-1
which is inherently bad (after all if I do RAID-1 it's for the sake of
data integrity, and write caching just counters that).
However, how bad dd may be, it reflects broadly my problem : on small
systems using software RAID, IO is overall way better with 2.4 than
2.6, especially NFS thruput.
Though I can substantially enhance 2.6 performance through tweaking
(playing with read ahead, disk queue length etc), it still lags behind
2.4 with defaults settings by a clear margin (10% or more).
This isn't true - fortunately - of larger systems with 12, 24, 48 disks
drives, hardware RAID, Fibre Channel and al.--
--------------------------------------------------
Emmanuel Florac www.intellique.com
--------------------------------------------------
--
This sounds more like a VM issue than a RAID issue. I suspect the
interesting difference between your small systems and your large systems
is the amount of RAM, not the storage. On small systems, the penalty
for sizing caches incorrectly is much greater, so small systems tend to
suffer more if the default tunings are a little off.If you do some VM tuning (particularly in /proc/sys/vm) and find that it
makes a large difference, please do report it. Most of the exciting VM
work is targeted to the high end, not the low end, so it's quite
possible that the heuristics which choose default VM parameters at boot
time are no longer as good for small systems as they once were.-- Chris
--
Not bad, it buffer flushing is secure. You just have 'one buffer size'
delay. If your system crashes, think it just crashed 'one buffer size'But you shouldn't have to tweak anything.
Let's forget for a moment calling dd a 'benchmark'. The fact is that a
certain program (in its default behaviour, dd if=xxx of=yyy) is waaay
slower in 2.6 than in 2.4. So something has gone nuts.
The typical question is 'who cares dd ?'. And the answer: all normal
applications that just read and write, that do not use any *advise()
because they tried to be portable, that are not rewritten and fancy
optimized to take advantage of latest kernel knobs, in short, any normal
app that just fopen()s and fread()s...
Seriously, are people telling that I have to tweak my app to get the
same performance that in 2.4 ? The basic performance should be the same,
and all those knobs should let you get _better_ throughput, not just the
same. To say anything else is to hide the head on the floor...--
J.A. Magallon <jamagallon()ono!com> \ Software is
like sex:
\ It's better when
it's free
Mandriva Linux release 2008.1 (Cooker) for i586--
Ah, agreement from another direction. Yes, portable solutions often use
tools like awk, grep, and sed, which do just the very thing which make
2.6 unhappy. And to claim that this is a vm problem (hopefully not) and
that Linux is no longer good at complex tasks like copying a large file
a line at a time, at leat unless you have GBs of memory so it canMajor company seen in the news which has custom QA hardware which writes
one very long line of ASCII every 100ms for 17hrs, and now they have to
I just copied an 8GB DVD image from one drive to another. I checked, it
doesn't go through a dial-up modem, it's just painfully *slow*.I think the problem is that many developers *do* use big machines, with
fast disk, lots of memory, and don't spend much time using (or making
useful) more typical desktop configurations. And yes, these are "real"
applications, and they run better on 2.4.--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot--
Sure, however my application may behave similarly to dd, or worse, some
external entity I don't have control upon (read : an annoying
customer) may use dd as a benchmark and draw fallacious asumptions I
need to sort out :)--
--------------------------------------------------
Emmanuel Florac www.intellique.com
--------------------------------------------------
--
| Benjamin Herrenschmidt | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
| Daniel Walker | Re: [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| David Miller | [GIT]: Networking |
| Hannes Eder | [PATCH 01/43] drivers/net/at1700.c: fix sparse warning: symbol shadows an earlier ... |
| Gerrit Renker | [PATCH 16/37] dccp: API to query the current TX/RX CCID |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
