Hello, Is it possible to 'optimize' ext4 so it is as fast as XFS for writes? I see about half the performance as XFS for sequential writes. I have checked the doc and tried several options, a few of which are shown below (I have also tried the commit/journal_async/etc options but none of them get the write speeds anywhere near XFS)? Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write. When it was XFS I used to get 400-600MiB/s for writes for the same RAID volume. How do I 'speed' up ext4? Is it possible? raid0_11 disks: (XFS) # /dev/md0 /r1 xfs noatime 0 1 p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s p63:/r1# raid0_11 disks: (EXT4) # /dev/md0 /r1 ext4 noatime 0 1 # dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s p63:/r1# Other tests (ext4) p63:~# mount /dev/md0 /r1 -o data=writeback p63:~# cd /r1 p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s p63:/r1# p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s Justin. --
I don't know how to speedup, but i do know how to slowdown XFS :) Seems that you forget to call fsync at the end of file write In this case some data may reside in memory cache. Please add "conv=fsync" or "conv=fdatasync" to the dd cmd. --
Hi, First with a sync added in the total time (still 2x as fast) EXT3: p63:~# mount /dev/md0 -o nobarrier,data=writeback /r1 p63:~# cd /r1 p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 35.4163 s, 303 MB/s 0.02user 19.85system 0:36.97elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k 0inputs+0outputs (5major+1145minor)pagefaults 0swaps XFS: p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 18.08 s, 594 MB/s 0.03user 16.15system 0:18.67elapsed 86%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (5major+1147minor)pagefaults 0swaps p63:/r1# Per your request: conv=fsync & conv=fdatasync XFS: p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 18.2142 s, 590 MB/s 0.03user 16.05system 0:18.21elapsed 88%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (0major+832minor)pagefaults 0swaps p63:/r1# EXT3: p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 39.5562 s, 271 MB/s XFS: p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 18.513 s, 580 MB/s 0.03user 16.25system 0:18.51elapsed 87%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (5major+828minor)pagefaults 0swaps p63:/r1# p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240' 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 39.7859 s, 270 MB/s 0.02user 24.20system 0:39.79elapsed 60%CPU (0avgtext+0avgdata 7328maxresident)k 0inputs+0outputs (5major+829minor)pagefaults ...
Aside from Dmitry's suggestion to time sync as well (although for 10G, you are likely not leaving much in cache) I'd ask: What kernel version? what xfsprogs/e2fsprogs version? Were the filesystems created to align with raid geometry? mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically) with recent kernel+e2fsprogs. --
2.6.33/x86_64 ii xfsprogs 3.1.1 Utilities for managing the XFS filesystem Only default options were used except the mount options. If that is the How recent? --
Justin Piszcz wrote:
You're recent enough. :)
mkfs.ext4 output should include the stripe info if it was found.
printf(_("Block size=%u (log=%u)\n"), fs->blocksize,
s->s_log_block_size);
printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
s->s_log_frag_size);
printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
s->s_raid_stride, s->s_raid_stripe_width);
printf(_("%u inodes, %llu blocks\n"), s->s_inodes_count,
ext2fs_blocks_count(s));
etc.
-Eric
--
Oh, you need very recent util-linux-ng as well, and use libblkid from there with: [e2fsprogs] # ./configure --disable-libblkid Otherwise you can just feed mkfs.ext4 stripe & stride manually. -Eric --
Hi, Even when set, there is still poor performance: http://busybox.net/~aldot/mkfs_stride.html Raid Level: 0 Number of Physical Disks: 11 RAID chunk size (in KiB): 1024 number of filesystem blocks (in KiB) mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816 p63:~# /usr/bin/time mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816 /dev/md0 mke2fs 1.41.10 (10-Feb-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=256 blocks, Stripe width=2816 blocks 335765504 inodes, 1343055824 blocks 67152791 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 40987 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 38 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. p63:~# p63:~# mount /dev/md0 /r1 -o nobarrier,data=writeback p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 39.3674 s, 273 MB/s p63:/r1# Still very slow? Let's try with some optimizations: p63:/r1# mount /dev/md0 /r1 -o noatime,barrier=0,data=writeback,nobh,commit=100,nouser_xattr,nodelalloc,max_batch_time=0^C Still not anywhere near 500-600MiB/s of XFS: p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 30.4824 s, 352 MB/s p63:/r1# Am I doing something wrong/is there a flag I am missing that will speed it up? Or is this performance for sequential writes on EXT4? Justin. --
I also tried with the default chunk size (64KiB) incase ext4 had a problem with chunk sizes > 64KiB, the results were the same for ext4, I also tried ext2 & ext3 as well just to see what their performance would be: p63:~# mkfs.ext2 -b 4096 -E stride=16,stripe-width=176 /dev/md0 p63:~# mount /dev/md0 /r1 p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10737418240 bytes (11 GB) copied, 19.9434 s, 538 MB/s p63:/r1# p63:~# mkfs.ext3 -b 4096 -E stride=16,stripe-width=176 /dev/md0 p63:~# mount /dev/md0 /r1 p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10737418240 bytes (11 GB) copied, 31.0195 s, 346 MB/s p63:~# mkfs.ext4 -b 4096 -E stride=16,stripe-width=176 /dev/md0 p63:~# mount /dev/md0 /r1 p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10737418240 bytes (11 GB) copied, 35.3866 s, 303 MB/s And, for comparison, XFS: p63:~# mkfs.xfs -f /dev/md0 > /dev/null 2>&1 p63:~# mount /dev/md0 /r1 p63:~# cd /r1 p63:/r1# dd if=/dev/zero of=file bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 18.1527 s, 592 MB/s p63:/r1# --
Hi, I have found the same results on 2 different systems: It seems to peak at ~350MiB/s performance on mdadm raid, whether a RAID-5 or RAID-0 (two separate machines): The only option I found that allows it to go from: 10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s to 10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s Is the -o nodelalloc option. How come it is not breaking the 350MiB/s barrier is the question? Justin. --
Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when working with many small files. EXT4 p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync' 0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k 0inputs+0outputs (0major+971minor)pagefaults 0swaps linux-2.6.33 linux-2.6.33.tar p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync' 0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k 0inputs+0outputs (0major+865minor)pagefaults 0swaps XFS p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync' 0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k 0inputs+0outputs (0major+970minor)pagefaults 0swaps p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync' 0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k 0inputs+0outputs (0major+864minor)pagefaults 0swaps So I guess that's the tradeoff, for massive I/O you should use XFS, else, use EXT4? I still would like to know however, why 350MiB/s seems to be the maximum performance I can get from two different md raids (that easily do 600MiB/s with XFS). Is this a performance issue within ext4 and md-raid? The problem does not exist with xfs and md-raid. Justin. --
Can you run "filefrag -v <filename>" on the large file you created using dd? Part of the problem may be the block allocator simply not being well optimized super large writes. To be honest, that's not something we've tried (at all) to optimize, mainly because for most users of ext4 they're more interested in much more reasonable sized files, and we only have so many hours in a day to hack on ext4. :-) XFS in contrast has in the past had plenty of paying customers interested in writing really large scientific data sets, so this is something Irix *has* spent time optimizing. As far as I know none of the ext4 developers' day jobs are currently focused on really large files using ext4. Some of us do use ext4 to support really large files, but it's via some kind of cluster or parallel file system layered on top of ext4 (i.e., Sun/Clusterfs Lustre File Systems, or Google's GFS) --- and so what gets actually stored in ext4 isn't a single 10-20 gigabyte file. I'm saying this not as an excuse; but it's an explanation for why no one has really noticed this performance problem until you brought it up. I'd like to see ext4 be a good general purpose file system, which includes handling the really big files stored in a single system. But it's just not something we've tried optimizing yet. So if you can gather some data, such as the filefrag information, that would be a great first step. Something else that would be useful is gathering blktrace information, so we can see how we are scheduling the writes and whether we have something bad going on there. I wouldn't be surprised if there is some stupidity going on in the generic FS/MM writeback code which is throttling us, and which XFS has worked around. Ext4 has worked around some writeback brain-damage already, but I've been focused on much smaller files (files in the tens or hundreds megabytes) since that's what I tend to use much more frequently. It's great to see that you're really interested in this; if you're willing to ...
Yes, this is shown at the bottom of the e-mail both with -o data=ordered
and data=writeback.
This is more dramatic on the software raid (mdadm) RAID-5 configuration.
Without -o nodelalloc, I see roughly 200MiB/s. With -o nodelalloc, I hit
the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less
dramatic. The full tests and output appear at the bottom of this e-mail;
however, for brevity, the example below shows 55MiB/s and 132MiB/s
performance increases with RAID-0 and RAID-5 respectively:
For the RAID-0:
-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
An increase of 55MiB/s.
For the RAID-5 (from earlier testing):
-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
=== CREATE RAID-0 WITH 11 DISKS
p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdj1 appears to be ...Have you tried testing with "nice" numbers of disks in your RAID set (e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)? The mballoc code is really much better tuned for power-of-two sized allocations. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Hi, Yes, the second system (RAID-5) has 8 disks and it shows the same performance problems with ext4 and not XFS (as shown from previous e-mail), where XFS usually got 500-600MiB/s for writes. http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c... For the RAID-5 (from earlier testing): <- This one has 8 disks. -o data=writeback,nobarrier: 10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s -o data=writeback,nobarrier,nodelalloc: 10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s An increase of 132MiB/s. Justin. --
Note that for RAID-5, the "nice" number of disks is 9 as Andreas said, not 8 as in your example. /mjt --
Hi, thanks for this.
RAID-0 with 12 disks:
p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-m]1 --level=0 -n 12 -c 64
mdadm: /dev/sdb1 appears to contain an ext2fs file system
size=1077256000K mtime=Sun Feb 28 08:35:47 2010
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdm1 appears to be part of a raid array:
level=raid6 devices=11 ctime=Sat Feb 27 06:57:29 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~# mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
366288896 inodes, 1465151808 blocks
73257590 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
44713 block groups
32768 blocks per ...... That looks pretty good. I think Dave's suggesting of seeing what cpu usage looks like is a good one. Running blktrace on xfs vs. ext4 could possibly also shed some light. -Eric --
Mount XFS with "-o logbsize=262144". Metadata intensive workloads on XFS are log IO bound, so larger log buffer size makes a big difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've been getting ~21s for the untar, and 8s for the rm -rf with that I wouldn't consider writing an 11GB file "massive IO", nor would I consider an 600MB/s massive, either, since you can get that out of a Check whether the dd process on ext4 is CPU bound.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
FWIW I'm seeing similar things on fast storage (Fusion IO), though this is under 2.6.31. 500MB/s+ for xfs, 300 for ext4. Overwriting an existing file is no faster. I don't think this driver is blktraceable but I'll try a newer driver that should be I think. (xfs's overwrite went from 534 to 597 mb/s; ext4 sat at 320-ish) direct IO was good for both xfs & ext4 at around 530mb/s I'll see if I can get this running on a more recent kernel to do further investigation. -Eric --
FWIW, blktrace (I'm still on 2.6.31) is enlightening: Total (xfs): Reads Queued: 4, 16KiB Writes Queued: 122,567, 10,485MiB Read Dispatches: 4, 16KiB Write Dispatches: 83,219, 10,485MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 4, 16KiB Writes Completed: 83,219, 10,485MiB Read Merges: 0, 0KiB Write Merges: 39,348, 314,804KiB IO unplugs: 344 Timer unplugs: 338 Total (ext4): Reads Queued: 14, 56KiB Writes Queued: 2,621K, 10,486MiB Read Dispatches: 14, 56KiB Write Dispatches: 107,944, 10,486MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 14, 56KiB Writes Completed: 107,944, 10,486MiB Read Merges: 0, 0KiB Write Merges: 2,513K, 10,054MiB IO unplugs: 2,461 Timer unplugs: 2,020 See "Writes Queued" See also submit_bio() calls in xfs. ext4 doing things a block at a time is certainly giving the elevator a workout... I'd tend to chalk it up to that at first glance. -Eric --
