Re: EXT4 is ~2X as slow as XFS (593MB/s vs 304MB/s) for writes?

Previous thread: [PATCH] ext3: explicitly remove inode from orphan list after failed direct_io by Dmitry Monakhov on Friday, February 26, 2010 - 6:05 am. (1 message)

Next thread: none
From: Justin Piszcz
Date: Friday, February 26, 2010 - 5:31 pm

Hello,

Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
I see about half the performance as XFS for sequential writes.

I have checked the doc and tried several options, a few of which are shown
below (I have also tried the commit/journal_async/etc options but none of 
them get the write speeds anywhere near XFS)?

Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 
hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.

When it was XFS I used to get 400-600MiB/s for writes for the same RAID 
volume.

How do I 'speed' up ext4?  Is it possible?

raid0_11 disks: (XFS)
# /dev/md0        /r1             xfs     noatime         0       1
p63:/r1# dd if=/dev/zero of=bigfile1 bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1021 s, 593 MB/s
p63:/r1#

raid0_11 disks: (EXT4)
# /dev/md0        /r1             ext4     noatime         0       1
# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.3741 s, 304 MB/s
p63:/r1#

Other tests (ext4)
p63:~# mount /dev/md0 /r1 -o data=writeback
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.8746 s, 269 MB/s
p63:/r1#

p63:~# mount /dev/md0 /r1 -o data=writeback,nobarrier
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 40.0656 s, 268 MB/s

Justin.
--

From: Dmitry Monakhov
Date: Friday, February 26, 2010 - 5:46 pm

I don't know how to speedup, but i do know how to slowdown XFS :)
Seems that you forget to call fsync at the end of file write
In this case some data may reside in memory cache.
Please add  "conv=fsync" or "conv=fdatasync" to the dd cmd.
--

From: Justin Piszcz
Date: Friday, February 26, 2010 - 6:05 pm

Hi,

First with a sync added in the total time (still 2x as fast)

EXT3:
p63:~# mount /dev/md0 -o nobarrier,data=writeback /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.4163 s, 303 MB/s
0.02user 19.85system 0:36.97elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (5major+1145minor)pagefaults 0swaps

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M count=10240; sync'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.08 s, 594 MB/s
0.03user 16.15system 0:18.67elapsed 86%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+1147minor)pagefaults 0swaps
p63:/r1#

Per your request: conv=fsync & conv=fdatasync


XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.2142 s, 590 MB/s
0.03user 16.05system 0:18.21elapsed 88%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (0major+832minor)pagefaults 0swaps
p63:/r1#

EXT3:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.5562 s, 271 MB/s

XFS:
p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fdatasync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.513 s, 580 MB/s
0.03user 16.25system 0:18.51elapsed 87%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (5major+828minor)pagefaults 0swaps
p63:/r1#

p63:/r1# /usr/bin/time bash -c 'dd if=/dev/zero of=file bs=1M conv=fsync count=10240'
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.7859 s, 270 MB/s
0.02user 24.20system 0:39.79elapsed 60%CPU (0avgtext+0avgdata 7328maxresident)k
0inputs+0outputs (5major+829minor)pagefaults ...
From: Eric Sandeen
Date: Friday, February 26, 2010 - 5:51 pm

Aside from Dmitry's suggestion to time sync as well (although for 10G, you are
likely not leaving much in cache) I'd ask:

What kernel version?  what xfsprogs/e2fsprogs version?

Were the filesystems created to align with raid geometry?

mkfs.xfs has done that forever; mkfs.ext4 only will do so (automatically)
with recent kernel+e2fsprogs.


--

From: Justin Piszcz
Date: Friday, February 26, 2010 - 6:08 pm

2.6.33/x86_64

ii  xfsprogs                                3.1.1                    Utilities for managing the XFS filesystem
Only default options were used except the mount options.  If that is the
How recent?


--

From: Eric Sandeen
Date: Friday, February 26, 2010 - 6:12 pm

Justin Piszcz wrote:

You're recent enough.  :)

mkfs.ext4 output should include the stripe info if it was found.

        printf(_("Block size=%u (log=%u)\n"), fs->blocksize,
                s->s_log_block_size);
        printf(_("Fragment size=%u (log=%u)\n"), fs->fragsize,
                s->s_log_frag_size);
        printf(_("Stride=%u blocks, Stripe width=%u blocks\n"),
               s->s_raid_stride, s->s_raid_stripe_width);
        printf(_("%u inodes, %llu blocks\n"), s->s_inodes_count,
               ext2fs_blocks_count(s));

etc.

-Eric
--

From: Eric Sandeen
Date: Friday, February 26, 2010 - 6:28 pm

Oh, you need very recent util-linux-ng as well, and use libblkid from there
with:

[e2fsprogs] # ./configure --disable-libblkid

Otherwise you can just feed mkfs.ext4 stripe & stride manually.

-Eric
--

From: Justin Piszcz
Date: Saturday, February 27, 2010 - 3:14 am

Hi,

Even when set, there is still poor performance:

http://busybox.net/~aldot/mkfs_stride.html
Raid Level: 0
Number of Physical Disks: 11
RAID chunk size (in KiB): 1024
number of filesystem blocks (in KiB)
mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816

p63:~# /usr/bin/time mkfs.ext4 -b 4096 -E stride=256,stripe-width=2816 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=256 blocks, Stripe width=2816 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544

Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 38 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
p63:~#

p63:~# mount /dev/md0 /r1 -o nobarrier,data=writeback
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 39.3674 s, 273 MB/s
p63:/r1#

Still very slow?

Let's try with some optimizations:
p63:/r1#  mount /dev/md0 /r1 -o noatime,barrier=0,data=writeback,nobh,commit=100,nouser_xattr,nodelalloc,max_batch_time=0^C

Still not anywhere near 500-600MiB/s of XFS:
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.4824 s, 352 MB/s
p63:/r1#

Am I doing something wrong/is there a flag I am missing that will speed it
up?  Or is this performance for sequential writes on EXT4?

Justin.

--

From: Justin Piszcz
Date: Saturday, February 27, 2010 - 3:51 am

I also tried with the default chunk size (64KiB) incase ext4 had a problem
with chunk sizes > 64KiB, the results were the same for ext4, I also tried
ext2 & ext3 as well just to see what their performance would be:

p63:~# mkfs.ext2 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 19.9434 s, 538 MB/s
p63:/r1#

p63:~# mkfs.ext3 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 31.0195 s, 346 MB/s

p63:~# mkfs.ext4 -b 4096 -E stride=16,stripe-width=176 /dev/md0
p63:~# mount /dev/md0 /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10737418240 bytes (11 GB) copied, 35.3866 s, 303 MB/s

And, for comparison, XFS:
p63:~# mkfs.xfs -f /dev/md0 > /dev/null 2>&1
p63:~# mount /dev/md0 /r1
p63:~# cd /r1
p63:/r1# dd if=/dev/zero of=file bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 18.1527 s, 592 MB/s
p63:/r1#

--

From: Justin Piszcz
Date: Saturday, February 27, 2010 - 4:09 am

Hi,

I have found the same results on 2 different systems:

It seems to peak at ~350MiB/s performance on mdadm raid, whether
a RAID-5 or RAID-0 (two separate machines):

The only option I found that allows it to go from:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
to
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s

Is the -o nodelalloc option.

How come it is not breaking the 350MiB/s barrier is the question?

Justin.

--

From: Justin Piszcz
Date: Saturday, February 27, 2010 - 4:36 am

Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
working with many small files.

EXT4

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+971minor)pagefaults 0swaps
linux-2.6.33  linux-2.6.33.tar
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
0inputs+0outputs (0major+865minor)pagefaults 0swaps

XFS

p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+970minor)pagefaults 0swaps
p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
0inputs+0outputs (0major+864minor)pagefaults 0swaps

So I guess that's the tradeoff, for massive I/O you should use XFS, else,
use EXT4?

I still would like to know however, why 350MiB/s seems to be the maximum
performance I can get from two different md raids (that easily do 600MiB/s
with XFS).

Is this a performance issue within ext4 and md-raid?
The problem does not exist with xfs and md-raid.

Justin.


--

From: tytso
Date: Saturday, February 27, 2010 - 10:42 pm

Can you run "filefrag -v <filename>" on the large file you created
using dd?  Part of the problem may be the block allocator simply not
being well optimized super large writes.  To be honest, that's not
something we've tried (at all) to optimize, mainly because for most
users of ext4 they're more interested in much more reasonable sized
files, and we only have so many hours in a day to hack on ext4.  :-)
XFS in contrast has in the past had plenty of paying customers
interested in writing really large scientific data sets, so this is
something Irix *has* spent time optimizing.

As far as I know none of the ext4 developers' day jobs are currently
focused on really large files using ext4.  Some of us do use ext4 to
support really large files, but it's via some kind of cluster or
parallel file system layered on top of ext4 (i.e., Sun/Clusterfs
Lustre File Systems, or Google's GFS) --- and so what gets actually
stored in ext4 isn't a single 10-20 gigabyte file.

I'm saying this not as an excuse; but it's an explanation for why no
one has really noticed this performance problem until you brought it
up.  I'd like to see ext4 be a good general purpose file system, which
includes handling the really big files stored in a single system.  But
it's just not something we've tried optimizing yet.

So if you can gather some data, such as the filefrag information, that
would be a great first step.  Something else that would be useful is
gathering blktrace information, so we can see how we are scheduling
the writes and whether we have something bad going on there.  I
wouldn't be surprised if there is some stupidity going on in the
generic FS/MM writeback code which is throttling us, and which XFS has
worked around.  Ext4 has worked around some writeback brain-damage
already, but I've been focused on much smaller files (files in the
tens or hundreds megabytes) since that's what I tend to use much more
frequently.

It's great to see that you're really interested in this; if you're
willing to ...
From: Justin Piszcz
Date: Sunday, February 28, 2010 - 7:55 am

Yes, this is shown at the bottom of the e-mail both with -o data=ordered
and data=writeback.


This is more dramatic on the software raid (mdadm) RAID-5 configuration. 
Without -o nodelalloc, I see roughly 200MiB/s.  With -o nodelalloc, I hit 
the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less 
dramatic.  The full tests and output appear at the bottom of this e-mail; 
however, for brevity, the example below shows 55MiB/s and 132MiB/s
performance increases with RAID-0 and RAID-5 respectively:

For the RAID-0:

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
An increase of 55MiB/s.

For the RAID-5 (from earlier testing):

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s

=== CREATE RAID-0 WITH 11 DISKS

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64
mdadm: /dev/sdb1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sde1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdj1 appears to be ...
From: Andreas Dilger
Date: Monday, March 1, 2010 - 1:39 am

Have you tried testing with "nice" numbers of disks in your RAID set  
(e.g. 8 disks for RAID-0, 9 for RAID-5, 10 for RAID-6)?  The mballoc  
code is really much better tuned for power-of-two sized allocations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Justin Piszcz
Date: Monday, March 1, 2010 - 2:21 am

Hi,

Yes, the second system (RAID-5) has 8 disks and it shows the same 
performance problems with ext4 and not XFS (as shown from previous 
e-mail), where XFS usually got 500-600MiB/s for writes.

http://groups.google.com/group/linux.kernel/browse_thread/thread/e7b189bcaa2c1cb4/ad6c...

For the RAID-5 (from earlier testing):  <- This one has 8 disks.
-o data=writeback,nobarrier: 
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s 
-o data=writeback,nobarrier,nodelalloc: 
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s 
An increase of 132MiB/s.

Justin.

--

From: Michael Tokarev
Date: Monday, March 1, 2010 - 7:48 am

Note that for RAID-5, the "nice" number of disks is 9 as Andreas
said, not 8 as in your example.

/mjt
--

From: Justin Piszcz
Date: Monday, March 1, 2010 - 8:07 am

Hi, thanks for this.

RAID-0 with 12 disks:

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-m]1 --level=0 -n 12 -c 64
mdadm: /dev/sdb1 appears to contain an ext2fs file system
     size=1077256000K  mtime=Sun Feb 28 08:35:47 2010
mdadm: /dev/sdb1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sde1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
     level=raid0 devices=11 ctime=Sun Feb 28 08:11:10 2010
mdadm: /dev/sdm1 appears to be part of a raid array:
     level=raid6 devices=11 ctime=Sat Feb 27 06:57:29 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~# mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
366288896 inodes, 1465151808 blocks
73257590 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
44713 block groups
32768 blocks per ...
From: Eric Sandeen
Date: Monday, March 1, 2010 - 9:15 am

...

That looks pretty good.

I think Dave's suggesting of seeing what cpu usage looks like is a good one.

Running blktrace on xfs vs. ext4 could possibly also shed some light.

-Eric
--

From: Dave Chinner
Date: Sunday, February 28, 2010 - 4:50 pm

Mount XFS with "-o logbsize=262144". Metadata intensive workloads on
XFS are log IO bound, so larger log buffer size makes a big
difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've
been getting ~21s for the untar, and 8s for the rm -rf with that

I wouldn't consider writing an 11GB file "massive IO", nor would I
consider an 600MB/s massive, either, since you can get that out of a

Check whether the dd process on ext4 is CPU bound....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Eric Sandeen
Date: Monday, March 1, 2010 - 5:08 pm

FWIW I'm seeing similar things on fast storage (Fusion IO),
though this is under 2.6.31.  500MB/s+ for xfs, 300 for ext4.

Overwriting an existing file is no faster.   I don't think this
driver is blktraceable but I'll try a newer driver that should be I think.

(xfs's overwrite went from 534 to 597 mb/s; ext4 sat at 320-ish)

direct IO was good for both xfs & ext4 at around 530mb/s

I'll see if I can get this running on a more recent kernel to
do further investigation.

-Eric
--

From: Eric Sandeen
Date: Monday, March 1, 2010 - 5:37 pm

FWIW, blktrace (I'm still on 2.6.31) is enlightening:

Total (xfs):
 Reads Queued:           4,       16KiB	 Writes Queued:     122,567,   10,485MiB
 Read Dispatches:        4,       16KiB	 Write Dispatches:   83,219,   10,485MiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        4,       16KiB	 Writes Completed:   83,219,   10,485MiB
 Read Merges:            0,        0KiB	 Write Merges:       39,348,  314,804KiB
 IO unplugs:           344        	 Timer unplugs:         338

Total (ext4):
 Reads Queued:          14,       56KiB	 Writes Queued:       2,621K,   10,486MiB
 Read Dispatches:       14,       56KiB	 Write Dispatches:  107,944,   10,486MiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:       14,       56KiB	 Writes Completed:  107,944,   10,486MiB
 Read Merges:            0,        0KiB	 Write Merges:        2,513K,   10,054MiB
 IO unplugs:         2,461        	 Timer unplugs:       2,020


See "Writes Queued"  See also submit_bio() calls in xfs.

ext4 doing things a block at a time is certainly giving the elevator a workout...
I'd tend to chalk it up to that at first glance.

-Eric
--

Previous thread: [PATCH] ext3: explicitly remove inode from orphan list after failed direct_io by Dmitry Monakhov on Friday, February 26, 2010 - 6:05 am. (1 message)

Next thread: none