Re: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs

Previous thread: [ANNOUNCE] TCM/LIO: v4.0.0-rc2 for v2.6.35 by Nicholas A. Bellinger on Wednesday, August 4, 2010 - 12:28 am. (1 message)

Next thread: Re: [PATCH 2/2] MEMSTICK: Add driver for Ricoh R5C592 Card reader. by Alex Dubov on Wednesday, August 4, 2010 - 12:57 am. (3 messages)
From: Dominik Brodowski
Date: Wednesday, August 4, 2010 - 12:35 am

Hey,

on a production system I run kernel 2.6.35 and

	XFS	(rw,relatime,nobarrier)

on a

	lvdevice of a vgroup 

consisting of five

	dm-crypt devices (cryptsetup -c aes-lrw-benbi -s 384 create)

, each of which runs on a

	md-raid1 device (mdadm --create --level=raid1 --raid-devices=2)

on two

	750 GB ATA devices.


The read performance is abysmal. The ata devices can be ruled out, as hdparm

How can I best track down the cause of the performance problem, 
a) without rebooting too often, and
b) without breaking up the setup specified above (production system)?

Any ideas? perf(1)? iostat(1)?

Thanks & best,

	Dominik
--

From: Christoph Hellwig
Date: Wednesday, August 4, 2010 - 1:50 am

So did you just upgrade the system from an earlier kernel that did not
show these problems?  Or did no one notice them before?

--

From: Dominik Brodowski
Date: Wednesday, August 4, 2010 - 2:13 am

Christoph,



Well, there are some reports relating to XFS on MD or RAID, though I couldn't
find a resolution to the issues reported, e.g.

- http://kerneltrap.org/mailarchive/linux-raid/2009/10/12/6490333
- http://lkml.indiana.edu/hypermail/linux/kernel/1006.1/00099.html

However, I think we can rule out barriers, as XFS is mounted "nobarrier"
here.

Best,
	Dominik
--

From: Christoph Hellwig
Date: Wednesday, August 4, 2010 - 2:21 am

Ok, so it's been around for a while.  Can you test the write speed of
each individual device layer by doing a large read from it, using:

	dd if=<device> of=/dev/null bs=8k iflag=direct

where device starts with the /dev/sda* device, and goes up to the MD
device, the dm-crypt device and the LV.  And yes, it's safe to read
from the device while it's otherwise mounted/used.

--

From: Michael Monnerie
Date: Wednesday, August 4, 2010 - 2:16 am

Has that system been running acceptable before? If yes, what has been 
changed that performance is down now?

Or is it a new setup? Then why is it in production already?

Can you run bonnie on that system?
What does "dd if=<your device> of=/dev/null bs=1m count=1024" say?
What does "dd if=/dev/zero of=<your device> bs=1m count=1024" say?


-- 
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc

it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31

****** Aktuelles Radiointerview! ******
http://www.it-podcast.at/aktuelle-sendung.html

// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
From: Dominik Brodowski
Date: Wednesday, August 4, 2010 - 3:25 am

Hey,

many thanks for your feedback. It seems the crypto step is the culprit:

Reading 1.1 GB with dd, iflag=direct, bs=8k:

/dev/sd*                35.3 MB/s       ( 90 %)
/dev/md*                39.1 MB/s       (100 %)
/dev/mapper/md*_crypt    3.9 MB/s       ( 10 %)
/dev/mapper/vg1-*        3.9 MB/s       ( 10 %)

The "good" news: it also happens on my notebook, even though it has a
different setup (no raid, disk -> lv/vg -> crypt). On my notebook, I'm
more than happy to test out different kernel versions, patches etc.

/dev/sd*                17.7 MB/s       (100 %)
/dev/mapper/vg1-*       16.2 MB/s       ( 92 %)
/dev/mapper/*_crypt      3.1 MB/s       ( 18 %)

On a different system, a friend of mine reported (with 2.6.33):

/dev/sd*		51.9 MB/s	(100 %)
dm-crypt		32.9 MB/s	( 64 %)

This shows that the speed drop when using dmcrypt is not always a factor of
5 to 10... Btw, it occurs both with aes-lrw-benbi and aes-cbc-essiv:sha256 ,
and (on my notebook) the CPU is mostly idling or waiting. 

Best,
	Dominik

PS: Bonnie output:

Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03d       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ABCABCABCABCABC 16G           60186   4 24796   4           53386   5 281.1   1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  3176  16 +++++ +++  4641  28  ...
From: Christoph Hellwig
Date: Wednesday, August 4, 2010 - 4:18 am

The good news is that you have it tracked down, the bad news is that
I know very little about dm-crypt.  Maybe the issue is the single
threaded decryption in dm-crypt?  Can you check how much CPU time
the dm crypt kernel thread uses?

--

From: Dominik Brodowski
Date: Wednesday, August 4, 2010 - 4:24 am

2 CPUs overall:
Cpu(s):  1.0%us,  5.7%sy,  0.0%ni, 44.8%id, 47.0%wa,  0.0%hi,  1.5%si, 0.0%st

Thanks & best,
	Dominik
--

From: Mikael Abrahamsson
Date: Wednesday, August 4, 2010 - 4:53 am

I'm not sure it's that. I have a Core i5 with AES-NI and that didn't 
significantly increase my overall performance, as it's not there the 
bottleneck is (at least in my system).

I earlier sent out an email wondering if someone could shed some light on 
how scheduling, block caching and read-ahead works together when one does 
disks->md->crypto->lvm->fs, becase that's a lot of layers and potentially 
a lot of unneeded buffering, readahead and scheduling magic?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se
--

From: Mike Snitzer
Date: Wednesday, August 4, 2010 - 5:56 am

On Wed, Aug 04 2010 at  7:53am -0400,

You could try applying both of these patches that are pending review for
hopeful inclussion in 2.6.36:

https://patchwork.kernel.org/patch/103404/
https://patchwork.kernel.org/patch/112657/
--

From: Neil Brown
Date: Wednesday, August 4, 2010 - 3:24 pm

On Wed, 4 Aug 2010 13:53:03 +0200 (CEST)

Both page-cache and read-ahead work at the filesystem level, so only the
device in the stack that the filesystem mounts from is relevant for these.
Any read-ahead setting on other devices are ignored.
Other levels only have a cache if they explicitly need one.  e.g. raid5 has a
stripe-cache to allow parity calculations across all blocks in a stripe.

Scheduling can potentially happen at every layer, but it takes very different
forms.  Crypto, lvm, raid0 etc don't do any scheduling - it is just
first-in-first-out.
RAID5 does some scheduling for writes (but not reads) to try to gather full
stripes.  If you write 2 of 3 blocks in a stripe, then 3 of 3 in another
stripe, the 3 of 3 will be processed immediately while the 2 of 3 might be
delayed a little in the hope that the third will arrive.

The sys/block/XXX/queue/scheduler setting only applies at the bottom of the
stack (though when you have dm-multipath it is actually one step above the
bottom).

Hope that helps,
NeilBrown
--

From: Valdis.Kletnieks
Date: Wednesday, August 4, 2010 - 1:33 pm

Unfortunately, on my laptop with a similar config, I'm seeing this:

# dd if=/dev/sda bs=8k count=1000000 of=/dev/null
1000000+0 records in
1000000+0 records out
8192000000 bytes (8.2 GB) copied, 108.352 s, 75.6 MB/s
# dd if=/dev/sda2 bs=8k count=1000000 of=/dev/null
1000000+0 records in
1000000+0 records out
8192000000 bytes (8.2 GB) copied, 105.105 s, 77.9 MB/s
# dd if=/dev/mapper/vg_blackice-root bs=8k count=100000 of=/dev/null
100000+0 records in
100000+0 records out
819200000 bytes (819 MB) copied, 11.6469 s, 70.3 MB/s

The raw disk, the LUKS-encrypted partition that's got a LVM on it, and a
crypted LVM partition. The last run spikes both CPUs up to about 50%CPU each.
So whatever it is, it's somehow more subtle than that.  Maybe the fact that
in my case, it's disk, crypto, and LVM on the crypted partition, rather than
crypted filesystems on an LVM volume?

From: Dominik Brodowski
Date: Thursday, August 5, 2010 - 2:31 am

Hey,

when attempting to track down insufficient I/O performance, I found the
following reression relating to direct-io on my notebook, where an
ata device, which consists of several partitions, is combined to a lvm
volume, and one logical volume is then encrypted using dm-crypt. Test case
was the following command:

$ dd if=/dev/mapper/vg0-root_crypt of=/dev/zero iflag=direct bs=8k count=131072

2.6.34 results in ~16 MB/s,
2.6.35 results in ~ 3.1 MB/s

The regression was bisected down to the follwoing commit:

commit c2c6ca417e2db7a519e6e92c82f4a933d940d076
Author: Josef Bacik <josef@redhat.com>
Date:   Sun May 23 11:00:55 2010 -0400

    direct-io: do not merge logically non-contiguous requests
    
...

How to fix this? I do not use btrfs, but ext3 (and the access was down on
the block level, not on the fs level, so this btrs-related commit should not
cause such a regression).

Best,

	Dominik
--

From: Chris Mason
Date: Thursday, August 5, 2010 - 4:32 am

Well, you've already bisected down to an offending if statement, that's
a huge help.  I'll try to reproduce this and fix it up today.

But, I'm surprised your drive is doing 8K dio reads at 16MB/s, that
seems a little high.  

-chris

--

From: Josef Bacik
Date: Thursday, August 5, 2010 - 5:36 am

Hrm, I made sure there were no perf regressions when I wast testing this stuff,
though I think I only tested xfs and ext4.  Originally I had a test where if we
provided our own submit_io, so maybe as a workaround just make

if (dio->final_block_in_bio != dio->cur_page_block ||
                    cur_offset != bio_next_offset) 

look like this

if (dio->final_block_in_bio != dio->cur_page_block ||
    (dio->submit_io && cur_offset != bio_next_offset))

and that should limit my change to only btrfs.  I know why it could cause a
problem, but this change shouldn't be causing a 400% regression.  I suspect
something else is afoot here.  Thanks,

Josef
--

From: Jeff Moyer
Date: Thursday, August 5, 2010 - 11:58 am

I'm not sure why you think that.  We're talking about a plain old SATA
disk, right?  I can get 40-50MB/s on my systems for 8KB O_DIRECT reads.
What am I missing?

Cheers,
Jeff
--

From: Chris Mason
Date: Thursday, August 5, 2010 - 12:01 pm

Clearly I'm wrong, his drive is going much faster ;)  I expect the
smaller reads to be slower but the drive's internal cache is doing well.

-chris

--

Previous thread: [ANNOUNCE] TCM/LIO: v4.0.0-rc2 for v2.6.35 by Nicholas A. Bellinger on Wednesday, August 4, 2010 - 12:28 am. (1 message)

Next thread: Re: [PATCH 2/2] MEMSTICK: Add driver for Ricoh R5C592 Card reader. by Alex Dubov on Wednesday, August 4, 2010 - 12:57 am. (3 messages)