Hey, on a production system I run kernel 2.6.35 and XFS (rw,relatime,nobarrier) on a lvdevice of a vgroup consisting of five dm-crypt devices (cryptsetup -c aes-lrw-benbi -s 384 create) , each of which runs on a md-raid1 device (mdadm --create --level=raid1 --raid-devices=2) on two 750 GB ATA devices. The read performance is abysmal. The ata devices can be ruled out, as hdparm How can I best track down the cause of the performance problem, a) without rebooting too often, and b) without breaking up the setup specified above (production system)? Any ideas? perf(1)? iostat(1)? Thanks & best, Dominik --
So did you just upgrade the system from an earlier kernel that did not show these problems? Or did no one notice them before? --
Christoph, Well, there are some reports relating to XFS on MD or RAID, though I couldn't find a resolution to the issues reported, e.g. - http://kerneltrap.org/mailarchive/linux-raid/2009/10/12/6490333 - http://lkml.indiana.edu/hypermail/linux/kernel/1006.1/00099.html However, I think we can rule out barriers, as XFS is mounted "nobarrier" here. Best, Dominik --
Ok, so it's been around for a while. Can you test the write speed of each individual device layer by doing a large read from it, using: dd if=<device> of=/dev/null bs=8k iflag=direct where device starts with the /dev/sda* device, and goes up to the MD device, the dm-crypt device and the LV. And yes, it's safe to read from the device while it's otherwise mounted/used. --
Has that system been running acceptable before? If yes, what has been changed that performance is down now? Or is it a new setup? Then why is it in production already? Can you run bonnie on that system? What does "dd if=<your device> of=/dev/null bs=1m count=1024" say? What does "dd if=/dev/zero of=<your device> bs=1m count=1024" say? -- mit freundlichen Grüssen, Michael Monnerie, Ing. BSc it-management Internet Services http://proteger.at [gesprochen: Prot-e-schee] Tel: 0660 / 415 65 31 ****** Aktuelles Radiointerview! ****** http://www.it-podcast.at/aktuelle-sendung.html // Wir haben im Moment zwei Häuser zu verkaufen: // http://zmi.at/langegg/ // http://zmi.at/haus2009/
Hey,
many thanks for your feedback. It seems the crypto step is the culprit:
Reading 1.1 GB with dd, iflag=direct, bs=8k:
/dev/sd* 35.3 MB/s ( 90 %)
/dev/md* 39.1 MB/s (100 %)
/dev/mapper/md*_crypt 3.9 MB/s ( 10 %)
/dev/mapper/vg1-* 3.9 MB/s ( 10 %)
The "good" news: it also happens on my notebook, even though it has a
different setup (no raid, disk -> lv/vg -> crypt). On my notebook, I'm
more than happy to test out different kernel versions, patches etc.
/dev/sd* 17.7 MB/s (100 %)
/dev/mapper/vg1-* 16.2 MB/s ( 92 %)
/dev/mapper/*_crypt 3.1 MB/s ( 18 %)
On a different system, a friend of mine reported (with 2.6.33):
/dev/sd* 51.9 MB/s (100 %)
dm-crypt 32.9 MB/s ( 64 %)
This shows that the speed drop when using dmcrypt is not always a factor of
5 to 10... Btw, it occurs both with aes-lrw-benbi and aes-cbc-essiv:sha256 ,
and (on my notebook) the CPU is mostly idling or waiting.
Best,
Dominik
PS: Bonnie output:
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03d ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ABCABCABCABCABC 16G 60186 4 24796 4 53386 5 281.1 1
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 3176 16 +++++ +++ 4641 28 ...The good news is that you have it tracked down, the bad news is that I know very little about dm-crypt. Maybe the issue is the single threaded decryption in dm-crypt? Can you check how much CPU time the dm crypt kernel thread uses? --
2 CPUs overall: Cpu(s): 1.0%us, 5.7%sy, 0.0%ni, 44.8%id, 47.0%wa, 0.0%hi, 1.5%si, 0.0%st Thanks & best, Dominik --
I'm not sure it's that. I have a Core i5 with AES-NI and that didn't significantly increase my overall performance, as it's not there the bottleneck is (at least in my system). I earlier sent out an email wondering if someone could shed some light on how scheduling, block caching and read-ahead works together when one does disks->md->crypto->lvm->fs, becase that's a lot of layers and potentially a lot of unneeded buffering, readahead and scheduling magic? -- Mikael Abrahamsson email: swmike@swm.pp.se --
On Wed, Aug 04 2010 at 7:53am -0400, You could try applying both of these patches that are pending review for hopeful inclussion in 2.6.36: https://patchwork.kernel.org/patch/103404/ https://patchwork.kernel.org/patch/112657/ --
On Wed, 4 Aug 2010 13:53:03 +0200 (CEST) Both page-cache and read-ahead work at the filesystem level, so only the device in the stack that the filesystem mounts from is relevant for these. Any read-ahead setting on other devices are ignored. Other levels only have a cache if they explicitly need one. e.g. raid5 has a stripe-cache to allow parity calculations across all blocks in a stripe. Scheduling can potentially happen at every layer, but it takes very different forms. Crypto, lvm, raid0 etc don't do any scheduling - it is just first-in-first-out. RAID5 does some scheduling for writes (but not reads) to try to gather full stripes. If you write 2 of 3 blocks in a stripe, then 3 of 3 in another stripe, the 3 of 3 will be processed immediately while the 2 of 3 might be delayed a little in the hope that the third will arrive. The sys/block/XXX/queue/scheduler setting only applies at the bottom of the stack (though when you have dm-multipath it is actually one step above the bottom). Hope that helps, NeilBrown --
Unfortunately, on my laptop with a similar config, I'm seeing this: # dd if=/dev/sda bs=8k count=1000000 of=/dev/null 1000000+0 records in 1000000+0 records out 8192000000 bytes (8.2 GB) copied, 108.352 s, 75.6 MB/s # dd if=/dev/sda2 bs=8k count=1000000 of=/dev/null 1000000+0 records in 1000000+0 records out 8192000000 bytes (8.2 GB) copied, 105.105 s, 77.9 MB/s # dd if=/dev/mapper/vg_blackice-root bs=8k count=100000 of=/dev/null 100000+0 records in 100000+0 records out 819200000 bytes (819 MB) copied, 11.6469 s, 70.3 MB/s The raw disk, the LUKS-encrypted partition that's got a LVM on it, and a crypted LVM partition. The last run spikes both CPUs up to about 50%CPU each. So whatever it is, it's somehow more subtle than that. Maybe the fact that in my case, it's disk, crypto, and LVM on the crypted partition, rather than crypted filesystems on an LVM volume?
Hey,
when attempting to track down insufficient I/O performance, I found the
following reression relating to direct-io on my notebook, where an
ata device, which consists of several partitions, is combined to a lvm
volume, and one logical volume is then encrypted using dm-crypt. Test case
was the following command:
$ dd if=/dev/mapper/vg0-root_crypt of=/dev/zero iflag=direct bs=8k count=131072
2.6.34 results in ~16 MB/s,
2.6.35 results in ~ 3.1 MB/s
The regression was bisected down to the follwoing commit:
commit c2c6ca417e2db7a519e6e92c82f4a933d940d076
Author: Josef Bacik <josef@redhat.com>
Date: Sun May 23 11:00:55 2010 -0400
direct-io: do not merge logically non-contiguous requests
...
How to fix this? I do not use btrfs, but ext3 (and the access was down on
the block level, not on the fs level, so this btrs-related commit should not
cause such a regression).
Best,
Dominik
--
Well, you've already bisected down to an offending if statement, that's a huge help. I'll try to reproduce this and fix it up today. But, I'm surprised your drive is doing 8K dio reads at 16MB/s, that seems a little high. -chris --
Hrm, I made sure there were no perf regressions when I wast testing this stuff,
though I think I only tested xfs and ext4. Originally I had a test where if we
provided our own submit_io, so maybe as a workaround just make
if (dio->final_block_in_bio != dio->cur_page_block ||
cur_offset != bio_next_offset)
look like this
if (dio->final_block_in_bio != dio->cur_page_block ||
(dio->submit_io && cur_offset != bio_next_offset))
and that should limit my change to only btrfs. I know why it could cause a
problem, but this change shouldn't be causing a 400% regression. I suspect
something else is afoot here. Thanks,
Josef
--
I'm not sure why you think that. We're talking about a plain old SATA disk, right? I can get 40-50MB/s on my systems for 8KB O_DIRECT reads. What am I missing? Cheers, Jeff --
Clearly I'm wrong, his drive is going much faster ;) I expect the smaller reads to be slower but the drive's internal cache is doing well. -chris --
