Hello,
I have 2 WD 150GB Raptor drives in RAID 0 on an NVIDIA 680i motherboard using NVIDIA's fake RAID.
dmraid -ay correctly detects and activates the devices, placing the corresponding entries under /dev/mapper/nvidia_dbggicbd for the logical drive and /dev/mapper/nvidia_dbggicbd1 for the first partition.
dmraid -s nvidia_dbggicbd reveals:
*** Active Set name : nvidia_dbggicbd size : 586093056 stride : 256 type : stripe status : ok subsets: 0 devs : 2 spares : 0
When performing a time dd if=/dev/mapper/nvidia_dbggicbd of=/dev/null bs=1024k count=10000, the following results are obtained:
10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 59.3748 seconds, 177 MB/s real 0m59.392s user 0m0.008s sys 0m12.529s
mpstat -P ALL 1 reveals:
10:59:14 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:59:15 AM all 0.00 0.00 10.05 33.17 2.51 4.02 0.00 50.25 3940.59 10:59:15 AM 0 0.00 0.00 19.80 66.34 5.94 7.92 0.00 0.00 3821.78 10:59:15 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.01 18.81 10:59:15 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:59:16 AM all 0.00 0.00 9.95 32.84 2.49 4.98 0.00 49.75 4007.07 10:59:16 AM 0 0.00 0.00 19.19 66.67 4.04 10.10 0.00 0.00 3902.02 10:59:16 AM 1 0.00 0.00 0.00 0.00 1.01 0.00 0.00 101.01 3.03 10:59:16 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:59:17 AM all 0.00 0.00 10.05 32.66 2.51 5.03 0.00 49.75 3978.79 10:59:17 AM 0 0.00 0.00 21.21 64.65 5.05 9.09 0.00 0.00 3851.52 10:59:17 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 101.01 11.11 10:59:17 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 10:59:18 AM all 0.00 0.00 10.45 32.84 2.99 3.48 0.00 50.25 3908.91 10:59:18 AM 0 0.00 0.00 20.79 65.35 5.94 7.92 0.00 0.00 3807.92 10:59:18 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.01 0.99
with the "dd" process consuming 20% CPU according to "top".
Performance drops drastically once I start using the logical partition instead of the entire logical drive, however:
time dd if=/dev/mapper/nvidia_dbggicbd1 of=/dev/null bs=1024k count=10000 results in:
10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 93.6217 seconds, 112 MB/s real 1m33.637s user 0m0.014s sys 1m29.637s
mpstat -P ALL 1 reveals:
11:05:42 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 11:05:43 AM all 0.00 0.00 26.13 0.00 3.02 21.11 0.00 49.75 7714.00 11:05:43 AM 0 0.00 0.00 52.00 0.00 6.00 42.00 0.00 0.00 7605.00 11:05:43 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00 3.00 11:05:43 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 11:05:44 AM all 0.00 0.00 26.00 0.00 3.50 20.50 0.00 50.00 7738.00 11:05:44 AM 0 0.00 0.00 52.00 0.00 7.00 41.00 0.00 0.00 7614.00 11:05:44 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 21.00 11:05:44 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 11:05:45 AM all 0.00 0.00 26.00 0.00 4.00 20.00 0.00 50.00 7763.64 11:05:45 AM 0 0.00 0.00 52.53 0.00 8.08 39.39 0.00 0.00 7658.59 11:05:45 AM 1 0.00 0.00 1.01 0.00 0.00 0.00 0.00 101.01 3.03 11:05:45 AM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 11:05:46 AM all 0.50 0.00 25.25 0.50 2.48 21.78 0.00 49.50 7741.00 11:05:46 AM 0 0.00 0.00 50.00 0.00 6.00 44.00 0.00 0.00 7614.00 11:05:46 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 12.00
with the "dd" process consuming 95% CPU according to "top".
fdisk -l /dev/mapper/nvidia_dbggicbd1 shows:
Disk /dev/mapper/nvidia_dbggicbd: 300.0 GB, 300079644672 bytes
255 heads, 63 sectors/track, 36482 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/mapper/nvidia_dbggicbd1 1 36482 293041633+ 83 Linux
The above performance/behavior difference does not occur when performing similar tests from actual physical disks, for example, /dev/sda and /dev/sda1.
So I guess my question is, why is there such a drastic difference in performance between using the entire logical device versus a partition on the logical device which, in my dense understanding, would appear to only be the difference between starting on sector 0 in the case of the logical device and sector 63 for the logical partition. Writes to both devices are similarly impacted though the numbers are artifically higher due to having 4GB of RAM installed. I'm obviously missing something stupidly simple. Somebody please clue me in?
Not enough information
At a guess, the kernel's doing something silly when dealing with the partition instead of the entire device. Can you install dmsetup, and post the output of dmsetup -r table and dmsetup -r status?
dmsetup info
Here are the results of the commands that you requested:
[root@miya ~]# dmsetup -r table
nvidia_dbggicbd1: 0 586083267 linear 253:0 63
nvidia_dbggicbd: 0 586093056 striped 2 256 8:0 0 8:16 0
[root@miya ~]# dmsetup -r status
nvidia_dbggicbd1: 0 586083267 linear
nvidia_dbggicbd: 0 586093056 striped
Interesting that the device reads "striped" while the partition reads "linear". I can't say if that is normal behavior or not as this is the first time I've used dmraid versus using good old md. Regardless of whether it thinks the partition is linear or striped, I've mounted and read my WinXP NTFS partition from it before and have built ext2 filesystems on it so I'm going to guess that it is normal behavior and that the linear device is actually sitting on top of the striped device. "iostat -xk 1" supports that assumption when using both the logical device and the partition.
Probably a kernel bug in the dm layer
Looks like the kernel's being a bit slow in the dm layer (and I meant dmsetup -r ls when I said dmsetup -r status - sorry, although there's enough here to spot what's going on).
As you've correctly surmised, nvidia_dbggicbd is being built as a striped device (RAID-0) from two physical disks. nvidia_dbggicbd1 is then being built as a linear mapping against that striped device. When you run dd against the partition, it issues I/O requests against nvidia_dbggicbd1. dm-linear reissues this I/O requests against nvidia_dbggicbd, but with a 63 sector offset. dm-stripe then splits the request across the two disks according to the stripe pattern.
The extra CPU load you see when accessing the partition is therefore occurring when dm-linear reworks the request and passes it on to dm-stripe. The code for this is in drivers/md/dm-*.
Thanks!
Thank you for the excellent information! It looks like there's no compelling reason to attempt to use dmraid over md unless compatibility for NTFS partitions installed on native/fake software RAID is required. I was scratching my head over this for a while, running and rerunning benchmarks. Makes perfect sense about the reissuance of I/O requests given the read from the logical device was drive-performance limited whereas the partition was bottlenecking on CPU. Thanks for helping me put 2 and 2 together!
Avoidance
Can you post some comments on generalizing the technique to avoid this bottleneck.
How should we arrange the logical partitions?
Will all LVM devices suffer from this bottleneck?
Does this apply to md raid also?