I'm in the process of evaluating various storage options for a large array (12TB) I'm creating. First. I will preface all of this by saying that I understand the note in the kernel docs about comparing file systems under various workloads, and I acknowledge that my exact methodology isn't perfect. But it works for what I'm doing. :) This array will be used for storage of large media files (up to 20-30GB per file). I'm testing using iozone with various file sizes ranging from 4GB to 32GB. I'm pretty much settled on a RAID50 (128kb stripe size) running ext4 on top of LVM (for snapshots, future expansion, etc.). I'm running kernel 2.6.33.2, e2fsprogs 1.41.11 and util-linux-ng 2.16. The file system in question was created with the following options: mkfs -t ext4 -T large -i 524288 -b 4096 -I 256 -E stride=32,stripe-width=192 /dev/vg/lv Currently, I'm testing the effect of various mount options on an ext4 file system and my results are not what I would have expected based on the docs I have read. I wanted to bounce some of them off the list to find out if I'm completely missing something, or if my expectations were off. I'll start with the craziest one: noatime. Everything I have read says that the noatime option should increase both read and write performance. My results are finding that write speeds are comparable with or without this option, but read speeds are significantly faster *without* the noatime option. For example, a 16GB file reads about 210MB/s with noatime but reads closer to 250MB/s without the noatime option. Next is the write barrier. I'm an in a fully battery-backed environment, so I'm not worried about disabling it. From my testing, setting barrier=0 will improve write performance on large files (>10GB), but hurts performance on smaller files (<10GB). Read performance is effected similarly. Is this to be expected with files of this size? Next is the data option. I am seeing a significant increase in read performance when using ...
Steve Brown wrote: the kernel uses "relatime" now by default, which gives you most of the not expected by me; barriers == drive write cache flushes, which I data=writeback is not safe for data integrity; unless you can handle not sure offhand what to make of decreased write performance with a longer commit time... -Eric --
hmmm... this would seem to conflict with the docs in the kernel, especially: "Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed in one way or another, I'm not worried about powerloss. The kernel docs seem to imply that data=[journaled,ordered] come with a performance hit. My results would indicate otherwise. Should I be seeing this kinda of Steve --
they are not exactly the same thing, so noatime may be -slightly- what you saw is in conflict with what is expected, yes; I don't know why barriers would ever increase performance. (my description of barriers as drive write caches isn't in conflict Sorry, I misread... I also don't know why reading would be much affected at all by the journalling mode, which journals -writes- (reading can update metadata, but not much, esp. if you have noatime/relatime). --
Barriers when working should never make things faster, at best, we should have parity. Also important to note that barriers should be disabled if you hardware RAID card exports itself as a "write through" cache, even if you enable barriers on the command line. What controller are you using and what kind of drives do you have in the back end? --
Thats good to know about the write barriers with WT cache. I'm still setting everything manually in /etc/fstab because, well... I don't always trust software. ;) The controller is an LSI 9280-8e (megaraid_sas kernel module). Drives are 1TB Seagate ES.2s, 16 of them in the chassis. Steve --
If you have the boot time log messages for the disks you use, you can see how the cache is advertised to the kernel. Also note that having battery backed RAID cards does not mean that your drive's write cache will survive a power outage. You need to use vendor specific tools usually to poke at the drives and make sure that the write cache on the S-ATA disks is properly disabled (unless the LSI firmware does something to manage the write cache on the drives). Thanks! Ric --
The server is fully battery backed for up to 45 minutes. Also, LSI does provide tools to disable the cache when the BBU fails. Its one of the array config parameters. --
