Hi Jens, I'm chasing a performance bottleneck identified by tiobench that seems to be caused by CFQ. On a SLES10-SP3 kernel (2.6.16, with some patches moving cfq closer to 2.6.17) tiobench with 8 threads gets about 260MB/s sequential read throughput. On a recent kernels (including vanilla 2.6.34-rc) it makes about 145MB/s, a regression of 45%. The queue and readahead parameters are the same. This goes back some time, 2.6.27 already seems to have a bad performance. Changing the scheduler to noop will increase the throughput back into the 260MB/s range. So this is not a driver issue. Also increasing quantum *and* readahead will increase the throughput, but not by as much. Both noop and these tweaks decrease the write throughput somewhat however... Apparently on recent kernels the number of dispatched requests stays mostly at or below 4 and the dispatched sector count at or below 2000, which is not enough to fill the bandwidth on this setup. On 2.6.16 the number of dispatched requests hovers around 22 and the sector count around 16000. I uploaded blktraces for the read part of the tiobench runs for both 2.6.16 and 2.6.32: http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/ Do you have any idea about the cause of this regression? Thanks, Miklos --
I've also just noticed this using the most recent Redhat kernels. Writes don't seem to be affected at all. If the latest Redhat kernels mean anything here, I might as well show you what I've got, in case there is something common. ./disktest -B 4k -h 1 -I BD -K 32 -p l -P T -T 300 -r /dev/sdf With cfq we get this: STAT | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 15032320.0B/s (14.34MB/s), IOPS 3670.0/s And with noop we get this: STAT | 17260 | v1.4.2 | /dev/sdf | Heartbeat read throughput: 111759360.0B/s (106.58MB/s), IOPS 27285.0/s. Setting some very large and busy web servers to noop just out of curiousity also reduced the average io time and dropped the load. Chris --
Hi Miklos, I don't think this is related to CFQ. I've made a graph of the accessed (read) sectors (see attached). You can see that the green cloud (2.6.16) is much more concentrated, while the red one (2.6.32) is split in two, and you can better recognize the different lines. This means that the FS put more distance between the blocks of the files written by the tio threads, and the read time is therefore impacted, since the disk head has to perform longer seeks. On the other hand, if you read those files sequentially with a single thread, the performance may be better with the new layout, so YMMV. When testing 2.6.32 and up, you should consider testing also with low_latency setting disabled, since tuning for latency can negatively affect throughput. Thanks, Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda
Hi Corrado,
low_latency is set to zero in all tests.
The layout difference doesn't explain why setting the scheduler to
"noop" consistently speeds up read throughput in 8-thread tiobench to
almost twice. This fact alone pretty clearly indicates that the I/O
scheduler is the culprit.
There are other indications, see the attached btt output for both
traces. From there it appears that 2.6.16 does more and longer seeks,
yet it's getting an overall better performance.
I've also tested with plain "dd" instead of tiobench where the
filesystem layout stayed exactly the same between tests. Still the
speed difference is there.
Thanks,
Miklos
************************************************************************
btt output for 2.6.16:
==================== All Devices ====================
ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------
Q2Q 0.000000047 0.000854417 1.003550405 67465
Q2G 0.000000458 0.000001211 0.000123527 46062
G2I 0.000000123 0.000001815 0.000494517 46074
Q2M 0.000000186 0.000001798 0.000010296 21404
I2D 0.000000162 0.000158001 0.040794333 46062
M2D 0.000000878 0.000133130 0.040585566 21404
D2C 0.000053870 0.023778266 0.234154543 67466
Q2C 0.000056746 0.023931014 0.234176000 67466
==================== Device Overhead ====================
DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 64) | 0.0035% 0.0052% 0.0024% 0.4508% 99.3617%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0035% 0.0052% 0.0024% 0.4508% 99.3617%
==================== Device Merge Information ====================
DEV | #Q #D Ratio | BLKmin ...Hi Miklos, can you give more information about the setup? How much memory do you have, what is the disk configuration (is this a From the attached btt output, I see that a lot of time is spent Since noop doesn't attach fancy data to each request, it can save those allocations, thus resulting in no sleeps. The delays in allocation, though, may not be completely imputable to the I/O scheduler, and working in constrained memory conditions will I see less seeks for 2.6.16, but longer on average. It seems that 2.6.16 allows more requests from the same process to be streamed to disk before switching to an other process. Since the timeslice is the same, it might be that we are limiting the Does dropping caches before the read test change the situation? Thanks,
Corrado, 8G of memory 8-way Xeon CPU, fiber channel attached storage array (HP I verified with the simple dd test that this happens even if there's no memory pressure from the cache by dd-ing only 5G of files, after clearing the cache. This way ~2G of memory are completely free In all my tests I drop caches before running it. Please let me know if you need more information. Thanks, Miklos --
Jens, Corrado, Here's a graph showing the number of issued but not yet completed requests versus time for CFQ and NOOP schedulers running the tiobench benchmark with 8 threads: http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg It shows pretty clearly the performance problem is because CFQ is not issuing enough request to fill the bandwidth. Is this the correct behavior of CFQ or is this a bug? This is on a vanilla 2.6.34-rc4 kernel with two tunables modified: read_ahead_kb=512 low_latency=0 (for CFQ) Thanks, Miklos --
Hi Miklos, This is the expected behavior from CFQ, even if it is not optimal, since we aren't able to identify multi-splindle disks yet. Can you post the result of "grep -r . ." in your /sys/block/*/queue directory, to see if we can find some parameter that can help identifying your You should get much better throughput by setting /sys/block/_your_disk_/queue/iosched/slice_idle to 0, or /sys/block/_your_disk_/queue/rotational to 0. Thanks, --
./iosched/quantum:8 ./iosched/fifo_expire_sync:124 ./iosched/fifo_expire_async:248 ./iosched/back_seek_max:16384 ./iosched/back_seek_penalty:2 ./iosched/slice_sync:100 ./iosched/slice_async:40 ./iosched/slice_async_rq:2 ./iosched/slice_idle:8 ./iosched/low_latency:0 ./iosched/group_isolation:0 ./nr_requests:128 ./read_ahead_kb:512 ./max_hw_sectors_kb:32767 ./max_sectors_kb:512 ./max_segments:64 ./max_segment_size:65536 ./scheduler:noop deadline [cfq] ./hw_sector_size:512 ./logical_block_size:512 ./physical_block_size:512 ./minimum_io_size:512 ./optimal_io_size:0 ./discard_granularity:0 ./discard_max_bytes:0 ./discard_zeroes_data:0 ./rotational:1 ./nomerges:0 slice_idle=0 definitely helps. rotational=0 seems to help on 2.6.34-rc but not on 2.6.32. As far as I understand setting slice_idle to zero is just a workaround to make cfq look at all the other queues instead of serving one exclusively for a long time. I have very little understanding of I/O scheduling but my idea of what's really needed here is to realize that one queue is not able to saturate the device and there's a large backlog of requests on other queues that are waiting to be served. Is something like that implementable? Thanks, Miklos --
Yes, basically it disables idling (i.e., waiting whether a thread sends I see a problem with defining "saturate the device" - but maybe we could measure something like "completed requests / sec" and try autotuning slice_idle to maximize this value (hopefully the utility function should be concave so we can just use "local optimization"). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Yeah, detecting saturation may be difficult. I guess that function depends on a lot of other things as well, including seek times, etc. Not easy to optimize. I'm still wondering what makes such a difference between CFQ on 2.6.16 and CFQ on 2.6.27-34, why is the one in older kernels performing so much better in this situation? What should we tell our customers? The default settings should at least handle these systems a bit better. Thanks, Miklos --
In the past we were of the opinion that for sequential workload multi spindle disks will not matter much as readahead logic (in OS and possibly in hardware also) will help. For random workload we anyway don't idle on the single cfqq so it is fine. But my tests now seem to be telling a different story. I also have one FC link to one of the HP EVA and I am running increasing number of sequential readers to see if throughput goes up as number of readers go up. The results are with noop and cfq. I do flush OS caches across the runs but I have no control on caching on HP EVA. Kernel=2.6.34-rc5 DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe Workload=bsr iosched=cfq Filesz=2G bs=4K ========================================================================= job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- bsr 1 1 135366 59024 0 0 bsr 1 2 124256 126808 0 0 bsr 1 4 132921 341436 0 0 bsr 1 8 129807 392904 0 0 bsr 1 16 129988 773991 0 0 Kernel=2.6.34-rc5 DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe Workload=bsr iosched=noop Filesz=2G bs=4K ========================================================================= job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- bsr 1 1 126187 95272 0 0 bsr 1 2 185154 72908 0 0 bsr 1 4 224622 88037 0 0 bsr 1 8 285416 115592 ...
Have you tested on older kernels? Around 2.6.16 it seemed to allow more parallel reads, but that might have been just accidental (due to I/O being submitted in a different pattern). Thanks, Miklos --
Hi Vivek, I tried to implement exactly what you are proposing, see the attached patches. I leverage the queue merging features to let multiple cfqqs share the disk in the same timeslice. I changed the queue split code to trigger on throughput drop instead of on seeky pattern, so diverging queues can remain merged if they have good throughput. Moreover, I measure the max bandwidth reached by single queues and merged queues (you can see the values in the bandwidth sysfs file). If merged queues can outperform non-merged ones, the queue merging code will try to opportunistically merge together queues that cannot submit enough requests to fill half of the NCQ slots. I'd like to know if you can see any improvements out of this on your hardware. There are some magic numbers in the code, you may want to try tuning them. Note that, since the opportunistic queue merging will start happening only after merged queues have shown to reach higher bandwidth than non-merged queues, you should use the disk for a while before trying Is the BW for 1 single reader also better on 2.6.16, or the improvement is only seen with more concurrent readers? Thanks,
On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote: Thanks corrado. Using split queue sounds like the right place to do it. I will also test 2.6.16. I am anyway curious, how come 2.6.16 performed better and we were dispatching requests from multiple cfqq and driving deeper queue depths. To me this is fundamental cfq design that at one time one queue gets to use the disk (at least for sync-idle tree). So something must have been different in 2.6.16. Thanks Vivek --
On Sat, Apr 24, 2010 at 10:36:48PM +0200, Corrado Zoccolo wrote: Hi Corrado, I ran these patches and I did not see any improvement. I think the reason being that no cooperative queue merging took place and we did not have any data for throughput with coop flag on. #cat /sys/block/dm-3/queue/iosched/bandwidth 230 753 0 I think we need to implement something similiar to hw_tag detection logic where we allow dispatches from multiple sync-idle queues at a time and try to observe the BW. After certain window once we have observed the window, then set the system behavior accordingly. Kernel=2.6.34-rc5-corrado-multicfq DIR= /mnt/iostmnt/fio DEV= /dev/mapper/mpathe Workload=bsr iosched=cfq Filesz=2G bs=4K ========================================================================== job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- bsr 1 1 126590 61448 0 0 bsr 1 2 127849 242843 0 0 bsr 1 4 131886 508021 0 0 bsr 1 8 131890 398241 0 0 bsr 1 16 129167 454244 0 0 Thanks Vivek --
Hi Vivek, thanks for testing. Can you try changing the condition to enable the queue merging to also consider the case in which max_bw[1] == 0 && max_bw[0] > 100MB/s (notice that max_bw is measured in sectors/jiffie). This should rule out low end disks, and enable merging where it can be beneficial. If the results are good, but we find this enabling condition unreliable, then we can think of a better way, but I'm curious to see if the results are promising before thinking to the details. Thanks, -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda --
Ok, I made some changes and now some queue merging seems to be happening
and I am getting little better BW. This will require more debugging. I
will try to spare some time later.
Kernel=2.6.34-rc5-corrado-multicfq
DIR= /mnt/iostmnt/fio DEV= /dev/mapper/mpathe
Workload=bsr iosched=cfq Filesz=1G bs=16K
==========================================================================
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 1 1 136457 67353 0 0
bsr 1 2 148008 192415 0 0
bsr 1 4 180223 535205 0 0
bsr 1 8 166983 542326 0 0
bsr 1 16 176617 832188 0 0
Output of iosched/bandwidth
0 546 740
I did following changes on top of your patch.
Vivek
---
block/cfq-iosched.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4e9e015..7589c44 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -243,6 +243,7 @@ struct cfq_data {
*/
int hw_tag_est_depth;
unsigned int hw_tag_samples;
+ unsigned int cfqq_merged_samples;
/*
* performance measurements
* max_bw is indexed by coop flag.
@@ -1736,10 +1737,14 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
// Opportunistic queue merging could be beneficial even on far queues
// We enable it only on NCQ disks, if we observed that merged queues
// can reach higher bandwidth than single queues.
+ // 204 sectors per jiffy is equivalent to 100MB/s on 1000 HZ conf.
+ // Allow merge if we don't have sufficient merged cfqq samples.
rs = cur_cfqq->allocated[READ] + cur_cfqq->allocated[WRITE];
- if (cfqd->hw_tag && ...This is becoming interesting. I think a major limitation of the current approach is that it is too easy for a merged queue to be separated again. My code: if (cfq_cfqq_coop(cfqq) && bw <= cfqd->max_bw[1] * 9/10) cfq_mark_cfqq_split_coop(cfqq); will immediately split any merged queue as soon as max_bw[1] grows too much, so it should be based on max_bw[0]. Moreover, this code will likely split off all cics from the merged queue, while it would be much better to split off only the cics that are receiving less than their fair share of the BW (this will also improve the fairness of the scheduler when queues are merged). -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda --
Is there any update on the status of this issue? Thanks, --
How about running cfq with slice_idle=0 on high end storage. This should make it very close to deadline behavior? There has not been any further progress on my end for merging more sequential queues for achieving better throughput. Thanks Vivek --
