Sorry for replying that late, but after digging through another pile of tasks I'm happy to come back to this issue and I'll try to answer all open questions.
Fortunately I'm also able to add a few new insights that might resurrect this discussion^^
For the requested CFQ scheduler tuning, its deadline what is here :-)
So I can't apply all that. But in the past I was already able to show that all the "slowdown" occurs above the block device layer (read back through our threads if interessted about details). But eventually that leaves all lower layer tuning out of the critical zone.
Corrado also asked for iostat data, due to the reason explained above (issue above BDL) it doesn't contain anything much useful as expected.
So I'll just add a one liner of good/bad case to show that things like req-sz etc are the same, but just slower.
This "being slower" is caused by the request arriving in the BDL at a lower rate - caused by our beloved full timeouts in congestion_wait.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
bad sdb 0.00 0.00 154.50 0.00 70144.00 0.00 908.01 0.62 4.05 2.72 42.00
good sdb 0.00 0.00 270.50 0.00 122624.00 0.00 906.65 1.32 4.94 2.92 79.00
So now coming to the probably most critical part - the evict once discussion in this thread.
I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.
In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
Therefore I ran all tests and verifications with that drops.
In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)
On the good side I was also able to analyze a few more things due to that insight - and it might give us new data to debug the root cause.
Like Mel I also had identified "56e49d21 vmscan: evict use-once pages first" to be related in the past. But without the watermark wait fix, unapplying it 56e49d21 didn't change much for my case so I left this analysis path.
But now after I found dropping caches is the key to "get back good performance" and "subsequent writes for bad performance" even with watermark wait applied I checked what else changes:
- first write/read load after reboot or dropping caches -> read TP good
- second write/read load after reboot or dropping caches -> read TP bad
=> so what changed.
I went through all kind of logs and found something in the system activity report which very probably is related to 56e49d21.
When issuing subsequent writes after I dropped caches to get a clean start I get this in Buffers/Caches from Meminfo:
pre write 1
Buffers: 484 kB
Cached: 5664 kB
pre write 2
Buffers: 33500 kB
Cached: 149856 kB
pre write 3
Buffers: 65564 kB
Cached: 115888 kB
pre write 4
Buffers: 85556 kB
Cached: 97184 kB
It stays at ~85M with more writes which is approx 50% of my free 160M memory.
It can be said that once Buffers reached the 65M level all (no matter how much read load I throw at the system) following read loads will have the bad throughput.
Dropping caches - and by that removing these buffers - gives back the good performance.
So far I found no alternative to a manual drop_caches, but recommending a 30 second cron job dropping caches to get good read performance for customers is not that good anyway.
I checked if the buffers get cleaned some when, but neither a lot of subsequent read loads pushing the pressure towards read page cache (I hoped the buffers would age or something to eventually get thrown out) nor waiting a long time helped.
The system seems to be totally unable to get rid of these buffers without my manual help via drop_caches.
I imagine a huge customer DB running wirtes&reads fine at day, with a nightly large backup that losses 50% read throughput because the kernel keeps 50% buffers all the night - and by that doesn't fit in their night slot - just to draw one realistic scenario.
Is there anything to avoid that behavior to "never free these buffers", but still get all/some of the intended benefits of 56e49d21?
Ideas welcome
P.S. This is still a .32 stable kernel + Mels watermark wait patch based analysis - I plan to check current kernels as well once I find the time, but let me know if there are known obvious fixes related to this issue I should test asap.
Mel Gorman wrote:
--
Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--