Hi again, So, here's another stab at fixing this. This patch is very much an RFC, so do not pull it into anything bound for Linus. ;-) For those new to this topic, here is the original posting: http://lkml.org/lkml/2010/4/1/344 The basic problem is that, when running iozone on smallish files (up to 8MB in size) and including fsync in the timings, deadline outperforms CFQ by a factor of about 5 for 64KB files, and by about 10% for 8MB files. From examining the blktrace data, it appears that iozone will issue an fsync() call, and will have to wait until it's CFQ timeslice has expired before the journal thread can run to actually commit data to disk. The approach below puts an explicit call into the filesystem-specific fsync code to yield the disk so that the jbd[2] process has a chance to issue I/O. This bring performance of CFQ in line with deadline. There is one outstanding issue with the patch that Vivek pointed out. Basically, this could starve out the sync-noidle workload if there is a lot of fsync-ing going on. I'll address that in a follow-on patch. For now, I wanted to get the idea out there for others to comment on. Thanks a ton to Vivek for spotting the problem with the initial approach, and for his continued review. Cheers, Jeff # git diff --stat block/blk-core.c | 15 +++++++++++++++ block/cfq-iosched.c | 38 +++++++++++++++++++++++++++++++++++++- block/elevator.c | 8 ++++++++ fs/ext3/fsync.c | 5 ++++- fs/ext4/fsync.c | 6 +++++- include/linux/backing-dev.h | 13 +++++++++++++ include/linux/blkdev.h | 1 + include/linux/elevator.h | 3 +++ 8 files changed, 86 insertions(+), 3 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 9fe174d..1be9413 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -323,6 +323,19 @@ void blk_unplug(struct request_queue *q) } EXPORT_SYMBOL(blk_unplug); +void blk_backing_dev_yield(struct ...
Thanks Jeff. Conceptually this appraoch makes lot of sense to me. Higher layers explicitly telling CFQ not to idle/yield the slice. My firefox timing test is perfoming much better now. real 0m15.957s user 0m0.608s sys 0m0.165s real 0m12.984s user 0m0.602s sys 0m0.148s real 0m13.057s user 0m0.624s sys 0m0.145s So we got to take care of two issues now. - Make it work with dm/md devices also. Somehow shall have to propogate this yield semantic down the stack. - The issue of making sure we don't yield if we are servicing sync-noidle tree and there is other IO going on which relies on sync-noidle tree idling (as you have already mentioned). I think it would be good if we also check that cfqq is empty or not. If cfqq is not empty we don't want to give up slice? But this is a minor point. At least in case of fsync, we seem to be waiting for all the iozone request to finish and then calling yield. So cfqq should be empty at this point of time. Vivek --
The way that Jeff set it up, it's completely parallel to eg congestion That's a bit of a problem. Say the queue has one request left, it'll service that and the start idling. But we already got this hint that we should yield, but as it happened before we reached the idling state, we'll not yield. Perhaps it would be better to mark the cfqq as yielding if it isn't ready to yield just yet. -- Jens Axboe --
Ok, so various dm targets now need to define "yield_fn" and propogate the That makes sense. So another cfqq flag in place. As soon as all the request from cfqq are dispatched, yield the slice instead of idling. And clear the flag when slice is expiring. Vivek --
Precisely. The next question would be how to control the yielding. In this particular case, you want to be yielding to a specific cfqq. IOW, you essentially want to pass your slide on to that queue. The way the above is implemented, you could easily just switch to another unrelated queue. And if that is done, fairness is skewed without helping the yielding process at all (which was the intention). -- Jens Axboe --
On Thu, Apr 08, 2010 at 04:09:44PM +0200, Jens Axboe wrote: True. I guess this is relatively simple yield implementation where we are telling IO scheduler that there is no more IO coming on this context so don't waste time idling. That's a different thing that after giving up slice, cfq might schedule in another sequential reader instead of journalling thread. Ideally it would be better to be also able to specify who to transfer remaining slice to. I am not sure if there is an easy way to pass this info CFQ. So this implementation might be good first step. Vivek --
Well, that's true in part. Prior to this patch, the process would idle, keeping all other cfq_queues on the system from making progress. With this patch, at least *somebody* else makes progress, getting you closer to running the journal thread that you're blocked on. Ideally, you'd want the thread you're waiting on to get disk time next, sure. You would have to pass the process information down to the I/O scheduler for that, and I'm not sure that the file system code knows which process to hand off to. Does it? Do we really want to go down this road at all? I'm not convinced. Cheers, Jeff --
Don't get me wrong, neither am I. I'm just thinking out loud and pondering. As a general mechanism, yield to a specific cfqq is going to be tricky and doing a generic yield to signal that _someone_ else must make progress before we can is better than nothing. Continuing that train of thought, I don't think we'll ever need full 'yield to X' functionality where 'X' is a really specific target. But for this fsync case, we know at least what we are yielding to and it seems like a shame to throw away that information. So you could include a hint of what to yield to, which cfq could factor in. Dunno, I need to think a bit about the cleanliness of such an approach. We can definitely use your patch as a starting point. -- Jens Axboe --
To do so doesn't DM (and MD) need a blk_queue_yield() setter to establish its own yield_fn? The established dm_yield_fn would call blk_yield() for all real devices in a given DM target. Something like how blk_queue_merge_bvec() or blk_queue_make_request() allow DM to provide functional extensions. I'm not seeing such a yield_fn hook for stacking drivers to use. And as is, jbd and jbd2 just call blk_yield() directly and there is no way for the block layer to call into DM. What am I missing? Thanks, Mike --
Nothing, it is I who am missing something (extra code). When I send out the next version, I'll add the setter function and ensure that queue->yield_fn is called from blk_yield. Hopefully that's not viewed as upside down. We'll see. Thanks for the review, Mike! -Jeff --
I like the concept, it's definitely useful (and your results amply demonstrate that). I was thinking if there was a way in through the ioc itself, rather than bdi -> queue and like you are doing. But I can't spin_lock_irq() is sufficient here. -- Jens Axboe --
I think, one issue with ioc based approach will be that it will then call yield operation on all the devices in the system where this context has ever done any IO. With bdi based approach this call will remain limited to a smaller set of devices. Thanks Vivek --
Oh, you'd want the bdi as well. And as I said, I don't think it was workable, just trying to think it over and consider potentially other ways to accomplish this. At one point I had a patch that did the equivalant of this yield on being scheduled out on the CPU side, which is probably why I was in the ioc mindset. -- Jens Axboe --
Which actually brings up the question of whether this needs some knowledge of whether the journal is on the same device as the file system! In such a case, we need not yield. I think I'll stick my head in the sand for this one. ;-) Cheers, Jeff --
Yes, that is true. But that should be relatively easy to check. -- Jens Axboe --
Jeff even if journal is not on same device, what harm yielding could do? Anyway there is no IO on that queue and we are idling. Only side affect is that yielding process could lose a bit if after fsync it immediately submits more IO. Because this process has yielded it slice, it is back in the queue instead of doing more IO in the current slice immediately. Vivek --
What happens if the journal is on a super fast device, and finishes up very quickly allowing our process to initiate more I/O within the idle window? Cheers, Jeff --
You lose. :-) But at the same time if journalling devices is not fast enough, and you if can't submit next IO in idling window, then you are unnecessarily keeping the disk idle and preventing others from making progress. Vivek --
Probably not needed for this incarnation of the patch, though previous iterations would Oops on boot w/o this. If you look through all of the code in this code path, cfqg == NULL seems to be handled, so it's OK, thanks! Jeff --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
| Michael |
