Due to the recent improvements in 2.5's VM, [related story], the I/O scheduler is having a hard timekeeping up with some workloads. Jens Axboe sent a patch for a deadline io scheduler (an I/O scheduler that tries to start request within a given time limit.)
Jens writes: "The Andrew Morton Interactive Workload (AMIW) [1] rates the current kernel poorly, on my test machine it completes in 1-2 minutes depending on your luck. 2.5.38-BK does a lot better, but mainly because it's being extremely unfair. This deadline io scheduler finishes the AMIW in anywhere from ~0.5 seconds to ~3-4 seconds, depending on the io load." Update: The deadline scheduler is now in both Linus's current BK kernels as well as Andrew Morton's -mm kernels.
From: Jens Axboe To: linux-kernel Subject: [PATCH] deadline io scheduler Date: 2002-09-25 17:20:24 Hi, Due to recent "problems" (well the vm being just too damn good at keep disks busy these days), it's become even more apparent that our current io scheduler just cannot cope with some work loads. Repeated starvartion of reads is the most important one. The Andrew Morton Interactive Workload (AMIW) [1] rates the current kernel poorly, on my test machine it completes in 1-2 minutes depending on your luck. 2.5.38-BK does a lot better, but mainly because it's being extremely unfair. This deadline io scheduler finishes the AMIW in anywhere from ~0.5 seconds to ~3-4 seconds, depending on the io load. I'd like folks to give it a test spin. Make two kernels, a 2.5.38 pristine and a 2.5.38 with this patch applied. Now beat on each of them, while listening to mp3's. Or read mails and change folders. Or anything else that gives you a feel for the interactiveness of the machine. Then report your findings. I'm interested in _anything_. There are a few tunables, but I'd suggest trying the defaults first. Then expirement with these two: static int read_expire = HZ / 2; This defines the read expire time, current default is 500ms. static int writes_starved = 2; This defines how many times reads can starve writes. 2 means that we can do two rounds of reads for 1 write. If you are curious how deadline-iosched works, search lkml archives for previous announcements. I might make a new one if there's any interesting in a big detailed analysis, since there has been some changes since last release. [1] Flush lots of stuff to disk (I start a dbench xxx, or do a dd if=/dev/zero of=test_file bs=64k), and then time a cat dir/*.c where dir/ holds lots of source files. -- Jens Axboe
[patch omitted for brevity, get it here]
From: Andrew Morton To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Wed, 25 Sep 2002 23:15:58 -0700 This is looking good. With a little more tuning and tweaking this problem is solved. The horror test was: cd /usr/src/linux dd if=/dev/zero of=foo bs=1M count=4000 sleep 5 time cat kernel/*.c > /dev/null Testing on IDE (this matters - SCSI is very different) - On 2.5.38 + souped-up VM it was taking 25 seconds. - My read-latency patch took 1 second-odd. - Linus' rework yesterday was taking 0.3 seconds. - With Linus' current tree (with the deadline scheduler) it now takes 5 seconds. Let's see what happens as we vary read_expire: read_expire (ms) time cat kernel/*.c (secs) 500 5.2 400 3.8 300 4.5 200 3.9 100 5.1 50 5.0 well that was a bit of a placebo ;) Let's leave read_expire at 500ms and diddle writes_starved: writes_starved (units) time cat kernel/*.c (secs) 1 4.8 2 4.4 4 4.0 8 4.9 16 4.9 Now alter fifo_batch, everything else default: fifo_batch (units) time cat kernel/*.c (secs) 64 5.0 32 2.0 16 0.2 8 0.17 OK, that's a winner. Here's something really nice with the deadline scheduler. I was madly catting five separate kernel trees (five reading processes) and then started a big `dd', tunables at default: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 9 0 6008 2460 8304 324716 0 0 2048 0 1102 254 13 88 0 0 7 0 6008 2600 8288 324480 0 0 1800 0 1114 266 0 100 0 0 6 0 6008 2452 8292 324520 0 0 2432 0 1126 287 29 71 0 0 6 0 6008 3160 8292 323952 0 0 3568 0 1132 312 0 100 0 0 6 0 6008 2860 8296 324148 128 0 2984 0 1119 281 17 83 0 1 6 0 5984 2856 8264 323816 352 0 5240 0 1162 479 0 100 0 0 7 1 5984 4152 7876 324068 0 0 1648 28192 1215 1572 1 99 0 0 9 2 6016 3136 7300 328568 0 180 1232 37248 1324 1201 3 97 0 0 9 2 6020 5260 5628 329212 0 4 1112 29488 1296 560 0 100 0 0 9 3 6020 3548 5596 330944 0 0 1064 35240 1302 629 6 94 0 0 9 3 6020 3412 5572 331352 0 0 744 31744 1298 452 6 94 0 0 9 2 6020 1516 5576 333352 0 0 888 31488 1283 467 0 100 0 0 9 2 6020 3528 5580 331396 0 0 1312 20768 1251 385 0 100 0 Note how the read rate maybe halved, and we sustained a high volume of writeback. This is excellent. Let's try it again with fifo_batch at 16: 0 5 0 80 303936 3960 49288 0 0 2520 0 1092 174 0 100 0 0 5 0 80 302400 3996 50776 0 0 3040 0 1094 172 20 80 0 0 5 0 80 301164 4032 51988 0 0 2504 0 1082 150 0 100 0 0 5 0 80 299708 4060 53412 0 0 2904 0 1084 149 0 100 0 1 5 1 80 164640 4060 186784 0 0 1344 26720 1104 891 1 99 0 0 6 2 80 138900 4060 212088 0 0 280 7928 1039 226 0 100 0 0 6 2 80 134992 4064 215928 0 0 1512 7704 1100 226 0 100 0 0 6 2 80 130880 4068 219976 0 0 1928 9688 1124 245 17 83 0 0 6 2 80 123316 4084 227432 0 0 2664 8200 1125 283 11 89 0 That looks acceptable. Writes took quite a bit of punishment, but the VM should cope with that OK. It'd be interesting to know why read_expire and writes_starved have no effect, while fifo_batch has a huge effect. I'd like to gain a solid understanding of what these three knobs do. Could you explain that a little more? During development I'd suggest the below patch, to add /proc/sys/vm/read_expire, fifo_batch and writes_starved - it beats recompiling each time. I'll test scsi now.
[akpm's patch here]
From: David S. Miller To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Wed, 25 Sep 2002 23:27:41 -0700 (PDT) From: Andrew Morton Date: Wed, 25 Sep 2002 23:15:58 -0700 I'd like to gain a solid understanding of what these three knobs do. Could you explain that a little more? My basic understanding of fifo_batch is: 1) fifo_batch is how many contiguous requests can be in a "set" 2) we send out one write "set" for every two read "sets" 3) a seek works out to "seek_cost" contiguous requests, cost wise, this gets subtracted from how many requests the current "set" has left that are allowed to be used
From: Jens Axboe To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 08:44:55 +0200 On Wed, Sep 25 2002, Andrew Morton wrote: > > This is looking good. With a little more tuning and tweaking > this problem is solved. > > The horror test was: > > cd /usr/src/linux > dd if=/dev/zero of=foo bs=1M count=4000 > sleep 5 > time cat kernel/*.c > /dev/null > > Testing on IDE (this matters - SCSI is very different) Yes, SCSI specific stuff comes next. > - On 2.5.38 + souped-up VM it was taking 25 seconds. > > - My read-latency patch took 1 second-odd. > > - Linus' rework yesterday was taking 0.3 seconds. > > - With Linus' current tree (with the deadline scheduler) it now takes > 5 seconds. > > Let's see what happens as we vary read_expire: > > read_expire (ms) time cat kernel/*.c (secs) > 500 5.2 > 400 3.8 > 300 4.5 > 200 3.9 > 100 5.1 > 50 5.0 > > well that was a bit of a placebo ;) For this work load, more on that later. > Let's leave read_expire at 500ms and diddle writes_starved: > > writes_starved (units) time cat kernel/*.c (secs) > 1 4.8 > 2 4.4 > 4 4.0 > 8 4.9 > 16 4.9 Interesting > Now alter fifo_batch, everything else default: > > fifo_batch (units) time cat kernel/*.c (secs) > 64 5.0 > 32 2.0 > 16 0.2 > 8 0.17 > > OK, that's a winner. Cool, I'm resting benchmarks with 16 as the default now. I fear this might be too agressive, and that 32 will be a decent value. > Here's something really nice with the deadline scheduler. I was > madly catting five separate kernel trees (five reading processes) > and then started a big `dd', tunables at default: > > procs memory swap io system cpu > r b w swpd free buff cache si so bi bo in cs us sy id > 0 9 0 6008 2460 8304 324716 0 0 2048 0 1102 254 13 88 0 > 0 7 0 6008 2600 8288 324480 0 0 1800 0 1114 266 0 100 > 0 > 0 6 0 6008 2452 8292 324520 0 0 2432 0 1126 287 29 71 0 > 0 6 0 6008 3160 8292 323952 0 0 3568 0 1132 312 0 100 > 0 > 0 6 0 6008 2860 8296 324148 128 0 2984 0 1119 281 17 83 0 > 1 6 0 5984 2856 8264 323816 352 0 5240 0 1162 479 0 100 > 0 > 0 7 1 5984 4152 7876 324068 0 0 1648 28192 1215 1572 1 99 0 > 0 9 2 6016 3136 7300 328568 0 180 1232 37248 1324 1201 3 97 0 > 0 9 2 6020 5260 5628 329212 0 4 1112 29488 1296 560 0 100 > 0 > 0 9 3 6020 3548 5596 330944 0 0 1064 35240 1302 629 6 94 0 > 0 9 3 6020 3412 5572 331352 0 0 744 31744 1298 452 6 94 0 > 0 9 2 6020 1516 5576 333352 0 0 888 31488 1283 467 0 100 > 0 > 0 9 2 6020 3528 5580 331396 0 0 1312 20768 1251 385 0 100 > 0 > > Note how the read rate maybe halved, and we sustained a high > volume of writeback. This is excellent. Yep > Let's try it again with fifo_batch at 16: > > 0 5 0 80 303936 3960 49288 0 0 2520 0 1092 174 0 100 > 0 > 0 5 0 80 302400 3996 50776 0 0 3040 0 1094 172 20 80 0 > 0 5 0 80 301164 4032 51988 0 0 2504 0 1082 150 0 100 > 0 > 0 5 0 80 299708 4060 53412 0 0 2904 0 1084 149 0 100 > 0 > 1 5 1 80 164640 4060 186784 0 0 1344 26720 1104 891 1 99 0 > 0 6 2 80 138900 4060 212088 0 0 280 7928 1039 226 0 100 > 0 > 0 6 2 80 134992 4064 215928 0 0 1512 7704 1100 226 0 100 > 0 > 0 6 2 80 130880 4068 219976 0 0 1928 9688 1124 245 17 83 0 > 0 6 2 80 123316 4084 227432 0 0 2664 8200 1125 283 11 89 0 > > That looks acceptable. Writes took quite a bit of punishment, but > the VM should cope with that OK. > > It'd be interesting to know why read_expire and writes_starved have > no effect, while fifo_batch has a huge effect. > > I'd like to gain a solid understanding of what these three knobs do. > Could you explain that a little more? Sure. The reason you are not seeing a big change with read expire, is that you basically only have one thread issuing reads. Once you start flooding the queue with more threads doing reads, then read expire just puts a lid on the max latency that will incur. So you are probably not hitting the read expire logic at all, or just slightly. The three tunables are: read_expire. This one controls how old a request can be, before we attempt to move it to the dispatch queue. This is the starvation logic for the read list. When a read expires, the other nobs control what the behaviour is. fifo_batch. This one controls how big a batch of requests we move from the sort lists to the dispatch queue. The idea was that we don't want to move single requests, since that might cause seek storms. Instead we move a batch of request, starting at the expire head for reads if necessary, along the sorted list to the dispatch queue. fifo_batch is the total cost that can be endured, a total of seeks and non-seeky requests. With you fifo_batch at 16, we can only move on seeky request to the dispatch queue. Or we can move 16 non-seeky requests. Or a few non-seeky request, and a seeky one. You get the idea. writes_starved. This controls how many times reads get preferred over writes. The default is 2, which means that we can serve two batches of reads over one write batch. A value of 4 would mean that reads could skip ahead of writes 4 times. A value of 1 would give you 1:1 read:write, ie no read preference. A silly value of 0 would give you write preference, always. Hope this helps? > During development I'd suggest the below patch, to add > /proc/sys/vm/read_expire, fifo_batch and writes_starved - it beats > recompiling each time. It sure does, I either want to talk Al into making the ioschedfs (better name will be selected :-) or try and do it myself so we can do this properly. > I'll test scsi now. Cool. I found a buglet that causes incorrect accounting when moving request if the dispatch queue is not empty. Attached. ===== drivers/block/deadline-iosched.c 1.1 vs edited ===== --- 1.1/drivers/block/deadline-iosched.c Wed Sep 25 21:16:26 2002 +++ edited/drivers/block/deadline-iosched.c Thu Sep 26 08:33:35 2002 @@ -254,6 +254,15 @@ struct list_head *sort_head = &dd->sort_list[rq_data_dir(rq)]; sector_t last_sec = dd->last_sector; int batch_count = dd->fifo_batch; + + /* + * if dispatch is non-empty, disregard last_sector and check last one + */ + if (!list_empty(dd->dispatch)) { + struct request *__rq = list_entry_rq(dd->dispatch->prev); + + last_sec = __rq->sector + __rq->nr_sectors; + } do { struct list_head *nxt = rq->queuelist.next; -- Jens Axboe
From: Jens Axboe To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 08:59:51 +0200 On Thu, Sep 26 2002, Jens Axboe wrote: > > Now alter fifo_batch, everything else default: > > > > fifo_batch (units) time cat kernel/*.c (secs) > > 64 5.0 > > 32 2.0 > > 16 0.2 > > 8 0.17 > > > > OK, that's a winner. > > Cool, I'm resting benchmarks with 16 as the default now. I fear this > might be too agressive, and that 32 will be a decent value. fifo_batch=16 drops throughput slightly on tiobench, however it also gives really really good interactive behaviour here. Using 32 doesn't change that a whole lot, the throughput that is. This might just be normal deviation between runs, more are needed to be sure. Note that I'm testing with the last_sec patch I posted, you should too. BTW, for SCSI, it would be nice to first convert more drivers to use the block level queued tagging. That would provide us with a much better means to control starvation properly on SCSI as well. -- Jens Axboe
From: Patrick Mansfield To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 08:54:45 -0700 On Thu, Sep 26, 2002 at 08:59:51AM +0200, Jens Axboe wrote: > On Thu, Sep 26 2002, Jens Axboe wrote: > BTW, for SCSI, it would be nice to first convert more drivers to use the > block level queued tagging. That would provide us with a much better > means to control starvation properly on SCSI as well. > > -- > Jens Axboe I haven't look closely at the block tagging, but for the FCP protocol, there are no tags, just the type of queueing to use (task attributes) - like ordered, head of queue, untagged, and some others. The tagging is normally done on the adapter itself (FCP2 protocol AFAIK). Does this mean block level queued tagging can't help FCP? Maybe the same for iSCSI, other protocols, and pseudo adapters - usb, ide, and raid adapters. -- Patrick Mansfield
From: Daniel Pittman To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 18:28:01 +1000 On Thu, 26 Sep 2002, Jens Axboe wrote: > On Wed, Sep 25 2002, Andrew Morton wrote: [...] > writes_starved. This controls how many times reads get preferred over > writes. The default is 2, which means that we can serve two batches of > reads over one write batch. A value of 4 would mean that reads could > skip ahead of writes 4 times. A value of 1 would give you 1:1 > read:write, ie no read preference. A silly value of 0 would give you > write preference, always. Actually, a value of zero doesn't sound completely silly to me, right now, since I have been doing a lot of thinking about video capture recently. How much is it going to hurt a filesystem like ext[23] if that value is set to zero while doing large streaming writes -- something like (almost) uncompressed video at ten to twenty meg a second, for gigabytes? This is a situation where, for a dedicated machine, delaying reads almost forever is actually a valuable thing. At least, valuable until it stops the writes from being able to proceed. Daniel
From: Jens Axboe To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 10:29:25 +0200 On Thu, Sep 26 2002, Daniel Pittman wrote: > On Thu, 26 Sep 2002, Jens Axboe wrote: > > On Wed, Sep 25 2002, Andrew Morton wrote: > > [...] > > > writes_starved. This controls how many times reads get preferred over > > writes. The default is 2, which means that we can serve two batches of > > reads over one write batch. A value of 4 would mean that reads could > > skip ahead of writes 4 times. A value of 1 would give you 1:1 > > read:write, ie no read preference. A silly value of 0 would give you > > write preference, always. > > Actually, a value of zero doesn't sound completely silly to me, right > now, since I have been doing a lot of thinking about video capture > recently. > > How much is it going to hurt a filesystem like ext[23] if that value is > set to zero while doing large streaming writes -- something like > (almost) uncompressed video at ten to twenty meg a second, for > gigabytes? You are going to stalll all reads indefinately :-) > This is a situation where, for a dedicated machine, delaying reads > almost forever is actually a valuable thing. At least, valuable until it > stops the writes from being able to proceed. Well 0 should achieve that quite fine -- Jens Axboe
From: Daniel Pittman To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Fri, 27 Sep 2002 09:23:39 +1000 On Thu, 26 Sep 2002, Jens Axboe wrote: > On Thu, Sep 26 2002, Daniel Pittman wrote: >> On Thu, 26 Sep 2002, Jens Axboe wrote: >> > On Wed, Sep 25 2002, Andrew Morton wrote: >> >> [...] >> >> > writes_starved. This controls how many times reads get preferred >> > over writes. The default is 2, which means that we can serve two >> > batches of reads over one write batch. A value of 4 would mean that >> > reads could skip ahead of writes 4 times. A value of 1 would give >> > you 1:1 read:write, ie no read preference. A silly value of 0 would >> > give you write preference, always. >> >> Actually, a value of zero doesn't sound completely silly to me, right >> now, since I have been doing a lot of thinking about video capture >> recently. >> >> How much is it going to hurt a filesystem like ext[23] if that value >> is set to zero while doing large streaming writes -- something like >> (almost) uncompressed video at ten to twenty meg a second, for >> gigabytes? > > You are going to stalll all reads indefinately :-) Which has some potentially fatal consequences, really, if any of the capture code gets paged out before the streaming write starts, or if the filesystem needs to read a bitmap block or so, as Rik points out. >> This is a situation where, for a dedicated machine, delaying reads >> almost forever is actually a valuable thing. At least, valuable until >> it stops the writes from being able to proceed. > > Well 0 should achieve that quite fine Would you consider allowing something akin to 'writes_starved = -4' to allow writes to bypass reads only 4 times -- a preference for writes, but not forever? That's going to express the bias I (think I) want for this case, but it's not going to be able to stall a read forever... Daniel
From: Rik van Riel To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 12:09:44 -0300 (BRT) On Thu, 26 Sep 2002, Daniel Pittman wrote: > > read:write, ie no read preference. A silly value of 0 would give you > > write preference, always. > How much is it going to hurt a filesystem like ext[23] if that value is > set to zero while doing large streaming writes -- something like > (almost) uncompressed video at ten to twenty meg a second, for > gigabytes? It depends, if you've got 2 video streams to the same filesystem and one needs to read a block bitmap in order to allocate more disk blocks you lose... regards, Rik -- A: No. Q: Should I include quotations after my reply?
From: Andrew Morton To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 00:12:41 -0700 Andrew Morton wrote: > > I'll test scsi now. > aic7xxx, Fujitsu "MAF3364L SUN36G" (36G SCA-2) Maximum number of TCQ tags=253 fifo_batch time cat kernel/*.c (seconds) 64 58 32 54 16 20 8 58 4 1:15 2 53 Maximum number of TCQ tags=4 fifo_batch time cat kernel/*.c (seconds) 64 53 32 39 16 33 8 21 4 22 2 36 1 22 Maximum number of TCQ tags = 0: fifo_batch time cat kernel/*.c (seconds) 64 22 32 10.3 16 10.5 8 5.5 4 3.2 2 1.9 I selected fifo_batch=16 and altered writes_starved and read_expires again. They made no appreciable difference. From this I can only conclude that my poor little read was stuck in the disk for ages while TCQ busily allowed new incoming writes to bypass already-sent reads. A dreadful misdesign. Unless we can control this with barriers, and if Fujutsu is typical, TCQ is just uncontrollable. I, for one, would not turn it on in a pink fit.
From: Jens Axboe To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 09:17:26 +0200 On Thu, Sep 26 2002, Andrew Morton wrote: > Andrew Morton wrote: > > > > I'll test scsi now. > > > > aic7xxx, Fujitsu "MAF3364L SUN36G" (36G SCA-2) > > > Maximum number of TCQ tags=253 > > fifo_batch time cat kernel/*.c (seconds) > 64 58 > 32 54 > 16 20 > 8 58 > 4 1:15 > 2 53 > > Maximum number of TCQ tags=4 > > fifo_batch time cat kernel/*.c (seconds) > 64 53 > 32 39 > 16 33 > 8 21 > 4 22 > 2 36 > 1 22 > > > Maximum number of TCQ tags = 0: > > fifo_batch time cat kernel/*.c (seconds) > 64 22 > 32 10.3 > 16 10.5 > 8 5.5 > 4 3.2 > 2 1.9 > > I selected fifo_batch=16 and altered writes_starved and read_expires > again. They made no appreciable difference. Abysmal. BTW, fifo_batch value less than seek cost doesn't make too much sense, unless the drive has really slow streaming io performance. > >From this I can only conclude that my poor little read was stuck > in the disk for ages while TCQ busily allowed new incoming writes > to bypass already-sent reads. > > A dreadful misdesign. Unless we can control this with barriers, > and if Fujutsu is typical, TCQ is just uncontrollable. I, for > one, would not turn it on in a pink fit. I have this dream that we might be able to control this if we get our hands on the queueing at the block level. The above looks really really bad though, in the past I've had quite good experience with a tag depth of 4. I should try ide tcq again, to see how that goes. -- Jens Axboe
From: Jens Axboe To: linux-kernel Subject: Re: [PATCH] deadline io scheduler Date: Thu, 26 Sep 2002 09:34:40 +0200 Hi, I found a small problem where hash would not contain the right request state. Basically we updated the hash too soon, this bug was introduced when the merge_cleanup stuff was removed. It's not a bit deal, it just means that the hash didn't catch as many merges as it should. However for efficiency it needs to be correct, of course :-) Current deadline against 2.5.38-BK attached.
[patch here]
Permissions?
I couldn't read this message if I were not logged in?
re: Permissions?
I've been doing some work to the backend of this site, adding a file-based caching mechanism and a pager utility for browsing the archives. Any errors you've seen in the past week or two are likely due to my efforts. Both are now in place and seem to be fully working, so hopefully there'll be no further problems. :)