From: Michael Rubin <mrubin@google.com> For those of you who have waited so long. This is the third submission of the first attempt at this patch. It is a trilogy. Two changes are in this patch. They are dependant on each other. In addition we get an unintended performance improvement. Syncing 1,000,000 inodes each with 4KB of dirty data takes the original kernel 83 seconds and with the change it take 77 seconds. 1) Adding a datastructure to guarantee fairness when writing inodes to disk and simplify the code base. When inodes are parked for writeback they are parked in the flush_tree. The flush tree is a data structure based on an rb tree. Duplicate keys are handled by making a list in the tree for each key value. The order of how we choose the next inode to flush is decided by two fields. First the earliest dirtied_when value. If there are duplicate dirtied_when values then the earliest i_flush_gen value determines who gets flushed next. The flush tree organizes the dirtied_when keys with the rb_tree. Any inodes with a duplicate dirtied_when value are link listed together. This link list is sorted by the inode's i_flush_gen. When both the dirtied_when and the i_flush_gen are identical the order in the linked list determines the order we flush the inodes. 2) Added an inode flag to allow inodes to be marked so that they are never written back to disk. The motivation behind this change is several fold. The first is to insure fairness in the writeback algorithm. The second is to deal with a bug where the writing to large files concurrently to smaller ones creates a situation where writeback cannot keep up with traffic and memory baloons until the we hit the threshold watermark. This can result in surprising long latency with respect to disk traffic. This latency can take minutes. The flush tree fixes this issue and fixes several other minor issues with fairness also. Signed-off-by: Michael ...
Just a quick question, how does this interact/depend-uppon etc.. with Fengguangs patches I still have in my mailbox? (Those from Dec 28th) --
They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25. This work was done before Fengguang's patches. I am trying to test Fengguang's for comparison but am having problems with getting mm1 to boot on my systems. mrubin --
Yeah, they are independent ones. The initial motivation is to fix the bug "sluggish writeback on small+large files". Michael introduced a new rbtree, and me introduced a new list(s_more_io_wait). Basically I think rbtree is an overkill to do time based ordering. Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io provides fair queuing between small/large files, and s_more_io_wait provides waiting mechanism for blocked inodes. The time ordered rbtree may delay io for a blocked inode simply by modifying its dirtied_when and reinsert it. But it would no longer be that easy if it is to be ordered by location. If we are going to do location based ordering in the future, the lists will continue to be useful. It would simply be a matter of switching from the s_dirty(order by time) to some rbtree or radix tree(order by location). We can even provide both ordering at the same time to different fs/inodes which is configurable by the user. Because the s_dirty and/or rbtree would provide _only_ ordering(not faireness or waiting) and hence is interchangeable. This patchset could be a good reference. It does location based ordering with radix tree: [RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45> Thank you, Fengguang --
What does the term "ordered by location" mean? Attemting to sort inodes by physical disk address? By using their i_ino as a key? list_heads are just the wrong data structure for this function. Especially list_heads which are protected by a non-sleeping lock. --
list_heads are OK if we use them for one and only function. We have been trying to jam too much into s_dirty in the past. Grabbing a refcount could be better than locking - anyway if we split the functions today, it would be easy to replace the list_heads one by one in the future. --
Not really. They're inappropriate when you wish to remember your position in the list while you dropped the lock (as we must do in writeback). A data structure which permits us to interate across the search key rather than across the actual storage locations is more appropriate. --
I totally agree with you. What I mean is to first do the split of functions - into three: ordering, starvation prevention, and blockade waiting. Then to do better ordering by adopting radix tree(or rbtree if radix tree is not enough), and lastly get rid of the list_heads to avoid locking. Does it sound like a good path? --
Does "ordering" here refer to ordering bt time-of-first-dirty? I'd have thaought that replacing list_heads with another data structure would be a simgle commit. --
Some inodes/pages cannot be synced now for some reason and should be That would be easy. s_more_io and s_more_io_wait can all be converted to radix trees. --
Given the way LBAs are located on disk and the fact that rotational latency is a large factor in changing locations of a drive head, any attempts to do a C-SCAN pass are pretty much useless. Further complicating this is any volume management that sits between the fs and the actual storage. A nice feature to have longer term is to have the write_inodes paths for background flushing understand storage congestion _through_ any volume management. This would allow us to back off background flushing on a per spindle basis (when using drives of course) and avoid write congestion in both the io scheduler and in the drive's writecaches, which I believe, but don't have hard evidence, get congested today, knocking the drive into a fifo fashion in firmware. A data structure that allows us to keep a dirtied_when values consistent across back-offs and blocking allows us to further develop the background writeout paths to get to this point (though exposing this --
In XFS, inode number is an encoding of it's location on disk, so ordering inode writeback by inode number *does* make sense. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
This code is mainly concerned with writing pagecache data, not inodes. --
Now I tend to agree with you. It is not easy to sort writeback pages by disk location. There are three main obstacles: - there can be multiple filesystems on the same disk - ext3/reiserfs/... make use of blockdev inode, which spans the whole partition; - xfs may store inode metadata/data separately; --
I think the flush_tree (which is a little more than just an rbtree) provides the same queuing mechanisms that the three or four lists heads do and manages to do it in one structure. The i_flushed_when provides the ability to have blocked inodes wait their turn so to speak. Another motivation behind the rbtree patch is to unify the data structure that handles the priority and mechanism of how we write out the pages of the inodes. There are some ideas about introducing priority schemes for QOS and such in the future. I am not saying this patch is about making that happen, but the idea is to if possible unify the four stages of lists into a single structure to facilitate efforts like that. mrubin --
Yeah, rbtree is better than list_heads after all. Let's make it happen. --
The main benefit of rbtree is possibly better support of future policies.
Can you demonstrate an example?
(grumble:
Apart from the benefit of flexibility, I don't think it makes things
simpler, nor does the introduction of rbtree automatically fixes bugs.
Bugs can only be avoided by good understanding of all possible cases.)
The most tricky writeback issues could be starvation prevention
between
- small/large files
- new/old files
- superblocks
Some kind of limit should be applied for each. They used to be:
- requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
this preempts big files
- refill s_io iif it is drained
this prevents promotion of big/old files
- return from sync_sb_inodes() after one go of s_io
(todo: don't restart from the first superblock in writeback_inodes())
this prevents busy superblock from starving others
and ensures fairness between superblocks
Introduce i_flush_gen to help restarting from the last inode?
What do you mean by fairness? Why cannot I_WRITEBACK_NEVER be in a
More details about the fixings, please?
--
These are ill-formed thoughts as of now on my end but the idea was that keeping one tree sorted via a scheme might be simpler than So I have written tests and believe I have covered these issues. If you are concerned in specific on any and have a test case please let Once a big file gets its first do_writepages it is moved behind the other smaller files via i_flushed_when. And the same in reverse for I am not sure how this limit helps things out. Is this for superblock The basic idea behind the writeback algorithm to handle starvation. The over arching idea is that we want to preserve order of writeback based on when an inode was dirtied and also preserve the dirtied_when contents until the inode has been written back (partially or fully) Every sync_sb_inodes we find the least recent inodes dirtied. To deal with large or small starvation we have a s_flush_gen for each iteration of sync_sb_inodes every time we issue a writeback we mark that the inode cannot be processed until the next s_flush_gen. This way we don't process one get to the rest since we keep pushing them into subsequent s_fush_gen's. So originally this comment was written when I was trying to fix a bug in 2.6.23. The one where we were starving large files from being flushed. There was a fairness issue where small files were being The WRITEBACK_NEVER could be in a previous patch to the rbtree. But not a subsequent patch to the rbtree. The rbtree depends on the WRITEBACK_NEVER patch otherwise we run in to problems in generic_delete_inode. Now that you point it out I think I can split this patch into two patches and make the WRITEBACK_NEVER in the first So again this comment was written against 2.6.23. The biggest fix is the starving of big files. I remember there were other smaller issues, but there have been so many changes in the patch sets that I need to go back to quantify. --
Suppose we want to grant longer expiration window for temp files, adding a new list named s_dirty_tmpfile would be a handy solution. You mean i_flush_gen? No, sync_sb_inodes() will abort on every MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly. We should have a way to go to next superblock even if new dirty inodes or pages are emerging fast in this superblock. Fill and drain s_io only once and then abort helps. Basically you make one list_head in each rbtree node. That list_head is recycled cyclic, and is an analog to the old fashioned s_dirty. We need to know 'where we are' and 'where it ends'. So an extra indicator must be introduced - i_flush_gen. It's awkward. In fact the bug is turned-around rather than fixed - now the small OK. --
[just a random idea; i have not worked out all the implications] Would it be possible to derive a writeback apriority from the ionice level of the process originating the IO? e.g. we have long standing problems that background jobs even when niced and can cause significant slow downs to foreground processes by starving IO and pushing out pages. ionice was supposed to help with that but in practice it does not seem to have helped too much and I suspect it needs more prioritization higher up the VM food chain. Adding such priorities to writeback would seem like a step in the right direction, although it would of course not solve the problem completely. -Andi --
We have some code internally that does just this, though it slightly abuses struct page by tagging pages with the highest priority that dirties them. I'm not sure what a better solution is, though there has been talk about rewritting it to use a per mapping radix tree to keep mem_map small. The idea is to mark current->ioprio into each page as they are dirtied, higher priority overrides lower priority. Buffer_heads are done in the same way. Come IO submission, bio's get the highest IO priority associated with the submitted pages/buffer_heads and these are passed into the the struct request's into the scheduler. Similar changes are underway for making sure all AIO get the right ioprios. It will take some cleaning to get it ready for lkml submission, though I'm still a bit unsure of what we should do to avoid struct page growth. Mike Waychison --
No idea - but it makes a good example ;-) But for those making different filesystems for /tmp, /var, /data etc, Good idea. Michael may well be considering similar interfaces :-) --
I'm not interested in WRITEBACK_NEVER or location based writeback Yes, exactly. And then it will continue to sync the big one again. It will never be able to move forward to the next dirtied_when before We need a limit and continuing scheme at each level. It was so hard to Hehe, I guess the bug is still there in 2.6.24-rc3. But should be gone in the latest patchset. --
This seems suboptimal for large files. If you keep feeding in new least recently dirtied files, the large files will never get an unimpeded go at the disk and hence we'll struggle to get decent bandwidth under anything but pure large file write loads. Fairness is a tradeoff between seeks and bandwidth. Ideally, we want to spend 50% of the *disk* time servicing sequential writes and 50% of the time servicing seeky writes - that way neither get penalised unfairly by the other type of workload. Switching inodes during writeback implies a seek to the new write location, while continuing to write the same inode has no seek penalty because the writeback is sequential. It follows from this that allowing larges file a disproportionate amount of data writeback is desirable. Also, cycling rapidly through all the large files to write 4MB to each is going to cause us to spend time seeking rather than writing compared to cycling slower and writing 40MB from each large file at a time. i.e. servicing one large file for 100ms is going to result in higher writeback throughput than servicing 10 large files for 10ms each because there's going to be less seeking and more writing done by the disks. That is, think of large file writes like process scheduler batch jobs - bulk throughput is what matters, so the larger the time slice you give them the higher the throughput. IMO, the sort of result we should be looking at is a writeback design that results in cycling somewhat like: slice 1: iterate over small files slice 2: flush large file 1 slice 3: iterate over small files slice 4: flush large file 2 ...... slice n-1: flush large file N slice n: iterate over small files slice n+1: flush large file N+1 So that we keep the disk busy with a relatively fair mix of small and large I/Os while both are necessary. Furthermore, as disk bandwidth goes up, the relationship between large file and small file writes changes if we want to maintain writeback at a significant ...
True. But IMO, locality ordering really only impacts the small file data writes and the inodes themselves because there is typically lots of seeks in doing that. For large sequential writes to a file, writing a significant chunk of data gives that bit of writeback it's own locality because it does not cause seeks. Hence simply writing large enough chunks avoids any need to order the writeback by locality. Hence I writeback ordering by locality more a function of Then the filesystem is not doing it's job. Fragmentation does not happen very frequently in XFS for large files - that is one of the reasons it is extremely good for large files and high For large files it is overkill. For filesystems that do delayed allocation, it is often impossible (no block mapping until the writeback is executed unless it's an overwrite). At this point, I'd say it is best to leave it to the filesystem and the elevator to do their jobs properly. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
If we can sync fast enough, the lower layer would be able to merge those 4MB requests. Whatever the slice size is, we end up sending the same set of pages to disk: the writable pages of expired inodes. We only have to worry about choosing the right set of pages being written in one single batch(in every pdflush wakeup time). That set is Slow queues go full first. Currently the writeback code will skip _and_ congestion_wait() for congested filesystems. The better policy is to congestion_wait() _after_ all other writable pages have been synced. The same have been done for blocked inodes/pages in my We might do write-behind for large files :-) --
