Andrew, Here are the two writeback livelock fixes (patch 3, 4) from Jan Kara. The series starts with a preparation patch and carries two more debugging patches. Thanks, Fengguang --
From: Jan Kara <jack@suse.cz>
Check whether background writeback is needed after finishing each work.
When bdi flusher thread finishes doing some work check whether any kind
of background writeback needs to be done (either because
dirty_background_ratio is exceeded or because we need to start flushing
old inodes). If so, just do background write back.
This way, bdi_start_background_writeback() just needs to wake up the
flusher thread. It will do background writeback as soon as there is no
other work.
This is a preparatory patch for the next patch which stops background
writeback as soon as there is other work to do.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 61 +++++++++++++++++++++++++++++++++-----------
1 file changed, 46 insertions(+), 15 deletions(-)
--- linux-next.orig/fs/fs-writeback.c 2010-11-06 23:55:25.000000000 +0800
+++ linux-next/fs/fs-writeback.c 2010-11-07 21:54:19.000000000 +0800
@@ -84,13 +84,9 @@ static inline struct inode *wb_inode(str
return list_entry(head, struct inode, i_wb_list);
}
-static void bdi_queue_work(struct backing_dev_info *bdi,
- struct wb_writeback_work *work)
+/* Wakeup flusher thread or forker thread to fork it. Requires bdi->wb_lock. */
+static void bdi_wakeup_flusher(struct backing_dev_info *bdi)
{
- trace_writeback_queue(bdi, work);
-
- spin_lock_bh(&bdi->wb_lock);
- list_add_tail(&work->list, &bdi->work_list);
if (bdi->wb.task) {
wake_up_process(bdi->wb.task);
} else {
@@ -98,15 +94,26 @@ static void bdi_queue_work(struct backin
* The bdi thread isn't there, wake up the forker thread which
* will create and run it.
*/
- trace_writeback_nothread(bdi, work);
wake_up_process(default_backing_dev_info.wb.task);
}
+}
+
+static void bdi_queue_work(struct backing_dev_info *bdi,
+ struct wb_writeback_work *work)
+{
+ trace_writeback_queue(bdi, ...In WB_SYNC_ALL mode, filesystems are not expected to skip dirty pages on temporal lock contentions or non fatal errors, otherwise sync() will return without actually syncing the skipped pages. Add a check to catch possible redirty_page_for_writepage() callers that violate this expectation. I'd recommend to keep this check in -mm tree for some time and fixup the possible warnings before pushing it to upstream. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 1 + 1 file changed, 1 insertion(+) --- linux-next.orig/fs/fs-writeback.c 2010-11-07 22:01:06.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2010-11-07 22:01:15.000000000 +0800 @@ -527,6 +527,7 @@ static int writeback_sb_inodes(struct su * buffers. Skip this inode for now. */ redirty_tail(inode); + WARN_ON_ONCE(wbc->sync_mode == WB_SYNC_ALL); } spin_unlock(&inode_lock); iput(inode); --
On Tue, 09 Nov 2010 07:09:21 +0800 This is quite kernel-developer-unfriendly. Suppose the warning triggers. Now some poor schmuck looks at the warning and doesn't have a *clue* why it was added. He has to run off and grovel through git trees finding changelogs, which is a real pain if the code has been trivially altered since it was first added. As a general rule, a kernel developer should be able to look at a warning callsite and then work out why the warning was emitted! IOW, you owe us a code comment, please. --
Good point! I'll add this comment. + /* + * There's no logic to retry skipped pages for sync(), + * filesystems are assumed not to skip dirty pages on + * temporal lock contentions or non fatal errors. + */ + WARN_ON_ONCE(wbc->sync_mode == WB_SYNC_ALL); IOW, if some FS triggers this warning and it's non-trivial to fix the FS, we'll have to work out a sync retry scheme for skipped pages. Thanks, Fengguang --
From: Jan Kara <jack@suse.cz> Background writeback are easily livelockable (from a definition of their target). This is inconvenient because it can make sync(1) stall forever waiting on its queued work to be finished. Generally, when a flusher thread has some work queued, someone submitted the work to achieve a goal more specific than what background writeback does. So it makes sense to give it a priority over a generic page cleaning. Thus we interrupt background writeback if there is some other work to do. We return to the background writeback after completing all the queued work. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 9 +++++++++ 1 file changed, 9 insertions(+) --- linux-next.orig/fs/fs-writeback.c 2010-11-07 21:56:42.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2010-11-07 22:00:51.000000000 +0800 @@ -651,6 +651,15 @@ static long wb_writeback(struct bdi_writ break; /* + * Background writeout and kupdate-style writeback are + * easily livelockable. Stop them if there is other work + * to do so that e.g. sync can proceed. + */ + if ((work->for_background || work->for_kupdate) && + !list_empty(&wb->bdi->work_list)) + break; + + /* * For background writeout, stop when we are below the * background dirty threshold */ --
On Tue, 09 Nov 2010 07:09:19 +0800 *why* is background writeback easily livelockable? Under which Again, why? Because there are works queued from the flusher thread, but that thread is stuck in a livelocked state in <unspecified code location> so it is unable to service the other works? But the pocess which called sync() will as a last resort itself perform all the So... what prevents higher priority works (eg, sync(1)) from livelocking or seriously retarding background or kudate writeout? --
Hi, New description which should address above questions: Background writeback is easily livelockable in a loop in wb_writeback() by a process continuously re-dirtying pages (or continuously appending to a file). This is in fact intended as the target of background writeback is to write dirty pages it can find as long as we are over dirty_background_threshold. But the above behavior gets inconvenient at times because no other work queued in the flusher thread's queue gets processed. In particular, since e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1) can hang forever waiting for flusher thread to do the work. Generally, when a flusher thread has some work queued, someone submitted the work to achieve a goal more specific than what background writeback does. Moreover by working on the specific work, we also reduce amount of dirty pages which is exactly the target of background writeout. So it makes sense to give specific work a priority over a generic page cleaning. Thus we interrupt background writeback if there is some other work to do. We return to the background writeback after completing all the queued work. If other work than background or kupdate writeout livelocks, it's a bug which should be fixed (either by setting sensible nr_to_write or by tagging like we do it for WB_SYNC_ALL writeback). Of course, higher priority work can be running when background or kupdate writeout would need to run as well. But the idea here is that the purpose of background/kupdate types of writeout is to get rid of dirty data and any type of writeout does this so working on it we also work on background/kupdate writeout only possibly less efficiently. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
On Tue, 9 Nov 2010 23:28:27 +0100 That's fixable by doing the work synchronously within sync_inodes_sb(), rather than twiddling thumbs wasting a thread resource while waiting for kernel threads to do it. As an added bonus, this even makes cpu time accounting more accurate ;) Please remind me why we decided to hand the sync_inodes_sb() work off The kupdate function is a data-integrity/quality-of-service sort of thing. And what I'm asking is whether this change enables scenarios in which these threads can be kept so busy that the kupdate function gets interrupted so frequently that we can have dirty memory not being written back for arbitrarily long periods of time? --
Because when sync(1) does IO on it's own, it competes for the device with
So let me compare:
What kupdate writeback does:
queue inodes older than dirty_expire_centisecs
while some inode in the queue
write MAX_WRITEBACK_PAGES from each inode queued
break if nr_to_write <= 0
What any other WB_SYNC_NONE writeback (let me call it "normal WB_SYNC_NONE
writeback") does:
queue all dirty inodes
while some inode in the queue
write MAX_WRITEBACK_PAGES from each inode queued
break if nr_to_write <= 0
There only one kind of WB_SYNC_ALL writeback - the one which writes
everything.
So after WB_SYNC_ALL writeback (provided all livelocks are fixed ;)
obviously no old data should be unwritten in memory. Normal WB_SYNC_NONE
writeback differs from a kupdate one *only* in the fact that we queue all
inodes instead of only the old ones. We start writing old inodes first and
go inode by inode writing MAX_WRITEBACK_PAGES from each. Now because the
queue can be longer for normal WB_SYNC_NONE writeback, it can take longer
before we return to the old inodes. So if normal writeback interrupts
kupdate one, it can take longer before all data of old inodes get to disk.
But we always get the old data to disk - essentially at the same time at
which kupdate writeback would get them to disk if dirty_expire_centisecs
was 0.
Is this enough? Do you want any of this in the changelog?
Thanks for the inquiry btw. It made me cleanup my thoughts on the subject ;)
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
On Wed, 10 Nov 2010 00:56:32 +0100 Skeptical. Has that effect been demonstrated? Has it been shown to be a significant problem? A worse problem than livelocking the machine? ;) If this _is_ a problem then it's also a problem for fsync/msync. But whoa. That's only true for sync(), or sync_filesystem(). It isn't true for, say, fsync() and msync(). Now when someone comes along and changes fsync/msync to use flusher threads ("resulting in less seeks") then we could get into a situation where fsync-serving flusher threads never OT, but: your faith in those time-ordered inode lists is touching ;) Put a debug function in there which checks that the lists _are_ time-ordered, and call that function from every site in the kernel --
Seriously, I also doubt the value of doing sync() in the flusher thread. sync() is by definition inefficient. In the block layer, it's served with less emphasis on throughput. In the VFS layer, it may sleep in inode_wait_for_writeback() and filemap_fdatawait(). In various FS, pages won't be skipped at the cost of more lock waiting. So when a flusher thread is serving sync(), it has difficulties saturating the storage device. btw, it seems current sync() does not take advantage of the flusher threads to sync multiple disks in parallel. And I guess (concurrent) sync/fsync/msync calls will be rare, especially for really performance demanding workloads (which will optimize sync away in the first place). And I'm still worrying about the sync work (which may take long time to serve even without livelock) to delay other works considerably -- may not be a problem for now, but it will be a real priority dilemma I'm more confident on that time orderness ;) But there is a caveat: redirty_tail() may touch dirtied_when. So it merely keeps the time orderness of b_dirty on the surface. Thanks, Fengguang --
sys_sync does a wakeup_flusher_threads(0) which kicks all flusher
threads to write back all data. We then do another serialized task
of kicking per-sb writeback, and then do synchronous per-sb writeback,
interwinded with non-blocking and blocking quota sync, ->sync_fs
and __sync_blockdev calls.
The way sync ended up looking like it did is rather historic:
First Jan and I fixed sync to actually get data correctly to disk,
the way is was done traditionally had a lot of issues with ordering
it's steps. For that we changed it to just loop around sync_filesystem
to have one common place to define the proper order for it.
That caused a large performance regression, which Yanmin Zhang found
and fixed, which added back the wakeup_pdflush(0) (which later became
wakeup_flusher_threads).
The introduction of the per-bdi writeback threads by Jens changed
writeback_inodes_sb and sync_inodes_sb to offload the I/O submission
to the I/O thread.
I'm not overly happy with the current situation. Part of that is
the rather complex callchain in __sync_filesystem. If we moved the
quota sync and the sync_blockdev into ->sync_fs we'd already be down
to a much more managable level, and we could optimize sync down to:
wakeup_flusher_threads(0);
for_each_sb
sb->sync_fs(sb, 0)
for_each_sb {
sync_inodes_sb(sb);
sb->sync_fs(sb, 1)
}
We would still try to do most of the I/O from the flusher thread,
but only if we can do it non-blocking. If we have to block we'll
get less optimal I/O patterns, but at least we don't block other
writeback while waiting for it.
I suspect a big problem for the statving workloads is that we only
do the live-lock avoidance tagging inside sync_inodes_sb, so
any page written by wakeup_flusher_threads, or the writeback_inodes_sb
in the two first passes that gets redirties is synced out again.
But I'd feel very vary doing this without a lot of performance testing.
dpkg package install workloads, the ffsb create_4k test Yanmin ...I don't think Jens has tested this when merging per-bdi writeback patches. Just the argument makes sense to me (I've seen some loads where application doing IO from balance_dirty_pages() together with pdflush doing writeback did seriously harm performance and this looks like a similar scenario). But I'd be also interested whether the theory and practice Yes, but nobody submits fsync() work to a flusher thread (currently, there's no way to tell flusher thread to operate only on a single inode). But as you write below *if* somebody started doing it and fsync() would be ;-) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Jan: for your convenience, attached is the combined patch for linux-next. Thanks, Fengguang
This tracks when balance_dirty_pages() tries to wakeup the flusher thread for background writeback (if it was not started already). Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 1 + include/trace/events/writeback.h | 1 + 2 files changed, 2 insertions(+) --- linux-next.orig/include/trace/events/writeback.h 2010-11-07 21:54:19.000000000 +0800 +++ linux-next/include/trace/events/writeback.h 2010-11-07 21:56:42.000000000 +0800 @@ -81,6 +81,7 @@ DEFINE_EVENT(writeback_class, name, \ TP_ARGS(bdi)) DEFINE_WRITEBACK_EVENT(writeback_nowork); +DEFINE_WRITEBACK_EVENT(writeback_wake_background); DEFINE_WRITEBACK_EVENT(writeback_wake_thread); DEFINE_WRITEBACK_EVENT(writeback_wake_forker_thread); DEFINE_WRITEBACK_EVENT(writeback_bdi_register); --- linux-next.orig/fs/fs-writeback.c 2010-11-07 21:54:19.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2010-11-07 21:56:42.000000000 +0800 @@ -169,6 +169,7 @@ void bdi_start_background_writeback(stru * We just wake up the flusher thread. It will perform background * writeback as soon as there is no other work to do. */ + trace_writeback_wake_background(bdi); spin_lock_bh(&bdi->wb_lock); bdi_wakeup_flusher(bdi); spin_unlock_bh(&bdi->wb_lock); --
