This series contains the initial writeback tracing patches from Jens, as well as the extensions I added to provide visibility into writeback control structures as the are used by the writeback code. The visibility given is sufficient to understand what is happening in the writeback path - what path is writing data, what path is blocking on congestion, etc, and to determine the differences in behaviour for different sync modes and calling contexts. This tracing really needs to be integrated into mainline so that anyone can improve the tracing as they use it to track down problems in our convoluted writeback paths. The remaining patches are fixes to problems that the new tracing highlighted. --
From: Dave Chinner <dchinner@redhat.com>
Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add IO dispatched by foreground throttling to the
writeback events.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/fs-writeback.c | 5 ++
include/trace/events/writeback.h | 77 ++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 4 ++
3 files changed, 86 insertions(+), 0 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3f5f0a5..5214b61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -752,7 +752,11 @@ static long wb_writeback(struct bdi_writeback *wb,
wbc.more_io = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
wbc.pages_skipped = 0;
+
+ trace_wbc_writeback_start(&wbc);
writeback_inodes_wb(wb, &wbc);
+ trace_wbc_writeback_written(&wbc);
+
args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
@@ -780,6 +784,7 @@ static long wb_writeback(struct bdi_writeback *wb,
if (!list_empty(&wb->b_more_io)) {
inode = list_entry(wb->b_more_io.prev,
struct inode, i_list);
+ trace_wbc_writeback_wait(&wbc);
inode_wait_for_writeback(inode);
}
spin_unlock(&inode_lock);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index df76457..02f34a5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -165,6 +165,83 @@ TRACE_EVENT(writeback_bdi_register,
TP_printk("%s: %s", __entry->name,
__entry->start ? "registered" : "unregistered")
);
+
+/* pass flags explicitly */
+DECLARE_EVENT_CLASS(wbc_class,
+ TP_PROTO(struct writeback_control *wbc),
+ TP_ARGS(wbc),
+ TP_STRUCT__entry(
+ __field(unsigned int, wbc)
+ __array(char, name, 16)
+ __field(long, nr_to_write)
+ __field(long, pages_skipped)
+ __field(int, sb)
+ __field(int, sync_mode)
+ __field(int, nonblocking)
+ __field(int, ...From: Dave Chinner <dchinner@redhat.com>
If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:
wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85
This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. Make it observe the current
value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
integrity syncs.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/fs-writeback.c | 9 ---------
include/linux/writeback.h | 9 +++++++++
include/trace/events/writeback.h | 1 +
mm/page-writeback.c | 20 +++++++++++++++++++-
4 files changed, 29 insertions(+), 10 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5214b61..d8271d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -675,15 +675,6 @@ void writeback_inodes_wbc(struct writeback_control *wbc)
writeback_inodes_wb(&bdi->wb, wbc);
}
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation. We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode. Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES 1024
-
static inline bool over_bground_thresh(void)
{
unsigned long background_thresh, dirty_thresh;
diff --git a/include/linux/writeback.h ...Honestly, this is an ugly hack. I'd rather work towards ignoring nr_to_write completely in WB_SYNC_ALL mode since it doesn't really make any sence to say "write me *safely* 5 pages". Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Be careful here. If you are going to write more pages than what the
writeback code has requested (the stupid no more than 1024 pages
restriction in the writeback code before it jumps to start writing
some other inode), you actually need to let the returned
wbc->nr_to_write go negative, so that wb_writeback() knows how many
pages it has written.
In other words, the writeback code assumes that
<orignal value of nr_to_write> - <returned wbc->nr_to_write>
is
<number of pages actually written>
If you don't let wbc->nr_to_write go negative, the writeback code will
be confused about how many pages were _actually_ written, and the
writeback code ends up writing too much. See commit 2faf2e1.
All of this is a crock of course. The file system shouldn't be
second-guessing the writeback code. Instead the writeback code should
be adaptively measuring how long it takes to were written out N pages
to a particular block device, and then decide what's the appropriate
setting for nr_to_write. What makes sense for a USB stick, or a 4200
RPM laptop drive, may not make sense for a massive RAID array....
But since we don't have that, both XFS and ext4 have workarounds for
brain-damaged writeback behaviour. (I did some testing, and even for
standard laptop drives the cap of 1024 pages is just Way Too Small;
that limit was set something like a decade ago, and everyone has been
afraid to change it, even though disks have gotten a wee bit faster
since those days.)
- Ted
--
Yes, but that does not require a negative value to get right. None of the code relies on negative nr_to_write values to do anything correctly, and all the termination checks are for wbc->nr_to-write <= 0. And the tracing shows it behaves correctly when wbc->nr_to_write = 0 on return. Requiring a negative number is not documented in any of the comments, write_cache_pages() does not return a negative number, etc, so I can't see why you think this is ext4 added a "bump" to wbc->nr_to_write, then in some cases forgot to remove it so it never returned to <= 0. Well, of course this causes writeback to write too much! But that's an ext4 bug not allowing nr_to_write to reach zero (not negative, but zero), not a Why? Writeback should just keep pushing pages down until it congests the block device. Then it throttles itself in get_request() and so writeback already adapts to the load on the device. Multiple passes of 1024 pages per dirty inode is fine for this - a larger nr_to_write doesn't get the block device to congestion any faster or XFS put a workaround in for a different reason to ext4. ext4 put it in to improve delayed allocation by working with larger chunks of pages. XFS put it in to get large IOs to be issued through submit_bio(), not to help the allocator... And to be the nasty person to shoot down your modern hardware theory: nr_to_write = 1024 pages works just fine on my laptop (XFS on indilix SSD) as well as my big test server (XFS on 12 disk RAID0) The server gets 1.5GB/s with pretty much perfect IO patterns with the fixes I posted, unlike the mess of single page IOs that occurs without them.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
In fs/fs-writeback.c, wb_writeback(), around line 774: wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; If we want "wrote" to be reflect accurately the number of pages that the filesystem actually wrote, then if you write more pages than what That's why I put in ext4 at least initially, yes. I'm working on rewriting the ext4_writepages() code to make this unnecessary.... Have you tested with multiple files that are subject to writeout at the same time? After all, if your I/O allocator does a great job of keeping the files contiguous in chunks larger tham 4MB, then if you have two or more files that need to be written out, the page allocator will round robin between the two files in 4MB chunks, and that might not be considered an ideal I/O pattern. Regards, - Ted --
Yes, but the change I made: a) prevented it from writing more than requested in the async writeback case, and b) prevented it from going massively negative so that the higher levels wouldn't have over-accounted for pages written. And if we consider that for the sync case we actaully return the number of pages written - it's gets capped at zero even when we write a lot more than that. Hence exactly accounting for pages written is really not important. Indeed, exact number of written pages is not actually used for anything specific - only to determine if there was activity or not: 919 pages_written = wb_do_writeback(wb, 0); 920 921 if (pages_written) 4MB chunks translate into 4-8 IOs at the block layer with typical setups that set the maximum IO size to 512k or 1MB. So that is _plenty_ to keep a single disk or several disks in a RAID stripe busy before seeking to another location to do the next set of 4-8 writes. And if the drive has any amount of cache (we're seeing 64-128MB in SATA drives now), then it will be aggregating these writes in the cache into even larger sequential chunks. Hence seeks in _modern hardware_ are going to be almost entirely mitigated for most large sequential write workloads as long as the contiguous chunks are more than a few MB in size. Some numbers for you: One 4GB file (baseline): $ dd if=/dev/zero of=/mnt/scratch/$i/test bs=1024k count=4000 ..... $ sudo xfs_bmap -vp /mnt/scratch/*/test /mnt/scratch/0/test: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..4710271]: 96..4710367 0 (96..4710367) 4710272 00000 1: [4710272..8191999]: 5242976..8724703 1 (96..3481823) 3481728 00000 Ideal layout - the AG size is about 2.4GB, so it'll be two extents as we see (average gives 2GB per extent). This completed at about 440MB/s. Two 4GB files in parallel into the same directory: $ for i in `seq 0 1 1`; do dd if=/dev/zero ...
On Tue, 20 Apr 2010 12:41:53 +1000 Bug. I suggest that what you do here is remove the local `nr_to_write' from write_cache_pages() and go back to directly using wbc->nr_to_write within the loop. And thus we restore the convention that if the fs writes back more than a single page, it subtracts (nr_written - 1) from wbc->nr_to_write. --
My mistake i never expected writepage to write more than one page. The
interface said 'writepage' so it was natural to expect that it writes only
one page. BTW the reason for the change is to give file system which
accumulate dirty pages using write_cache_pages and attempt to write
them out later a chance to properly manage nr_to_write. Something like
ext4_da_writepages
-- write_cache_pages
---- collect dirty page
---- return
--return
--now try to writeout all the collected dirty pages ( say 100)
----Only able to allocate blocks for 50 pages
so update nr_to_write -= 50 and mark rest of 50 pages as dirty
again
So we want wbc->nr_to_write updated only by ext4_da_writepages.
-aneesh
--
On Fri, 30 Apr 2010 11:31:53 +0530 So you want a ->writepage() implementation which doesn't actually write a page at all - it just remembers that page for later. Maybe that fs shouldn't be calling write_cache_pages() at all. After all, write_cache_pages() is a wrapper which emits a sequence of calls to ->writepage(), and ->writepage() writes a page. Rather than hacking around, subverting things and breaking core kernel code, let's step back and more clearly think about what to do? One option would be to implement a new address_space_operation which provides the new semantics in a well-understood fashion. Let's call it writepage_prepare(?). Then reimplement write_cache_pages() so that if ->writepage_prepare() is available, it handles it in a sensible fashion and doesn't break traditional filesystems. Or simply implement a new, different version of write_cache_pages() for filesystems which wish to buffer in this fashion. The new write_cache_pages_prepare()(?) would call ->writepage_prepare(). Internally it might share implementation with write_cache_pages(). There are lots of options. But the way in which write_cache_pages() was extended to handle this ext4 requirement was rather unclean, non-obvious and, umm, broken! --
On my todo list is to fix ext4 to not call write_cache_pages() at all. We are seriously abusing that function ATM, since we're not actually writing the pages when we call write_cache_pages(). I won't go into what we're doing, because it's too embarassing, but suffice it to say that we end up calling pagevec_lookup() or pagevec_lookup_tag() *four*, count them *four* times while trying to do writeback. I have a simple patch that gives ext4 our own copy of write_cache_pages(), and then simplifies it a lot, and fixes a bunch of problems, but then I discarded it in favor of fundamentally redoing how we do writeback at all, but it's going to take a while to get things completely right. But I am working to try to fix this. If it would help, I can ressurect the "fork write_cache_pages() and simplify" patch, so ext4 isn't dependent on the mm/page-writeback.c's write_cache_pages(), if there is an immediate, short-term need to fix that function. - Ted --
sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.
For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.
This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
mm/page-writeback.c | 15 +++++++++++++++
1 files changed, 15 insertions(+), 0 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e22af84..4ba2728 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -852,7 +852,22 @@ int write_cache_pages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;
cycled = 1; /* ignore range_cyclic tests */
+
+ /*
+ * If this is a data integrity sync, cap the writeback to the
+ * current end of file. Any extension to the file that occurs
+ * after this is a new write and we don't need to write those
+ * pages out to fulfil our data integrity requirements. If we
+ * try to write them out, we can get stuck in this scan until
+ * the concurrent writer stops adding dirty pages and extending
+ * EOF.
+ */
+ if (wbc->sync_mode == WB_SYNC_ALL &&
+ wbc->range_end == LLONG_MAX) {
+ end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
+ }
}
+
retry:
done_index = index;
while (!done && (index <= end)) {
--
I guess it can still get stuck if someone does ftruncate() first, then writes to the hole? -- Jamie --
Yes, it would. It only deals with extending files because fixing the problem w.r.t. writes into holes requires something much more invasive like Jan's radix tree mark-and-sweep algorithm.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Looks good. -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi Dave, Thanks for adding tracing to this, it will be really useful. The fix to write_cache_pages looks really interesting, I'm going to test it on my machine. Maybe it should be a separate patch to get more visibility? Ext4 also multiplies nr_to_write, so will that need fixing too? regards Richard --
I don't see a big need to separate the series at this point. Once there's been a review and testing we can decide how to push them into mainline. IMO, the tracing is just as important as the bug No idea. I don't claim to understand ext4's convoluted delayed allocation path and all it's constraints, so I guess you'd need to ask the ext4 developers about that one. After all, with the tracing they'd be able to see if there is a problem. ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I've been snowed under trying to get several different things done all at the same time, so this has slipped. I'm trying to get back to it early next week. The nr_to_pages fix is badly needed, as that causes severe fragmentation in XFS for the exact workloads it fixes the severe fragmentation in ext4.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
