[PATCH 0/4] writeback: tracing and wbc->nr_to_write fixes

Previous thread: [PATCH 4/4] xfs: remove nr_to_write writeback windup. by Dave Chinner on Monday, April 19, 2010 - 7:41 pm. (3 messages)

Next thread: [BUGFIX] [PATCH v2] x86: update all PGDs for direct mapping changes on 64bit. by Haicheng Li on Monday, April 19, 2010 - 8:28 pm. (3 messages)
From: Dave Chinner
Date: Monday, April 19, 2010 - 7:41 pm

This series contains the initial writeback tracing patches from
Jens, as well as the extensions I added to provide visibility into
writeback control structures as the are used by the writeback code.
The visibility given is sufficient to understand what is happening
in the writeback path - what path is writing data, what path is
blocking on congestion, etc, and to determine the differences in
behaviour for different sync modes and calling contexts. This
tracing really needs to be integrated into mainline so that anyone
can improve the tracing as they use it to track down problems
in our convoluted writeback paths.

The remaining patches are fixes to problems that the new tracing
highlighted.

--

From: Dave Chinner
Date: Monday, April 19, 2010 - 7:41 pm

From: Dave Chinner <dchinner@redhat.com>

Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add IO dispatched by foreground throttling to the
writeback events.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c                |    5 ++
 include/trace/events/writeback.h |   77 ++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |    4 ++
 3 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3f5f0a5..5214b61 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -752,7 +752,11 @@ static long wb_writeback(struct bdi_writeback *wb,
 		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+
+		trace_wbc_writeback_start(&wbc);
 		writeback_inodes_wb(wb, &wbc);
+		trace_wbc_writeback_written(&wbc);
+
 		args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
@@ -780,6 +784,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_list);
+			trace_wbc_writeback_wait(&wbc);
 			inode_wait_for_writeback(inode);
 		}
 		spin_unlock(&inode_lock);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index df76457..02f34a5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -165,6 +165,83 @@ TRACE_EVENT(writeback_bdi_register,
 	TP_printk("%s: %s", __entry->name,
 			__entry->start ? "registered" : "unregistered")
 );
+
+/* pass flags explicitly */
+DECLARE_EVENT_CLASS(wbc_class,
+	TP_PROTO(struct writeback_control *wbc),
+	TP_ARGS(wbc),
+	TP_STRUCT__entry(
+		__field(unsigned int, wbc)
+		__array(char, name, 16)
+		__field(long, nr_to_write)
+		__field(long, pages_skipped)
+		__field(int, sb)
+		__field(int, sync_mode)
+		__field(int, nonblocking)
+		__field(int, ...
From: Dave Chinner
Date: Monday, April 19, 2010 - 7:41 pm

From: Dave Chinner <dchinner@redhat.com>

If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:


    wbc_writeback_start: towrt=1024
    wbc_writepage: towrt=1024
    wbc_writepage: towrt=0
    wbc_writepage: towrt=-1
    wbc_writepage: towrt=-5
    wbc_writepage: towrt=-21
    wbc_writepage: towrt=-85

This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. Make it observe the current
value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a
termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data
integrity syncs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c                |    9 ---------
 include/linux/writeback.h        |    9 +++++++++
 include/trace/events/writeback.h |    1 +
 mm/page-writeback.c              |   20 +++++++++++++++++++-
 4 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5214b61..d8271d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -675,15 +675,6 @@ void writeback_inodes_wbc(struct writeback_control *wbc)
 	writeback_inodes_wb(&bdi->wb, wbc);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
diff --git a/include/linux/writeback.h ...
From: Jan Kara
Date: Thursday, April 22, 2010 - 12:07 pm

Honestly, this is an ugly hack. I'd rather work towards ignoring
nr_to_write completely in WB_SYNC_ALL mode since it doesn't really make
any sence to say "write me *safely* 5 pages".

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: tytso
Date: Saturday, April 24, 2010 - 8:33 pm

Be careful here.  If you are going to write more pages than what the
writeback code has requested (the stupid no more than 1024 pages
restriction in the writeback code before it jumps to start writing
some other inode), you actually need to let the returned
wbc->nr_to_write go negative, so that wb_writeback() knows how many
pages it has written.

In other words, the writeback code assumes that 

  <orignal value of nr_to_write> - <returned wbc->nr_to_write>

is

  <number of pages actually written>

If you don't let wbc->nr_to_write go negative, the writeback code will
be confused about how many pages were _actually_ written, and the
writeback code ends up writing too much.  See commit 2faf2e1.

All of this is a crock of course.  The file system shouldn't be
second-guessing the writeback code.  Instead the writeback code should
be adaptively measuring how long it takes to were written out N pages
to a particular block device, and then decide what's the appropriate
setting for nr_to_write.  What makes sense for a USB stick, or a 4200
RPM laptop drive, may not make sense for a massive RAID array....

But since we don't have that, both XFS and ext4 have workarounds for
brain-damaged writeback behaviour.  (I did some testing, and even for
standard laptop drives the cap of 1024 pages is just Way Too Small;
that limit was set something like a decade ago, and everyone has been
afraid to change it, even though disks have gotten a wee bit faster
since those days.)

    	   	      	       	     	 - Ted
--

From: Dave Chinner
Date: Sunday, April 25, 2010 - 6:49 pm

Yes, but that does not require a negative value to get right.  None
of the code relies on negative nr_to_write values to do anything
correctly, and all the termination checks are for wbc->nr_to-write
<= 0. And the tracing shows it behaves correctly when
wbc->nr_to_write = 0 on return. Requiring a negative number is not
documented in any of the comments, write_cache_pages() does not
return a negative number, etc, so I can't see why you think this is

ext4 added a "bump" to wbc->nr_to_write, then in some cases forgot
to remove it so it never returned to <= 0. Well, of course this
causes writeback to write too much! But that's an ext4 bug not
allowing nr_to_write to reach zero (not negative, but zero), not a

Why? Writeback should just keep pushing pages down until it congests
the block device. Then it throttles itself in get_request() and so
writeback already adapts to the load on the device.  Multiple passes
of 1024 pages per dirty inode is fine for this - a larger
nr_to_write doesn't get the block device to congestion any faster or

XFS put a workaround in for a different reason to ext4. ext4 put it
in to improve delayed allocation by working with larger chunks of
pages. XFS put it in to get large IOs to be issued through
submit_bio(), not to help the allocator...

And to be the nasty person to shoot down your modern hardware
theory: nr_to_write = 1024 pages works just fine on my laptop (XFS
on indilix SSD) as well as my big test server (XFS on 12 disk RAID0)
The server gets 1.5GB/s with pretty much perfect IO patterns with
the fixes I posted, unlike the mess of single page IOs that occurs
without them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: tytso
Date: Sunday, April 25, 2010 - 7:43 pm

In fs/fs-writeback.c, wb_writeback(), around line 774:

   		      wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;

If we want "wrote" to be reflect accurately the number of pages that
the filesystem actually wrote, then if you write more pages than what

That's why I put in ext4 at least initially, yes.  I'm working on
rewriting the ext4_writepages() code to make this unnecessary....


Have you tested with multiple files that are subject to writeout at
the same time?  After all, if your I/O allocator does a great job of
keeping the files contiguous in chunks larger tham 4MB, then if you
have two or more files that need to be written out, the page allocator
will round robin between the two files in 4MB chunks, and that might
not be considered an ideal I/O pattern.

Regards,

					- Ted
--

From: Dave Chinner
Date: Monday, April 26, 2010 - 8:30 pm

Yes, but the change I made:

	a) prevented it from writing more than requested in the
	   async writeback case, and
	b) prevented it from going massively negative so that the
	   higher levels wouldn't have over-accounted for pages
	   written.

And if we consider that for the sync case we actaully return the
number of pages written - it's gets capped at zero even when we
write a lot more than that.

Hence exactly accounting for pages written is really not important.
Indeed, exact number of written pages is not actually used for
anything specific - only to determine if there was activity or not:

 919                 pages_written = wb_do_writeback(wb, 0);
 920
 921                 if (pages_written)


4MB chunks translate into 4-8 IOs at the block layer with typical
setups that set the maximum IO size to 512k or 1MB. So that is
_plenty_ to keep a single disk or several disks in a RAID stripe
busy before seeking to another location to do the next set of 4-8
writes. And if the drive has any amount of cache (we're seeing
64-128MB in SATA drives now), then it will be aggregating these writes in
the cache into even larger sequential chunks. Hence seeks in _modern
hardware_ are going to be almost entirely mitigated for most large
sequential write workloads as long as the contiguous chunks are more
than a few MB in size.

Some numbers for you:

One 4GB file (baseline):

$ dd if=/dev/zero of=/mnt/scratch/$i/test bs=1024k count=4000
.....
$ sudo xfs_bmap -vp /mnt/scratch/*/test
/mnt/scratch/0/test:
 EXT: FILE-OFFSET         BLOCK-RANGE      AG AG-OFFSET     TOTAL FLAGS
   0: [0..4710271]:       96..4710367       0 (96..4710367) 4710272 00000
   1: [4710272..8191999]: 5242976..8724703  1 (96..3481823) 3481728 00000

Ideal layout - the AG size is about 2.4GB, so it'll be two extents
as we see (average gives 2GB per extent). This completed at about 440MB/s.

Two 4GB files in parallel into the same directory:

$ for i in `seq 0 1 1`; do dd if=/dev/zero ...
From: Andrew Morton
Date: Thursday, April 29, 2010 - 2:39 pm

On Tue, 20 Apr 2010 12:41:53 +1000

Bug.


I suggest that what you do here is remove the local `nr_to_write' from
write_cache_pages() and go back to directly using wbc->nr_to_write
within the loop.

And thus we restore the convention that if the fs writes back more than
a single page, it subtracts (nr_written - 1) from wbc->nr_to_write.

--

From: Aneesh Kumar K. V
Date: Thursday, April 29, 2010 - 11:01 pm

My mistake i never expected writepage to write more than one page. The
interface said 'writepage' so it was natural to expect that it writes only
one page. BTW the reason for the change is to give file system which
accumulate dirty pages using write_cache_pages and attempt to write
them out later a chance to properly manage nr_to_write. Something like

ext4_da_writepages
-- write_cache_pages
---- collect dirty page
---- return
--return
--now try to writeout all the collected dirty pages ( say 100)
----Only able to allocate blocks for 50 pages
    so update nr_to_write -= 50 and mark rest of 50 pages as dirty
    again

So we want wbc->nr_to_write updated only by ext4_da_writepages. 


-aneesh
--

From: Andrew Morton
Date: Friday, April 30, 2010 - 12:43 pm

On Fri, 30 Apr 2010 11:31:53 +0530


So you want a ->writepage() implementation which doesn't actually write
a page at all - it just remembers that page for later.

Maybe that fs shouldn't be calling write_cache_pages() at all.  After
all, write_cache_pages() is a wrapper which emits a sequence of calls
to ->writepage(), and ->writepage() writes a page.

Rather than hacking around, subverting things and breaking core kernel
code, let's step back and more clearly think about what to do?

One option would be to implement a new address_space_operation which
provides the new semantics in a well-understood fashion.  Let's call it
writepage_prepare(?).  Then reimplement write_cache_pages() so that if
->writepage_prepare() is available, it handles it in a sensible fashion
and doesn't break traditional filesystems.

Or simply implement a new, different version of write_cache_pages() for
filesystems which wish to buffer in this fashion.  The new
write_cache_pages_prepare()(?) would call ->writepage_prepare(). 
Internally it might share implementation with write_cache_pages().

There are lots of options.  But the way in which write_cache_pages()
was extended to handle this ext4 requirement was rather unclean,
non-obvious and, umm, broken!


--

From: tytso
Date: Saturday, May 1, 2010 - 12:47 pm

On my todo list is to fix ext4 to not call write_cache_pages() at all.
We are seriously abusing that function ATM, since we're not actually
writing the pages when we call write_cache_pages().  I won't go into
what we're doing, because it's too embarassing, but suffice it to say
that we end up calling pagevec_lookup() or pagevec_lookup_tag()
*four*, count them *four* times while trying to do writeback.

I have a simple patch that gives ext4 our own copy of
write_cache_pages(), and then simplifies it a lot, and fixes a bunch
of problems, but then I discarded it in favor of fundamentally redoing
how we do writeback at all, but it's going to take a while to get
things completely right.  But I am working to try to fix this.

If it would help, I can ressurect the "fork write_cache_pages() and
simplify" patch, so ext4 isn't dependent on the mm/page-writeback.c's
write_cache_pages(), if there is an immediate, short-term need to fix
that function.

						- Ted
--

From: Dave Chinner
Date: Monday, April 19, 2010 - 8:40 pm

sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.

For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.

This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/page-writeback.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e22af84..4ba2728 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -852,7 +852,22 @@ int write_cache_pages(struct address_space *mapping,
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		cycled = 1; /* ignore range_cyclic tests */
+
+		/*
+		 * If this is a data integrity sync, cap the writeback to the
+		 * current end of file. Any extension to the file that occurs
+		 * after this is a new write and we don't need to write those
+		 * pages out to fulfil our data integrity requirements. If we
+		 * try to write them out, we can get stuck in this scan until
+		 * the concurrent writer stops adding dirty pages and extending
+		 * EOF.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL &&
+		    wbc->range_end == LLONG_MAX) {
+			end = i_size_read(mapping->host) >> PAGE_CACHE_SHIFT;
+		}
 	}
+
 retry:
 	done_index = index;
 	while (!done && (index <= end)) {
--

From: Jamie Lokier
Date: Tuesday, April 20, 2010 - 4:28 pm

I guess it can still get stuck if someone does ftruncate() first, then
writes to the hole?

-- Jamie
--

From: Dave Chinner
Date: Tuesday, April 20, 2010 - 4:31 pm

Yes, it would. It only deals with extending files because fixing the
problem w.r.t.  writes into holes requires something much more
invasive like Jan's radix tree mark-and-sweep algorithm....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Jan Kara
Date: Thursday, April 22, 2010 - 12:13 pm

Looks good.

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Richard Kennedy
Date: Tuesday, April 20, 2010 - 5:02 am

Hi Dave,

Thanks for adding tracing to this, it will be really useful.

The fix to write_cache_pages looks really interesting, I'm going to test
it on my machine. Maybe it should be a separate patch to get more
visibility?

Ext4 also multiplies nr_to_write, so will that need fixing too?

regards
Richard
--

From: Dave Chinner
Date: Tuesday, April 20, 2010 - 4:29 pm

I don't see a big need to separate the series at this point. Once
there's been a review and testing we can decide how to push them
into mainline. IMO, the tracing is just as important as the bug

No idea. I don't claim to understand ext4's convoluted delayed
allocation path and all it's constraints, so I guess you'd need to
ask the ext4 developers about that one. After all, with the tracing
they'd be able to see if there is a problem. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Friday, May 21, 2010 - 5:09 pm

I've been snowed under trying to get several different things done
all at the same time, so this has slipped. I'm trying to get back to
it early next week. The nr_to_pages fix is badly needed, as that
causes severe fragmentation in XFS for the exact workloads it fixes
the severe fragmentation in ext4....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

Previous thread: [PATCH 4/4] xfs: remove nr_to_write writeback windup. by Dave Chinner on Monday, April 19, 2010 - 7:41 pm. (3 messages)

Next thread: [BUGFIX] [PATCH v2] x86: update all PGDs for direct mapping changes on 64bit. by Haicheng Li on Monday, April 19, 2010 - 8:28 pm. (3 messages)