Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?

Previous thread: [GIT]: Networking by David Miller on Friday, May 30, 2008 - 3:04 am. (1 message)

Next thread: [PATCH 1/3] KVM: Expose get_eoi_gsi for IRQ acking for assigned devices. by Amit Shah on Friday, May 30, 2008 - 3:27 am. (4 messages)
From: Pavel Machek
Date: Friday, May 30, 2008 - 3:26 am

Hi!

sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
for s2disk: there we want to start writeout as early as possible
(system is going to shut down after write, and we need the data on
disk).

Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
does not work for us. Is there non-blocking variant? "Start writeout
on this fd, but don't block me"?

For now I'm doing:

static inline int start_writeout(int fd)
{
#ifdef SYS_sync_file_range
        if (fork())
                return 0;
        else {
                syscall(SYS_sync_file_range, fd,
                        (loff_t)0, (loff_t)0, SYNC_FILE_RANGE_WRITE);
                exit(0);
        }
#else
        errno = ENOSYS;
        return -1;
#endif
}

...but I'd prefer something more elegant...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Hugh Dickins
Date: Friday, May 30, 2008 - 6:58 am

I guess there are lots of reasons why it may block (get rescheduled)
briefly, but why would that matter to you?  Are you saying that its
whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE
is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too?

Hugh
--

From: Pavel Machek
Date: Friday, May 30, 2008 - 1:43 pm

It appears to me like it includes WAIT_AFTER, yes.

I was not sure what the expected behaviour was... lets say we have a
lot of dirty data (like 40MB) and system with enough free memory. Is
it reasonable to expect SYNC_FILE_RANGE_WRITE to return pretty much
immediately? (like in less than 10msec)? Because it seems to take more
like a second here...

(Underlying 'file' is actually /dev/sda1 -- aka my swap partition, but
that should not matter --right?)
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Hugh Dickins
Date: Saturday, May 31, 2008 - 11:44 am

Right (so long as you're not swapping to it at the same time!).
And it seems to be behaving the same way on a regular file.

All I can say so far is that I find the same as you do:
SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.

Which make the sync_file_range call pretty pointless: your usage seems
perfectly reasonable to me, but somehow we've broken its behaviour.
I'll be investigating ...

Hugh
--

From: Andrew Morton
Date: Saturday, May 31, 2008 - 5:39 pm

It will block on disk queue fullness - sysrq-W will tell.

--

From: Hugh Dickins
Date: Sunday, June 1, 2008 - 12:23 am

Ah, thank you.  What a disappointment, though it's understandable.
Doesn't that very severely limit the usefulness of the system call?

I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
but I don't suppose Pavel and I are the only ones misled by it.

Hugh
--

From: Andrew Morton
Date: Sunday, June 1, 2008 - 1:15 am

A bit.  The request queue size is runtime tunable though.

I expect major users of this system call will be applications which do
small-sized overwrites into large files, mainly databases.  That is,
once the application developers discover its existence.  I'm still
getting expressions of wonder from people who I tell about the

Yup, this caveat/restriction should be in the manpage.
--

From: Pavel Machek
Date: Sunday, June 1, 2008 - 4:40 am

Which /sys is that? What happens if I set the queue size to pretty

Hey, you have one user now, its called s2disk. But for this call to be
useful, we'd need asynchronous variant... is there such thing?


Michael, this is something for you I guess?

And andrew, something for you:

--- 

SYNC_FILE_RANGE_WRITE may and will block. Document that.

Signed-off-by: Pavel Machek <pavel@suse.cz>

---
commit 5db78da3d8e6fa527bfe384ded2ff7c835592fe2
tree 4c405e07be12f0a2260492fb43d19802ff7ebab1
parent 0ea376de01be797f9563c2c2464149f8f0af6329
author Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200
committer Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200

 fs/sync.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 228e17b..54e9f20 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -139,7 +139,8 @@ asmlinkage long sys_fdatasync(unsigned i
  * before performing the write.
  *
  * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the
- * range which are not presently under writeback.
+ * range which are not presently under writeback. Notice that even this this
+ *  may and will block if you attempt to write more than request queue size.
  *
  * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range
  * after performing the write.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andrew Morton
Date: Sunday, June 1, 2008 - 1:37 pm

In theory, no - it's always caused problems when the VM/VFS/FS layer
has relied upon request-queue exhaustion for throttling.  Hence all
that code is supposed to work OK when there is no request-queue

Well if you're asking the syscall to shove more data into the block
layer than it can concurrently handle, sure, the block layer will
block.  It's tunable...

It can still block in places, of course - we might need to do

I sense a strangeness.  What are you actually trying to do with all of this?

Bear in mind that sync_file_range() doesn't sync metadata (ie: indirect
blocks).  So if they weren't already known to have been written, the

um, OK. I'll fix the grammar a bit there.
--

From: Rafael J. Wysocki
Date: Sunday, June 1, 2008 - 3:00 pm

Pavel is trying to avoid me doing multithreaded s2disk, more or less. ;-)

However, I have some numbers showing that multithreaded image saving actually
helps a lot in case the image is compressed and encrypted.

Thanks,
Rafael
--

From: Pavel Machek
Date: Sunday, June 1, 2008 - 3:22 pm

Okay, so I have around 400MB of data, I want it compressed, optionally
encrypted and written to partition.

Now, if I do it "naturally", I do writes, followed by fsync.

That's bad, because kernel does not start write out immediately, and
we waste time with idle disk. (If data compress really well, or
encryption is off, this is significant).

So we improve on this, by doing sync_file_range(SYNC_FILE_RANGE_WRITE)
periodically. That keeps the disk busy, but occassionaly blocks the
cpu... wasting time (which mostly hurts in compression+encryption
case).


I'm not trying to use this for correctness; I'm optimizing for

Thanks.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andrew Morton
Date: Sunday, June 1, 2008 - 3:47 pm

yep.  That's another use of sync_file_range(): to allow smart userspace

pthread_create() ;)

How about this:

- Add a new SYNC_FILE_RANGE_NON_BLOCKING

- If userspace set that flag, turn on writeback_control.nonblocking
  in __filemap_fdatawrite_range().

- test it a lot.

It will be userspace's responsibility to avoid burning huge amounts of
CPU repeatedly calling sync_file_range() and having it not actually write
anything.


--

From: Pavel Machek
Date: Sunday, June 1, 2008 - 4:00 pm

Works for me. Is the expectation that I code this? I can certainly

Ok... I guess doing 10x sync_file_range() when writing 400MB of data
is not excessive?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andrew Morton
Date: Sunday, June 1, 2008 - 4:11 pm

Something like this:

 fs/sync.c          |    3 ++-
 include/linux/fs.h |    4 +++-
 mm/filemap.c       |    9 +++++----
 3 files changed, 10 insertions(+), 6 deletions(-)

diff -puN mm/filemap.c~a mm/filemap.c
--- a/mm/filemap.c~a
+++ a/mm/filemap.c
@@ -207,7 +207,7 @@ static int sync_page_killable(void *word
  * be waited upon, and not just skipped over.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
-				loff_t end, int sync_mode)
+				loff_t end, int sync_mode, int nonblocking)
 {
 	int ret;
 	struct writeback_control wbc = {
@@ -215,6 +215,7 @@ int __filemap_fdatawrite_range(struct ad
 		.nr_to_write = mapping->nrpages * 2,
 		.range_start = start,
 		.range_end = end,
+		.nonblocking = !!nonblocking,
 	};
 
 	if (!mapping_cap_writeback_dirty(mapping))
@@ -227,7 +228,7 @@ int __filemap_fdatawrite_range(struct ad
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
 {
-	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode, 0);
 }
 
 int filemap_fdatawrite(struct address_space *mapping)
@@ -239,7 +240,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
 static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end)
 {
-	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
+	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL, 0);
 }
 
 /**
@@ -430,7 +431,7 @@ int filemap_write_and_wait_range(struct 
 
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
-						 WB_SYNC_ALL);
+						 WB_SYNC_ALL, 0);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
diff -puN fs/sync.c~a fs/sync.c
--- a/fs/sync.c~a
+++ a/fs/sync.c
@@ -268,7 +268,8 @@ int do_sync_mapping_range(struct address
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
 		ret = ...
From: Hugh Dickins
Date: Monday, June 2, 2008 - 1:43 am

Though this fits very easily into the current kernel implementation,
I don't think it's the right interface for userspace.

If we do go this kind of a way, then I'd say SYNC_FILE_RANGE_NON_BLOCKING
needs to tell the caller how far it got before giving up, rather than just
success or failure.  Why? um, um, because it feels right; and would help
the caller help the kernel by not overloading it with needlessly repeated
loop ranges - any stronger reasons?  But sync_file_range() was defined
to return int rather than ssize_t, so that becomes awkward.

Never mind, I don't think it is the right way anyway.  We don't need
additions to the existing sync_file_range() interface, we just need it
to behave as naive people like Pavel and I expected it to behave in the
first place: SYNC_FILE_RANGE_WRITE should be nonblocking (with respect
to queue congestion, and maybe page locking also).

I was imagining that where the existing nonblocking code just gives up,
the SYNC_FILE_RANGE_WRITE case should schedule the remaining work to be
done a little later: possibly by poking and/or leaving info for pdflush.

I guess there may be some resource fairness issues, with either approach:
it ought not to be unreasonable for a process to proceed by writing a
page then SYNC_FILE_RANGE_WRITing that page, page by page, should it?

But once we claim nonblocking at the user interface, I expect we'll
come up against the raciness in the current nonblocking treatment:
just because bdi is not congested when it's tested doesn't mean we
won't block when the write is submitted.  Perhaps a BIO_RW_NONBLOCK
could fix that up?

Hugh
--

From: Rafael J. Wysocki
Date: Monday, June 2, 2008 - 4:18 am

Well, frankly, I'm not sure if we need anything better than we already have.
In fact my numbers show that SYNC_FILE_RANGE_WRITE works quite well -
please see
http://sourceforge.net/mailarchive/message.php?msg_name=200806020122.36193.rjw%40sisk.pl
"early writeout" means that SYNC_FILE_RANGE_WRITE is used and the file with
the results is attached for convenience.

My interpretation of the results is here:
http://sourceforge.net/mailarchive/forum.php?thread_name=200806021238.17100.rjw%40sisk...

Thanks,
Rafael
From: Hugh Dickins
Date: Monday, June 2, 2008 - 5:11 am

If the current SYNC_FILE_RANGE_WRITE is good enough for you, great,
keep on using it.  Forgive me, I'm not much concerned with your
particular use, I'll happily leave that to you and Pavel!  What
bothers me is Pavel's discovery that a WAIT|WRITE|WAIT interface,
designed to help callers by allowing those stages to be separated,
ends up doing a significant amount of WAITing in the WRITE stage.

Hugh
--

From: Jens Axboe
Date: Monday, June 2, 2008 - 4:43 am

Ehm, lets get the history right, please :-)

The block layer pretty much doesn't care about how large the queue
size is, it's largely at 128 to prevent the vm from shitting itself
like it has done in the past (and continues to do I guess, though
your reply leaves me wondering).

So you think the vm will be fine with a huge number of requests?
It wont go nuts scanning and reclaiming, wasting oodles of CPU
cycles?

-- 
Jens Axboe

--

From: Hugh Dickins
Date: Monday, June 2, 2008 - 5:40 am

Interesting.  I wonder.  I may be quite wrong (Cc'ed Rik and Lee
who I think are currently most in touch with what goes on there),
but my impression is that whereas vmscan.c takes pages off LRU
while it's doing writeback on them, and arranges for them to go
back to the reclaimable tail of the LRU once writeback completes
(the rotate reclaimable business), pages under writeback which
it did not itself initiate will get put back on LRU at the head
(with write completion still pending).

If unbounded queues turn out to be viable in other respects
(which would be very attractive for SYNC_PAGE_RANGE_WRITE), even
if not, then maybe shrink_page_list needs to be able to take pages
already under Writeback off the LRU and set them Reclaimable.

Hugh
--

From: Rik van Riel
Date: Monday, June 16, 2008 - 1:53 pm

On Mon, 2 Jun 2008 13:40:28 +0100 (BST)

The problem with lots of CFQ requests is simpler.

If you look at balance_dirty_pages() you will see that it only
takes dirty and NFS unstable pages into account when checking
the dirty limit.  The pages that are in flight (under IO) are
not counted at all.

If you have 8192 CFQ requests and large streaming IO, most of
the IOs will be mergeable and it is possible to pin all of
memory in in-flight (PG_writeback - and other?) pages.

I suspect that balance_dirty_pages() will have to take the
writeback pages into account, though that may cause problems
with FUSE.  Maybe it should at least wait for some IO to
complete?

-- 
All rights reversed.
--

From: Andrew Morton
Date: Monday, June 16, 2008 - 9:54 pm

[Empty message]
From: Rik van Riel
Date: Tuesday, June 17, 2008 - 6:38 am

On Mon, 16 Jun 2008 21:54:06 -0700

Mmm, you are right.   It is counting the number of writeback
pages.

It just isn't waiting on them.  After moving enough pages from
dirty state to writeback state, it breaks out of the loop and
allows memory dirtying to continue.

                if (pages_written >= write_chunk)
                        break;          /* We've done our duty */

I wonder if we should wait for some IO to complete, or at least
do a simple congestion_wait() if the number of writeback pages
is very large?

-- 
All rights reversed.
--

From: Andrew Morton
Date: Monday, June 2, 2008 - 9:50 am

The VFS did screw up a couple of times with unbounded queues.  It did
get fixed and it is a design objective for the writeback code to _not_
depend upon request exhaustion for proper behaviour.

But it hasn't had a large amount of testing with unbounded queues and
there may still be problems in there.

--

From: Michael Kerrisk
Date: Tuesday, June 3, 2008 - 1:01 am

Pavel,

Just to confirm: you are meaning that the sentence

    Notice that even this this may and will block if you attempt
    to write more than request queue size.

should be added to the man page under the description of
SYNC_FILE_RANGE_WRITE, right?

Cheers,




-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
--

Previous thread: [GIT]: Networking by David Miller on Friday, May 30, 2008 - 3:04 am. (1 message)

Next thread: [PATCH 1/3] KVM: Expose get_eoi_gsi for IRQ acking for assigned devices. by Amit Shah on Friday, May 30, 2008 - 3:27 am. (4 messages)