Hi!
sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
for s2disk: there we want to start writeout as early as possible
(system is going to shut down after write, and we need the data on
disk).
Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
does not work for us. Is there non-blocking variant? "Start writeout
on this fd, but don't block me"?
For now I'm doing:
static inline int start_writeout(int fd)
{
#ifdef SYS_sync_file_range
if (fork())
return 0;
else {
syscall(SYS_sync_file_range, fd,
(loff_t)0, (loff_t)0, SYNC_FILE_RANGE_WRITE);
exit(0);
}
#else
errno = ENOSYS;
return -1;
#endif
}
...but I'd prefer something more elegant...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
I guess there are lots of reasons why it may block (get rescheduled) briefly, but why would that matter to you? Are you saying that its whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too? Hugh --
It appears to me like it includes WAIT_AFTER, yes. I was not sure what the expected behaviour was... lets say we have a lot of dirty data (like 40MB) and system with enough free memory. Is it reasonable to expect SYNC_FILE_RANGE_WRITE to return pretty much immediately? (like in less than 10msec)? Because it seems to take more like a second here... (Underlying 'file' is actually /dev/sda1 -- aka my swap partition, but that should not matter --right?) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Right (so long as you're not swapping to it at the same time!). And it seems to be behaving the same way on a regular file. All I can say so far is that I find the same as you do: SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time, more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too. Which make the sync_file_range call pretty pointless: your usage seems perfectly reasonable to me, but somehow we've broken its behaviour. I'll be investigating ... Hugh --
It will block on disk queue fullness - sysrq-W will tell. --
Ah, thank you. What a disappointment, though it's understandable. Doesn't that very severely limit the usefulness of the system call? I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING, but I don't suppose Pavel and I are the only ones misled by it. Hugh --
A bit. The request queue size is runtime tunable though. I expect major users of this system call will be applications which do small-sized overwrites into large files, mainly databases. That is, once the application developers discover its existence. I'm still getting expressions of wonder from people who I tell about the Yup, this caveat/restriction should be in the manpage. --
Which /sys is that? What happens if I set the queue size to pretty Hey, you have one user now, its called s2disk. But for this call to be useful, we'd need asynchronous variant... is there such thing? Michael, this is something for you I guess? And andrew, something for you: --- SYNC_FILE_RANGE_WRITE may and will block. Document that. Signed-off-by: Pavel Machek <pavel@suse.cz> --- commit 5db78da3d8e6fa527bfe384ded2ff7c835592fe2 tree 4c405e07be12f0a2260492fb43d19802ff7ebab1 parent 0ea376de01be797f9563c2c2464149f8f0af6329 author Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200 committer Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200 fs/sync.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/sync.c b/fs/sync.c index 228e17b..54e9f20 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -139,7 +139,8 @@ asmlinkage long sys_fdatasync(unsigned i * before performing the write. * * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the - * range which are not presently under writeback. + * range which are not presently under writeback. Notice that even this this + * may and will block if you attempt to write more than request queue size. * * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range * after performing the write. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
In theory, no - it's always caused problems when the VM/VFS/FS layer has relied upon request-queue exhaustion for throttling. Hence all that code is supposed to work OK when there is no request-queue Well if you're asking the syscall to shove more data into the block layer than it can concurrently handle, sure, the block layer will block. It's tunable... It can still block in places, of course - we might need to do I sense a strangeness. What are you actually trying to do with all of this? Bear in mind that sync_file_range() doesn't sync metadata (ie: indirect blocks). So if they weren't already known to have been written, the um, OK. I'll fix the grammar a bit there. --
Pavel is trying to avoid me doing multithreaded s2disk, more or less. ;-) However, I have some numbers showing that multithreaded image saving actually helps a lot in case the image is compressed and encrypted. Thanks, Rafael --
Okay, so I have around 400MB of data, I want it compressed, optionally encrypted and written to partition. Now, if I do it "naturally", I do writes, followed by fsync. That's bad, because kernel does not start write out immediately, and we waste time with idle disk. (If data compress really well, or encryption is off, this is significant). So we improve on this, by doing sync_file_range(SYNC_FILE_RANGE_WRITE) periodically. That keeps the disk busy, but occassionaly blocks the cpu... wasting time (which mostly hurts in compression+encryption case). I'm not trying to use this for correctness; I'm optimizing for Thanks. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
yep. That's another use of sync_file_range(): to allow smart userspace pthread_create() ;) How about this: - Add a new SYNC_FILE_RANGE_NON_BLOCKING - If userspace set that flag, turn on writeback_control.nonblocking in __filemap_fdatawrite_range(). - test it a lot. It will be userspace's responsibility to avoid burning huge amounts of CPU repeatedly calling sync_file_range() and having it not actually write anything. --
Works for me. Is the expectation that I code this? I can certainly Ok... I guess doing 10x sync_file_range() when writing 400MB of data is not excessive? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Something like this:
fs/sync.c | 3 ++-
include/linux/fs.h | 4 +++-
mm/filemap.c | 9 +++++----
3 files changed, 10 insertions(+), 6 deletions(-)
diff -puN mm/filemap.c~a mm/filemap.c
--- a/mm/filemap.c~a
+++ a/mm/filemap.c
@@ -207,7 +207,7 @@ static int sync_page_killable(void *word
* be waited upon, and not just skipped over.
*/
int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
- loff_t end, int sync_mode)
+ loff_t end, int sync_mode, int nonblocking)
{
int ret;
struct writeback_control wbc = {
@@ -215,6 +215,7 @@ int __filemap_fdatawrite_range(struct ad
.nr_to_write = mapping->nrpages * 2,
.range_start = start,
.range_end = end,
+ .nonblocking = !!nonblocking,
};
if (!mapping_cap_writeback_dirty(mapping))
@@ -227,7 +228,7 @@ int __filemap_fdatawrite_range(struct ad
static inline int __filemap_fdatawrite(struct address_space *mapping,
int sync_mode)
{
- return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+ return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode, 0);
}
int filemap_fdatawrite(struct address_space *mapping)
@@ -239,7 +240,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
loff_t end)
{
- return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
+ return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL, 0);
}
/**
@@ -430,7 +431,7 @@ int filemap_write_and_wait_range(struct
if (mapping->nrpages) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
- WB_SYNC_ALL);
+ WB_SYNC_ALL, 0);
/* See comment of filemap_write_and_wait() */
if (err != -EIO) {
int err2 = wait_on_page_writeback_range(mapping,
diff -puN fs/sync.c~a fs/sync.c
--- a/fs/sync.c~a
+++ a/fs/sync.c
@@ -268,7 +268,8 @@ int do_sync_mapping_range(struct address
if (flags & SYNC_FILE_RANGE_WRITE) {
ret = ...Though this fits very easily into the current kernel implementation, I don't think it's the right interface for userspace. If we do go this kind of a way, then I'd say SYNC_FILE_RANGE_NON_BLOCKING needs to tell the caller how far it got before giving up, rather than just success or failure. Why? um, um, because it feels right; and would help the caller help the kernel by not overloading it with needlessly repeated loop ranges - any stronger reasons? But sync_file_range() was defined to return int rather than ssize_t, so that becomes awkward. Never mind, I don't think it is the right way anyway. We don't need additions to the existing sync_file_range() interface, we just need it to behave as naive people like Pavel and I expected it to behave in the first place: SYNC_FILE_RANGE_WRITE should be nonblocking (with respect to queue congestion, and maybe page locking also). I was imagining that where the existing nonblocking code just gives up, the SYNC_FILE_RANGE_WRITE case should schedule the remaining work to be done a little later: possibly by poking and/or leaving info for pdflush. I guess there may be some resource fairness issues, with either approach: it ought not to be unreasonable for a process to proceed by writing a page then SYNC_FILE_RANGE_WRITing that page, page by page, should it? But once we claim nonblocking at the user interface, I expect we'll come up against the raciness in the current nonblocking treatment: just because bdi is not congested when it's tested doesn't mean we won't block when the write is submitted. Perhaps a BIO_RW_NONBLOCK could fix that up? Hugh --
Well, frankly, I'm not sure if we need anything better than we already have. In fact my numbers show that SYNC_FILE_RANGE_WRITE works quite well - please see http://sourceforge.net/mailarchive/message.php?msg_name=200806020122.36193.rjw%40sisk.pl "early writeout" means that SYNC_FILE_RANGE_WRITE is used and the file with the results is attached for convenience. My interpretation of the results is here: http://sourceforge.net/mailarchive/forum.php?thread_name=200806021238.17100.rjw%40sisk... Thanks, Rafael
If the current SYNC_FILE_RANGE_WRITE is good enough for you, great, keep on using it. Forgive me, I'm not much concerned with your particular use, I'll happily leave that to you and Pavel! What bothers me is Pavel's discovery that a WAIT|WRITE|WAIT interface, designed to help callers by allowing those stages to be separated, ends up doing a significant amount of WAITing in the WRITE stage. Hugh --
Ehm, lets get the history right, please :-) The block layer pretty much doesn't care about how large the queue size is, it's largely at 128 to prevent the vm from shitting itself like it has done in the past (and continues to do I guess, though your reply leaves me wondering). So you think the vm will be fine with a huge number of requests? It wont go nuts scanning and reclaiming, wasting oodles of CPU cycles? -- Jens Axboe --
Interesting. I wonder. I may be quite wrong (Cc'ed Rik and Lee who I think are currently most in touch with what goes on there), but my impression is that whereas vmscan.c takes pages off LRU while it's doing writeback on them, and arranges for them to go back to the reclaimable tail of the LRU once writeback completes (the rotate reclaimable business), pages under writeback which it did not itself initiate will get put back on LRU at the head (with write completion still pending). If unbounded queues turn out to be viable in other respects (which would be very attractive for SYNC_PAGE_RANGE_WRITE), even if not, then maybe shrink_page_list needs to be able to take pages already under Writeback off the LRU and set them Reclaimable. Hugh --
On Mon, 2 Jun 2008 13:40:28 +0100 (BST) The problem with lots of CFQ requests is simpler. If you look at balance_dirty_pages() you will see that it only takes dirty and NFS unstable pages into account when checking the dirty limit. The pages that are in flight (under IO) are not counted at all. If you have 8192 CFQ requests and large streaming IO, most of the IOs will be mergeable and it is possible to pin all of memory in in-flight (PG_writeback - and other?) pages. I suspect that balance_dirty_pages() will have to take the writeback pages into account, though that may cause problems with FUSE. Maybe it should at least wait for some IO to complete? -- All rights reversed. --
On Mon, 16 Jun 2008 21:54:06 -0700
Mmm, you are right. It is counting the number of writeback
pages.
It just isn't waiting on them. After moving enough pages from
dirty state to writeback state, it breaks out of the loop and
allows memory dirtying to continue.
if (pages_written >= write_chunk)
break; /* We've done our duty */
I wonder if we should wait for some IO to complete, or at least
do a simple congestion_wait() if the number of writeback pages
is very large?
--
All rights reversed.
--
The VFS did screw up a couple of times with unbounded queues. It did get fixed and it is a design objective for the writeback code to _not_ depend upon request exhaustion for proper behaviour. But it hasn't had a large amount of testing with unbounded queues and there may still be problems in there. --
Pavel,
Just to confirm: you are meaning that the sentence
Notice that even this this may and will block if you attempt
to write more than request queue size.
should be added to the man page under the description of
SYNC_FILE_RANGE_WRITE, right?
Cheers,
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
--
Yes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
