In my last set of numbers for my buffered-write deadlock fix using 2 copies per page, I realised there is no real performance hit for !uptodate pages as opposed to uptodate ones. This is unexpected because the uptodate pages only require a single copy... The problem turns out to be operator error. I forgot tmpfs won't use this prepare_write path, so sorry about that. On ext2, copy 64MB of data from /dev/zero (IO isn't involved), using 4K and 64K block sizes, and conv=notrunc for testing overwriting of uptodate pages. Numbers is elapsed time in seconds, lower is better. 2.6.20 bufferd write fix 4K 0.0742 0.1208 (1.63x) 4K-uptodate 0.0493 0.0479 (0.97x) 64K 0.0671 0.1068 (1.59x) 64K-uptodate 0.0357 0.0362 (1.01x) So we get about a 60% performance hit, which is more expected. I guess if 0.5% doesn't fly, then 60% is right out ;) If there were any misconceptions, the problem is not that the code is incredibly tricky or impossible to fix with good performance. The problem is that the existing aops interface is crap. "correct, fast, compatible -- choose any 2" So I have finally finished a first slightly-working draft of my new aops op (perform_write) proposal. I would be interested to hear comments about it. Most of my issues and concerns are in the patch headers themselves, so reply to them. The patches are against my latest buffered-write-fix patchset. This means filesystems not implementing the new aop, will remain safe, if slow. Here's some numbers after converting ext2 to the new aop: 2.6.20 perform_write aop 4K 0.0742 0.0769 (1.04x) 4K-uptodate 0.0493 0.0475 (0.96x) 64K 0.0671 0.0613 (0.91x) 64K-uptodate 0.0357 0.0343 (0.96x) Thanks, Nick -- SuSE Labs -
Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.
include/linux/fs.h | 32 ++++++++++++
mm/filemap.c | 132 ++++++++++++++++++++++++++++++++++++++++++-----------
mm/filemap.h | 103 -----------------------------------------
3 files changed, 137 insertions(+), 130 deletions(-)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -395,6 +395,38 @@ struct page;
struct address_space;
struct writeback_control;
+struct iovec_iterator {
+ const struct iovec *iov;
+ unsigned long nr_segs;
+ size_t iov_offset;
+ size_t count;
+};
+
+size_t iovec_iterator_copy_from_user_atomic(struct page *page,
+ struct iovec_iterator *i, unsigned long offset, size_t bytes);
+size_t iovec_iterator_copy_from_user(struct page *page,
+ struct iovec_iterator *i, unsigned long offset, size_t bytes);
+void iovec_iterator_advance(struct iovec_iterator *i, size_t bytes);
+int iovec_iterator_fault_in_readable(struct iovec_iterator *i);
+
+static inline void iovec_iterator_init(struct iovec_iterator *i,
+ const struct iovec *iov, unsigned long nr_segs,
+ size_t count, size_t written)
+{
+ i->iov = iov;
+ i->nr_segs = nr_segs;
+ i->iov_offset = 0;
+ i->count = count + written;
+
+ iovec_iterator_advance(i, written);
+}
+
+static inline size_t iovec_iterator_count(struct iovec_iterator *i)
+{
+ return i->count;
+}
+
+
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
#include <linux/security.h>
#include <linux/syscalls.h>
#include <linux/cpuset.h>
-#include ...iovec_iterator is an awfully long and not very descriptive name. In past discussions we named this thingy iodesc and wanted to pass it down all the I/O path, including the file operations. -
Hi Christoph, Sure I think it would be a good idea to shorten the name. And yes, although I just construct the iterator to pass into perform_write, I think it should make sense to go much further up the call stack instead of passing all those args around. iodesc seems like a fine name, so I'll use that unless anyone objects. Thanks, Nick -
i had a patch integrating the iodesc idea, but after some thought, had decided to call it struct file_io. That name reflects the fact that it's doing I/O in arbitrary lengths with byte offsets, and struct file_io *fio contrasts well with struct bio (block_io). I also had used the field ->nbytes instead of ->count, to clarify the difference between segment iterators, segment offsets, and absolute bytecount. FYI, the patch is still in the works and would convert the whole file I/O stack to use the new structure. I would like to base it off of this work as well if this makes it into -mm (as I think it should) NATE -
The field name is a good suggestion. What I have there is not actually a full-blown file io descriptor, because there is no file or offset. It is just an iovec iterator (so maybe I should rename it to iov_iter, rather than iodesc). I think it might be a nice idea to keep this iov_iter as a standalone structure, and it could be embedded into a struct file_io? Thanks, Nick -
Yeah, maybe. I'm not sure we need something as generic as a "file_io" struct. To recap, I've hoped for the expression of the state of an iovec array with a richer structure to avoid the multiple walks of the array at different parts of the kernel that previously only had access to the raw iovec * and size_t count. Stuff like the alignment checks in __blockdev_direct_IO() and the pages_in_io accounting in direct_io_worker(). I imagined building up the state in this 'iodesc' structure as we first copied and verified the structure from userspace. (say in rw_copy_check_uvector()). If as we copied we, say, stored in the bits of the buffer and length fields then by the time we got to __blockdev_direct_IO() we'd just test the bits for misalignment instead of iterating over the array again. It starts to get gross as some paths currently modify the kernel copy of the array as they process it :/. - z -
struct file_io sounds rather ugly to me, I don't know why. And it's really user I/O so we could call it struct uio (historical punt intended) :) -
Hi Nick, I really like this iterator structure - it brings together some fields that seem to get passed around seperately, even though they're linked. Also IMHO, it makes things in filemap.c easer to read... --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com -
Add a new "perform_write" aop, which replaces prepare_write and commit_write
as a single call to copy a given amount of userdata at the given offset. This
is more flexible, because the implementation can determine how to best handle
errors, or multi-page ranges (eg. it may use a gang lookup), and only requires
one call into the fs.
One problem with this interface is that it cannot be used to write into the
filesystem by any means other than already-initialised buffers via iovecs. So
prepare/commit have to stay around for non-user data...
Another thing is that it seems to be less able to be implemented in generic,
reusable code. It should be possible to introduce a new 2-op interface (or
maybe just a new error handler op) which can be used correctly in generic code.
I avoided that way in my first attempt because this seemed more elegant from
a logical POV. But after seeing the implementation, I'm having second
thoughts.
include/linux/fs.h | 6 ++++
mm/filemap.c | 64 +++++++++++++++++++++++++++++++++--------------------
2 files changed, 47 insertions(+), 23 deletions(-)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -447,6 +447,12 @@ struct address_space_operations {
*/
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+ /*
+ * perform_write replaces prepare and commit_write callbacks.
+ */
+ int (*perform_write)(struct file *, struct iovec_iterator *, loff_t);
+
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1920,8 +1920,7 @@ ...Hi Nick, sorry for my later reply, this has been on my to answer list for the last month and I only managed to get back to it now. I really like this idea, especially for avoiding to call into the allocator for every block. Have you contacted the reiser4 folks whether this would Actually I think that's a a good thing to a certain extent. It reminds us that all other users are horrible abuse of the interface. I'd even go so far as to make batch_write a callback that the filesystem passes to generic_file_aio_write to make clear it's not a generic thing but a helper. (It's not a generic thing because it's the upper layer writing into the pagecache, not a pagecache to fs below operation). The still leaves open on how to get rid of ->prepare_write and ->commit_write compltely, and for that we'll probably need ->kernel_read and ->kernel_write file operations. But that's a step you shouldn't consider yet when doing This is a rather useless comment :) Better remove it and add a proper descriptions to Documentation/filesystems/vfs.txt and Documentation/filesystems/Locking -
Hi Christoph, No worries, I haven't had much time to work on it since then anyway. I haven't yet, although that's been on my todo list when I get the API into a more final state. batch_write seems quite similar, however theirs is still page based, and a bit crufty, IMO. I found it to be really clean to just pass down offsets, but that may be a matter for debate. What they _do_ have is a write actor function that will do the data copy. This could be one possible way to get rid of ->prepare_write and ->commit_write, but I haven't tried that yet, because I don't like adding I had a couple of possibilities for that. First is passing in a write actor (eg. defaulting to the normal iovec usercopy), but as I said I consider this more like fixing the problem with brute force (ie. just making the interface more complex). Maybe as a last resort, though. Another thing that would be much nicer from _my_ point of view would be to just make all kernel users set up their data in an iovec, and use the normal call with KERNEL_DS. Unfortunately, this is not the expected way for a lot Will do. Thanks! -
Indeed. FWIW my NTFS driver does not use the generic file write helper function and instead has its own code that does the allocation first and then does the writing a page at a time much the same as generic file write does so depending on what this new interface ends up looking like exactly I may well be able to switch the NTFS driver to use it instead of doing everything by myself and duplicating a ton of code from the VFS... Best regards, -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ -
->kernel_write() as opposed to genericizing ->perform_write() would be fine with me. Just so long as we get rid of ->prepare_write and ->commit_write in that other kernel code doesn't call them directly. That interface just doesn't work for Ocfs2. There, we have the triple whammy of having to order cluster locks with page locks, avoiding nesting cluster locks in the case that the user data has to be paged in (causing a lock in ->readpage()) and grabbing / zeroing adjacent pages to fill holes. So, a combination of ->perform_write and ->kernel_write() could really help me solve my write woes. Right now I've got Ocfs2 implementing it's own lowest-level buffered write code - think generic_file_buffered_write() replacement for Ocfs2. With some duplicated code above that layer. What's nice is that I can abstract away the "copy data into some target pages" bits such that the majority of that code is re-usable for ocfs2's splice write operation. I'm not sure we could have that low a level of abstraction for anyhing above individual the file system though which also has to deal with non-kernel writes though. That's where a ->kernel_write() might come in handy. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com -
It doesn't work for any filesystem that needs slightly fancy locking. That and the reason that's an interface that doesn't fit into our layering is why I want to get rid of it. Note that fops->kernel_write might in fact use ->perform_write with an actor as Nick suggested. I'm not quite sure how it'll look like - I'd rather take care of the buffered write path first and then handle this issue once the first Why do you need your own low level buffered write functionality? As in past times when filesystems want to come up I'd like to have a very good exaplanation on why you think it's needed and whether we shouldn't improve the generic buffered write code instead. This codepath is so nasty that any duplication will be a maintaince horror. -
Fair enough - I personally tried everything I could before coming to the
conclusion that for the time being, Ocfs2 should have a seperate write path.
As you know, I've been adding sparse file support for Ocfs2. Putting aside
all the reasons to have real support for sparse files (as opposed to zeroing
allocated regions), the tree code changes alone has gotten us 90% the way to
supporting unwritten extents (much like xfs does).
Ocfs2 supports atomic data allocation units ('clusters', to use an
overloaded term) which can range in size from 4k to 1 meg. This means that
for allocating writes on page size < cluster size file systems, we have to
zero pages adjacent to the one being written so that a re-read doesn't
return dirty data. This alone requires page locking which we can't
realistically achieve via ->prepare_write() and ->commit_write(). I believe
NTFS has a similar restriction, which has lead to their own file write.
So, page locking was definitely the straw that broke the camels back. Some
other things which were akward or slightly less critically broken than the
page locking:
Since ocfs2 has a rather large (compared to a local file system) context to
build up during an allocating write, it became uncomfortable to pass that
around ->prepare_write() and ->commit_write() without putting that context
on our struct inode and protecting it with a lock. And since the existing
interfaces were so rigid, it actually required a lot more context to be
passed around than in my current code.
There's also the cluster lock / page lock inversion which we have to deal
with (it gets even worse if we fault in pages in the middle of the user copy
for a write). Granted, we fixed a lot of that before merging, but allocating
in write means taking even more cluster locks and I don't really feel
comfortable nesting so many of those within the page locks.
Finally, we get to the optimization problem - writing stuff one page at a
time. To be fair, my current stuff doesn't do a very good job ...So I've tried a different approach - the 2-op API rather than an actor. perform_write stays around as a higher performance API, but it isn't required if the filesystem implements the 2-op API. I've called them write_begin/write_end for now. There are a few upshots to doing this rather than the actor approach. First of all, this is what callers expect, they want to write into the page directly rather than making an actor. More importantly, it allows us to implement generic block versions of the API which is much more reusable than block_perform_write (which was basically useless for anything more than ext2). The API calls for the filesystem to find and lock the page itself, and pass that up to the caller, as well as handle short-writes properly, so we can solve this deadlock properly. The nice thing about this is that write_begin is basically a single page case of the first half (before the iovec copy) of perform_write, and write_end is the single page case of the second half (after the copy). So any filesystem that implements perform_write should be able to reuse write_begin/write_end components. Anyway, I'm attaching the top patches of my stack (underneath are the initial patches to solve prepare_write deadlock -- I'll repost the complete set once I get some more feedback). Sorry it is a bit under commented and not stress tested. However it does boot and run with ext2/3/shmem and other simplefses. Mark has been providing some helpful advice about the new 2-op interface, but it is still pretty much up in the air. Also note that a _lot_ of crud is there to support prepare_write (eg. pagecache_write_begin/pagecache_write_end basically become single liners once we get rid of the old API). Anyway, I think I should throw this out for comments before investing too much more time, in case everyone hates it ;) Anyway, comments much appreciated... -
Generally thiis look pretty cool. But even if we go with perform_write as aop for now (which I think is a bad idea aswell, but moving it out would better be done after all filesystems are converted) these should just stay callbacks passed to generic_perform_write. -
Convert ext2 to use ->perform_write. This uses the main loop out of
generic_perform_write, but when encountering a short usercopy, it
zeroes out new uninitialised blocks, and passes in a short-length commit
to __block_commit_write, which does the right thing (in terms of not
setting things uptodate).
fs/buffer.c | 143 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 7 ++
include/linux/buffer_head.h | 1
include/linux/pagemap.h | 2
4 files changed, 153 insertions(+)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1866,6 +1866,50 @@ next_bh:
return err;
}
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+ unsigned int block_start, block_end;
+ struct buffer_head *head, *bh;
+
+ BUG_ON(!PageLocked(page));
+ if (!page_has_buffers(page))
+ return;
+
+ bh = head = page_buffers(page);
+ block_start = 0;
+ do {
+ block_end = block_start + bh->b_size;
+
+ if (buffer_new(bh)) {
+ if (block_end > from && block_start < to) {
+ if (!PageUptodate(page)) {
+ unsigned start, end;
+ void *kaddr;
+
+ start = max(from, block_start);
+ end = min(to, block_end);
+
+ kaddr = kmap_atomic(page, KM_USER0);
+ memset(kaddr+start, 0, block_end-end);
+ flush_dcache_page(page);
+ kunmap_atomic(kaddr, KM_USER0);
+ set_buffer_uptodate(bh);
+ }
+
+ /*
+ * XXX: make buffer_new behaviour more
+ * consistent.
+ * clear_buffer_new(bh);
+ */
+ mark_buffer_dirty(bh);
+ }
+ }
+
+ block_start = block_end;
+ bh = bh->b_this_page;
+ } while (bh != head);
+}
+
static int __block_commit_write(struct inode *inode, struct page *page,
unsigned from, unsigned to)
{
@@ -1900,6 +1944,105 @@ static int __block_commit_write(struct i
return 0;
}
+ssize_t block_perform_write(struct file *file, struct ...<<<<<<<<<<< here fs cat do some fs-specific stuff without making
<<<<<<<<<<< If i've understand correctly folowing scenario possible:
iteration 1: ->iovec_iterator_fault_in_readable(...) = 0
iteration 1: __block_prepare_write = {blocks allocated}
iteration 1: iovec_iterator_copy_from_user_atomic(...) = 0
iteration 1: while(iovec_iterator_count(i)) == goto next loop
iteration 2: ->iovec_iterator_fault_in_readable(...) = -EFAULT
Than breack loop .
At this point prepare_write() may have instantiated a few blocks
-
I don't see how this differs from the previous attempts to solve the deadlock via atomic copt_from_user(). Here we temporarily zero out the pagecache page then block_perform_write() unlocks the page. So another thread can come in, read the page and see the temporary zeroes? If so, that might be preventable by leaving the buffer nonuptodate. -
oh, OK, it was buffer_new(), so zeroes are the right thing for a reader to see. But if it wasn't buffer_new() then the appropriate thing for the reader to see is what's on the disk. But __block_prepare_write() won't read a buffer which is fully-inside the write area from disk. And that's seemingly OK, because if a reader gets in there after the short copy, that reader will see the non-uptodate buffer and will populate it from disk. But doing that will overwrite the data which the write() caller managed to copy into the page before it took a fault. And that's not OK because block_perform_write() does iovec_iterator_advance(i, copied) in this case and hence will not rerun the copy after acquiring the page lock? -
Hmm, yeah. This can be handled by not advancing partially into a !uptodate buffer. -
On Sat, 10 Feb 2007 02:34:07 +0100 Think so, yeah. Overall, the implementation you have there seems reasonable to me. Basically it's passing the responsibility for preventing the deadlock and the exposure-of-zeroes problem down into the filesystem itself, where we have visibility of the state of the various subsections of the page and can take appropriate actions in response to that. It's got conceptually harder to follow as a result, which is a shame. But still no magic bullet is on offer. I pity the poor schmuck who has to write ext3_journalled_perform_write(), ext3_ordered_perform_write(), ext3_writeback_perform_write(), ext3_writeback_nobh_perform_write() and all that other stuff. But I think we need to do that pretty soon to validate the whole approach. Also xfs and reiser3. NTFS will be interesting from the can-this-be-made-to-work POV. Is NFS vulnerable to the deadlock? It looks to be. Shudder. We'd need to find a way of communicating all this to the poor old fs maintainers. -
Agreed. There's lots of problems with the interface (imho), but my biggest two issues are the page lock being held on ->prepare_write() / ->commit_write() and the fact that the file system only sees the write one I like ->perform_write(). It allows the file system to make re-use of the checks in the generic write path, but gives a maximum amount of information about the overall operation to be done. There's an added advantage in that some file systems (ocfs2 is one of these) want to be more careful about ordering page locks, which should be much easier with it. If this goes in, it could probably be helpful to me in some of the code I'm currently writing which needs to be better about page lock / cluster lock ordering and wants to see as much of the (allocating) writes as possible. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com -
Yeah, if possible I like a range based interface rather than page I think it would be important to have a non trivial user of this new API before it goes into mainline. It would be great if you could look at using it, after it passes some review. Thanks, Nick -
What happened with Linus's proposal to instantiate the page as pinned, non-uptodate, unlocked and in pagecache while we poke the user address? -
That's still got a deadlock, and also it doesn't work if we want to lock the page when performing a minor fault (which I want to fix fault vs invalidate), and also assumes nobody's ->nopage locks the page or requires any resources that are held by prepare_write (something not immediately clear to me with the cluster filesystems, at least). But that all becomes legacy path, so do we really care? Supposing fs maintainers like perform_write, then after the main ones have implementations we could switch over to the slow-but-correct prepare_write legacy path. Or we could leave it, or we could use Linus's slightly-less-buggy scheme... by that point I expect I'd be sick of arguing about it ;) -
It's hard to discuss this without a description of what you want to fix It's worth "arguing" about. This is write(). What matters more?? -
That's the legacy path that uses prepare/commit (ie. supposing that all major filesystems did get converted to perform_write). Of course I would still want my correct-but-slow version in that case, but I just wouldn't care to argue if you still wanted to keep it fast. -
Where? It requires that someone hold mmap_sem for writing as well as a page lock (in an order which would require some thought). Do we ever do mutter. Could perhaps fix that by running ClearPageUptodate in invalidate, thus forcing the pagefault code to take the page lock (which we already hold). That does mean that we'll fleetingly have a non-uptodate page in pagetables which is a bit nasty. Or, probably better, we could add a new page flag (heh) telling nopage that So the rule is that ->nopage handlers against pagecache mustn't lock the page if it's already uptodate. That's OK. But it might conflict with the above invalidate proposal. We'll never, ever, ever update and test all filesytems. What you're calling "legacy" code will be there for all time. This is write(). We just cannot go and double-copy all the memory or take mmap_sem and do a full pagetable walk in there. It just means that we haven't found a suitable solution yet. -
task1 task2 task3 lock_page(page) down_read(mmap_sem) down_write(mmap_sem) down_read(mmap_sem) lock_page(page) t1 is blocked by t3 t2 is blocked by t1 t3 is blocked by t2 OK, it isn't quite ABBA, but that's shorthand if we remember that Well that _will_ be the rule if you want to do Linus's half-fix. It will I didn't say all; I still prefer correct than fast; you are still free You prefer speed over correctness even for little used filessytems, which is fine because I'm sick of arguing about it. The main thing for me is that important filesystems can be correct and fast. -
For gawd's sake. You can make the kernel less buggy by removing SMP support. You make it sound like this is a choice. It isn't. Nobody is going to go I wouldn't characterise it as "arguing". It's development. Going and sticking enormous slowdowns into write() to fix some bug which nobody is hitting is insane. We need to find a better fix, that's all. -
I agree that 60% is much too big of a hit for all filesystems. Which is IMO, once all the maintained filesystems are converted then it would be Actually I'm doing this because I try to fix real data corruption problems which people are hitting. You told me I can't get those fixes in until I I actually found perform_write to be a speedup. And if perform_write is merged then I would be happy to not fix the prepare_write path, or wait for someone to come up with a better fix. -
