Re: [patch 3/3] ext2: use perform_write aop

Previous thread: [PATCH] block: blk_max_pfn is somtimes wrong by Vasily Tarasov on Thursday, February 8, 2007 - 5:39 am. (2 messages)

Next thread: OT: I don't recive mail from linux-kernel by Levitsky Maxim on Thursday, February 8, 2007 - 6:09 am. (3 messages)
From: Nick Piggin
Date: Thursday, February 8, 2007 - 6:07 am

In my last set of numbers for my buffered-write deadlock fix using 2 copies
per page, I realised there is no real performance hit for !uptodate pages
as opposed to uptodate ones. This is unexpected because the uptodate pages
only require a single copy...

The problem turns out to be operator error. I forgot tmpfs won't use this
prepare_write path, so sorry about that.

On ext2, copy 64MB of data from /dev/zero (IO isn't involved), using
4K and 64K block sizes, and conv=notrunc for testing overwriting of
uptodate pages. Numbers is elapsed time in seconds, lower is better.

		2.6.20		bufferd write fix
4K		0.0742		0.1208 (1.63x)
4K-uptodate	0.0493		0.0479 (0.97x)
64K		0.0671		0.1068 (1.59x)
64K-uptodate	0.0357		0.0362 (1.01x)

So we get about a 60% performance hit, which is more expected. I guess if
0.5% doesn't fly, then 60% is right out ;)

If there were any misconceptions, the problem is not that the code is
incredibly tricky or impossible to fix with good performance. The problem
is that the existing aops interface is crap. "correct, fast, compatible
-- choose any 2"

So I have finally finished a first slightly-working draft of my new aops
op (perform_write) proposal. I would be interested to hear comments about
it.  Most of my issues and concerns are in the patch headers themselves,
so reply to them.

The patches are against my latest buffered-write-fix patchset. This
means filesystems not implementing the new aop, will remain safe, if slow.
Here's some numbers after converting ext2 to the new aop:

		2.6.20		perform_write aop
4K		0.0742		0.0769 (1.04x)
4K-uptodate	0.0493		0.0475 (0.96x)
64K		0.0671		0.0613 (0.91x)
64K-uptodate	0.0357		0.0343 (0.96x)

Thanks,
Nick

--
SuSE Labs

-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 6:07 am

Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

 include/linux/fs.h |   32 ++++++++++++
 mm/filemap.c       |  132 ++++++++++++++++++++++++++++++++++++++++++-----------
 mm/filemap.h       |  103 -----------------------------------------
 3 files changed, 137 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -395,6 +395,38 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iovec_iterator {
+	const struct iovec *iov;
+	unsigned long nr_segs;
+	size_t iov_offset;
+	size_t count;
+};
+
+size_t iovec_iterator_copy_from_user_atomic(struct page *page,
+		struct iovec_iterator *i, unsigned long offset, size_t bytes);
+size_t iovec_iterator_copy_from_user(struct page *page,
+		struct iovec_iterator *i, unsigned long offset, size_t bytes);
+void iovec_iterator_advance(struct iovec_iterator *i, size_t bytes);
+int iovec_iterator_fault_in_readable(struct iovec_iterator *i);
+
+static inline void iovec_iterator_init(struct iovec_iterator *i,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t count, size_t written)
+{
+	i->iov = iov;
+	i->nr_segs = nr_segs;
+	i->iov_offset = 0;
+	i->count = count + written;
+
+	iovec_iterator_advance(i, written);
+}
+
+static inline size_t iovec_iterator_count(struct iovec_iterator *i)
+{
+	return i->count;
+}
+
+
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/cpuset.h>
-#include ...
From: Christoph Hellwig
Date: Thursday, February 8, 2007 - 12:49 pm

iovec_iterator is an awfully long and not very descriptive name.
In past discussions we named this thingy iodesc and wanted to pass it
down all the I/O path, including the file operations.

-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 6:46 pm

Hi Christoph,

Sure I think it would be a good idea to shorten the name. And yes, although
I just construct the iterator to pass into perform_write, I think it should
make sense to go much further up the call stack instead of passing all those
args around. iodesc seems like a fine name, so I'll use that unless
anyone objects.


Thanks,
Nick
-

From: Nate Diller
Date: Thursday, February 8, 2007 - 7:03 pm

i had a patch integrating the iodesc idea, but after some thought, had
decided to call it struct file_io.  That name reflects the fact that
it's doing I/O in arbitrary lengths with byte offsets, and struct
file_io *fio contrasts well with struct bio (block_io).  I also had
used the field ->nbytes instead of ->count, to clarify the difference
between segment iterators, segment offsets, and absolute bytecount.

FYI, the patch is still in the works and would convert the whole file
I/O stack to use the new structure.  I would like to base it off of
this work as well if this makes it into -mm (as I think it should)

NATE
-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 8:31 pm

The field name is a good suggestion.

What I have there is not actually a full-blown file io descriptor, because
there is no file or offset. It is just an iovec iterator (so maybe I should
rename it to iov_iter, rather than iodesc). 

I think it might be a nice idea to keep this iov_iter as a standalone
structure, and it could be embedded into a struct file_io?

Thanks,
Nick
-

From: Zach Brown
Date: Friday, February 9, 2007 - 10:28 am

Yeah, maybe.

I'm not sure we need something as generic as a "file_io" struct.

To recap, I've hoped for the expression of the state of an iovec  
array with a richer structure to avoid the multiple walks of the  
array at different parts of the kernel that previously only had  
access to the raw iovec * and size_t count.

Stuff like the alignment checks in __blockdev_direct_IO() and the  
pages_in_io accounting in direct_io_worker().

I imagined building up the state in this 'iodesc' structure as we  
first copied and verified the structure from userspace.  (say in  
rw_copy_check_uvector()).

If as we copied we, say, stored in the bits of the buffer and length  
fields then by the time we got to __blockdev_direct_IO() we'd just  
test the bits for misalignment instead of iterating over the array  
again.

It starts to get gross as some paths currently modify the kernel copy  
of the array as they process it :/.

- z
-

From: Christoph Hellwig
Date: Friday, March 9, 2007 - 3:40 am

struct file_io sounds rather ugly to me, I don't know why.  And it's
really user I/O so we could call it struct uio (historical punt intended) :)

-

From: Mark Fasheh
Date: Thursday, February 8, 2007 - 4:04 pm

Hi Nick,


I really like this iterator structure - it brings together some fields that
seem to get passed around seperately, even though they're linked.

Also IMHO, it makes things in filemap.c easer to read...
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 6:07 am

Add a new "perform_write" aop, which replaces prepare_write and commit_write
as a single call to copy a given amount of userdata at the given offset. This
is more flexible, because the implementation can determine how to best handle
errors, or multi-page ranges (eg. it may use a gang lookup), and only requires
one call into the fs.

One problem with this interface is that it cannot be used to write into the
filesystem by any means other than already-initialised buffers via iovecs. So
prepare/commit have to stay around for non-user data... 

Another thing is that it seems to be less able to be implemented in generic,
reusable code. It should be possible to introduce a new 2-op interface (or
maybe just a new error handler op) which can be used correctly in generic code.

I avoided that way in my first attempt because this seemed more elegant from
a logical POV. But after seeing the implementation, I'm having second
thoughts.

 include/linux/fs.h |    6 ++++
 mm/filemap.c       |   64 +++++++++++++++++++++++++++++++++--------------------
 2 files changed, 47 insertions(+), 23 deletions(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -447,6 +447,12 @@ struct address_space_operations {
 	 */
 	int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
 	int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+	/*
+	 * perform_write replaces prepare and commit_write callbacks.
+	 */
+	int (*perform_write)(struct file *, struct iovec_iterator *, loff_t);
+
 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
 	sector_t (*bmap)(struct address_space *, sector_t);
 	void (*invalidatepage) (struct page *, unsigned long);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1920,8 +1920,7 @@ ...
From: Christoph Hellwig
Date: Friday, March 9, 2007 - 3:39 am

Hi Nick,

sorry for my later reply, this has been on my to answer list for the last
month and I only managed to get back to it now.


I really like this idea, especially for avoiding to call into the allocator
for every block.  Have you contacted the reiser4 folks whether this would

Actually I think that's a a good thing to a certain extent.  It reminds
us that all other users are horrible abuse of the interface.  I'd even
go so far as to make batch_write a callback that the filesystem passes
to generic_file_aio_write to make clear it's not a generic thing but
a helper.  (It's not a generic thing because it's the upper layer writing
into the pagecache, not a pagecache to fs below operation).

The still leaves open on how to get rid of ->prepare_write and ->commit_write
compltely, and for that we'll probably need ->kernel_read and ->kernel_write
file operations.  But that's a step you shouldn't consider yet when doing


This is a rather useless comment :)  Better remove it and add a proper
descriptions to Documentation/filesystems/vfs.txt and
Documentation/filesystems/Locking

-

From: Nick Piggin
Date: Friday, March 9, 2007 - 5:52 am

Hi Christoph,


No worries, I haven't had much time to work on it since then anyway.

I haven't yet, although that's been on my todo list when I get the API
into a more final state.

batch_write seems quite similar, however theirs is still page based, and
a bit crufty, IMO. I found it to be really clean to just pass down offsets,
but that may be a matter for debate.

What they _do_ have is a write actor function that will do the data copy.
This could be one possible way to get rid of ->prepare_write and
->commit_write, but I haven't tried that yet, because I don't like adding


I had a couple of possibilities for that. First is passing in a write actor
(eg. defaulting to the normal iovec usercopy), but as I said I consider this
more like fixing the problem with brute force (ie. just making the interface
more complex). Maybe as a last resort, though.

Another thing that would be much nicer from _my_ point of view would be to
just make all kernel users set up their data in an iovec, and use the normal
call with KERNEL_DS. Unfortunately, this is not the expected way for a lot

Will do. Thanks!

-

From: Anton Altaparmakov
Date: Friday, March 9, 2007 - 3:01 pm

Indeed.  FWIW my NTFS driver does not use the generic file write  
helper function and instead has its own code that does the allocation  
first and then does the writing a page at a time much the same as  
generic file write does so depending on what this new interface ends  
up looking like exactly I may well be able to switch the NTFS driver  
to use it instead of doing everything by myself and duplicating a ton  
of code from the VFS...

Best regards,


-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-

From: Mark Fasheh
Date: Friday, March 9, 2007 - 4:33 pm

->kernel_write() as opposed to genericizing ->perform_write() would be fine
with me. Just so long as we get rid of ->prepare_write and ->commit_write in
that other kernel code doesn't call them directly. That interface just
doesn't work for Ocfs2. There, we have the triple whammy of having to order
cluster locks with page locks, avoiding nesting cluster locks in the case
that the user data has to be paged in (causing a lock in ->readpage()) and
grabbing / zeroing adjacent pages to fill holes.

So, a combination of ->perform_write and ->kernel_write() could really help
me solve my write woes.

Right now I've got Ocfs2 implementing it's own lowest-level buffered write
code - think generic_file_buffered_write() replacement for Ocfs2. With some
duplicated code above that layer. What's nice is that I can abstract away
the "copy data into some target pages" bits such that the majority of that
code is re-usable for ocfs2's splice write operation. I'm not sure we could
have that low a level of abstraction for anyhing above individual the file
system though which also has to deal with non-kernel writes though. That's
where a ->kernel_write() might come in handy.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
-

From: Christoph Hellwig
Date: Saturday, March 10, 2007 - 2:25 am

It doesn't work for any filesystem that needs slightly fancy locking.
That and the reason that's an interface that doesn't fit into our
layering is why I want to get rid of it.  Note that fops->kernel_write
might in fact use ->perform_write with an actor as Nick suggested.
I'm not quite sure how it'll look like - I'd rather take care of the
buffered write path first and then handle this issue once the first

Why do you need your own low level buffered write functionality?  As in
past times when filesystems want to come up I'd like to have a very
good exaplanation on why you think it's needed and whether we shouldn't
improve the generic buffered write code instead.  This codepath is so nasty
that any duplication will be a maintaince horror.

-

From: Mark Fasheh
Date: Sunday, March 11, 2007 - 7:13 pm

Fair enough - I personally tried everything I could before coming to the
conclusion that for the time being, Ocfs2 should have a seperate write path.

As you know, I've been adding sparse file support for Ocfs2. Putting aside
all the reasons to have real support for sparse files (as opposed to zeroing
allocated regions), the tree code changes alone has gotten us 90% the way to
supporting unwritten extents (much like xfs does).

Ocfs2 supports atomic data allocation units ('clusters', to use an
overloaded term) which can range in size from 4k to 1 meg. This means that
for allocating writes on page size < cluster size file systems, we have to
zero pages adjacent to the one being written so that a re-read doesn't
return dirty data. This alone requires page locking which we can't
realistically achieve via ->prepare_write() and ->commit_write(). I believe
NTFS has a similar restriction, which has lead to their own file write.

So, page locking was definitely the straw that broke the camels back. Some
other things which were akward or slightly less critically broken than the
page locking:

Since ocfs2 has a rather large (compared to a local file system) context to
build up during an allocating write, it became uncomfortable to pass that
around ->prepare_write() and ->commit_write() without putting that context
on our struct inode and protecting it with a lock. And since the existing
interfaces were so rigid, it actually required a lot more context to be
passed around than in my current code.

There's also the cluster lock / page lock inversion which we have to deal
with (it gets even worse if we fault in pages in the middle of the user copy
for a write). Granted, we fixed a lot of that before merging, but allocating
in write means taking even more cluster locks and I don't really feel
comfortable nesting so many of those within the page locks.

Finally, we get to the optimization problem - writing stuff one page at a
time. To be fair, my current stuff doesn't do a very good job ...
From: Nick Piggin
Date: Wednesday, March 14, 2007 - 6:30 am

So I've tried a different approach - the 2-op API rather than an actor.

perform_write stays around as a higher performance API, but it isn't
required if the filesystem implements the 2-op API. I've called them
write_begin/write_end for now.

There are a few upshots to doing this rather than the actor approach.
First of all, this is what callers expect, they want to write into the
page directly rather than making an actor.

More importantly, it allows us to implement generic block versions of
the API which is much more reusable than block_perform_write (which was
basically useless for anything more than ext2).

The API calls for the filesystem to find and lock the page itself, and
pass that up to the caller, as well as handle short-writes properly, so
we can solve this deadlock properly.

The nice thing about this is that write_begin is basically a single
page case of the first half (before the iovec copy) of perform_write,
and write_end is the single page case of the second half (after the
copy). So any filesystem that implements perform_write should be able
to reuse write_begin/write_end components.

Anyway, I'm attaching the top patches of my stack (underneath are the
initial patches to solve prepare_write deadlock -- I'll repost the
complete set once I get some more feedback). Sorry it is a bit under
commented and not stress tested. However it does boot and run with
ext2/3/shmem and other simplefses.

Mark has been providing some helpful advice about the new 2-op interface,
but it is still pretty much up in the air.

Also note that a _lot_ of crud is there to support prepare_write (eg.
pagecache_write_begin/pagecache_write_end basically become single liners
once we get rid of the old API).

Anyway, I think I should throw this out for comments before investing too
much more time, in case everyone hates it ;)

Anyway, comments much appreciated...

-

From: Christoph Hellwig
Date: Wednesday, March 14, 2007 - 8:17 am

Generally thiis look pretty cool.  But even if we go with perform_write
as aop for now (which I think is a bad idea aswell, but moving it out
would better be done after all filesystems are converted) these should
just stay callbacks passed to generic_perform_write.
-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 6:07 am

Convert ext2 to use ->perform_write. This uses the main loop out of
generic_perform_write, but when encountering a short usercopy, it
zeroes out new uninitialised blocks, and passes in a short-length commit
to __block_commit_write, which does the right thing (in terms of not
setting things uptodate).

 fs/buffer.c                 |  143 ++++++++++++++++++++++++++++++++++++++++++++
 fs/ext2/inode.c             |    7 ++
 include/linux/buffer_head.h |    1 
 include/linux/pagemap.h     |    2 
 4 files changed, 153 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1866,6 +1866,50 @@ next_bh:
 	return err;
 }
 
+void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+{
+	unsigned int block_start, block_end;
+	struct buffer_head *head, *bh;
+
+	BUG_ON(!PageLocked(page));
+	if (!page_has_buffers(page))
+		return;
+
+	bh = head = page_buffers(page);
+	block_start = 0;
+	do {
+		block_end = block_start + bh->b_size;
+
+		if (buffer_new(bh)) {
+			if (block_end > from && block_start < to) {
+				if (!PageUptodate(page)) {
+					unsigned start, end;
+					void *kaddr;
+
+					start = max(from, block_start);
+					end = min(to, block_end);
+
+					kaddr = kmap_atomic(page, KM_USER0);
+					memset(kaddr+start, 0, block_end-end);
+					flush_dcache_page(page);
+					kunmap_atomic(kaddr, KM_USER0);
+					set_buffer_uptodate(bh);
+				}
+
+				/*
+				 * XXX: make buffer_new behaviour more
+				 * consistent.
+				 * clear_buffer_new(bh);
+				 */
+				mark_buffer_dirty(bh);
+			}
+		}
+
+		block_start = block_end;
+		bh = bh->b_this_page;
+	} while (bh != head);
+}
+
 static int __block_commit_write(struct inode *inode, struct page *page,
 		unsigned from, unsigned to)
 {
@@ -1900,6 +1944,105 @@ static int __block_commit_write(struct i
 	return 0;
 }
 
+ssize_t block_perform_write(struct file *file, struct ...
From: Dmitriy Monakhov
Date: Thursday, February 8, 2007 - 7:47 am

<<<<<<<<<<< here fs cat do some fs-specific stuff without making
<<<<<<<<<<<  If i've understand correctly folowing scenario possible:
         iteration 1: ->iovec_iterator_fault_in_readable(...)  = 0           
         iteration 1: __block_prepare_write  = {blocks allocated}
         iteration 1: iovec_iterator_copy_from_user_atomic(...) = 0 
         iteration 1: while(iovec_iterator_count(i))  == goto next loop

         iteration 2: ->iovec_iterator_fault_in_readable(...)  = -EFAULT
                      Than breack loop .
         At this point prepare_write() may have instantiated a few blocks

-

From: Andrew Morton
Date: Friday, February 9, 2007 - 12:14 pm

I don't see how this differs from the previous attempts to solve the
deadlock via atomic copt_from_user().  Here we temporarily zero out the
pagecache page then block_perform_write() unlocks the page.  So another
thread can come in, read the page and see the temporary zeroes?  

If so, that might be preventable by leaving the buffer nonuptodate.

-

From: Andrew Morton
Date: Friday, February 9, 2007 - 12:45 pm

oh, OK, it was buffer_new(), so zeroes are the right thing for a reader to
see.

But if it wasn't buffer_new() then the appropriate thing for the reader to
see is what's on the disk.  But __block_prepare_write() won't read a buffer
which is fully-inside the write area from disk.

And that's seemingly OK, because if a reader gets in there after the short
copy, that reader will see the non-uptodate buffer and will populate it
from disk.

But doing that will overwrite the data which the write() caller managed to
copy into the page before it took a fault.  And that's not OK because
block_perform_write() does iovec_iterator_advance(i, copied) in this case
and hence will not rerun the copy after acquiring the page lock?
-

From: Nick Piggin
Date: Friday, February 9, 2007 - 6:34 pm

Hmm, yeah. This can be handled by not advancing partially into a !uptodate
buffer.
-

From: Andrew Morton
Date: Friday, February 9, 2007 - 6:50 pm

On Sat, 10 Feb 2007 02:34:07 +0100

Think so, yeah.

Overall, the implementation you have there seems reasonable to me. 
Basically it's passing the responsibility for preventing the deadlock and
the exposure-of-zeroes problem down into the filesystem itself, where we
have visibility of the state of the various subsections of the page and can
take appropriate actions in response to that.

It's got conceptually harder to follow as a result, which is a shame.  But
still no magic bullet is on offer.

I pity the poor schmuck who has to write ext3_journalled_perform_write(),
ext3_ordered_perform_write(), ext3_writeback_perform_write(),
ext3_writeback_nobh_perform_write() and all that other stuff.  But I think
we need to do that pretty soon to validate the whole approach.  Also xfs
and reiser3.

NTFS will be interesting from the can-this-be-made-to-work POV.

Is NFS vulnerable to the deadlock?  It looks to be.  Shudder.

We'd need to find a way of communicating all this to the poor old
fs maintainers.

-

From: Mark Fasheh
Date: Thursday, February 8, 2007 - 5:38 pm

Agreed. There's lots of problems with the interface (imho), but my biggest
two issues are the page lock being held on ->prepare_write() /
->commit_write() and the fact that the file system only sees the write one

I like ->perform_write(). It allows the file system to make re-use of the
checks in the generic write path, but gives a maximum amount of information
about the overall operation to be done. There's an added advantage in that
some file systems (ocfs2 is one of these) want to be more careful about
ordering page locks, which should be much easier with it.

If this goes in, it could probably be helpful to me in some of the code I'm
currently writing which needs to be better about page lock / cluster lock
ordering and wants to see as much of the (allocating) writes as possible.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com
-

From: Nick Piggin
Date: Thursday, February 8, 2007 - 7:04 pm

Yeah, if possible I like a range based interface rather than page

I think it would be important to have a non trivial user of this new API
before it goes into mainline. It would be great if you could look at
using it, after it passes some review.

Thanks,
Nick
-

From: Andrew Morton
Date: Friday, February 9, 2007 - 1:41 am

What happened with Linus's proposal to instantiate the page as pinned,
non-uptodate, unlocked and in pagecache while we poke the user address?
-

From: Nick Piggin
Date: Friday, February 9, 2007 - 2:54 am

That's still got a deadlock, and also it doesn't work if we want to lock
the page when performing a minor fault (which I want to fix fault vs
invalidate), and also assumes nobody's ->nopage locks the page or
requires any resources that are held by prepare_write (something not
immediately clear to me with the cluster filesystems, at least).

But that all becomes legacy path, so do we really care? Supposing fs
maintainers like perform_write, then after the main ones have implementations
we could switch over to the slow-but-correct prepare_write legacy path.
Or we could leave it, or we could use Linus's slightly-less-buggy scheme...
by that point I expect I'd be sick of arguing about it ;)

-

From: Andrew Morton
Date: Friday, February 9, 2007 - 3:09 am

It's hard to discuss this without a description of what you want to fix


It's worth "arguing" about.  This is write().  What matters more??
-

From: Nick Piggin
Date: Friday, February 9, 2007 - 3:32 am

That's the legacy path that uses prepare/commit (ie. supposing that all
major filesystems did get converted to perform_write).

Of course I would still want my correct-but-slow version in that case,
but I just wouldn't care to argue if you still wanted to keep it fast.

-

From: Andrew Morton
Date: Friday, February 9, 2007 - 3:52 am

Where?  It requires that someone hold mmap_sem for writing as well as a
page lock (in an order which would require some thought).  Do we ever do

mutter.

Could perhaps fix that by running ClearPageUptodate in invalidate, thus
forcing the pagefault code to take the page lock (which we already hold).

That does mean that we'll fleetingly have a non-uptodate page in pagetables
which is a bit nasty.

Or, probably better, we could add a new page flag (heh) telling nopage that

So the rule is that ->nopage handlers against pagecache mustn't lock the
page if it's already uptodate.  That's OK.  But it might conflict with the
above invalidate proposal.


We'll never, ever, ever update and test all filesytems.  What you're
calling "legacy" code will be there for all time.


This is write().  We just cannot go and double-copy all the memory or take
mmap_sem and do a full pagetable walk in there.  It just means that we
haven't found a suitable solution yet.

-

From: Nick Piggin
Date: Friday, February 9, 2007 - 4:31 am

task1			task2			task3
lock_page(page)		down_read(mmap_sem)
						down_write(mmap_sem)
down_read(mmap_sem)
			lock_page(page)

t1 is blocked by t3
t2 is blocked by t1
t3 is blocked by t2

OK, it isn't quite ABBA, but that's shorthand if we remember that


Well that _will_ be the rule if you want to do Linus's half-fix. It will

I didn't say all; I still prefer correct than fast; you are still free

You prefer speed over correctness even for little used filessytems, which
is fine because I'm sick of arguing about it. The main thing for me is that
important filesystems can be correct and fast.

-

From: Andrew Morton
Date: Friday, February 9, 2007 - 4:46 am

For gawd's sake.  You can make the kernel less buggy by removing SMP
support.


You make it sound like this is a choice.  It isn't.  Nobody is going to go

I wouldn't characterise it as "arguing".  It's development.  Going and
sticking enormous slowdowns into write() to fix some bug which nobody is
hitting is insane.

We need to find a better fix, that's all.
-

From: Nick Piggin
Date: Friday, February 9, 2007 - 5:11 am

I agree that 60% is much too big of a hit for all filesystems. Which is

IMO, once all the maintained filesystems are converted then it would be

Actually I'm doing this because I try to fix real data corruption problems
which people are hitting. You told me I can't get those fixes in until I

I actually found perform_write to be a speedup. And if perform_write is
merged then I would be happy to not fix the prepare_write path, or wait for
someone to come up with a better fix.
-

Previous thread: [PATCH] block: blk_max_pfn is somtimes wrong by Vasily Tarasov on Thursday, February 8, 2007 - 5:39 am. (2 messages)

Next thread: OT: I don't recive mail from linux-kernel by Levitsky Maxim on Thursday, February 8, 2007 - 6:09 am. (3 messages)