Re: [rfc][patch 3/5] afs: new aops

Previous thread: [PATCH -v3] SELinux: Add get, set, and cloning of superblock security information by Eric Paris on Friday, November 9, 2007 - 6:22 pm. (5 messages)

Next thread: [PATCH 2/2] FIEMAP ioctl for ext4 by Kalpak Shah on Monday, November 12, 2007 - 5:00 pm. (2 messages)
To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:12 am

Hi,

These are a set of patches to convert the last few filesystems to use
the new deadlock-free write aops, and remove the core code to handle the
legacy write path.

I don't really have setups to sufficiently test these filesystems. So I
would really appreciate if filesystem maintainers can pick these patches
up, bear with my bugs, and send them upstream when they're ready.

The benefit to you is that you get to use the fast and well tested code
paths! Actually, it is interesting: several of the conversions I've done
(including these) take a relatively naive aproach of simply prefilling
the whole page if it isn't uptodate. It might be the case that some
filesystems actually prefer to do something similar to the legacy
double-copy path which they're being converted away from! (then again,
it would be probably even more ideal to have simple sub-page state
tracking structures).

There is still quite a lot of work left to be done. Several filesystems
still use prepare_write() helpers, and when they're fixed up, all the
old helpers themselves have to be removed. But this step is probably
most important to getting rid of complex code.

-

To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:23 am

Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c
+++ linux-2.6/drivers/block/loop.c
@@ -40,8 +40,7 @@
* Heinz Mauelshagen <mge@sistina.com>, Feb 2002
*
* Support for falling back on the write file operation when the address space
- * operations prepare_write and/or commit_write are not available on the
- * backing filesystem.
+ * operations write_begin is not available on the backing filesystem.
* Anton Altaparmakov, 16 Feb 2005
*
* Still To Fix:
@@ -761,7 +760,7 @@ static int loop_set_fd(struct loop_devic
*/
if (!file->f_op->splice_read)
goto out_putf;
- if (aops->prepare_write || aops->write_begin)
+ if (aops->write_begin)
lo_flags |= LO_FLAGS_USE_AOPS;
if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
lo_flags |= LO_FLAGS_READ_ONLY;
Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -175,7 +175,7 @@ static ssize_t fat_direct_IO(int rw, str

if (rw == WRITE) {
/*
- * FIXME: blockdev_direct_IO() doesn't use ->prepare_write(),
+ * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
* so we need to update the ->mmu_private to block boundary.
*
* But we must fill the remaining area or hole by nul for
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c
+++ linux-2.6/fs/ocfs2/file.c
@@ -790,8 +790,7 @@ leave:

/* Some parts of this taken from generic_cont_expand, which turned out
* to be too fragile to do exactly what we need without us having to
- * worry about recursive locking in ->prepare_write() and
- * ->commit_write(). */
+ * worry about recursive locking in ->write_begin() and ->write_end(). */
static int ocfs2_write_zero_page(struct in...

To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:20 am

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/drivers/block/Kconfig
===================================================================
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
If unsure, say N.

config BLK_DEV_RAM
- tristate "RAM disk support"
+ tristate "RAM block device support"
---help---
Saying Y here will allow you to use a portion of your RAM memory as
a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
The default value is 4096 kilobytes. Only change this if you know
what you are doin...

To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:14 am

Convert cifs to new aops.

This is another relatively naive conversion. Always do the read upfront when
the page is not uptodate (unless we're in the writethrough path).

Fix an uninitialized data exposure where SetPageUptodate was called before
the page was uptodate.

SetPageUptodate and switch to writeback mode in the case that the full page
was dirtied.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper

/* want handles we can use to read with first
in the list so we do not have to walk the
- list to search for one in prepare_write */
+ list to search for one in write_begin */
if ((file->f_flags & O_ACCMODE) == O_WRONLY) {
list_add_tail(&pCifsFile->flist,
&pCifsInode->openFileList);
@@ -1411,49 +1411,53 @@ static int cifs_writepage(struct page *p
return rc;
}

-static int cifs_commit_write(struct file *file, struct page *page,
- unsigned offset, unsigned to)
+static int cifs_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
{
- int xid;
- int rc = 0;
- struct inode *inode = page->mapping->host;
- loff_t position = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
- char *page_data;
+ int rc;
+ struct inode *inode = mapping->host;
+ loff_t position;
+
+ cFYI(1, ("commit write for page %p from position %lld with %d bytes",
+ page, pos, copied));
+
+ if (!PageUptodate(page) && copied == PAGE_CACHE_SIZE)
+ SetPageUptodate(page);

- xid = GetXid();
- cFYI(1, ("commit write for page %p up to position %lld for %d",
- page, position, to));
- spin_lock(&inode->i_lock);
- if (position > inode->i_size) {
- i_size_write(inode, position);
- }
- spin_...

To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:14 am

Convert afs to new aops.

Cannot assume writes will fully complete, so this conversion goes the easy
way and always brings the page uptodate before the write.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/afs/file.c
===================================================================
--- linux-2.6.orig/fs/afs/file.c
+++ linux-2.6/fs/afs/file.c
@@ -50,8 +50,8 @@ const struct address_space_operations af
.launder_page = afs_launder_page,
.releasepage = afs_releasepage,
.invalidatepage = afs_invalidatepage,
- .prepare_write = afs_prepare_write,
- .commit_write = afs_commit_write,
+ .write_begin = afs_write_begin,
+ .write_end = afs_write_end,
.writepage = afs_writepage,
.writepages = afs_writepages,
};
Index: linux-2.6/fs/afs/internal.h
===================================================================
--- linux-2.6.orig/fs/afs/internal.h
+++ linux-2.6/fs/afs/internal.h
@@ -731,8 +731,12 @@ extern int afs_volume_release_fileserver
*/
extern int afs_set_page_dirty(struct page *);
extern void afs_put_writeback(struct afs_writeback *);
-extern int afs_prepare_write(struct file *, struct page *, unsigned, unsigned);
-extern int afs_commit_write(struct file *, struct page *, unsigned, unsigned);
+extern int afs_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata);
+extern int afs_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata);
extern int afs_writepage(struct page *, struct writeback_control *);
extern int afs_writepages(struct address_space *, struct writeback_control *);
extern int afs_write_inode(struct inode *, int);
Index: linux-2.6/fs/afs/write.c
===================================================================
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -84,15 +84,23 @@ void afs_put_writeback(struct afs_writeb
...

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Monday, November 12, 2007 - 11:29 am

Do you guarantee this will work if PAGE_CACHE_SIZE != PAGE_SIZE? If not, you
can't make this particular change.

Do we ever intend to have PAGE_CACHE_SIZE != PAGE_SIZE? If not, then surely

That can't be right, surely. Either 'eof' is the size of the file or it's the
length of the data to be read. It can't be both. The first case needs eof
masking off. Also, 'eof' isn't a good choice of name. 'len' would be better
were it not already taken:-/

I notice you removed the stuff that clears holes in the page to be written. Is
this is now done by the caller?

I notice also that you use afs_fill_page() in place of afs_prepare_page() to
prepare a page. You can't do this if the region to be filled currently lies
beyond the server's idea of EOF.

I'll try and get a look at fixing this patch tomorrow.

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Monday, November 12, 2007 - 8:15 pm

It is supposed to bring the page uptodate first. So, no need to clear

No rush, it won't get into 2.6.24 obviously. But that would be nice, thanks.
-

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Monday, November 12, 2007 - 8:30 pm

Perhaps, but the function being called from there takes pages not page cache
slots. If I have to allow for PAGE_CACHE_SIZE > PAGE_SIZE then I need to

Hmmm... I suppose. However, it is wasteful in the common case as it is then
bringing the page up to date by filling/clearing the whole of it and not just
the bits that are not going to be written.

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Monday, November 12, 2007 - 8:44 pm

It takes a pagecache page, yes. If you follow convention, you use
PAGE_CACHE_SIZE for that guy. You don't have to allow PAGE_CACHE_SIZE !=
PAGE_SIZE, and if all the rest of your code is in units of PAGE_SIZE, then
obviously my changing of just the one unit is even more confusing than

Yes, that's where you come in. You are free (and encouraged) to optimise
this.

Let's see, for a network filesystem this is what you could do:
- if the page is not uptodate at write_begin time, do not bring it uptodate
(at least, not the region that is going to be written to)
- if the page is not uptodate at write_end time, but the copy was fully
completed, just mark it uptodate (provided you brought the regions outside
the copy uptodate).
- if the page is not uptodate and you have a short copy, simply do not
mark the page uptodate or dirty, and return 0 from write_end, indicating
that you have committed 0 bytes. The generic code should DTRT.

Or you could:
Pass back a temporary (not pagecache) page in *pagep, and copy that yourself
into the _real_ pagecache page at write_end time, so you know exactly how
big the copy will be (this is basically what the 2copy method does now...
it is probably not as good as the first method I described, but for a
high latency filesystem it may be preferable to always bringing the page
uptodate).

Or: keep track of sub-page dirty / uptodate status eg. with a light weight
buffer_head like structure, so you can actually have partially dirty pages
that are not completely uptodate.

Or... if you can think of another way...

-

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Tuesday, November 13, 2007 - 6:56 am

The problem is that the code called assumes that the struct page * argument
points to a single page, not an array of pages as would presumably be the case
if PAGE_CACHE_SIZE > PAGE_SIZE. If I should allow for an array of pages then
the lower functions (specifically afs_deliver_fs_fetch_data()) need to change,
and until that time occurs, the assertion *must* remain as it is now. It
defends the lower functions against being asked to do something they weren't
designed to do.

So: you may not change the assertion unless you also fix the lower functions.

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Wednesday, November 14, 2007 - 12:24 am

Incorrect. Christoph's patch for example does this by using compound pages.
Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE /
PAGE_SIZE distinction, but I'm just telling you what the convention is. There

I won't change the assertion, because you haven't been following said
convention, so just changing it in one place is stupider than not changing
it at all, but not for the reason cited.
-

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Wednesday, November 14, 2007 - 8:18 am

No! You are wrong. I wrote the AFS code. I *know* it can only deal with
single pages. It has no knowledge of compound pages and does not handle page
arrays. This may be a flaw in my code, but it's there nonetheless. The
assertion is a guard against that. *That* was the point of my statement.
Perhaps I should've said 'my code' rather than 'the code'.

If Christoph has a patch to deal with that, it's either not upstream yet or it

The convention is not precisely clear. Just grep for PAGE_CACHE_SIZE in
Documentation/. It's only mentioned twice, and in neither case does it give
any information about what PAGE_CACHE_SIZE is, what it represents, or where it
applies. Therefore it's an ill-defined concept.

If you look in Documentation/filesystems/vfs.txt, you'll see that it almost
always talks about 'pages'. It only mentions 'pagecache pages' once - in the
description of write_begin(), but it's not clear whether that means anything.

However, I've now noted that I need to fix my code, so just keep the assertion
for now and I'll fix my code to handle multipage blocks.

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Wednesday, November 14, 2007 - 11:18 am

No I'm talking about core code. In core code, the PAGE_CACHE_SIZE is
for page cache struct pages. Single struct pages (not page arrays).
Take a look at generic mapping read or something.

There is nothing to deal with page arrays there either, but that's simply

I'm not saying you need to do that. Leave it at PAGE_SIZE, really it
doesn't matter that much at present. This has just blown out of proportion.

-

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Wednesday, November 14, 2007 - 11:57 am

So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an

If it's not Documented, then it's irrelevant.

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Wednesday, November 14, 2007 - 5:32 pm

No, a pagecache page is PAGE_CACHE_SIZE. And not all struct pages control

But you can't just decide yourself that it is irrelevant because you don't
grep hard enough ;)

include/linux/mm.h:
* A page may belong to an inode's memory mapping. In this case, page->mapping
* is the pointer to the inode, and page->index is the file offset of the page,
* in units of PAGE_CACHE_SIZE.

include/linux/mm_types.h
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */

Looks like documentation to me.

-

To: Nick Piggin <npiggin@...>
Cc: <dhowells@...>, Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Thursday, November 15, 2007 - 8:15 am

That doesn't answer my question. I didn't ask about 'pagecache pages' per se.

Are you saying then that a page struct always represents an area of PAGE_SIZE
to, say, the page allocator and PAGE_CACHE_SIZE to a filesystem's address
operations?

How about I state it this way: Please define what the coverage of a
(non-compound) struct page is, and how this relates to PAGE_SIZE and

Compound pages are irrelevant to my question. A compound page is actually a
regulated by a series of page structs, each of which represents a 'page' of
real memory.

Do you say, then, that all, say, readpage() and readpages() methods must
handle a compound page if that is given to them?

David
-

To: David Howells <dhowells@...>
Cc: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>
Date: Thursday, November 15, 2007 - 5:37 pm

No it's easy. It's PAGE_SIZE (which also happens to be PAGE_CACHE_SIZE).
An implementation that would (not trivially, but without changing the
basic concepts) allow a larger pagecache size is with compound pages. And

I'm not talking about a specific implementation that allows
PAGE_CACHE_SIZE > PAGE_SIZE. So no, I don't say anything about that. I
say that pagecache pages are PAGE_CACHE_SIZE! Yes it is easy now because
that is the same as PAGE_SIZE. Yes it will be harder if you wanted to
change that. What you would not have to change is the assumption that
pagecache pages are in PAGE_SIZE units.

-

To: Andrew Morton <akpm@...>, <linux-fsdevel@...>, <mhalcrow@...>, <phillip@...>, <sfrench@...>, <dhowells@...>
Date: Monday, November 12, 2007 - 3:13 am

Convert ecryptfs to new aops.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -263,31 +263,38 @@ out:
return 0;
}

-static int ecryptfs_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ecryptfs_write_begin(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
{
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+ struct page *page;
int rc = 0;

- if (from == 0 && to == PAGE_CACHE_SIZE)
+ page = __grab_cache_page(mapping, index);
+ if (!page)
+ return -ENOMEM;
+
+ if (flags & AOP_FLAG_UNINTERRUPTIBLE && len == PAGE_CACHE_SIZE)
goto out; /* If we are writing a full page, it will be
up to date. */
if (!PageUptodate(page)) {
- rc = ecryptfs_read_lower_page_segment(page, page->index, 0,
+ rc = ecryptfs_read_lower_page_segment(page, index, 0,
PAGE_CACHE_SIZE,
- page->mapping->host);
+ mapping->host);
if (rc) {
printk(KERN_ERR "%s: Error attemping to read lower "
"page segment; rc = [%d]\n", __FUNCTION__, rc);
- ClearPageUptodate(page);
goto out;
} else
SetPageUptodate(page);
}
- if (page->index != 0) {
+ if (index != 0) {
loff_t end_of_prev_pg_pos =
- (((loff_t)page->index << PAGE_CACHE_SHIFT) - 1);
+ (((loff_t)index << PAGE_CACHE_SHIFT) - 1);

- if (end_of_prev_pg_pos > i_size_read(page->mapping->host)) {
+ if (end_of_prev_pg_pos > i_size_read(mapping->host)) {
rc = ecryptfs_truncate(file->f_path.dentry,
end_of_prev_pg_pos);
if (rc) {
@@ -297,7 +304,7 @@ static int ecryptfs_prepare_write(struct
goto out;
}
}
- if (end_of_prev_pg_pos + 1 > i...

Previous thread: [PATCH -v3] SELinux: Add get, set, and cloning of superblock security information by Eric Paris on Friday, November 9, 2007 - 6:22 pm. (5 messages)

Next thread: [PATCH 2/2] FIEMAP ioctl for ext4 by Kalpak Shah on Monday, November 12, 2007 - 5:00 pm. (2 messages)