Re: [RFC] fuse writable mmap design

Previous thread: [PATCH] [CIFS] fix potential data corruption when there are errors writing out dirty pages by Jeff Layton on Thursday, November 15, 2007 - 8:45 am. (1 message)

Next thread: [PATCH 0/2] cramfs: support for other endianness by Andi Drebes on Thursday, November 15, 2007 - 4:29 pm. (15 messages)
To: <linux-kernel@...>
Cc: <a.p.zijlstra@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 12:10 pm

Writable shared memory mappings for fuse are something I've been
trying to implement forever.

Now hopefully I've got it all worked out, it survives indefinitely
with bash-shared-mapping and fsx-linux. And I'd like to solicit
comments about the approach.

I'm not asking for comments on the patch itself. It needs to be
cleaned and split up. It's only included for reference.

Thanks,
Miklos

Fuse page writeback design
--------------------------

fuse_writepage() allocates a new temporary page with
GFP_NOFS|__GFP_HIGHMEM. It copies the contents of the original page,
and queues a WRITE request to the userspace filesystem using this temp
page.

From the VM's point of view, the writeback is finished instantly: the
page is removed from the radix trees, and the PageDirty and
PageWriteback flags are cleared.

The per-bdi writeback count is not decremented until the writeback
truly completes. And there's a new 'nr_writeback_temp' counter, that
is used to track the global count of these writebacks instead of the
per-zone NR_WRITEBACK (it could be a new per-zone counter in vm_stat,
but for simplicity, current code just uses a single atomic counter).

If the writeout was due to memory pressure, in effect this migrates
data from a full zone to a less full zone.

On dirtying the page, fuse waits for a previous write to finish before
proceeding. This makes sure, there can only be one temporary page used
at a time for one cached page.

This approach is wasteful in both memory and CPU bandwidth, so why is
this complication needed?

The basic problem is that there can be no guarantee about the time in
which the userspace filesystem will complete a write. It may be buggy
or even malicious, and fail to complete WRITE requests. We don't want
unrelated parts of the system to grind to a halt in such cases.

Also a filesystem may need additional resources (particularly memory)
to complete a WRITE request. There's a great danger of a deadlock if
that allocation may wait for the ...

To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 3:22 pm

I'm somewhat confused by the complexity. Currently we can already have a
lot of dirty pages from FUSE (up to the per BDI dirty limit - so
basically up to the total dirty limit).

How is having them dirty from mmap'ed writes different?

-

To: <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 3:37 pm

Nope, fuse never had dirty pages. It does normal writes
synchronously, just updating the cache.

The dirty accounting and then the per-bdi throttling basically made it
possible _at_all_ to have a chance at a writepage implementation which
is not deadlocky (so thanks for those ;).

But there's still the throttle_vm_writeout() thing, and the other
places where the kernel is waiting for a write to complete, which just
cannot be done within a constrained time if an unprivileged userspace
process is involved.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 3:42 pm

Ah, ok, your initial story missed this part (not being intimately
familiar with FUSE made all that somewhat obscure).

The next point then, I'd expect your fuse_page_mkwrite() to push
writeout of your 32-odd mmap pages instead of poll.

-

To: <a.p.zijlstra@...>
Cc: <miklos@...>, <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 3:57 pm

You're talking about this:

+ wait_event(fc->writeback_waitq,
+ fc->numwrite < FUSE_WRITEBACK_THRESHOLD);

right? It's one of the things I need to clean out, there's no point
in fc->numwrite, which is essentially the same as the BDI_WRITEBACK
counter.

OTOH, I'm thinking about adding a per-fs limit (adjustable for
privileged mounts) of dirty+writeback.

I'm not sure how hard would it be to add support for this into
balance_dirty_pages(). So I'm thinking of a parameter in struct
backing_dev_info that is used to clip the calculated per-bdi threshold
below this maximum.

How would that affect the proportions algorithm? What would happen to
the unused portion? Would it adapt to the slowed writeback and
allocate it to some other writer?

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 4:01 pm

The unused part is gone, I've not yet found a way to re-distribute this
fairly.

[ It's one of my open-problems, I can do a min_ratio per bdi, but not
yet a max_ratio ]

-

To: <a.p.zijlstra@...>
Cc: <miklos@...>, <linux-kernel@...>, <akpm@...>, <linux-fsdevel@...>, <linux-mm@...>
Date: Thursday, November 15, 2007 - 4:11 pm

OK, I'll bear this in mind.

Limiting the number of dirty+writeback to << dirty_thresh could still
make sense, since it could prevent a nasty filesystem from pinning
lots of kernel memory (which it can do without fuse in other ways, so
this is not very important IMO).

Miklos
-

Previous thread: [PATCH] [CIFS] fix potential data corruption when there are errors writing out dirty pages by Jeff Layton on Thursday, November 15, 2007 - 8:45 am. (1 message)

Next thread: [PATCH 0/2] cramfs: support for other endianness by Andi Drebes on Thursday, November 15, 2007 - 4:29 pm. (15 messages)