Writable shared memory mappings for fuse are something I've been trying to implement forever. Now hopefully I've got it all worked out, it survives indefinitely with bash-shared-mapping and fsx-linux. And I'd like to solicit comments about the approach. I'm not asking for comments on the patch itself. It needs to be cleaned and split up. It's only included for reference. Thanks, Miklos Fuse page writeback design -------------------------- fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM. It copies the contents of the original page, and queues a WRITE request to the userspace filesystem using this temp page. From the VM's point of view, the writeback is finished instantly: the page is removed from the radix trees, and the PageDirty and PageWriteback flags are cleared. The per-bdi writeback count is not decremented until the writeback truly completes. And there's a new 'nr_writeback_temp' counter, that is used to track the global count of these writebacks instead of the per-zone NR_WRITEBACK (it could be a new per-zone counter in vm_stat, but for simplicity, current code just uses a single atomic counter). If the writeout was due to memory pressure, in effect this migrates data from a full zone to a less full zone. On dirtying the page, fuse waits for a previous write to finish before proceeding. This makes sure, there can only be one temporary page used at a time for one cached page. This approach is wasteful in both memory and CPU bandwidth, so why is this complication needed? The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. Also a filesystem may need additional resources (particularly memory) to complete a WRITE request. There's a great danger of a deadlock if that allocation may wait for ...
I'm somewhat confused by the complexity. Currently we can already have a lot of dirty pages from FUSE (up to the per BDI dirty limit - so basically up to the total dirty limit). How is having them dirty from mmap'ed writes different? -
Nope, fuse never had dirty pages. It does normal writes synchronously, just updating the cache. The dirty accounting and then the per-bdi throttling basically made it possible _at_all_ to have a chance at a writepage implementation which is not deadlocky (so thanks for those ;). But there's still the throttle_vm_writeout() thing, and the other places where the kernel is waiting for a write to complete, which just cannot be done within a constrained time if an unprivileged userspace process is involved. Miklos -
Ah, ok, your initial story missed this part (not being intimately familiar with FUSE made all that somewhat obscure). The next point then, I'd expect your fuse_page_mkwrite() to push writeout of your 32-odd mmap pages instead of poll. -
You're talking about this: + wait_event(fc->writeback_waitq, + fc->numwrite < FUSE_WRITEBACK_THRESHOLD); right? It's one of the things I need to clean out, there's no point in fc->numwrite, which is essentially the same as the BDI_WRITEBACK counter. OTOH, I'm thinking about adding a per-fs limit (adjustable for privileged mounts) of dirty+writeback. I'm not sure how hard would it be to add support for this into balance_dirty_pages(). So I'm thinking of a parameter in struct backing_dev_info that is used to clip the calculated per-bdi threshold below this maximum. How would that affect the proportions algorithm? What would happen to the unused portion? Would it adapt to the slowed writeback and allocate it to some other writer? Miklos -
The unused part is gone, I've not yet found a way to re-distribute this fairly. [ It's one of my open-problems, I can do a min_ratio per bdi, but not yet a max_ratio ] -
OK, I'll bear this in mind. Limiting the number of dirty+writeback to << dirty_thresh could still make sense, since it could prevent a nasty filesystem from pinning lots of kernel memory (which it can do without fuse in other ways, so this is not very important IMO). Miklos -
