On Sun, 28 Oct 2007, Erez Zadok wrote:That's very encouraging... ... but still a few problems, I'm afraid. The greatest problem is a tmpfs one, that would be for me to solve. But first... On reflection, I think I went too far in asking you to mask off __GFP_IO. Loop has to do so because it's a block device, down towards the IO layer; but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent recursion into the FS layer with danger of deadlock, and leaving __GFP_IO on gives a better chance of success. Hmm, several points on that. When suggesting something like this, I did remark "what locking needed?". You've got none: which is problematic if two stacked mounts are playing with the same underlying file concurrently (yes, userspace would have a data coherency problem in such a case, but the kernel still needs to worry about its own internal integrity) - you'd be in danger of masking (__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore, losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long. Neither likely but both wrong. See the comment on mapping_set_gfp_mask() in include/pagemap.h: * This is non-atomic. Only to be used before the mapping is activated. Strictly speaking, I guess loop was a little bit guilty even when just loop_set_fd() did it: the underlying mapping might already be active. It appears to be just as guilty as you in its do_loop_switch() case (done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl which would only be expected to be called once, during installation; whereas you're using mapping_set_gfp_mask here with great frequency. Another point on this is: loop masks __GFP_IO|__GFP_FS off the file for the whole duration while it is looped, whereas you're flipping it just in this preliminary section of unionfs_writepage. I think you're probably okay to be doing it only here within ->writepage: I think loop covered every operation because it's at the block device level, perhaps both reads and writes needed to serve reclaim at the higher FS level; and also easier to do it once for all. Are you wrong to be doing it only around the grab_cache_page, leaving the lower level ->writepage further down unprotected? Certainly doing it around the grab_cache_page is likely to be way more important than around the ->writepage (but rather depends on filesystem). And on reflection, I think that the lower filesystem's writepage should already be using GFP_NOFS to avoid deadlocks in any of its allocations when wbc->for_reclaim, so you should be okay just masking off around the grab_cache_page. (Actually, in the wbc->for_reclaim case, I think you don't really need to call the lower level writepage at all. Just set_page_dirty on the lower page, unlock it and return. In due course that memory pressure which has called unionfs_writepage, will come around to the lower level page and do writepage upon it. Whether that's a better strategy or not, I'm do not know.) There's an attractively simple answer to the mapping_set_gfp_mask locking problem, if we're confident that it's only needed around the grab_cache_page. Look at the declaration of grab_cache_page in linux/pagemap.h: it immediately extracts the gfp_mask from the mapping and passes that down to find_or_create_page, which doesn't use the mapping's gfp_mask at all. So, stop flipping and use find_or_create_page directly yourself. You need to unlock_page, don't you? Or move the "out" label up before the unlock_page. There seems to have been confusion about this even in the current 2.6.23-mm1 unionfs_writepage: the only case in which a writepage returns with its page still locked is that AOP_WRITEPAGE_ACTIVATE case we're going to get rid of. Better to use kmap_atomic. unionfs_writepage cannot get called at interrupt time, I see no reason to avoid KM_USER0 and KM_USER1: therefore simply use copy_highpage(lower_page, page) and let it do all the kmapping and copying. If PAGE_CACHE_SIZE ever diverges from PAGE_SIZE (e.g. Christoph Lameter's variable page_cache_size patches), then yes, this would need updating to a loop over several pages (or better, linux/highmem.h should then provide a function to do it). Another instance of that confusion: lower_page is already unlocked, on success or failure; it's only the anomalous AOP_WRITEPAGE_ACTIVATE case that leaves it locked. Page needs to be unlocked, whether here or at out. Right (once you've got the locking right). I'm not going to comment much on your unionfs_sync_page: it looks like a total misunderstanding of what sync_page does, assuming from the name that it syncs the page in a fsync/msync/sync manner. No, it would much better be named "unplug_page_io": please take a look at sync_page() in mm/filemap.c, observe how it gets called (via wait_on_page_bit) and what it ends up doing. (Don't pay much attention to what Documentation/filesystems says about it, either!) It's an odd business; I think Nick did have a patch to get rid of it completely, which would be nice; but changes to unplugging I/O (kicking off the I/O after saving up several requests to do all together) can be a hang-prone business. Do you need a unionfs_sync_page at all? I think not, since the I/O, plugged or unplugged, is below your lower level filesystem. But I started by mentioning a serious tmpfs problem. Now I've persuaded you to go back to grab_cache_page/find_or_create_page, I realize a nasty problem for tmpfs. Under memory pressure, you're liable to be putting tmpfs file pages into the page cache at the same time as they're already present but in disguise as swap cache pages. Perhaps the solution will be quite simple (since you're overwriting the whole page), but I do need to think about it. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [GIT]: Networking |
| Evgeniy Polyakov | Re: [BUG] New Kernel Bugs |
