Document design and implementation of union mounts (a.k.a. writable overlays). --- Documentation/filesystems/union-mounts.txt | 899 ++++++++++++++++++++++++++++ 1 files changed, 899 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/union-mounts.txt diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt new file mode 100644 index 0000000..ba830e8 --- /dev/null +++ b/Documentation/filesystems/union-mounts.txt @@ -0,0 +1,899 @@ +Union mounts (a.k.a. writable overlays) +======================================= + +This document describes the architecture and current status of union +mounts, also known as writable overlays. + +In this document: + - Overview of union mounts + - Terminology + - VFS implementation + - Locking strategy + - VFS/file system interface + - Userland interface + - NFS interaction + - Status + - Contributing to union mounts + +Overview +======== + +A union mount layers one read-write file system over a one read-only +file system, with all writes going to the writable file system. The +namespace of both file systems appears as a combined whole to +userland, with files and directories on the writable file system +covering up any files or directories with matching pathnames on the +read-only file system. The read-write file system is the "topmost" +or "upper" file system and the read-only file system is the "lower" +file system. A few use cases: + +- Root file system on CD with writes saved to hard drive (LiveCD) +- Multiple virtual machines with the same starting root file system +- Cluster with NFS mounted root on clients + +Most if not all of these problems could be solved with a COW block +device or a clustered file system (include NFS mounts). However, for +some use cases, sharing is more efficient and better performing if +done at the file system namespace level. COW block devices only +increase their divergence as time goes on, and a fully coherent +writable ...
I spent some time looking at patch 27 trying to figure it out for myself, but my lack of splice()-fu doomed me. :) A few quick questions: 1) For calls like chmod() that only touch the metadata, does it still trigger a copyup of the data, or just the affected metadata? 2) Is the copyup of data synchronous or async done in the background? The comments in union_copyup_len() about "We raced with someone else" imply this is synchronous - if so. probably a note should be made that an open() may take a little while under some conditions. There's a *lot* of code out there that assumes that open() calls are *really* cheap. I wonder how many programs don't correctly deal with an ENOSPC on open() of an already existing file. (The answers probably don't matter unless somebody ends up invoking a copyup of a gigabyte file, which of course implies one of my users will end up doing exactly that. :)
Yes, it copies up the whole file. Right now there's no concept of It's synchronous. Code that assumes open() calls are cheap will have I'm not too worried about that - how many programs deal correctly with ENOSPC when it is normally returned? -VAL --
Can copyup be interrupted? E.g. if I chmod an 80GB file, will the Does it apply the same permission checks that a program doing copy+rename would have to pass? I guess that is just write access to the directory. Does it effectively "rename" all hard links referring to the file, to point to the new version, or does it only affect the path that was used by the writer/modifier, leaving the other links continue to refer Why is O_DIRECT relevant? O_DIRECT doesn't imply writing, and copy+rename behaviour is the same with O_DIRECT as not. Some programs use O_DIRECT to read very large files, without intending they will ever be modified. For example, qemu using O_DIRECT to I'm finding it hard to imagine _guaranteeing_ really read-only. All you can guarantee is that the NFS says it is read-only. For example, a userspace NFS server cannot prevent the filesystem it's serving from changing. I can imagine some database-like programs getting confused by that. Maybe it would be better to fail copyup operations when the file is currently open O_RDONLY by anyone, analogous to the way writable mounts are refused when any union holds it read-only? Are there uses likely to be broken by that behaviour? Thanks, -- Jamie --
The right behavior is that you should be able to control-C it, but I doubt that currently works. Let me look into testing and implementing In order to update all the hard links to a file, we would have to walk the entire file system searching for links with a matching inode number and copy them up too. We're never going to do a file-system-wide walk, so we won't do that. The other hard links still point to the old copy of the file. We hope applications don't Each file system that wants to support union mounts will need to implement the features necessary for that layer (hard read-only for That's an interesting question. In general, this seems like a bad idea - any process can prevent another process from writing to a file by opening it. This is like chmod'ing it to 444. -VAL --
