Document design and implementation of union mounts (a.k.a. writable overlays). Signed-off-by: Valerie Aurora <vaurora@redhat.com> --- Documentation/filesystems/union-mounts.txt | 752 ++++++++++++++++++++++++++++ 1 files changed, 752 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/union-mounts.txt diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt new file mode 100644 index 0000000..977a2b5 --- /dev/null +++ b/Documentation/filesystems/union-mounts.txt @@ -0,0 +1,752 @@ +Union mounts (a.k.a. writable overlays) +======================================= + +This document describes the architecture and current status of union +mounts, also known as writable overlays. + +In this document: + - Overview of union mounts + - Terminology + - VFS implementation + - Locking strategy + - VFS/file system interface + - Userland interface + - NFS interaction + - Status + - Contributing to union mounts + +Overview +======== + +A union mount layers one read-write file system over one or more +read-only file systems, with all writes going to the writable file +system. The namespace of both file systems appears as a combined +whole to userland, with files and directories on the writable file +system covering up any files or directories with matching pathnames on +the read-only file system. The read-write file system is the +"topmost" or "upper" file system and the read-only file systems are +the "lower" file systems. A few use cases: + +- Root file system on CD with writes saved to hard drive (LiveCD) +- Multiple virtual machines with the same starting root file system +- Cluster with NFS mounted root on clients + +Most if not all of these problems could be solved with a COW block +device or a clustered file system (include NFS mounts). However, for +some use cases, sharing is more efficient and better performing if +done at the file system namespace level. COW block devices only +increase ...
On Sun, 8 Aug 2010 11:52:31 -0400 Thanks for including lots of documentation! Given how intrusive this patch set is, I would really like the see the justification above fleshed out a bit more. What would be particularly valuable would be real-life use cases where someone has put this to work and found that it genuinely meets a need. I realise there can be a bit of a chicken/egg issue there, but if you do have anything it would be good to include it. A particular need for this is that fact that a number of standard features are not going to be supported and it would be good to be sure that there are real cases that don't need those. I wonder if the restriction is not more serious than this. Given the prevalence of "copy-up", particularly of directories, I would think that even off-line upgrade would not be supported. If the upgrade adds a file in a directory that has already been read (and hence copied-up), or changes a file that has been chmodded, then the upgrade will not be completely visible, which sounds dangerous. Don't you have to require (or strongly recommend) that the underlying filesystem remain unchanged while the on-top filesystem exists, not just while it is mounted ?? As a counter-position for you or others to write cogent arguments against, and to then include those arguments in the justification section, I would like to present my preferred approach, which is essentially that the problem is better solved at the block layer or the distro layer. A distro-layer solution would be appropriate when you want a common root filesystem with per-host configuration, whether in an NFS cluster of a virtual-machine cluster. This involved every file that might need configuration being made a symlink to e.g. /local, and every instance mounts some local directory on /local. e.g. mount --bind /local-`hostname` /local This is obviously less transparent, but it is also more predictable (you know exactly what can and cannot be changed by an upgraded on the ...
::: DM snapshot provides the COW block feature and it will match your idea since the size of COW device is much smaller genearally. But it doesn't support off-line upgrade either. If you do, it is equivalent to corrupt filesystem for DM snapshot device. Here is pros/cons of DM snapshot comparing a union. - the number of bytes to be copied between devices is much smaller. - the type of filesystem must be one and only. - the fs must be writable, no readonly fs, even for the lower original device. so the compression fs will not be usable. but if we use loopback mount, we may address this issue. for instance, mount /cdrom/squashfs.img /sq losetup /sq/ext2.img losetup /somewhere/cow dmsetup "snapshot /dev/loop0 /dev/loop1 ..." - it will be difficult (or needs more operations) to extract the difference between the original device and COW. - DM snapshot-merge may help a lot when users try merging. in the fs-layer union, users will use rsync(1). - in fs-based union, users can add/remove members(layer) dynamicall without unmounting. of course, all files on the removing layer should not be busy. Also here is my concern about UnionMount. All these issues have been reported before. - for users, the inode number may change silently. eg. copy-up. - link(2) may break by copy-up. - read(2) may get an obsoleted filedata (fstat(2) too). - fcntl(F_SETLK) may be broken by copy-up. - unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after open(O_RDWR). J. R. Okajima --
On Tue, 17 Aug 2010 16:44:30 -0400 You present a good argument that "something must be done", but it gives no pointers to what that something should be. However, until it is merged in to mainline it would be good to keep the justification of this change well documented so you don't have to repeat the same argument to every bozo who pops up and thinks they know better. Ultimately the git commit log (or even an lwn.net article) could well be a better place to store this rather than Documenation/, but I think there is Absolutely right - no argument about that. I just think that should be explicit in the documentation. Right after the "Online upgrade" paragraph: Even off-line upgrade - e.g. installing software on an exported filesystem and the remounting that on client and union-mounting a pre-existing over lay on top of it - is significantly non-trivial and would require significant extra management software to created a working solution. That may be enough justification to work on this as a research project, but I don't think it is enough justification to merge it into mainline. Just because aufs might be the best available solution to a particular problem doesn't mean that making a better aufs (aka VFS union mounts) will be the best possible solution. That can only be determine if the key needs, and the problems with all available solutions, are publicly known. Thanks, NeilBrown --
I think that the safety of personal data and the ability to make changes to the layers independently are very important features that justify union mounts. If you do block-level COW and then lose the lower filesystem layer (e.g. lose the LiveCD or lose network access to the NFS master), then you have no guarantee of being able to access the data you added (e.g. your /home) since you'll only have a corrupted "filesystem piece" that fsck may or may not be able to fix. Also, you can't modify the lower layer at all (without rebuilding the upper layer from scratch), while with an union mount minor changes can be done with no issues (e.g. replacing the LiveCD with a new minor update, or applying a security update to the NFS master), and major ones can be done with some care. Hence, in any case where the layers are even slightly separated, or where you need to modify them independently, or extract the changes, union mounts/unionfs are much better, and often actually the only viable solution. This includes the LiveCD case, the NFS mount case and some use cases with virtual machines. This would be the case even more strongly if additional features like online modification of lower layers, or path resolution to the most recent file instead of the one in the highest layer, were added. Of course, this is why people currently use unionfs or aufs, and a VFS-based solution seems just better, since it is going to be more efficient and guaranteed to be relatively bug-free once it satisfies the high quality requirements for inclusion in the core kernel. --
Sorry, all the documentation I have about union mounts is publicly available. I'll announce any new documentation in the usual way. If you are willing to do the research personally, you can start with the list of projects using unionfs: The problems with all available solutions, including union mounts, are thoroughly documented in my four LWN articles on union mounts: http://lwn.net/Articles/324291/ http://lwn.net/Articles/325369/ http://lwn.net/Articles/327738/ http://lwn.net/Articles/396020/ I understand your desire for better documentation. But contrary to popular conception, I hate writing and do it as seldom as possible. :) Thanks, -VAL --
Hmm, anyone who meets crash in aufs, please let me know. While I never say aufs is bug-free, I don't receive such report recently. I always try fixing a bug as soon as possible when I got a report. A reply I have to write repeatedly to who have met a problem in aufs and reported to aufs-users ML, is "your aufs version is too old. please get the latest one." Because aufs is released every week and some linux distributions keep using very old aufs version, even over one year old version than thier release date. By the way, I don't have objection to merge Val's UnionMount into Is it (mostly) possible by receiving a notification via fsnotify? For remote FS, their ->d_revalidate() will tell us something is changed. J. R. Okajima --
According Al Viro, unionfs has some fundamental architectural problems that prevents it from being correct and leads to crashes: http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html The main question for me is whether aufs has fixed these problems. If Think about the case of two different RPM package database files. One contains the info from newly installed packages on the top layer file system. The lower layer contains info from packages newly installed on the lower file system. You don't want either file; you want the merged packaged database showing the info for all packages installed on both layers. Any practical file system based system is only going to be able to pick one file or the other, and it's going to be wrong in some cases. -VAL --
Although I don't understand fully your question, aufs actually verifies the parent-child relationship after lock_rename() on the writable layer. Such verification is done in other operations too. And aufs provides three options to specify the level of verification. When the highest (most strict) level is given, aufs_rename lookup again after lock_rename() and compares the got parent and the given (cached) parent. Let me make sure. Do you mean something like this? - a user makes a union - fileA exists on the lower layer but upper - modify fileA in the union --> the file is copied-up and updated on the upper layer. - modify fileA on the lower layer directly (by-passing union) --> the file on the lower is updated. - and the user will not see the uptodate fileA in the union, lack of the modification made on the lower directly. Then I'd say it is an expected behaviour. Simply the upper file hides the lower. While UnionMount takes a block device as a parameter of making a union operaion, aufs takes a directory. # mount /dev/sda /u # mount -o union /dev/sdb /u # mount /dev/sda /ro # mount /dev/sdb /rw # mount -t aufs -o br:/rw:/ro none /u It means sda is hidden in UnionMount (generally) and users cannot access it directly. But in aufs, it is possible via /ro. For those who wants to hide /ro and stop accessing it directly, aufs document suggests mounting another thing onto /ro. It can be an empty directly if you use "mount -o bind". J. R. Okajima --
Thank you for explanation, very much. Because users can hide the layers (such like UnionMount) if they want, and it totally prohibits bypassing aufs. Additionally they modify on the layer directly (bypassing aufs) only when it is really necessary. So the default value of the option is not a strict one. And users can change When a rename happens on a layer directly, aufs receives a inotify/fsnotify event. Following the event type, aufs makes the cached dentry/inode obsoleted and they will be lookup-ed again in the succeeding access. Finally aufs will know the upper parent_dir1 is not covering the lower parent_dir2 anymore. This notification is the main purpose of the strict option which is ::: No, deadlock will not happen since aufs knows the new parent-child relationship. By using inotify/hinotify in above answer, I hope you I am afraid that still I may not understand what you wrote well. Do you mean that upgrading a package involves updating seveal files and their version have to be matched with each other within the package, and upgrading different package in both of upper and lower layer directly causes mismatch among those files? Although I don't think you are talking about an aufs utility aubrsync which runs rsync between layers, I don't understand about "putting this policy decision into the VFS". The simple rule "the upper file hides the lower" is out of VFS. J. R. Okajima --
No, that's not a sufficient description and leaves open questions about all sorts of deadlocks and race conditions. For example, inotify events occur while holding locks only on one layer. You obviously need to lock the top layer to update the inheritance and parent-child relationships. Now you are locking the lower layer first and the top layer second, which is the reverse of the usual order. Also, it should not be an option. If Al Viro says it's wrong, you need a very detailed explanation of why it is right. See Documentation/filesystem/directory-locking for an example of the argument you have to make to show that moving things around on the lower layer is safe. In general, your first task is to show a global lock ordering to prove lack of deadlocks (which I don't think you should spend time on because most VFS experts think it is impossible to do with two read-write layers). I'm not going to explain any more how aufs is wrong; it's the maintainer's job to convince Al Viro and other maintainers that aufs is right. But I hope this gave you a start and showed why union mounts is a preferred approach for many people. Thanks, -VAL --
Aloha Everybody; This all reminds me of the 5/dining philosophers problem and its solutions, especially the waiter and the resource hierarchy solutions (see [1]). And I do think that such problems can always be solved in a real world [1] http://en.wikipedia.org/wiki/Dining_philosophers_problem Have fun Christian --
I don't agree about deadlock and race condition.
When user modifies the dir hierarchy on the layer directly during
aufs_rename() is running, aufs will detect it after lock_rename().
It behaves like this.
- decide the layer where actual rename operates. create the dir
hierarchy on it if necessary.
- lock_rename() for the layer
- calls ->rename()
or
- if the renaming file exists on the lower readonly layer, aufs will
copyup it to the upper writable layer as the rename target name.
In this case, ->rename() is not called.
If a user changes the dir hierarchy directly on the layer before
aufs_rename(), then the notify event tells aufs it and aufs gets the
latetst hierarchy.
If it happens before lock_rename() in aufs_rename(), aufs verifies the
relationship between the target child and the locked dir. if it differs,
return EBUSY. Of course, lock_rename() follows the "ancestors first"
Since you may not read this anymore and other people doesn't seem to
be intrested in aufs, it may not be meaningful to write down about
locking in aufs. But I will try.
At first,
- since aufs is FS, it has its own super_block, dentry and inode.
- super_block, dentry and inode in aufs have private data which contains
rwsem.
- the locking order for these rwsem is child-first.
- aufs specifies FS_RENAME_DOES_D_MOVE.
locking order in aufs_rename
+ down_read() for aufs sb
protects sb from branch-add, delete.
+ two down_write()s for src and dest child
protects them from other processes in aufs.
+ down_write() for the dst_parent.
+ decide the layer where we will operate, by comparing the index of
layers where the targets exist and the layer attribute (ro, rw).
+ copyup the dest dir hierarchy if necessary, by repeating
- dget_parent(), down/up_read() for the parent (in aufs)
- mutex_lock() for the dir (on the layer) to mkdir the non-existing
child dir on the layer and verify the parent-child relationship.
- mkdir and setattr on the layer.
- mutex_unlock() the dir on ...