Re: [PATCH 14/39] union-mount: Union mounts documentation

Previous thread: [PATCH 08/39] whiteout: Allow removal of a directory with whiteouts by Valerie Aurora on Sunday, August 8, 2010 - 8:52 am. (1 message)

Next thread: [PATCH 05/39] whiteout/NFSD: Don't return information about whiteouts to userspace by Valerie Aurora on Sunday, August 8, 2010 - 8:52 am. (1 message)
From: Valerie Aurora
Date: Sunday, August 8, 2010 - 8:52 am

Document design and implementation of union mounts (a.k.a. writable
overlays).

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 Documentation/filesystems/union-mounts.txt |  752 ++++++++++++++++++++++++++++
 1 files changed, 752 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..977a2b5
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,752 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over one or more
+read-only file systems, with all writes going to the writable file
+system.  The namespace of both file systems appears as a combined
+whole to userland, with files and directories on the writable file
+system covering up any files or directories with matching pathnames on
+the read-only file system.  The read-write file system is the
+"topmost" or "upper" file system and the read-only file systems are
+the "lower" file systems.  A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts).  However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level.  COW block devices only
+increase ...
From: Neil Brown
Date: Monday, August 9, 2010 - 3:56 pm

On Sun,  8 Aug 2010 11:52:31 -0400

Thanks for including lots of documentation!
Given how intrusive this patch set is, I would really like the see the
justification above fleshed out a bit more.

What would be particularly valuable would be real-life use cases where
someone has put this to work and found that it genuinely meets a need.
I realise there can be a bit of a chicken/egg issue there, but if you do have
anything it would be good to include it.
A particular need for this is that fact that a number of standard features
are not going to be supported and it would be good to be sure that there are
real cases that don't need those.


I wonder if the restriction is not more serious than this.
Given the prevalence of "copy-up", particularly of directories, I would think
that even off-line upgrade would not be supported.
If the upgrade adds a file in a directory that has already been read (and
hence copied-up), or changes a file that has been chmodded, then the upgrade
will not be completely visible, which sounds dangerous.

Don't you have to require (or strongly recommend) that the underlying
filesystem remain unchanged while the on-top filesystem exists, not just
while it is mounted ??



As a counter-position for you or others to write cogent arguments against,
and to then include those arguments in the justification section,  I would
like to present my preferred approach, which is essentially that the problem
is better solved at the block layer or the distro layer.

A distro-layer solution would be appropriate when you want a common root
filesystem with per-host configuration, whether in an NFS cluster of a
virtual-machine cluster.
This involved every file that might need configuration being made a symlink
to e.g. /local, and every instance mounts some local directory on /local.
e.g.  mount --bind /local-`hostname` /local

This is obviously less transparent, but it is also more predictable (you
know exactly what can and cannot be changed by an upgraded on the ...
From: J. R. Okajima
Date: Tuesday, August 10, 2010 - 6:51 pm

:::

DM snapshot provides the COW block feature and it will match your idea
since the size of COW device is much smaller genearally. But it doesn't
support off-line upgrade either. If you do, it is equivalent to corrupt
filesystem for DM snapshot device.

Here is pros/cons of DM snapshot comparing a union.
- the number of bytes to be copied between devices is much smaller.

- the type of filesystem must be one and only.
- the fs must be writable, no readonly fs, even for the lower original
  device. so the compression fs will not be usable. but if we use
  loopback mount, we may address this issue.
  for instance,
	mount /cdrom/squashfs.img /sq
	losetup /sq/ext2.img
	losetup /somewhere/cow
	dmsetup "snapshot /dev/loop0 /dev/loop1 ..."

- it will be difficult (or needs more operations) to extract the
  difference between the original device and COW.

- DM snapshot-merge may help a lot when users try merging. in the
  fs-layer union, users will use rsync(1).

- in fs-based union, users can add/remove members(layer) dynamicall
  without unmounting. of course, all files on the removing layer should
  not be busy.


Also here is my concern about UnionMount. All these issues have been
reported before.
- for users, the inode number may change silently. eg. copy-up.
- link(2) may break by copy-up.
- read(2) may get an obsoleted filedata (fstat(2) too).
- fcntl(F_SETLK) may be broken by copy-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
  open(O_RDWR).


J. R. Okajima
--

From: Neil Brown
Date: Tuesday, August 17, 2010 - 3:53 pm

On Tue, 17 Aug 2010 16:44:30 -0400

You present a good argument that "something must be done", but it gives no
pointers to what that something should be.

However, until it is merged in to mainline it would be good to keep the
justification of this change well documented so you don't have to repeat the
same argument to every bozo who pops up and thinks they know better.
Ultimately the git commit log (or even an lwn.net article) could well be a
better place to store this rather than Documenation/, but I think there is

Absolutely right - no argument about that.
I just think that should be explicit in the documentation.
Right after the "Online upgrade" paragraph:

  Even off-line upgrade - e.g. installing software on an exported filesystem
  and the remounting that on client and union-mounting a pre-existing over
  lay on top of it - is significantly non-trivial and would require
  significant extra management software to created a working solution.

That may be enough justification to work on this as a research project, but I
don't think it is enough justification to merge it into mainline.

Just because aufs might be the best available solution to a particular problem
doesn't mean that making a better aufs (aka VFS union mounts) will be the best
possible solution.  That can only be determine if the key needs, and the
problems with all available solutions, are publicly known.

Thanks,
NeilBrown

--

From: Luca Barbieri
Date: Tuesday, August 17, 2010 - 5:15 pm

I think that the safety of personal data and the ability to make
changes to the layers independently are very important features that
justify union mounts.

If you do block-level COW and then lose the lower filesystem layer
(e.g. lose the LiveCD or lose network access to the NFS master), then
you have no guarantee of being able to access the data you added (e.g.
your /home) since you'll only have a corrupted "filesystem piece" that
fsck may or may not be able to fix.

Also, you can't modify the lower layer at all (without rebuilding the
upper layer from scratch), while with an union mount minor changes can
be done with no issues (e.g. replacing the LiveCD with a new minor
update, or applying a security update to the NFS master), and major
ones can be done with some care.

Hence, in any case where the layers are even slightly separated, or
where you need to modify them independently, or extract the changes,
union mounts/unionfs are much better, and often actually the only
viable solution.

This includes the LiveCD case, the NFS mount case and some use cases
with virtual machines.

This would be the case even more strongly if additional features like
online modification of lower layers, or path resolution to the most
recent file instead of the one in the highest layer, were added.

Of course, this is why people currently use unionfs or aufs, and a
VFS-based solution seems just better, since it is going to be more
efficient and guaranteed to be relatively bug-free once it satisfies
the high quality requirements for inclusion in the core kernel.
--

From: Valerie Aurora
Date: Wednesday, August 18, 2010 - 12:04 pm

Sorry, all the documentation I have about union mounts is publicly
available.  I'll announce any new documentation in the usual way.

If you are willing to do the research personally, you can start with
the list of projects using unionfs:



The problems with all available solutions, including union mounts, are
thoroughly documented in my four LWN articles on union mounts:

http://lwn.net/Articles/324291/
http://lwn.net/Articles/325369/
http://lwn.net/Articles/327738/
http://lwn.net/Articles/396020/

I understand your desire for better documentation.  But contrary to
popular conception, I hate writing and do it as seldom as possible. :)

Thanks,

-VAL
--

From: J. R. Okajima
Date: Tuesday, August 17, 2010 - 6:23 pm

Hmm, anyone who meets crash in aufs, please let me know.
While I never say aufs is bug-free, I don't receive such report
recently. I always try fixing a bug as soon as possible when I got a
report.

A reply I have to write repeatedly to who have met a problem in aufs and
reported to aufs-users ML, is "your aufs version is too old. please get
the latest one."
Because aufs is released every week and some linux distributions keep
using very old aufs version, even over one year old version than thier
release date.

By the way, I don't have objection to merge Val's UnionMount into

Is it (mostly) possible by receiving a notification via fsnotify?
For remote FS, their ->d_revalidate() will tell us something is changed.


J. R. Okajima
--

From: Valerie Aurora
Date: Wednesday, August 18, 2010 - 11:55 am

According Al Viro, unionfs has some fundamental architectural problems
that prevents it from being correct and leads to crashes:

http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html

The main question for me is whether aufs has fixed these problems.  If


Think about the case of two different RPM package database files.  One
contains the info from newly installed packages on the top layer file
system.  The lower layer contains info from packages newly installed
on the lower file system.  You don't want either file; you want the
merged packaged database showing the info for all packages installed
on both layers.  Any practical file system based system is only going
to be able to pick one file or the other, and it's going to be wrong
in some cases.

-VAL
--

From: J. R. Okajima
Date: Wednesday, August 18, 2010 - 6:34 pm

Although I don't understand fully your question, aufs actually verifies
the parent-child relationship after lock_rename() on the writable layer.
Such verification is done in other operations too.
And aufs provides three options to specify the level of
verification. When the highest (most strict) level is given, aufs_rename
lookup again after lock_rename() and compares the got parent and the
given (cached) parent.

Let me make sure.
Do you mean something like this?
- a user makes a union
- fileA exists on the lower layer but upper
- modify fileA in the union
  --> the file is copied-up and updated on the upper layer.
- modify fileA on the lower layer directly (by-passing union)
  --> the file on the lower is updated.
- and the user will not see the uptodate fileA in the union, lack of the
  modification made on the lower directly.

Then I'd say it is an expected behaviour. Simply the upper file hides
the lower.

While UnionMount takes a block device as a parameter of making a union
operaion, aufs takes a directory.
# mount /dev/sda /u
# mount -o union /dev/sdb /u

# mount /dev/sda /ro
# mount /dev/sdb /rw
# mount -t aufs -o br:/rw:/ro none /u

It means sda is hidden in UnionMount (generally) and users cannot access
it directly. But in aufs, it is possible via /ro. For those who wants to
hide /ro and stop accessing it directly, aufs document suggests mounting
another thing onto /ro. It can be an empty directly if you use "mount -o
bind".


J. R. Okajima
--

From: J. R. Okajima
Date: Monday, August 23, 2010 - 7:28 pm

Thank you for explanation, very much.



Because users can hide the layers (such like UnionMount) if they want,
and it totally prohibits bypassing aufs. Additionally they modify on the
layer directly (bypassing aufs) only when it is really necessary. So the
default value of the option is not a strict one. And users can change

When a rename happens on a layer directly, aufs receives a
inotify/fsnotify event. Following the event type, aufs makes the cached
dentry/inode obsoleted and they will be lookup-ed again in the
succeeding access. Finally aufs will know the upper parent_dir1 is not
covering the lower parent_dir2 anymore.
This notification is the main purpose of the strict option which is
	:::

No, deadlock will not happen since aufs knows the new parent-child
relationship. By using inotify/hinotify in above answer, I hope you

I am afraid that still I may not understand what you wrote well.
Do you mean that upgrading a package involves updating seveal files and
their version have to be matched with each other within the package, and
upgrading different package in both of upper and lower layer directly
causes mismatch among those files?

Although I don't think you are talking about an aufs utility aubrsync
which runs rsync between layers, I don't understand about "putting this
policy decision into the VFS". The simple rule "the upper file hides the
lower" is out of VFS.


J. R. Okajima
--

From: Valerie Aurora
Date: Tuesday, August 24, 2010 - 1:48 pm

No, that's not a sufficient description and leaves open questions
about all sorts of deadlocks and race conditions.  For example,
inotify events occur while holding locks only on one layer.  You
obviously need to lock the top layer to update the inheritance and
parent-child relationships.  Now you are locking the lower layer first
and the top layer second, which is the reverse of the usual order.
Also, it should not be an option.

If Al Viro says it's wrong, you need a very detailed explanation of
why it is right.  See Documentation/filesystem/directory-locking for
an example of the argument you have to make to show that moving things
around on the lower layer is safe.  In general, your first task is to
show a global lock ordering to prove lack of deadlocks (which I don't
think you should spend time on because most VFS experts think it is
impossible to do with two read-write layers).

I'm not going to explain any more how aufs is wrong; it's the
maintainer's job to convince Al Viro and other maintainers that aufs
is right.  But I hope this gave you a start and showed why union
mounts is a preferred approach for many people.

Thanks,

-VAL
--

From: Christian Stroetmann
Date: Tuesday, August 24, 2010 - 7:59 pm

Aloha Everybody;



This all reminds me of the 5/dining philosophers problem and its 
solutions, especially the waiter and the resource hierarchy solutions 
(see [1]).
And I do think that such problems can always be solved in a real world 

[1] http://en.wikipedia.org/wiki/Dining_philosophers_problem

Have fun
Christian
--

From: J. R. Okajima
Date: Tuesday, August 24, 2010 - 10:03 pm

I don't agree about deadlock and race condition.
When user modifies the dir hierarchy on the layer directly during
aufs_rename() is running, aufs will detect it after lock_rename().
It behaves like this.
- decide the layer where actual rename operates. create the dir
  hierarchy on it if necessary.
- lock_rename() for the layer
- calls ->rename()
or
- if the renaming file exists on the lower readonly layer, aufs will
  copyup it to the upper writable layer as the rename target name.
  In this case, ->rename() is not called.

If a user changes the dir hierarchy directly on the layer before
aufs_rename(), then the notify event tells aufs it and aufs gets the
latetst hierarchy.

If it happens before lock_rename() in aufs_rename(), aufs verifies the
relationship between the target child and the locked dir. if it differs,
return EBUSY. Of course, lock_rename() follows the "ancestors first"

Since you may not read this anymore and other people doesn't seem to
be intrested in aufs, it may not be meaningful to write down about
locking in aufs. But I will try.

At first,
- since aufs is FS, it has its own super_block, dentry and inode.
- super_block, dentry and inode in aufs have private data which contains
  rwsem.
- the locking order for these rwsem is child-first.
- aufs specifies FS_RENAME_DOES_D_MOVE.

locking order in aufs_rename
+ down_read() for aufs sb
  protects sb from branch-add, delete.
+ two down_write()s for src and dest child
  protects them from other processes in aufs.
+ down_write() for the dst_parent.
+ decide the layer where we will operate, by comparing the index of
  layers where the targets exist and the layer attribute (ro, rw).
+ copyup the dest dir hierarchy if necessary, by repeating
  - dget_parent(), down/up_read() for the parent (in aufs)
  - mutex_lock() for the dir (on the layer) to mkdir the non-existing
    child dir on the layer and verify the parent-child relationship.
  - mkdir and setattr on the layer.
  - mutex_unlock() the dir on ...
Previous thread: [PATCH 08/39] whiteout: Allow removal of a directory with whiteouts by Valerie Aurora on Sunday, August 8, 2010 - 8:52 am. (1 message)

Next thread: [PATCH 05/39] whiteout/NFSD: Don't return information about whiteouts to userspace by Valerie Aurora on Sunday, August 8, 2010 - 8:52 am. (1 message)