Re: [PATCH 17/39] union-mount: Union mounts documentation

Previous thread: [PATCH 22/39] union-mount: Support for mounting union mount file systems by Valerie Aurora on Monday, May 3, 2010 - 4:12 pm. (1 message)

Next thread: [PATCH 25/39] VFS: Split inode_permission() and create path_permission() by Valerie Aurora on Monday, May 3, 2010 - 4:12 pm. (1 message)
From: Valerie Aurora
Date: Monday, May 3, 2010 - 4:12 pm

Document design and implementation of union mounts (a.k.a. writable
overlays).
---
 Documentation/filesystems/union-mounts.txt |  899 ++++++++++++++++++++++++++++
 1 files changed, 899 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..ba830e8
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,899 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over a one read-only
+file system, with all writes going to the writable file system.  The
+namespace of both file systems appears as a combined whole to
+userland, with files and directories on the writable file system
+covering up any files or directories with matching pathnames on the
+read-only file system.  The read-write file system is the "topmost"
+or "upper" file system and the read-only file system is the "lower"
+file system.  A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts).  However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level.  COW block devices only
+increase their divergence as time goes on, and a fully coherent
+writable ...
From: Valdis.Kletnieks
Date: Monday, May 3, 2010 - 6:54 pm

I spent some time looking at patch 27 trying to figure it out for myself,
but my lack of splice()-fu doomed me. :)

A few quick questions:

1) For calls like chmod() that only touch the metadata, does it still
trigger a copyup of the data, or just the affected metadata?

2) Is the copyup of data synchronous or async done in the background?
The comments in union_copyup_len() about "We raced with someone else"
imply this is synchronous - if so. probably a note should be made that
an open() may take a little while under some conditions.  There's a *lot* of
code out there that assumes that open() calls are *really* cheap.

I wonder how many programs don't correctly deal with an ENOSPC on open() of
an already existing file.

(The answers probably don't matter unless somebody ends up invoking a
copyup of a gigabyte file, which of course implies one of my users will end up
doing exactly that. :)
From: Valerie Aurora
Date: Wednesday, May 5, 2010 - 6:06 am

Yes, it copies up the whole file.  Right now there's no concept of

It's synchronous.  Code that assumes open() calls are cheap will have

I'm not too worried about that - how many programs deal correctly with
ENOSPC when it is normally returned?

-VAL
--

From: Jamie Lokier
Date: Tuesday, May 4, 2010 - 2:12 pm

Can copyup be interrupted?  E.g. if I chmod an 80GB file, will the

Does it apply the same permission checks that a program doing
copy+rename would have to pass?  I guess that is just write access to
the directory.

Does it effectively "rename" all hard links referring to the file, to
point to the new version, or does it only affect the path that was
used by the writer/modifier, leaving the other links continue to refer

Why is O_DIRECT relevant?  O_DIRECT doesn't imply writing, and
copy+rename behaviour is the same with O_DIRECT as not.

Some programs use O_DIRECT to read very large files, without intending
they will ever be modified.  For example, qemu using O_DIRECT to

I'm finding it hard to imagine _guaranteeing_ really read-only.  All
you can guarantee is that the NFS says it is read-only.

For example, a userspace NFS server cannot prevent the filesystem it's
serving from changing.



I can imagine some database-like programs getting confused by that.

Maybe it would be better to fail copyup operations when the file is
currently open O_RDONLY by anyone, analogous to the way writable
mounts are refused when any union holds it read-only?

Are there uses likely to be broken by that behaviour?

Thanks,
-- Jamie
--

From: Valerie Aurora
Date: Wednesday, May 5, 2010 - 6:19 am

The right behavior is that you should be able to control-C it, but I
doubt that currently works.  Let me look into testing and implementing


In order to update all the hard links to a file, we would have to walk
the entire file system searching for links with a matching inode
number and copy them up too.  We're never going to do a
file-system-wide walk, so we won't do that.  The other hard links
still point to the old copy of the file.  We hope applications don't



Each file system that wants to support union mounts will need to
implement the features necessary for that layer (hard read-only for


That's an interesting question.  In general, this seems like a bad
idea - any process can prevent another process from writing to a file
by opening it.  This is like chmod'ing it to 444.

-VAL
--

Previous thread: [PATCH 22/39] union-mount: Support for mounting union mount file systems by Valerie Aurora on Monday, May 3, 2010 - 4:12 pm. (1 message)

Next thread: [PATCH 25/39] VFS: Split inode_permission() and create path_permission() by Valerie Aurora on Monday, May 3, 2010 - 4:12 pm. (1 message)