Introduce white-out support to ext2.
Known Bugs:
- Needs a reserved inode number for white-outs
- S_OPAQUE isn't persistently stored
Signed-off-by: Jan Blunck <j.blunck@tu-harburg.de>
---
fs/ext2/dir.c | 2 ++
fs/ext2/namei.c | 18 ++++++++++++++++++
fs/ext2/super.c | 5 ++++-
include/linux/ext2_fs.h | 4 ++++
4 files changed, 28 insertions(+), 1 deletion(-)
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -230,6 +230,7 @@ static unsigned char ext2_filetype_table
[EXT2_FT_FIFO] = DT_FIFO,
[EXT2_FT_SOCK] = DT_SOCK,
[EXT2_FT_SYMLINK] = DT_LNK,
+ [EXT2_FT_WHT] = DT_WHT,
};
#define S_SHIFT 12
@@ -241,6 +242,7 @@ static unsigned char ext2_type_by_mode[S
[S_IFIFO >> S_SHIFT] = EXT2_FT_FIFO,
[S_IFSOCK >> S_SHIFT] = EXT2_FT_SOCK,
[S_IFLNK >> S_SHIFT] = EXT2_FT_SYMLINK,
+ [S_IFWHT >> S_SHIFT] = EXT2_FT_WHT,
};
static inline void ext2_set_de_type(ext2_dirent *de, struct inode *inode)
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -288,6 +288,23 @@ static int ext2_rmdir (struct inode * di
return err;
}
+static int ext2_whiteout(struct inode *dir, struct dentry *dentry)
+{
+ struct inode *inode;
+ int err;
+
+ inode = ext2_new_inode (dir, S_IFWHT | S_IRUGO);
+ err = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto out;
+
+ init_special_inode(inode, inode->i_mode, 0);
+ mark_inode_dirty(inode);
+ err = ext2_add_nondir(dentry, inode);
+out:
+ return err;
+}
+
static int ext2_rename (struct inode * old_dir, struct dentry * old_dentry,
struct inode * new_dir, struct dentry * new_dentry )
{
@@ -382,6 +399,7 @@ const struct inode_operations ext2_dir_i
.mkdir = ext2_mkdir,
.rmdir = ext2_rmdir,
.mknod = ext2_mknod,
+ .whiteout = ext2_whiteout,
.rename = ext2_rename,
#ifdef CONFIG_EXT2_FS_XATTR
.setxattr = generic_setxattr,
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -752,6 +752,9 @@ static int ext2_fill_super(struct super_
ext2_xip_verify_sb(sb); /* s...I think storing whiteouts on the branches is wrong. It creates all sort of nasty cases when people actually try to use unioning. Imagine a (no-so unlikely) scenario where you have 2 unions, and they share a branch. If you create a whiteout in one union on that shared branch, the whiteout magically affects the other union as well! Whiteouts are a union-level construct, and therefore storing them at the branch level is wrong. If you store whiteouts on the branches, you'll probably want readdir to not include them. That's relatively cheap if you have a whiteout bit in the inode, but I don't think filesystems should be forced to use up rather prescious inode bits for whiteouts/opaqueness [1]. Really the only sane way of keeping track of whiteouts seems some external store. We did an experiment with Unionfs, and moving the whiteout handling to effectively a "library" that did all the dirty work cleaned up the code Out of curiosity, how do you keep track of opaqueness while the fs is mounted? Josef 'Jeff' Sipek. [1] http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg02904.html [2] http://www.filesystems.org/unionfs-odf.txt [3] http://download.filesystems.org/unionfs/unionfs-2.0-odf/linux-2.6.20-rc6-odf1.diff.gz -- UNIX is user-friendly ... it's just selective about who it's friends are -
Inode GID bits - are you reducing my 32 bits of gid_t to 31 bits? That does not work out either. Jan -- -
No. The ODF code just uses the GID bits to store extra info. The GID is _NOT_ used to store the GID of the file. The GID of the file is still coming from the branches. Josef 'Jeff' Sipek. -- I abhor a system designed for the "user", if that word is a coded pejorative meaning "stupid and unsophisticated." - Ken Thompson -
What about keeping track of whiteouts in a special file (or files) in the top level filesystem of the union? For instance, having a /.whiteouts file at the root of the top FS in the stack, instead of storing union-specific data in the flags / inode numbers of the lower levels. This file could also e.g. store the UUID of the lower level FS (if appropriate) so that in subsequent mounts (which might attempt a union with a different lower level branch) you can tell if the whiteouts have meaning. The whiteout history could be flushed by directly mounting the FS and doing rm .whiteouts. This might avoid requiring a store external to the stack of filesystems and I believe it would solve the problem with shared branches and arbitrary stacking that you described? I guess a rather similar effect could be had by somehow storing loopback mountable ODF filesystems in the top layer of a union somewhere (e.g. with the default path /.odf) and allowing the user to specify an alternate location at mount time if necessary. So maybe these approaches are quite similar after all... Cheers, Mark -- Dave: Just a question. What use is a unicyle with no seat? And no pedals! Mark: To answer a question with a question: What use is a skateboard? Dave: Skateboards have wheels. Mark: My wheel has a wheel! -
What is needed is a "filesystem" that has all the directory bits only. For ODF, we opted to "abuse" existing filesystems to see if it actually helped Unionfs, and I think it did help. Really, now what we (unionfs) need is a cleanup of the ODF code, with a bit better defined interface. Very :) We forced the user to mount the fs in the odf loopback manually, but there's no reason why we couldn't do an in-kernel mount on unionfs mount time. Josef 'Jeff' Sipek. -- Once you have their hardware. Never give it back. (The First Rule of Hardware Acquisition) -
So you think that just because you mounted the filesystem somewhere else it should look different? This is what sharing is all about. If you share a Haven't checked if you could use ODF for a generic store for filesystems that Its an inode flag (S_OPAQUE). -
No. At least I don't. Usage case: I heavily depend on using union mounts in diskless nfs setups, since it drops the amount of administration of many systems _near_ one. It boils down on installing the distribution of your choice in a directory, union mount it ro, overlayed with a node private one (doing this in initrd on the client for several reasons), add a little boot and automatic setup machinery and be done. Since all changes are persistant, any system can be set up individually, and still mostly only one tree is needed to keep up to date.. Being in production in an office environment since two years without major hassle (*). This setup is likely to be useful for virtualization needs, too, but side effects via the base directory from one node to another would render this setup void. Cheers, Pete *) The amount of administration work of any (necessary, unfortunately) VMware XP instance running on top of those diskless clients excels that of all diskless clients by an order of magnitude. -
Hardly :) Install XP, snapshot it when done. Copy .vmdk to 'all' machines. On security upgrades, revert to snapshot (well - if the workflow allows it), install, snapshot again. Etc. Work: 1 1/2. Jan -- -
You're not sharing the rw layer so it's a different scenario, and will not have the problem I'm talking about. See my other post [1] for exact scenario Unionfs is used by many people in this way. Josef 'Jeff' Sipek. [1] http://lkml.org/lkml/2007/7/31/365 -- Intellectuals solve problems; geniuses prevent them - Albert Einstein -
The removal happens at the union level, not the branch level. Say you have:
/a/
/b/foo
/c/foo
And you mount /u1 as a union of {a,b}, and /u2 as union of {a,c}.
$ find /u*
/u1
/u1/foo
/u2
/u2/foo
$ rm /u1/foo # this creates whiteout for "foo" in /a
$ find /u*
/u1
/u2
Is that what you'd expect as a user? I don't think so.
Yes, since the ODF is completely separate, you can use _any_ filesystem and
regardless of whether or not they support whiteouts.
Josef 'Jeff' Sipek.
--
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)
-Yes, although that might sound strange: you are sharing the topmost writable Completely separate? It is totally tied to UnionFS and tries to work out purely the problems that this kind of VFS emulating filesystems have. -
Who does this? I'm assuming that a is the "top" layer. Aren't union mounts typically about sharing lower layers and having a separate rw That's exactly what I would expect. If I were to: $ echo "this is new" > /u1/foo I would expect: $ cat /u2/foo this is new So why should rm behave differently? I haven't really been tuned into union mounts, so maybe I'm missing out on something basic here. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center -
Alright not the greatest of examples, there is something to be said about
symmetry, so...let me try again :)
/a/
/b/bar (whiteout for bar)
/c/foo/qwerty
Now, let's mount a union of {a,b,c}, and we'll see:
$ find /u
/u
/u/foo
/u/foo/qwerty
$ mv /u/foo /u/bar
Now what? How do you rename? Do you rename in the same branch (assuming it
is rw)? If you do, you'll get:
$ find /u
/u
Oops! There's a whiteout in /b that hides the directory in /c -- rename(2)
shouldn't make directory subtrees disappear.
There are two ways to solve this:
1) "cp -r" the entire subtree being renamed to highest-priority branch, and
rename there (you might have to recreate a series of directories to have a
place to "cp" to...so you got "cp -r" _AND_ "mkdir -p"-like code in the VFS!
1/2 a :) )
2) Don't store whiteouts within branches. This makes it really easy to
rename and remove the whiteout.
Sure, you could try to rename in-place and remove the whiteout, but what if
you have:
/a/
/b/bar (whiteout)
/c/bar/blah
/d/foo/qwerty
$ mv /u/foo /u/bar
You can't just remove the whiteout, because that'd uncover the whited-out
directory bar in /c.
Josef 'Jeff' Sipek.
--
Bad pun of the week: The formula 1 control computer suffered from a race
condition
-Sorry for making uninformed guesses, but if there are already special nodes (whiteout), why not extending them to some more general format - specifying a (source, destination) pair at the topmost level? - A delete is a (source, NULL) pair - A rename is a (source, destination) pair, which causes lookups on source to use the string destination in the lower branches. Would that work? Regards, Phil -
Originally I had the idea that whiteouts are a special kind of symlink. After discussing that with various people sticked to the simplest approach. -
Er, no. According to Documentation/filesystems/union-mounts.txt, "only -- David Kleikamp IBM Linux Technology Center -
This brings up an very interesting (but painful) question...which makes more sense? Allowing the modifications in only the top-most branch, or any branch (given the user allows it at mount-time)? Right. Doing something like this at the filesystem level (as we do in unionfs) seems less painful - filesystems are places full of all sorts of nefarious activities to begin with. Having it in the VFS seems...even uglier. Josef 'Jeff' Sipek. -- *NOTE: This message is ROT-13 encrypted twice for extra protection* -
Only write to top-most layer. There are two reasons for this. First it allows users to create a union mount, test something (e.g. update the distribution) and remove every trace from the test by umounting the top-most layer. Such a thing can be quite valuable. The second reason is simplicity. I personally couldn't even start to describe the semantics. If the user does a rename, which layer will the change end up in? What if source or target exist in multiple layers? How to rename a directory in a lower layer containing a new file in an upper layer? Finding new and interesting corner cases for such a beast can be quite entertaining. And until someone has properly documented the semantics for _all_ the corner cases, my enthusiasm is below freezing point. Does such a documentation exist? Jörn -- A surrounded army must be given a way out. -- Sun Tzu -
Josef did specifically state that modification to the lower layers would I think that if someone can come up with consistent (and useful) semantics for a mount option that allows modifications to other layers as well, it would be a useful additional feature to support. It seems that it should be possible to add this feature at a later time in any case. Perhaps referring to the plan9 semantics could be helpful. -- Jeremy Maitin-Shepard -
My implementation is keeping things simple because of reason. There have been many attempts to get unioning working on the filesystem layer. Most of them failed because of complexity. E.g. BSD throwed away all of the filesystem stacking support after they tried to fix unionfs for years. Writing to lower The userspace is doing it since I return -EXDEV. And that even comes for free. I don't need to hack around and call back into VFS as you do. It is so simple and straightforward in the VFS. -
Your examples point out the complexity of trying to allow modifications at lower levels. It seems to me to be simpler (even if recursive copies I haven't looked at either implementation close enough to offer an opinion here that I would be able to defend. I'm sure others have their Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center -
[...]
There are three other reasons why Unionfs and our users like to have
multiple writable branches:
1. If only the topmost layer is writable, then every little change tends to
cause a copyup, which tends to clutter the top layer more quickly. Some
of our users didn't like that idea, while others explicitly wanted it --
so we give them a choice to decide, on a per layer/branch whether it
should be writable or readonly.
2. Some users unify different packages together. Imagine you union under
/union, several installed packages: /X11R6/{bin,man,lib,conf},
/apache/{bin,man,lib,etc}, and /mysql/{bin,man,lib,etc}, and so on. If a
user modifies /union/apache/etc/apache.conf, they sometimes want
apache.conf to remain in the writable branch it came from, not copied up.
That way all apache related files are logically left where they came
from, which makes administration easier. Again, some users like to have
multiple writable branches, and some don't -- so in Unionfs we give them
the choice. And yes, it does make our implementation more complex.
3. Some people use Unionfs in the scenario described in point #2 above, as a
poor man's space- and load- distribution system. Some of our users like
the idea of controlling how much storage space they give each branch, and
how much it might grow, and even how much CPU or I/O load might be placed
on each of the lower filesystems which serve a given branch. That way
they worry less about the top-layer's space filling up more quickly than
expected. Now Unionfs was never designed to be a load-balancing f/s (we
have RAIF for that, see <http://www.filesystems.org/project-raif.html>),
but users seems to always find creative ways to [ab]use one's software in
ways one never thought of. :-)
BTW, does Union Mounts copyup on meta-data changes (e.g., chmod, chgrp,
etc.)?
Erez.
-And error-prone and unflexible wrt to changes. When XIP was introduced, unionfs crashed all over this changes. I don't know if this has changed yet. Not speaking of other issues like calling back into VFS (stack usage), No. But it was proposed during on of the last postings. -
You picked different reserved inodes for the ext2 and ext3 filesystems. That's good for a NACK right there. The codepoints (i.e., reserved inode numbers, feature bit masks, etc.) for ext2, ext3, and ext4 MUST not overlap. After all, someone might use tune2fs -j to convert an ext2 filesystem to ext3, and is it's REALLY BAD that you're using a reserved inode of 7 for ext2, and 9 for ext3. Also, I note that you have created a new INCOMPAT feature flag support for whiteouts. That's really unfortunate; we try to avoid introducing incompatible feature flags unless absolutely necessary; note that even adding a COMPAT feature flag means that you need a new version of e2fsprogs if you want e2fsck to be willing to touch that filesystem. So --- if you're looking for a way to add whiteout support to ext2/ext3 without needing a feature bit, here's how. We allocate a new inode flag in struct ext3_inode.i_flags: #define EXT2_WHTOUT_FL 0x00040000 We also allocate a new field in the ext2 superblock to store the "whiteout inode". (Please coordinate with me so it's a superblock field not in use by ext3/ext4, and so it's reserved so that no one else uses it.) The superblock field, call it s_whtout_ino, stores the inode number for the "white out inode". When you create a new whiteout file, the code checks sb->s_whtout_ino, and if it is zero, it allocates a new inode, and creates it as a zero-length regular file (i_mode |= S_IFREG) with the EXT2_WHTOUT_FL flag set in the inode, and then store the inode number in sb->s_whtout_ino. If sb->s_whtout_ino is non-zero, you must read in the inode and make sure that the EXT2_WHTOUT_FL is set. If it is not, then allocate a new whiteout inode as described previously. Then link the inode into the directory as before. When reading an inode, if the EXT2_WHTOUT_FL flag is set, then set the in-memory mode of the inode to be S_IFWHT. That's pretty much about it. For cleanliness sake, it would be good if ext2_delete_inode clears sb-&...
Ok, this is pretty similar to the way I implemented this for tmpfs. The problem is that the union mount code is explicitly checking if the filesystem is supporting whiteout. I used to use a new filesystem flag (FS_WHITEOUT) for this but thought that disk filesystem like ext2/3/4 will have problem with At the moment I still rely on this for the current readdir implementation. Viro already said that he doesn't want to see this (the readdir changes) in the kernel but in userspace. Thanks, Jan -
Without the method I described to you, *any* ext2/3/4 filesystem will support whiteouts (as long as you have the support code compiled into Life gets very messy if you have to do this in userspace. Example: statically linked programs that were compiled with a version of glibc that didn't know about whiteout records. Unfortunately, the memory needed to to collate directories entries so that whiteout records can be dropped is painful enough that completely understand why Al doesn't want to see this in userspace. Unfortunately this is going to be one of those things that will make union mounts problematic, compared to something like unionfs. - Ted -
WEll, also if root deletes something, it should be _gone_, and user should not be able to work around that just by bringing statically linked ls.. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
You also need whiteout support for extents. This could be done with unwritten extents potentially, or as I previously proposed (RFC) in linux-ext4. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Maybe. But this is about something totally different: a whiteout filetype, an existing file that when it is found make the VFS return -ENOENT. Cheers, Jan -
| Chuck Ebbert | Why do so many machines need "noapic"? |
| Renato S. Yamane | Error -71 on device descriptor read/all |
| Greg Kroah-Hartman | [PATCH 05/54] kset: convert fuse to use kset_create |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
git: | |
| R. Tyler Ballance | Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted ... |
| Shawn O. Pearce | Re: Some ideas for StGIT |
| Alexander Litvinov | git-svn does not seems to work with crlf convertion enabled. |
| Wink Saville | Resolving conflicts |
| John P Poet | Realtek 8111C transmit timed out |
| Rémi Denis-Courmont | Re: [PATCH] Security: Implement and document RLIMIT_NETWORK. |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Jason Beaudoin | Re: Real men don't attack straw men |
| Parvinder Bhasin | BIND and CNAME-ing |
| Manuel Ravasio | Annoying problem with dnsmasq |
| Craig Skinner | Re: How can i boot a bsd.rd from windows 2000 ? |
