Inspired by a discussion with Christoph Hellwig, I tried to
recreate a patch that he did a few years ago to add support
for writing to a mounted cramfs file system. It still has
known problems (and likely unknown ones), but should be
good enough for practical use. I've been able to boot
a full Ubuntu installation from a cramfs image and work with
it normally.The intention is to use it for instance on read-only root
file systems like CD-ROM, or on compressed initrd images.
In either case, no data is written back to the medium, but
remains in the page/inode/dentry cache, like ramfs does.Many existing systems currently use unionfs or aufs for this
purpose, by overlaying a tmpfs over a read-only file
system like cramfs, squashfs or iso9660. IMHO, it would
be a much nicer solution to not require unionfs for a simple
case like this, but rather have support for it in the file
system. If people find this useful, we can do the same in
other read-only file system.Writing to existing files is broken in at least two corner
cases, and I'm still looking for a solution here:When you truncate an on-disk to make it larger, reading
beyond the old end of the file will make cramfs try to
read from disk instead of filling with zeroes. I'm not sure
if this can be solved without adding additional members to
the inode structure (using a private inode cache) to remember
the end of the on-disk file.Deleting a preexisting file currently does not free the inode
and page cache for that file, which I assume is easy to fix.Also, the i_nlink field of directories is always 1, and
has always been on cramfs. Getting the count right should
simplify the code a bit and make it more correct according
to posix, but will cost a bit of performance on 'stat'.The patch series also lives on
git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground.git cramfsComments?
Arnd <><
--
I think it's a good idea, and I have been thinking about adding
Patch 2 ([RFC 2/7] cramfs: create unique inode numbers) changes the
inode number to be based on the dentry location rather than the file
location. This is a user-visible change, not only do empty directories,
char, block, pipe, and sockets get real inode numbers rather than 1 (a
good thing IMHO), but files that were hard-linked (in the original
source directory) now get different inode numbers. Obviously cramfs has
never properly supported hard links, but the duplicate file check in
cramfs did ensure hard linked files got the same inode number.This change in behaviour may break some existing users of cramfs
filesystems. It may be worth sending the RFC and patches etc. to the
new linux-embedded mailing list to get some feedback from the embedded
folks who use cramfs.Phillip
--
I don't agree that it is nicer to do this in cramfs. I prefer the
technique of union of a tmpfs over some other fs because a single
solution that works with all filesystems is better than re-implementing
the same idea in multiple filesystems. Multiple implementations is a
recipe for bugs and feature mismatch.
--
You're right in principle, but unfortunately there is to date no working
implementation of union mounts. Giving users the option of using an
existing file system with a few tweaks can only be better than than
forcing them to use hacks like unionfs.Arnd <><
--
I've not used unionfs (nor aufs) so I'm not aware of its foibles, but I
can say that it's the right kind of solution. Rather than spend effort
implementing write support for read-only filesystems, why not put your
time into fixing whatever you see wrong with one or both of those?
--
There is a strong argument to be made for fixing some problem once
instead of N times. But when that solution is M times more complicated,
with M being significantly larger than N, said argument becomes rather
weak.And having looked at unionfs, I claim that your argument is paper-thin.
Jörn
--
/* Keep these two variables together */
int bar;
--
I have to join in. Unionfs and AUFS may be bigger in bytes than the
embedded developer wants to sacrifice, but that is what it takes for
a solid implementation that has to deal with things like NFS and
mmap. Even so, there is a fs called mini_fo you can try using if
you disagree with the size of unionfs/aufs, at the cost of not having
support for all corner cases.
--
I tend to agree with Arnd Bergmann. While I prefer the aesthetic
cleanliness of stackable filesystems, the lack of proper stacking
support in the Linux VFS makes other techniques necessary. Unionfs is
complex and for many embedded systems with constrained resources Unionfs
adds a lot of extra overhead.If I read the patches correctly, when a file page is written to, only
that page gets copied into the page cache and locked, the other pages
continue to be read off disk from cramfs? With Unionfs a page write
causes the entire file to be copied up to the r/w tmpfs and locked into
the page cache causing unnecessary RAM overhead.Phillip
--
Ok, so why not fix that in unionfs? An option so that holes in the
overlay file let through data from the underlying file sounds like it
would be generally useful, and quite easy to implement.If not unionfs, a "union-tmpfs" combination would be good. Many
filesystems aren't well suited to being the overlay filesystem -
adding to the implementation's complexity - but a modified tmpfs could
be very well suited.-- Jamie
--
I can imagine a lot of unexpected effects with that. Think of e.g.
someone replacing the underlying file with a new one. Then enlarge
the file using truncate() and read from it -- suddenly you see
the old contents instead of zeroes. Probably fixable as well, but
certainly not in a nice way.Besides, there are a many more problems with unionfs, which have
all been mentioned in the previous review cycles. Aufs doesn't
address those either AFAIK, with the exception of at least
not making additional copies in the page cache when writing to
a file.The real solution of course are VFS based union mounts (think
'mount --union -t tmpfs none /'), but the patches for thatYes, that is similar to one of my earlier ideas as well. Christoph
managed to convince me that it's not as easy as I thought, though
I can't remember the exact arguments any more. I'll try to think
about that some more.One of the problems is certainly the complexity involved in tmpfs
to start with, which is the reason I based the code on ramfs instead.Arnd <><
--
Hello Arnd,
While I don't have particular objection to your idea and approach to
cramfs, I'd point out that modern LiveCDs tend to save their
modifications to disk.
And AUFS did address all known problems. If there left something, please
let me know.Junjiro Okajima
--
Sure, and I wasn't trying to address those of course. I have a rather
specific setup in mind myself, and I figured the same would be useful
for others as well, while we are waiting for a generic union mountOk, I'm sorry for not having looked at it myself. I saw an older version
and assumed it was not going to improve much. I'll have another look
when I find the time. Unionfs was suffering from severe feature creep
(multiple writable branches, runtime branch modification), and aufs
seemed to add even more features instead of removing them.Without reading either again, the top problems in unionfs at the time were:
* data inconsistency problems when simultaneously accessing the underlying
fs and the union.
* duplication of dentry and inode data structures in the union wastes
memory and cpu cycles.
* whiteouts are in the same namespace as regular files, so conflicts are
possible.
* mounting a large number of aufs on top of each other eventually
overflows the kernel stack, e.g. in readdir.
* allowing multiple writable branches (instead of just stacking
one rw copy on a number of ro file systems) is confusing to the user
and complicates the implementation a lot.With the exception of the last two, I assumed that these were all
unfixable with a file system based approach (including the hypothetical
union-tmpfs). If you have addressed them, how?Arnd <><
--
Re: feature creep. Unionfs had more features initially, but we removed
those that users didn't seem to want/use. The bottom line, we've been
maintaining unionfs publicly for 5+ years now, so the set of features we
have is based exactly on what users want. If anyone can give the users what
they want/need in a different, more elegant way, that's great; if not, usersThat's not an issue when using vm_ops->fault for data.
There is still an issue wrt dentries and topology changes, as Al mentioned
here before. Al suggested to me (at OLS 08) that the superblock struct
might need the same writers-count as has been done for vfsmounts recently;
then you can prevent topology changes during union'ed operationsYes, but I don't think it's much more than any other layered solution will
have (including ecryptfs and union mounts). This is a general problem in
stackable file systems. Union Mounts, being in the VFS, has the chance toAgreed. We have a different version of unionfs, called unionfs-odf, which
moves the whiteouts and all unioning-related meta-data to a separate, small
persistent partition.But the better long-term solution is to get WH support in every native f/s.
These patches had been floating around for a while now, and they seem simple
enough that I don't see why it had taken so long to get basic WH support
into mainline (or at least -mm). (Bharata, can you ask akpm to add just theYes. That's a general problem with stackable file systems. Each layer you
add increases the depth of the stack. There are all already known paths
(involving xfs/nfs/dm, etc.) which overrun the stack and the solution I've
heard was "don't do it." That seems silly to me. Instead, the kernel stack
should be growable dynamically, at the cost of performance.However, the vast majority of unioning users use just one layer, so even for
us, blowing up the stack has been a rather rare user complaint. And we've
been very mindful of stack usage (i.e., checking and optimizing based onI...
I will try explain individually.
Here are what I implemented in AUFS.
Aufs has three levels of detecting the direct-access to the lower
(branch) filesystems (ie. bypassing aufs). I guess the most strict level
is a good answer for your question. It is based on the inotify
feature. Aufs sets inotify-watch to every accessed directories on lower
fs. During those inodes are cached, aufs receives the inotify event for
thier children/files and marks the aufs data for the file is
obsoleted. When the file is accessed later, aufs retrives the latest
inode (or dentry) again.
The inotify-watch will be removed when the aufs dir inode is discardedAufs has its own dentry and inode object as normal fs has. And they have
pointers to the corresponding ones on the lower fs. If you make a union
from two real filesystems, then aufs inode will have (at most) two
pointers as its private data.Yes, that's right.
Aufs reserves ".wh." as a whiteout prefix, and prohibits users to handle
such filename inside aufs. It might be a problem as you wrote, but users
can create/remove them directly on the lower fs and I have neverAufs readdir operation consumes memory, but it is not stack. If it was
implemented as a recursive function, it might cause the stack
overflow. But actually it is a loop.
The memory is used for stroing entry names and eliminating whiteout-ed
ones, and the result will be cached for a specified time. So the memoryProbably you are right. Initially aufs had only one policy to select the
writable branch. But several users requested another policy such as
round-robin or most-free-spece, and aufs has implemented them.
I don't guess uers will be confused by these policies. While I tried it
should be simple, I guess some people will say it is complex.Junjiro Okajima
--
This is a very complicated approach, and I'm not sure if it even addresses
the case where you have a shared mmap on both files. With VFS based union
mounts, they share one inode, so you don't need to use idiotify in the firstI mean having your own dentry and inode object is duplication. The
underlying file system already has them, so if you have your own,
you need to keep them synchronized. I guess that in order to do
a lookup on a file, you need the steps of1. lookup in aufs dentry cache -> fail
2. lookup in underlying dentry cache -> fail
3. try to read dentry from disk -> fail
4. repeat 2-3 until found, or arrive at lowest level
5. create an inode in memory for the lower file system
6. create dentry in memory on lower file system, pointing
to that
7. create an aufs specific inode pointing to the underlying
inode
8. create an aufs specific dentry object to point to that
9. create a struct inode representing the aufs inode
10. create another VFS dentry to point to thatwhen you really should just return the dentry found by the
It's not so much a practical limitation as an exploitable feature.
E.g. an unpriviledged user may use this to get an application into
an error condition by asking for an invalid file name.Posix reserves a well-defined set of invalid file names, and
deviation from this means that you are not compliant, and thatHow does aufs know that one of its branches is an aufs itself?
If you detect this, do you fold it into a single aufs instance with
more branches?
In case you don't do it, I don't see how you get around the stack
overflow, but if you do it, you have again added a whole lot ofI personally think that a policy other than writing to the top is crazy
enough, but randomly writing to multiple places is much worse, as it
becomes unpredictable what the file system does, not just unexpected.Arnd <><
--
Hi Arnd.
Inotify has nothing common with that - it notifies about inode update,
which is only thing needed for unionfs. VM and aufs vmops will take care ofOr it is a feature, and you should not return dentry for lower file
system, when you can have different objects pointing to theHmm... I believe if exploit wants to do bad things and system prevents
it, it is actually a right decision? But since you asked, I'm not sureEverything has own limitation. 256 bytes per name is much stronger
problem, but everyone works with that.Is this a double rot13 encoded "people will never use computers with
more than 640 kb of ram" phrase? :)While working VFS union mounting does not exist, AUFS does work.
It is just another filesystem, which works and has big userbase. Any VFS
approach (when implemented) will work on its own and its implementation
does not depend on this particular fs.--
Evgeniy Polyakov
--
No, it's more the "people don't need variable block size drives" argument.
They've been working fine for decades on mainframes, are incredibly
complicated to build and entirely pointless in practice ;-)Arnd <><
--
As you might know, aufs doesn't have its own file mapped pages. Aufs
overrides vm_operations and redirects the page fault to the lower file's
vm_operation. So the shared mmap has no problem.
I am afraid that I should write "marks the attributes in aufs is
obsoleted" instead of "marks the aufs data for the file is obsoleted" inI see.
Then the solution must be union-mount.
Your 10 steps seem to be rather verbose. Generally, 'lookup' means to
create (or get) inode and dentry, and the fs inode and VFS inode are
allocated in the same time.
Aufs does 'lookup' for the lower dentry (yes, it must be repeated if- To detect the filesystem type is easy. Aufs can know whether the
branch is aufs or not by checking s_magic or s_type->name.
- aufs doesn't fold? expand? the nested aufs branch.You might be pointng out a general matter of stacking filesystem.
When one of branches is a stacking fs, and it is nested deeper and
deeper,
- /aufs1 = /rw1 + /aufs2
- /aufs2 = /rw2 + /aufs3
- /aufs3 = /rw3 + /aufs4
:::
then the stack-overflow may happen. It is not limited to readdir, it can
happen in every operation. Basically aufs rejects 'aufs/unionfs branch',
in other word "aufs branch of another aufs mount."
But aufs has a configuration to enable this. When a user enables it and
sets deeply nested aufs branch, it could happen. But this is same thingI don't want you to call aufs users crazy who are using such policies.
By the way, how do you think link(2) or rename(2)? When the source file
exists on the lower writable branch, do you think copy-up is the best
way? Or do you think all lower branches should be readonly?
There is an exception in aufs's branch-select policy. That is
link/rename case. When the source file exists on a writable branch, aufs
tries link/rename it on that branch in every policy. Do you think it
best to do it on the top branch only?Junjiro Okajima
--
Yes, I tend to consider the union case identical to the cross-mount
move or link, so I'd expect the kernel to return errno=EXDEV and user
space to handle this by doing the appropriate copy/unlink as it does
for other cases already.Arnd <><
--
Aure rename returns EXDEV when the target is a dir and it has child
entr(y|ies) on lower branhc(es). And mv(1) handles this case.
My Engilsh might be miunderstood. Do you think link(2) should return an
error when the target exists on lower writable branch?Junjiro Okajima
--
Any writes should always just go to the top level. If the source file
for link() exists on the top level, link should succeed even if a target
exists on a lower level (given that the user has permissions to
unlink that file), but should return EXDEV if the source comes from
a lower level.Arnd <><
--
Then what will happen when a user builds a union by "empty tmpfs" +
"cramfs"? Following your design, link(2) becomes useless in stacking fs.You may be considering to implement a new dynamic link library for
stacking.
Hmm, that is intersting. It may be worth to think.Junjiro Okajima
--
I agree w/ Jan E.
Folks, I've said it before: unioning is a deceptively simple idea in
principle, and &^@%*$&^@ hard in practice. And anyone who thinks otherwise
is welcome to write a *versatile* unioning implementation on their own. Once
you get through all corner cases and satisfy all the features which users
want, you have a complex large file system.I believe that implementing unioning inside actual filesystems is totally the
wrong direction: going to lower layers is wrong, instead of going up to a
VFS-based solution. Unioning is a namespace operation that should not be
done deep inside a lower f/s.People often wonder why FScache is (reportedly) so complex and big. It's
b/c in some part it has to deal with similar issues: unioning is
copy-on-write, whereas caching is copy-on-read.Nevertheless, I can understand if the embedded community wants lightweight
unioning. Union Mounts initially may not support everything that unionfs
does, but it should be smaller, and it should be enough I believe for the
basic unioning uses --- perhaps even for the embedded community. If so,
then I suggest people offer to help Bharata and Jan Blunk's efforts, rather
than [sic] cramming unioning into a single file system.Erez.
--
To the original posters:
I urge those who do believe {au,union}fs is too fat to go and build
their unioning into their on-disk filesystems, then let users run it
(remark: iff you can convince (or force) them why they should not be
using existing fs), let users report issues and iron it out for
perhaps 2-3 years, and then see how much your implementation has
grown. That is, if you actually added code (see remark 1).About last year (June 2007), SLAX sought a solution that enhances
VFAT with UNIX permissions -- much like the old umsdosfs. A kernel
solution was initially preferred by Tomas (SLAX developer), yet I
(who got to write posixovl then) went for FUSE. It was about 20 KB
when it was moderately usable. The end result? Posixovl is a 46 KB C
file today. For userspace code. I bet it would be much more if it was
in-kernel.Take that as a hint when developing your fs-specific unioning.
--
Though Union Mount effort has become slow and silent lately, some of
us are still working on it. While I worked on readdir support lately,
Jan Blunck and David Woodhouse are working on having a generic
whiteout support for linux.Talking about help, Union Mount effort could take a generous help in
getting directory listing implementation right. We first tried to
handle duplicate elimination (during readdir) inside the kernel
entirely. The outcome was neither clean nor efficient.
(http://lkml.org/lkml/2007/12/5/147). Then there was a suggestion to
push the duplicate elimination to userspace. When that was tried out
(http://lkml.org/lkml/2008/4/29/248), we were told that NFS support is
going to be an issue. (BTW NFS support is going to be an issue
irrespective of where directory listing is implemented: kernel or
userspace). Some insights into feasibility of supporting NFS with
Union Mount from people who understand NFS better would be very
helpful.Regards,
Bharata.
--
http://bharata.sulekha.com/blog/posts.htm
--
Yes, unionfs does copyup whole files, but it doesn't lock the entire file
into the page cache. But I agree, that copying up large files to a tmpfs
partition adds more memory pressure, at least temporarily (until pdflushIf I understand you right, you want to copyup one page at a time, right?
That's not nearly as easy as one might imagine. First, you can't do it on
file systems which don't support holes. Second, holes is a file-systems
specific implementation issue, and the knowledge of holes AFAIC, is hidden
from the VFS (IIRC, FreeBSD has a specific "zfod" page flag, which is turned
on when the VM has a page that came out of a f/s hole).You'll need a way to tell if a given page was copied up or not, and
distinguish b/t pages which are naturally filled with zeros vs. those which
came from f/s holes.Copyup is also providing persistency: you can copyup to a persistent f/s
such as ext2. So you'll need a bitmap or some sort of record that will
survive file system remount and system reboot; such a bitmap will have to
tell which pages of a file have been copied up or not.I'm not saying it's not possible, but it's to do this page-wise caching at a
stackable layer than inside a native f/s such as ext2. Now, if there was a
generic VFS op that allowed me to query a file system whether a page it a
given file is a hole or not, then unionfs would be able to do page-wise
copyup easily.Frankly, I think something like support for a copied-up file, page-by-page,
should probably be supported by a block layer virtual driver (this might be
easier in a BSD-like geom layer.)BTW, I believe FSCache has page-wise caching, right? Caching is a
copy-on-read operation, and it doesn't take much to make it cache (read:
copy) on writes. So FScache might be a good starting point for such anI think a union-tmpfs is a better solution than a cramfs-specific one, b/c
at least with union-tmpfs, many more users could use it. Even if you
restrict yourself to using tmpfs as the r-w layer, and re...
1: I'm thinking systems which have union-over-cramfs probably don't have
swap at all...2: It's a problem when you modify a very large file, even on a fast PC
with plenty of RAM. LVM snapshots might be better for this sort ofTrue, although the new FIEMAP ioctl is supposed to make holes more
See FIEMAP. Is it any use?
-- Jamie
--
Correction: Unionfs doesn't make additional copies in the page cache.
Arnd, I favor a more generic approach, one that will work with the vast
majority of file systems that people use w/ unioning, preferably all of
them. Supporting copy-on-write in cramfs will only help a small subset of
users. Yes, it might be simple, but I fear it won't be useful enough to
convince existing users of unioning to switch over. And I don't think we
should add CoW support in every file system -- the complexity will be much
more than using unionfs or some other VFS-based solution.I can see some advantages (re: cache coherency) by hacking CoW support
directly into a f/s. If you want to use a filesystem-specific solution,
then I suggest you don't modify a file system used as a source in a union,
but one used as a destination. You'll have better overage that way. The
vast majority of times, unionfs users will either write to tmpfs or ext2;
but the source readonly f/s can be a lot of different ones (most popular are
ext*, nfs*, isofs, and cramfs/squashfs).I find it somewhat ironic to hear the argument that "union mounts isn't
stable yet, so lets come up with a new solution inside cramfs." Why should
your solution become stable much faster than union mounts (which also had
patches floating around for a long time already).If you have cycles to spare, why not help Bharata and Jan?
Cheers,
Erez.
--
My idea was to have it in cramfs, squashfs and iso9660 at most, I agree
that doing it in even a single writable file system would add far too
much complexity. I did not mean to start a fundamental discussion about
how to do it the right way, just noticed that there are half a dozen
implementations that have been around for years without getting close to
inclusion in the mainline kernel, while a much simpler approach givesYes, that absolutely makes sense. I don't care much about a persistant
storage for the overlay, so tmpfs (if not ramfs) should be the only place
to do it in. It does introduce some of the same old problems though,
because you could still write to a bind mounted copy of the underlying
file system (unlike cramfs, which is guaranteed to be read-only), which
forces you to either to a full copy-up, or can result in inconsistent
file contents. Also, stacking multiple union-tmpfs copies on top of each
other would be hard to do without the potential to overflow the kernel
stack.I'll probably try implementing a '-o union' option tmpfs anyway, just
Because the patches are not trying to solve any of the hard problems at all:
Persistent storage of overlays, readdir traversal through more than two
layers, stable inode numbers, opening a file through two different overlays,
copyup, and so on. I'm sure you know more about these problems that I do,
but as long as I don't have to care about them, I don't see a problemI spent a lot of time on discussing the initial implementation with Jan
years ago, and will keep reviewing their patches, but I have neither the
time nor the brains to really contribute much to them. As you mentioned
in your reply to Jan E., it's on an entirely different scale than doing
a small hack to cramfs or tmpfs.Arnd <><
--
> > them.
>>> them.
Yes, that's right.
Arnd <><
--
| Sunil Naidu | Re: Linux 2.6.20-rc6 |
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Chris Snook | Re: init's children list is long and slows reaping children. |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Eric W. Biederman | Re: [PATCH 10/11] avoid kobject name conflict with different namespaces |
