Linux : Reiser4 and VFS Issues

Submitted by Kedar Sovani
on August 30, 2004 - 10:06am

Continuing the debate over the right way to go about implementing some of the features found in the newly released Reiser4 [story], Hans Reiser asked Al Viro to clarify the issues he thinks could arise from the current implementation. The result was a brilliant explanation of what problems Al sees, specifically related to dentry aliasing, and how the current VFS architecture handles some of these problems.

Read on for Al's response and further clarification from Linus Torvalds. The interesting exchange provides some good insight into the Linux VFS layer.


                                                                                                    
From:       Al Viro [ email blocked ]
To:         linux-kernel
Subject:    Re: silent semantic changes with reiser4
Date:       2004-08-29 21:27:00
                                                                                                    
On Sun, Aug 29, 2004 at 01:06:41PM -0700, Hans Reiser wrote:
                                                                                                    
> How about if you educate me on the problems you see for a bit before I
> respond? I think it might help us move into a constructive discussion.
                                                                                                    
A slew of cache coherency issues (and memory consumption on top of that).
Rigth now we have a signle dentry tree for all fs instances, no matter
how many times it and its subtrees are visible in the tree.  Which,
obviously, avoids all that crap.  As soon as we start trying to have
multiple trees over the same fs, we are in for a *lot* of fun.
                                                                                                    
The basic question is how do you propagate the changes from one tree
to another and what do you do when change to "invisible" entry happens
via a tree where it is visible.  Unfortunately, all obvious solutions
either hit the pathname resolution and hit it *hard* (both in scalability
and in price of uncontended case) or create a mess for stuff like
NFS silly-rename semantics, VFAT aliases, etc.
                                                                                                    
Saying that some unspecified data structures might work doesn't help.
obviously - any variant will have tradeoffs of its own and has to be
discussed individually.  fsdevel is there for exactly that sort of
stuff, so if you have any specific proposals you want to discuss, you
are more than welcome.
                                                                                                    
Again, right now all cache coherency issues are sidestepped by having a single
instance of dentry tree per superblock.  Each mount/binding is represented by
vfsmount and _refers_ to a (sub)tree of dentry tree of that fs - dentries
and inodes are the same (ditto for private data structures owned by filesystem,
obviously).

So we have a forest of dentry trees (one per filesystem) and forest of
vfsmount trees (one tree per namespace).  Each vfsmount corresponds to
a chunk of namespace - think of the full user-visible tree cut into
pieces by mountpoints.  For every piece we know
       a) what fs it's from (->mnt_sb)
       b) what subtree of that fs it corresponds (->mnt_root)
       c) what piece it's attached to (->mnt_parent)
       d) where in that piece it's attached (->mnt_mountpoint)
       e) where to find its siblings and children
       f) which namespace (== tree in vfsmount forest) it's from
                                                                                                    
Point in a namespace is determined by pair (vfsmount, dentry) - which
piece we are looking at and where in that piece we are.  Pathname lookups
operate on such pairs - as long as we do not cross into another piece
we just step from dentry to dentry, when we cross mountpoint towards root,
we flip to (mnt_parent, mnt_mountpoint) first, when we cross mountpoint
into mounted filesystem, we do hash lookup (lookup_mnt()) by our pair,
find vfsmount with such mnt_parent and mnt_mountpoint and step into
(vfsmount, vfsmount->mnt_root).
                                                                                                    
Lifetime of vfsmounts is controlled by a simple refcount.  Having vfsmount
attached to tree contributes +1, so does having it pinned down by lookup/
chdir/chroot/opened file.  When the last vfsmount goes away, filesystem is
shut down (basically, that's what happens on umount).  Binding simply creates
a vfsmount pointing to target and attaches it at source.
                                                                                                    
Cloning a namespace copies the vfsmount tree of parent process' namespace
and flips vfsmounts of our root and cwd into the corresponding nodes in that
copy.  Normally, fork()/clone() just increments namespace refcount instead
of cloning it, so all children share namespace with parent.
                                                                                                    
When all processes in a namespace are gone its refcount goes to zero.  At
that point we dissolve all attachments in its vfsmount tree (which drops
a reference to each vfsmount and can in turn lead to fs shutdowns).
                                                                                                    
Lazy umount does similar operation with specified subtree - it detaches
all pieces mounted anywhere in that subtree and drops reference to each.
Ones that are not busy will be shut down immediately, ones that are will
go away as soon as they stop being busy.

                                                                                                    
vfsmounts can have flags of their own - one of the pending projects is to
allow individual vfsmounts to be read-only.  Noexec, nosuid and nodev are
already per-mountpoint.
                                                                                                    
Tree of vfsmounts is protected by
     a) vfsmount spinlock (protects hash lookups, mostly)
     b) per-namespace semaphore (that one is obviously blocking)
     c) in cases when we can go from 0 to 1 vfsmounts with given dentry
as a mountpoint, we also hold ->d_inode->i_sem on the mountpoint-to-be
(that closes races between rmdir/mount, rename/mount, etc.)
                                                                                                    
Dentry tree is messier - these days we have dcache_lock and RCU stuff.
If you can help getting the documentation out of these guys, you've got
a *lot* of thanks from a lot of people.  As it is, it's RTFS country.
I can answer specific questions there, but I won't even try to produce
readable manual covering the entire thing.
                                                                                                    
Directory operations have exclusion based on ->i_sem - see
Documentation/filesystems/directory-locking for description and proof
of correctness.
                                                                                                    
For operations that can destroy an object (unlink/rmdir/overwriting rename)
we try to unhash the victim dentry first, so the filesystems that can't
handle unlinked-but-busy stuff can detect such attempt by seeing the
victim still hashed _and_ can be sure that if it's not hashed, nobody will
come and see it (lookup would have to go to filesystem code, since we
have the sucker unhashed and we have exclusion there).  If operation is
unsuccessful, we rehash the victim.
                                                                                                    
Files involved:
      fs/dcache.c
      fs/super.c
      fs/inode.c
      fs/namespace.c
      include/linux/dcache.h
      include/linux/fs.h
      include/linux/mount.h
                                                                                                    
If you need more details on any specific area (or gaps in the above) - just
ask.
                                                                                                    
                                                                                                    
From:       Linus Torvalds [ email blocked ]
Subject:    Re: silent semantic changes with reiser4
Date:       2004-08-29 21:50:07
                                                                                                    
                                                                                                    
On Sun, 29 Aug 2004 [email blocked] wrote:
>
> A slew of cache coherency issues (and memory consumption on top of that).
> Rigth now we have a signle dentry tree for all fs instances, no matter
> how many times it and its subtrees are visible in the tree.  Which,
> obviously, avoids all that crap.  As soon as we start trying to have
> multiple trees over the same fs, we are in for a *lot* of fun.
                                                                                                    
Al, I think you should make the argument a bit more specific, because I
doubt a lot of people understand just what the problems are with aliased
names. Just a few examples of the problems involved will illuminate things
very well, I think. People who haven't been intimate with the name caches
probably simply don't understand why you worry so much.
                                                                                                    
I'll start out with some trivial examples, just to let Hans and others get
an idea about what the issues are. I think examples of problems are often
better ways to explain them than the abstract issues themselves.
                                                                                                    
Aliases: let's say that you have filename "a" hard-linked to filename "b",
and you have a directory structure of streams under there. So you have
                                                                                                    
    a/file1
    a/dir1/file2
    a/dir2/file3
                                                                                                    
and (through the hard-link with "b") you have aliases of all these same
names available as "b/file1", "b/dir1/file2" etc).
                                                                                                    
Now, imagine that you have two processes doing
                                                                                                    
     mv a/dir1 a/dir2/newdir
                                                                                                    
and
                                                                                                    
        mv b/dir2 b/dir1/newdir
                                                                                                    
at the same time. Both of them MUST NOT SUCCEED, for pretty obvious
reasons (you'd have moved two directories within each other, and now
neither would be accessible any more).
                                                                                                    
How do you handle locking for this situation?
                                                                                                    
Another interesting case is what happens when you have looked up and cache
the filename "a/file1" and then another process does "rm b/file1". How do
Another interesting case is what happens when you have looked up and cache
the filename "a/file1" and then another process does "rm b/file1". How do
you update the _other_ cached copy, since they had two different names,
but _both_ names went away at the same time. Also again, how do you handle
locking?
                                                                                                    
The general VFS layer has a lot of rules, and avoids these problems by
simply never having aliases between two directories. If the same directory
shows up multiple times (which can happen with bind mounts), they have the
exact same dentry for the directory, it's just found through two different
vfsmount instances. That's why vfsmounts exist - they allow the same name
cache entry to show up in different places at the same time.
                                                                                                    
So when we do a bind mount, and the same directory shows up under two
different names "a" and "b", and we do a "rm b/file1", it _automatically_
disappears from "a/file1" too, simply by virtue of "a" and "b" literally
being the same dentry. No aliasing ever happens, and this makes coherency
and locking much easier (which is not to say that they are trivial, but
they are pretty damn clear in comparison to the alternatives).
                                                                                                    
What Al (and others) worries about is that the reiser4 name handling has
_none_ of these issues figured out and protected against. You can protect
against them by taking very heavy locks (you can trivially protect against
all races by taking one large lock around any operation), but the fact is,
that is just not an option for high-performance name lookup.
                                                                                                    
These aliasing/locking rules need to be global and wll-though-out. Not
just fix the two examples above, but be shown to be safe _in_general_.
                                                                                                    
         Linus

Related Links:

Complexity

Anonymous
on
August 30, 2004 - 2:34pm

[...] Now, imagine that you have two processes doing



mv a/dir1 a/dir2/newdir

mv b/dir2 b/dir1/newdir



at the same time. Both of them MUST NOT SUCCEED, [...]

Wrong: just one of them needs to fail. It seems an easy locking problem: you cannot move what doesn't exist anymore. The VFS code should have lockings to prevent this things already, right? What's the difference between doing that and doing:



$ mkdir abc def

$ # same-time paralelism begin #

$ mv abc def/abc

$ mv def abc/def

$ # same-time paralelism end #



The result here is pretty obvious: one will succed, the other will fail. The same goes to the aliasing problem: when you move 'a/dir1' to 'b/dir2/newdir', 'a/dir1' AND 'b/dir1' won't exist anymore, as 'a' and 'b' are hardlinks, and the second command will fail.

Another interesting case is what happens when you have looked up and cache the filename "a/file1" and then another process does "rm b/file1". How do you update the _other_ cached copy, since they had two different names, [...]

Maintain a list of aliases for each file. They could be names or just pointers to the other(s) file(s) structure(s).

[...] but _both_ names went away at the same time. [...]

That's the idea.

The general VFS layer has a lot of rules, and avoids these problems by simply never having aliases between two directories.

The most known generic solution is avoid the problem. Sometimes it work, but just when you don't really need to solve it.

Graphs are more complex than trees, but if you really need that complexity to express relations between the information, in a simpler way, you should re-think cons of the complexity.

What Al (and others) worries about is that the reiser4 name handling has _none_ of these issues figured out and protected against. You can protect against them by taking very heavy locks (you can trivially protect against all races by taking one large lock around any operation), but the fact is, that is just not an option for high-performance name lookup.

Exactly. But acquire large locks in this case it's just laziness: even not knowing the internals of this fs, I can imagine a simple RCU to solve this problems. If this is really an issue, Reiser probably will find a good and simpler answer.



--
ofranja

at the same time. Both of t

Anonymous
on
August 30, 2004 - 4:11pm

at the same time. Both of them MUST NOT SUCCEED, [...]

Wrong: just one of them needs to fail

I think you may be confusing the meaning of "both" here. "Both" refers to the pair, as a pair. It does not mean "each," which refers to each member of the pair considered individually.

So, "[b]oth of them must not succeed" means "both X and Y must not succeed," which only implies that at least one must fail, not that each must fail.

this is the most classy way o

Anonymous
on
August 30, 2004 - 4:58pm

this is the most classy way of saying "you can't read" to someone I've ever saw :p

Hmmm...

Anonymous
on
August 30, 2004 - 6:26pm

> ...this is the most classy way of saying "you can't read" to someone I've ever saw :p

Yes, maybe, but being classy is a must, after all.

Anyway, maybe this is a problem of different mother tongues.

In my language, for instance, ["both x and y"] and ["both, x and y,"] have not the same meaning. English seems to be very frugal in the comma usage, hence probably the resulting confusion.

After all these years of speaking English -- some 25 -- I still can't say it's easy.

Esperanto

Anonymous
on
August 30, 2004 - 6:37pm

Agreed. English is my native and only language and it still confuses the hell out of me. Maybe it's time that we all sat down and learned Esperanto.

enrish

Anonymous
on
August 30, 2004 - 11:14pm

I'm a big fan of engrish, ise it all the tume.

Not suited

Anonymous
on
September 7, 2004 - 8:26am

For talking I would much prefer Interlingua (mostly because I have a reasonable background in Romance languages).

For discussions such as these and eliminating confusion in general you would be better off with Lojban (or similar). "The grammar is based on predicate logic, and is capable of expressing complex logical constructs precisely."

Given the porting effort, let's just stick to English.

Grammar confusion

Anonymous
on
September 10, 2004 - 1:30am

For what it's worth, to me it would look like "both of them can not succeed" was a clumsy way to say "neither of them can succeed", while the other meaning would look like "not both of them can succeed". Non-native speakers trying to interpret the English of other non-native speakers can be quite funny, I agree.

Sorry

Anonymous
on
August 31, 2004 - 2:00pm

Sorry about that. As you can see - and others have also noted -, English is not my mother tongue, and sometimes I don't get the (important) details.


Well, i take back my first paragraph, but i keep the rest. And that's it.


-- ofranja

Understandable

Anonymous
on
September 1, 2004 - 12:13pm

It is quite understandable that you could be confused by the wording, since it is not technically correct. At best, it could be taken either way -- what Linus actually meant, or what you thought he meant. At worst, it meant exactly what you thought it meant (and technically this is the case).

Word order is generally more important in English than in most languages. This can be an issue even for native English speakers, and they make mistakes such as the one Linus made here all the time. Most other native English speakers are so familiar with the mistakes that they understand the actual meaning immediately. So although Linus didn't word that correctly, he was immediately understood correctly by nearly everybody reading it. Where the "not" is placed in this sentence is important. So Linus saying that they 'both must not succeed' is technically the same thing as 'neither must succeed.' What he actually meant was they 'must not both succeed,' or 'one or the other must fail.'

This type of thing is so common that I can think of a proverb that is more often stated incorrectly than correctly. Generally people incorrectly say, "All that glitters is not gold," when what they actually mean is, "Not all that glitters is gold."

Of course, being a native English speaker, I knew what Linus intended to say and only noted that it was incorrectly worded in passing. I would never have brought it up, but your confusion illustrates why wording can be important.

It's not a proverb, it's a Shakespeare quotation

Anonymous
on
September 1, 2004 - 2:16pm

and you quoted it wrong.

"All that glisters is not gold."

--From The Merchant of Venice (II, vii)

Not a proverb or do you mean not from Proverbs?

Anonymous
on
September 2, 2004 - 2:23am

The quote from Shakespeare is a proverb, just not a Biblical Proverb.

The Bible doesn't own the sole rights to proverbs. They existed long before in other religions as well as other mystical doctrines, or just plain insightful writings from observant writers, such as the mythical Shakespeare (man or many men, so on and so forth).

In truth, since when are Programmers known to be well-versed grammarians of any language other than the programming languages they profess to be experts using? And even those they cannot claim, for if they could we would not have so many syntactical bugs flying around seemingly endless bits of past and current known or unknown software projects, correct?

Both will succeed and kill each other.

Anonymous
on
September 1, 2004 - 12:14am

I think, if they run at the same instance and have the same priority, then the moving will go like this --

a/dir1 to a/dir2/newdir
b/dir2 to b/dir1/newdir

and that means

b/dir1 to b/dir2/newdir
b/dir2 to b/dir1/newdir

and it will never end, creating a loop and the system will fail only when the media's free space is used up.

If one runs a bit faster then the other, then both should kill each other because its a move function, and will delete one of the directories.

hardlinks

Anonymous
on
August 31, 2004 - 2:55pm

ack, its just a locking issue, but:

this discussion reminds me when we discussed hard links once in a low level os design class... and there easily some evil issues came up... issues like this... u have 2 files hard linked and all the great transparency and intuitivity goes to hell when it comes to even basic file operations like rm... (i finly guess hard links suck. realy. i didnt need them and i dont need them - for what?)

and now i see all this multiple stream issues comming up... we already had this one here and there from time to time... and i dont think ill like this idea... a file should just be a file... it should be one chain of bytes. if i do a directory list i see its name, its position and its length (and some flags etc..) and when i copy it i can be sure the copy is 100% the same bytes with the same length... (no strange backup issues for example).

when u need streams or other "extensions" to an fs, it should be somewhere in userspace i guess... there simply can be a program wich handles a file like another filesystem (consisting of streams, whatever)... u can do everything u want but dont cause any fs or kernel problems...

besides... the tree algorithms and otehr fancy stuff in raiser4, let alone be speed are great so lets finaly use some bits :-)

Eugene

you sir, is a debaser

Anonymous
on
September 1, 2004 - 4:23pm

so you really have nothing to add? great, now stfu.

Locking could be easy

Anonymous
on
September 1, 2004 - 12:19pm

You say that it is simple to prevent this. That is true if you don't care about performance. Just make every fs operation atomic and synchronize all of userspace on all processors. But in the real world you want on process running on processor A to be able to delete the file without locking all other processors from accessing the filesystem. The question is how to tell if the referenced file is aliased anywhere else, and if so, update it, and to do so without any race conditions.

hardlink of dirs?

Anonymous
on
August 31, 2004 - 12:45am


Aliases: let's say that you have filename "a" hard-linked to filename "b",
and you have a directory structure of streams under there. So you have


a/file1

a/dir1/file2

a/dir2/file3



and (through the hard-link with "b") you have aliases of all these same
names available as "b/file1", "b/dir1/file2" etc).



Since when I can set a hard link onto a dir?
Bad example, bad

ln: `b': hard link not allowed for directory

RE: hardlink of dirs?

Anonymous
on
August 31, 2004 - 1:07am

>Since when I can set a hard link onto a dir? Bad example, bad
>ln: `b': hard link not allowed for directory

Because a and b _are_ not directories, If you followed the thread you will understand. a and b are the new "file stream" that Reiser4 introduced, so you have a _file_ xmms it can have streams xmms/icons/32x32.png xmms/icons/64x64.png xmms/skin that is the problem, you can _have_ a hardlink to xmms because it is a _file_ which causes the problem Linus is talkin about

easy sollution:

Anonymous
on
August 31, 2004 - 2:58pm

forbid hard links for files as well ;-)

can anyone tell me why we would realy need that hard link feature? does anyone use this?

Eugene

yes

on
August 31, 2004 - 4:27pm

You can use hard links to make incremental backups.
If you backup the whole server each day, you only actually back it up once every week, and the other 6 days you backup only the files that changed and make hard-links to older backups for the rest of the files (unchanged). This way, you can save a lot of space on the back-up server.

This whole strategy is actually already implemented by rsync.

http://www.mikerubel.org/computers/rsync_snapshots/#Incremental

Also, you can implement somet

on
September 1, 2004 - 8:38am

Also, you can implement something like ACLs by hard-linking the same file into different subdirectories. Each subdirectory might have different ownership, and the hard-linked file might be world or group readable.

share files

Anonymous
on
September 3, 2004 - 4:16am

You can easily share files between directories with hardlinks. For example you may want to make .bash_profile and .bashrc owned by root and writable only by root, and share those files across all home directories.

> You can easily share files

Anonymous
on
October 26, 2004 - 8:45am

> You can easily share files between directories with
> hardlinks. For example you may want to make
> .bash_profile and .bashrc owned by root and writable
> only by root, and share those files across all home directories.

Good idea for sharing the DEFAULT settings.
Bad idea if you think you can use that method to FORCE settings
on a user, because you cannot prevent a user from:

$ rm $HOME/.bashrc
$ vi $HOME/.bashrc # create a new, customized one

unless you also do:

# chmod u-w /home/*

as root. And to prevent the users to "chmod u+w $HOME", you must:

# chown root.root /home/*

That'd be weird!

Yes

Anonymous
on
September 3, 2004 - 9:04pm

Whenever you go to copy a file from one place to another, think to yourself: "Can I just hard link this instead?". I think you'll find that your use of "cp" will drop dramatically since often you won't actually need separate instances of the same data, just different ways to reference it.

symlinks seems to be good eno

Anonymous
on
September 4, 2004 - 7:55am

symlinks seems to be good enough if you need to share files or make incremental backups...

No, because symlinks are diff

on
September 4, 2004 - 8:48am

No, because symlinks are different from normal files. If you want to backup /lib you do that with cp -a to preserve the symlinks among other things, you really want to backup symlinks as symlinks, and restore them as symlinks.

symlinks

Anonymous
on
September 4, 2004 - 11:27am

you are right... thanks

Hard vs. Soft links

Anonymous
on
September 4, 2004 - 2:26pm

Not to mention symlinks to a file aren't enough to prevent it from being removed. A hard link is no different than the original directory entry. All hard links to a file have to be removed before the inode is freed.

RE: hardlink of dirs?

Anonymous
on
September 13, 2004 - 9:49am

But a and b _are_ _treated_ as directories. You can't have hardlinks to directories, end of history.

The plugin scheme for reiser4 should prevent conflicts of this kind. If inode of 'a' is opened as a dirstream, it (the inode) cannot be opened as dirstream by any other name, as that would be a hardlink.

Or maybe it could be prevented earlier: if something can be treated as dirstream, it cannot have hardlinks?

Or maybe a dirstream could be inserted in the dentry stuff just like a new mount point (only for the opening process, if that's possible at all), and treated as such?

Hans already replied to this explanations?

on
August 31, 2004 - 8:15am

... and ... will reiser4 support 4k stacks?

Thanks!

Yes

on
August 31, 2004 - 8:52am

Someone's working on it.

Who cares about the size of t

Anonymous
on
August 31, 2004 - 8:54am

Who cares about the size of the stack? I doubt a designer of some tiny embedded computer is going to cram in reiserfs when the design is so cramped that the kernel's stack size is a worry.

Size of stack matters.

Anonymous
on
August 31, 2004 - 2:24pm

Each process needs allocated memory for its stack in kernel. 8k instead of 4k cuts the max number of processes to 50%. 4k stack is a must for lots-and-lots of processes.

stack size is a major design factor

on
August 31, 2004 - 7:30pm

Thats true a whole family of OS designs came out just because of this. Some have only one kernel stack.Some have shared number of stacks. Some have really small stacks. This designs give different opprutunities for interesting optimisations in other areas of the OS.

Stack needs to be contiguous pages

on
September 4, 2004 - 2:15am

In order to work, a stack needs to be made up of contiguous pages; for simplicity, the Linux kernel maps its own memory as a 1-1 mapping between physical RAM and virtual addresses. Thus, an 8K stack (two hardware pages) involves searching for two contiguous free physical pages in the kernel memory area. A 4K stage (one hardware page) just involves finding a free memory area.

In most systems, the 8K stack is not a huge issue; the kernel rarely allocates then later deallocates an odd number of pages, so normally there is an excess of pairs of contiguous pages. However, it is possible for the system to run out of these pairs, and to solve this problem there are two solutions:

  1. Make the kernel memory map more complex (like user process memory maps). This allows you to take two physical pages from anywhere in physical RAM, and form a contiguous pair in virtual memory. It also makes the VM much more complex, and Linux has traditionally had problems finding people who understand the VM.
  2. Reduce the kernel stack to one hardware page. Then, the only way to run out of pages for kernel stacks is to run out of free memory altogether. By swapping before you completely run out of physical RAM (when there is space left for a kernel stack), you guarantee that there will always be space for a kernel stack, unless the system is completely out of memory (in which case it's doomed anyway).

So, 4K stacks are important in making Linux more scalable, and more robust, without increasing the complexity of the kernel.

Solution for uniprocessors?

Anonymous
on
October 5, 2004 - 1:59am

I'm still only a first year cpsc major, but here goes:

Why not just have the operation itself be performed atomically? I mean, Reiser4 is an atomic FS so it can delete files atomically? That is what I understood. So that means that if the FS locks the stream so that it cannot delete it because it's busy (until it atomically finalizes its operation) and then that operation deletes the file... then there is no multiprocessing on that file.

If not, then why don't we just add a meta to the file to lock it?
That's another asset of Reiser4...

Maybe I'm entirely off base, but I thought you guys were saying Reiser4 cannot handle locking and therefore couldn't manage those hardlinks. In fact, using metas, the file itself would be the only thing locked and it would allow hardlinked directories?

If I'm wrong, I'll learn eventually. My 2 cents. Dan.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.