Re: [Tux3] Feature interaction between multiple volumes and atomic update

Previous thread: [Tux3] Feature interaction between multiple volumes and atomic update by Daniel Phillips on Friday, August 29, 2008 - 8:41 pm. (2 messages)

Next thread: [Tux3] Hi all, I'm a weak coder but you've my full support by Michael Keulkeul on Saturday, August 30, 2008 - 8:13 pm. (5 messages)
To: Daniel Phillips <phillips@...>
Cc: <tux3@...>
Date: Friday, August 29, 2008 - 11:31 pm

I had a lot of trouble trying to implement multiple logs in HAMMER
(the idea being to improve I/O throughput). I eventually gave up
and went with a single log (well, UNDO fifo in HAMMER's case). So
e.g. even though HAMMER does implement pseudo-filesystem spaces
for mirroring slaves and such, everything still uses a single log

If you synchronize the transaction id spaces between the subvolumes
then the crash recovery code could use a single number to determine

I tried this in an earlier HAMMER implementation and it was a
nightmare. I gave up on it. Also, in an earlier iteration, I
had a blockmap translation layer to support the above. That
worked fairly well as long as the blocks were very large (at least
8MB). When I went to the single global B-Tree model I didn't
need the layer any more and devolved it back down to a simple

Yah. We do support pseudo-filesystems within a HAMMER filesystem,
but they are implemented using a field in the B-Tree element key.
They aren't actually separate filesystems, they just use totally
independant key spaces within the global B-Tree.

We use the PFSs as replication sources and targets. This also allows
the inode numbers to be replicated (each PFS gets its own inode

-Matt
Matthew Dillon
<dillon@backplane.com>

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3

To: <tux3@...>
Date: Thursday, December 11, 2008 - 3:20 pm

Hi,

I think this adds some improvements, and makes some todo appear.

With this, ->inum has the real inode number (64bit), and ->i_ino (may
32bit) is used as just hash value. But it has issue like the below
list. (So, we may want to add ->read_ino() handler or something.
->i_ino is hidden more or less from userland, but it is not perfect.
If we have ->read_ino() handler to return the real inode number (same
with stat64->ino), generic_* stuff may become more generically.)

tux_new_inode() was introduced. We initialize the new inode with it.
And TODO, now it saves almost all attributes, but we would want to
optimize it by removing unneeded attributes.

The preparation of device special file support was done. TODO, we have
to load/store the rdev valude from/to ileaf. So, we have to decide the
size of rdev. (Now kernel uses the 32bit device number internally (new_*),
but there is the preparation of the 64bit device number (huge_*).
And glibc seems to already use 64bit to prepare for the future.)

NOTE, and this changes the inum field size in dirent from 32bit to
64bit. Of course, it is for storing the 64bit inum. However, this means
it changes the on-disk format.

Per-patch comment is in repo as usual,

static-http://userweb.kernel.org/~hirofumi/tux3/

Please review.

Thanks.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

[->i_ino users] (If I'm missing something, please let me know)

1) printk()
In some places, kernel users ->i_ino for printk()

2) /proc/locks, /proc/<pid>/{maps,smaps}
Those show the ->i_ino

3) pipefs - pipefs_dname()
It uses ->i_ino as part of filename. This may not matter?

4) fs/nfsd/export.c:exp_hash()
This uses ->i_ino. Since it may be for hash key, so it's ok?

5) audit (CONFIG_AUDIT)
audit (CONFIG_AUDIT) uses ->i_ino to check if it is interesting inode.

6) selinux
Current tree seems _not_ to use ino, however, variant may use?

7) NFS ->get_name()
default handler uses ->i_ino ...

To: <tux3@...>
Cc: OGAWA Hirofumi <hirofumi@...>
Date: Thursday, December 11, 2008 - 10:51 pm

The change to 64 bit ino in dir.c is a disk format change and requires
a rev of the date part of the superblock magic. I will pull first,
then do that... pulled, user built perfectly, make tests ran, kernel
built perfectly, booted up... the tests I use for ext2 ran...

Check_present is a nice add, the iget5 stuff is very cool, supporting
48 bit ino on 32 bit arch is very very cool.

Awesome patch set :-)

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

To: <tux3@...>
Cc: OGAWA Hirofumi <hirofumi@...>
Date: Thursday, December 11, 2008 - 5:23 pm

I upgraded from mercurial 0.9.1 to 1.0.1 to get some of the fancy new
extensions, and was not able to pull/clone your repository any more.
Bug report here:

http://groups.google.com/group/linux.debian.bugs.dist/browse_thread/thre...

Patch here:

http://hg.intevation.org/mercurial/crew-stable/raw-rev/98b6c3dde237

This fixes a 1.0.1 mercurial to be able to pull/clone from a 0.9.1 format
repo, and thus avoids breaking 0.9.1 user's static-http pull/clone.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

To: Daniel Phillips <phillips@...>
Cc: <tux3@...>
Date: Thursday, December 11, 2008 - 6:33 pm

Yes. I downgraded the version due to this like you know.
--
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3

To: Matthew Dillon <dillon@...>
Cc: <tux3@...>
Date: Saturday, August 30, 2008 - 5:56 am

The concept of log in Tux3 is a little different, consisting of mini
transactions placed wherever the data goal happens to be, with a commit
block placed opportunistically somewhere near each transaction body. I
think this lends itself to parallelizing, though I agree that starting
off that way would be an unwise implementation choice.

At first there will be exactly one replay order, though transactions
will not necessarily be written in that order. What I see is throwing
several hundred mini transactions (I need to come up with a name for
such things) at the block device in parallel, but waiting on
completions in a linear order, in turn unblocking waiters in a linear
order. This is to avoid such a faux pas as allowing an fsync to
complete when there would be gaps in the linear replay sequence on

Right, that is not too hard. What I don't like about (3) is the idea
that a bunch of work will be done on behalf of subvolumes that are not
even being asked for yet, which means that the one that actually is
being mounted will have to wait for some perhaps irrelevant work to be
done before it gets back online.

Another thing I don't like about this is, it violates the design
decision that there is no classic "replay" on mount with Tux3, only
recreation of the relevant cache state. The cache state of a subvolume
that is not yet being mounted does not qualify as relevant, and yet

Number 4 is a service that any volume manager should be able to
perform. Unfortunately on Linux, the volume manager performs the job
poorly, not in a way that a filesystem can control on the fly. So a
filesystem cannot expand itself, there is just no suitable internal
interface to direct the volume manager to do its part of the work. I
presume that a similar sad situation must exist on Solaris, which set
the stage for ZFS taking on the role of volume manager itself, a
factoring that I find... a little disturbing.

I am now leaning towards the idea of dropping subvolumes. There is
exactly one advanta...

Previous thread: [Tux3] Feature interaction between multiple volumes and atomic update by Daniel Phillips on Friday, August 29, 2008 - 8:41 pm. (2 messages)

Next thread: [Tux3] Hi all, I'm a weak coder but you've my full support by Michael Keulkeul on Saturday, August 30, 2008 - 8:13 pm. (5 messages)