On Friday 29 August 2008 20:31, Matthew Dillon wrote:
The concept of log in Tux3 is a little different, consisting of mini
transactions placed wherever the data goal happens to be, with a commit
block placed opportunistically somewhere near each transaction body. I
think this lends itself to parallelizing, though I agree that starting
off that way would be an unwise implementation choice.
At first there will be exactly one replay order, though transactions
will not necessarily be written in that order. What I see is throwing
several hundred mini transactions (I need to come up with a name for
such things) at the block device in parallel, but waiting on
completions in a linear order, in turn unblocking waiters in a linear
order. This is to avoid such a faux pas as allowing an fsync to
complete when there would be gaps in the linear replay sequence on
which the fsync depends.
> : 3) When the first subvolume is remounted after a crash, implicitly
Right, that is not too hard. What I don't like about (3) is the idea
that a bunch of work will be done on behalf of subvolumes that are not
even being asked for yet, which means that the one that actually is
being mounted will have to wait for some perhaps irrelevant work to be
done before it gets back online.
Another thing I don't like about this is, it violates the design
decision that there is no classic "replay" on mount with Tux3, only
recreation of the relevant cache state. The cache state of a subvolume
that is not yet being mounted does not qualify as relevant, and yet
it would be recreated anyway, then discarded.
> : 4) Partition the allocation space so that each subvolume allocates
Number 4 is a service that any volume manager should be able to
perform. Unfortunately on Linux, the volume manager performs the job
poorly, not in a way that a filesystem can control on the fly. So a
filesystem cannot expand itself, there is just no suitable internal
interface to direct the volume manager to do its part of the work. I
presume that a similar sad situation must exist on Solaris, which set
the stage for ZFS taking on the role of volume manager itself, a
factoring that I find... a little disturbing.
I am now leaning towards the idea of dropping subvolumes. There is
exactly one advantage I was able to think of that would be out of scope
of what a volume manager could do, which is the idea of "budding" and
"melding" directories to/from separate volumes. And I hardly think
that having multiple subvolumes available is the only way to do that.
Subvolumes can always be added later if there turn out to be real use
cases that cannot possibly be performed by an improved volume manager.
> :I CC'd this one to Matt Dillon, perhaps mainly for sympathy. Hammer
I envy the lovely symmetry of your fat key space. But I am convinced
that the extra complexity of a hierarchical structure encoded with
narrow keys (48 bits to handle exabyte filesystems) will pay off in
cache performance. And the complexity is not actually all that much.
Most of the structural details are in place now and the code base is
about 5,000 lines, about 1/3 of that is unit tests, and the buffer
emulation is about 10%. I expect the finished size of the kernel code
to come in around 10,000 lines but I am guessing wildly about how
complex the fsync will be, the details of which are just starting to
take shape.
> We use the PFSs as replication sources and targets. This also allows
Your PFS sounds like what I would call a snapshot.
Indeed, it is essential to replicate inode numbers faithfully if you
want to export via NFS on the downstream side. We got that for free in
ddsnap, which replicates an entire volume, but I had overlooked that
detail so far in thinking about filesystem-based replication. Ooh.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Adrian Bunk | Re: LSM conversion to static interface |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
| Winkler, Tomas | RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
