:Tux3 never backs anything out, so there is some skew to clear up.
Got it. I'm still not clear on the physical vs logical log
interactions required for crash recovery, but I think I understandThe only major interlock between frontend searches and backend flushes
occurs on an inode-by-inode basis and only for the duration of the
modifying B-Tree operation, which does *not* include the writing of
the modified meta-data buffer to disk. Once the operation concludes
we are left with dirty meta-data buffers which for all intents and
purposes have transfered the in-memory cached B-Tree records (held
as separate entities in a red-black tree in-memory) into the media
view of the B-Tree (having performed the actual B-Tree operations
necessary to insert or delete the record), and the inode lock is
then released. The flusher flushes each inode on the flush list in
turn, with certain long-lived inodes (e.g. truncation of a 1TB
file to 0) only getting partial goodness and then being moved to
the next flush list. The inode lock is held for each individual
inode as it is being flushed, one inode at a time.The actual dirty meta-data buffers are flushed later on, without
any inode lock held and without having any blocking effect on the
frontend. This media-write sequence occurs at the end, after all
inodes have been processed. UNDO finishes writing, we wait for
I/O to complete, we update the volume header, we wait for I/O
to complete, then we asynchronously flush the dirty media buffers
(and don't wait for that I/O to complete so the writes can be merged
with the beginning of the next flush group).That is what I meant when I said the frontend was disconnected
from the backend. The worst the frontend ever does is issue reads
on media meta-data as part of any merged memory-media B-Tree searches
it must perform to locate information. Any modifying operations made
by the frontend will create records in the...
The reason for my unclear crash recovery algorithm is, it was not fully
designed when we started chatting, only the basic principles of robust
micro-transaction commit. You forced me to go work out the details of
the high level transaction architecture.The big points you got me focussed on were, what about mass commits a la
fsync, and how do you sort out the ties that arise between otherwise
independent high level operations, and how do you preserve the ordering
of low level operations to reflect the ordering constraints of the high
level operations? Which is what you were trying to say with your createI am well down path to identifying a similar frontend/backend
separation in Tux3. The back end data consists of all the dirty
metadata and file data cache blocks (buffer cache and page cache(s)
respectively) and the "phase" boundaries, which separate the results of
successive large groups of high level operations so that there are no
disk-image dependencies between them.I now have two nice tools for achieving such a clean separation between
transaction phases:1) Logical log records
2) Dirty cache block cloning
The latter takes advantage of logical log records to limit the effect of
a high level operation to just the leaf blocks affected. Where two
operations that are supposed to be in two different transaction phases
affect the same leaf block(s) then the block(s) will be cloned to
achieve proper physical separation. So I can always force a clean
boundary between phases without generating a lot of dirty blocks to do
it, an important property when it comes to anti-deadlock memory
provisions.The payoff for a clean frontend/backend separation is, as you say, a
clean design. The art in this is to accomplish the separation without
any throughput compromise, which I am willing to believe you have done
in Hammer, and which I now wish to achieve by a somewhat differentI need a means of remembering which blocks are members of which phases,
and an rbtree may b...
| Greg Kroah-Hartman | [PATCH 006/196] Chinese: add translation of oops-tracing.txt |
| Andrew Morton | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
| Eric W. Biederman | [PATCH] nfs lockd reclaimer: Convert to kthread API |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
