On Monday 28 July 2008 09:58, Matthew Dillon wrote:
Then I need to rewrite the post so it seems as simple as it is :-)
> How would you deal with incidental
In this case a log transaction is created containing all of the split
nodes as physical updates and possibly some merged nodes from the free
tree. Btree splitting can always be committed atomically and
independently of any other activity on the filesystem. It is nicely
bounded by the btree depth. Allocations that do not require splitting
the free tree are logged logically, leaving the affected free tree
blocks dirty in cache. So the allocation updates come "for free",
being tucked into the same commit block that governs the split btree
leaves.
> Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
I now see how UNDO works, thankyou for your patient explanations. The
point I overlooked is, fsync (and friends) is indeed the only barrier
you have to worry about.
Still, I think it is good to try to get as much as possible of what was
going on in a bit burst of activity with no fsync durably onto disk.
Redo will clearly leave the filesystem in a consistent state less far
back in time than undo.
So my general strategy is to log big changes like splits as "physical"
full block updates and small ones like allocating an extent as logical
edit records in the commit blocks of the physical updates.
The complexity you noted above is due to making sure that the on-disk
image of a block is always what the logical edit record expects it
to be at the time the logical edit is replayed.
> Having to allocate new blocks for B-Tree deletions could create a big
Yes, allocate-in-free is usually nasty. I need to ensure that there is
always a reserve of blocks available on disk that is more than the
maximum possible transaction that can be accepted. This simple idea
can get really complex, I know. One nice trick to simplify things a
little is to have a really generous limit on the maximum number of
dirty blocks that are allowed, when the filesystem has lots of free
space, and shrink that limit progressively as free space approaches
zero.
I now see what you were driving at with your point about interlocking
namespace transactions, etc. While each VFS transaction can indeed be
committed on its own, atomically and independently, to do so would be
death for throughput. So it is necessary to batch up a lot of these
transactions, and be mindful of the interdependencies between the VFS
transactions and the underlying data structures. The rule is,
underlying data changes required by any given VFS transaction must
never lie in more than one commit. This can be accomplished without
actually tracking the physical representation interdependencies between
VFS transactions. Instead, just count the dirty cache blocks and send
out an atomic commit for the entire set of transactions when some
threshold is passed. Now the challenge is to figure out how to avoid
stalling during the commit.
It is thanks to you, your searching questions and the example of Hammer
that I was forced to understand these issues clearly. :-)
Note! In Linux, VFS means "Virtual Filesystem Switch", that is, the
security, synchronization and abstraction layer that exists above
filesystems in Linux. I think it means "Virtual File System" to you,
which is just "filesystem" to Linux hackers.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3@tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | Re: [GIT]: Networking |
| Evgeniy Polyakov | Re: [BUG] New Kernel Bugs |
