Ok, here's the final design document that I am now implementing.
Again, I expect most or all of these features to be ready and the
filesystem to be beta-quality by the December release.
Hammer Filesystem
(I) General Storage Abstraction
HAMMER uses a basic 16K filesystem buffer for all I/O. Buffers are
collected into clusters, cluster are collected into volumes, and a
single HAMMER filesystem may span multiple volumes.
HAMMER maintains a small hinted radix tree for block management in
each layer. A small radix tree in the volume header manages cluster
allocations within a volume, one in the cluster header manages buffer
allocations within a cluster, and most buffers (pure data buffers
excepted) will embed a small tree to manage item allocations within
the buffer.
Volumes are typically specified as disk partitions, with one volume
designated as the root volume containing the root cluster. The root
cluster does not need to be contained in volume 0 nor does it have to
be located at any particular offset.
Data can be migrated on a cluster-by-cluster or volume-by-volume basis
and any given volume may be expanded or contracted while the filesystem
is live. Whole volumes can be added and (with appropriate data
migration) removed.
HAMMER's storage management limits it to 32768 volumes, 32768 clusters
per volume, and 32768 16K filesystem buffers per cluster. A volume
is thus limited to 16TB and a HAMMER filesystem as a whole is limited
to 524288TB. HAMMER's on-disk structures are designed to allow future
expansion through expansion of these limits. In particular, the volume
id is intended to be expanded to a full 32 bits in the future and using
a larger buffer size will also greatly increase the cluster and volume
size limitations by increasing the number of elements the buffer-
restricted radix trees can manage.
HAMMER breaks all of its ...Hi, I hope this question has not been implicitly answered before, but how does Hammer handle quotas? Filesystems like XFS and ZFS maintain quota information internally so that a quotacheck after a system crash does not take ages. It seems to me that Hammer could manage quotas as a part of its cluster allocation strategy. Is this the case? TIA, RIggs
Wow, this seems pretty good. What about data corruption issues ? Have you thought about implementing some sort of checksumming mechanism ? We cannot assume hardware to be absolutely reliable. There may be some silent corruption going on the disk or network layers, etc... More on this in this article: http://kerneltrap.org/Linux/Data_Errors_During_Drive_Communication -- Francois Tigeot
Quoting from Matt's announcement: " All information in a HAMMER filesystem is CRCd to detect corruption." 'All' So the question - if there is one - is 'how good' that check is. Otherwise, not the fs' job. It *must* presume a 'generally reliable' environment beyond a certain point. Error prevention, detection, (possible) correction, and friends more properly should exist in the storage hardware, I/O, and link layers. As they do. Or do not. .. just as the article you cited points out.... hardware and driver selection issues, or even suboptimal silicon. Bill
According to Matt's design document:
"All information in a HAMMER filesystem is CRCd to detect
corruption."
Regards,
MichaelAny specific reason not to go with a B+-Tree or B#-Tree which have shown
to have advantageous effects?
Also, what, if any, will be the locking policy for multiple I/O threads
accessing a single HAMMER filesystem? Shared/exclusive mutexes on the
cluster level?
(Whomever invented early mornings does not deserve brownie points),
--
Thomas E. Spanjaard
tgen@netphreax.netInteresting. What exactly are those database files used for? Is a database file attached to each file to store ACLs, for example? Or can it be used like btree(3)? Do they have their own namespace in the filesystem? Wow! I a am really looking forward to try out HAMMER!!! Regards, Michael
Because HAMMER uses a B-Tree (maybe a B+Tree the more I look at it)..
in anycase, because HAMMER uses a B-Tree all lookups are basically
key searches, even when looking up an offset in a file. Since B-Tree
elements specify records which can reference variable-length data,
there really is very little difference between a database record
indexed with a key and regular file data indexed with an offset.
Records are typed so any given filesystem object can contain multiple
key spaces. One space will hold ACLs, one will be for regular file
offsets, and there's nothing preventing us from having a key space
directly accessible by userland.
A HAMMER-aware database would be able to store its records using the
key space directly. It opens up some intriguing possibilities.
-Matt
Matthew Dillon
<dillon@backplane.com>.. including implementing a ZFS-like DB/fs crossbreed *ATOP* HAMMER. Or a Venti workalike (surpassed already on feature-set, but not storage efficiency AFAICS). IF one really wanted to do either badly enough. And had a need. Too complicated for my taste, but PostgreSQL data store might be another matter entirely. Bill
*snip* Matt, Awesome! Tells me: "ZFS, bend over, grab your ankles and kiss your an(atomy) 'Goodbye'" From the amount of work that has HAD to go into this, it also tells me you are: A) probably single, or soon will be and B) don't sleep much anyway! ;-) Looking forward to a 'test drive'... Bill Hacker
| Alok Kataria | Use CPUID to communicate with the hypervisor. |
| Greg KH | [RFC] kobject and kset core changes and cleanups |
| Linus Torvalds | Re: [PATCH 00/23] per device dirty throttling -v8 |
| Alexandre Oliva | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
git: | |
| Francis Moreau | Track /etc directory using Git |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Ping Yin | why still no empty directory support in git |
| walt | git versus CVS (versus bk) |
| Christopher Bianchi | How can i boot a bsd.rd from windows 2000 ? |
| Chris Bullock | OpenBSD isakmpd and pf vs Cisco PIX or ASA |
| Nuno Magalhães | Can't scp, ssh is slow to authenticate. |
| Richard Stallman | Real men don't attack straw men |
| Eric W. Biederman | [PATCH 0/10] sysfs network namespace support |
| John P Poet | Realtek 8111C transmit timed out |
| KOSAKI Motohiro | [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
| Indan Zupancic | Re: Realtek 8111C transmit timed out |
