Linux: Btrfs, File Data and Metadata Checksums

Submitted by Jeremy
on June 14, 2007 - 8:33am

Chris Mason announced an early alpha release of his new Btrfs filesystem, "after the last FS summit, I started working on a new filesystem that maintains checksums of all file data and metadata." He listed the following features as "mostly implemented": "extent based file storage (2^64 max file size), space efficient packing of small files, space efficient indexed directories, dynamic inode allocation, writable snapshots, subvolumes (separate internal filesystem roots), checksums on data and metadata (multiple algorithms available), very fast offline filesystem check". He listed the following features as yet to be implemented: "object level mirroring and striping, strong integration with device mapper for multiple device support, online filesystem check, efficient incremental backup and FS mirroring". Regarding the current state of the project, Chris said:

"The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. I've gotten this far pretty quickly, and plan on continuing to knock off the features as fast as I can. Hopefully I'll manage a release every few weeks or so. The disk format will probably change in some major way every couple of releases."


From: Chris Mason [email blocked]
To:  linux-kernel, [email blocked]
Subject: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Date:	Tue, 12 Jun 2007 12:10:29 -0400

Hello everyone,

After the last FS summit, I started working on a new filesystem that
maintains checksums of all file data and metadata.  Many thanks to Zach
Brown for his ideas, and to Dave Chinner for his help on 
benchmarking analysis.

The basic list of features looks like this:

	* Extent based file storage (2^64 max file size)
	* Space efficient packing of small files
	* Space efficient indexed directories
	* Dynamic inode allocation
	* Writable snapshots
	* Subvolumes (separate internal filesystem roots)
	- Object level mirroring and striping
	* Checksums on data and metadata (multiple algorithms available)
	- Strong integration with device mapper for multiple device support
	- Online filesystem check
	* Very fast offline filesystem check
	- Efficient incremental backup and FS mirroring

The ones with marked with * are mostly working, and the others are on
my todo list.  There are more details on the FS design, some benchmarks
and download links here:

http://oss.oracle.com/~mason/btrfs/

The current status is a very early alpha state, and the kernel code
weighs in at a sparsely commented 10,547 lines.  I'm releasing now in
hopes of finding people interested in testing, benchmarking,
documenting, and contributing to the code.

I've gotten this far pretty quickly, and plan on continuing to knock off
the features as fast as I can.  Hopefully I'll manage a release every
few weeks or so.  The disk format will probably change in some major way
every couple of releases.

The TODO list has some critical stuff:

	* Ability to return -ENOSPC instead of oopsing
	* mmap()ed writes
	* Fault tolerance, (EIO, bad metadata etc)
	* Concurrency.  I use one mutex for all operations today
	* ACLs and extended attributes
	* Reclaim dead roots after a crash
	* Various other bits from the feature list above

And finally, here's a quick and dirty summary of the FS design points:

	* One large Btree per subvolume
	* Copy on write logging for all data and metadata
	* Reference count snapshots are the basis of the transaction
	  system.  A transaction is just a snapshot where the old root
	  is immediately deleted on commit
	* Subvolumes can be snapshotted any number of times
	* Snapshots are read/write and can be snapshotted again
	* Directories are doubly indexed to improve readdir speeds

So, please give it a try or a look and let me know what you think.

-chris


From: "Mike Snitzer" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 15:53:03 -0400 On 6/12/07, Chris Mason [email blocked] wrote: > Hello everyone, > > After the last FS summit, I started working on a new filesystem that > maintains checksums of all file data and metadata. Many thanks to Zach > Brown for his ideas, and to Dave Chinner for his help on > benchmarking analysis. > > The basic list of features looks like this: > > * Extent based file storage (2^64 max file size) > * Space efficient packing of small files > * Space efficient indexed directories > * Dynamic inode allocation > * Writable snapshots > * Subvolumes (separate internal filesystem roots) > - Object level mirroring and striping > * Checksums on data and metadata (multiple algorithms available) > - Strong integration with device mapper for multiple device support > - Online filesystem check > * Very fast offline filesystem check > - Efficient incremental backup and FS mirroring > > The ones with marked with * are mostly working, and the others are on > my todo list. There are more details on the FS design, some benchmarks > and download links here: > > http://oss.oracle.com/~mason/btrfs/ > > The current status is a very early alpha state, and the kernel code > weighs in at a sparsely commented 10,547 lines. I'm releasing now in > hopes of finding people interested in testing, benchmarking, > documenting, and contributing to the code. > > I've gotten this far pretty quickly, and plan on continuing to knock off > the features as fast as I can. Hopefully I'll manage a release every > few weeks or so. The disk format will probably change in some major way > every couple of releases. > > The TODO list has some critical stuff: > > * Ability to return -ENOSPC instead of oopsing > * mmap()ed writes > * Fault tolerance, (EIO, bad metadata etc) > * Concurrency. I use one mutex for all operations today > * ACLs and extended attributes > * Reclaim dead roots after a crash > * Various other bits from the feature list above > > And finally, here's a quick and dirty summary of the FS design points: > > * One large Btree per subvolume > * Copy on write logging for all data and metadata > * Reference count snapshots are the basis of the transaction > system. A transaction is just a snapshot where the old root > is immediately deleted on commit > * Subvolumes can be snapshotted any number of times > * Snapshots are read/write and can be snapshotted again > * Directories are doubly indexed to improve readdir speeds > > So, please give it a try or a look and let me know what you think. Chris, Given the substantial work that you've already put into btrfs and the direction you're Todo list details; it feels as though Btrfs will quickly provide the features that only Sun's ZFS provides. Looking at your Btrfs benchmark and design pages it is clear that you're motivation is a filesystem that addresses modern concerns (performance that doesn't degrade over time, online fsck, fast offline fsck, data/metadata checksums, unlimited snapshots, efficient remote mirroring, etc). There is still much "Todo" but you've made very impressive progress for the first announcement! I have some management oriented questions/comments. 1) Regarding the direction of Btrfs as it relates to integration with DM. The allocation policies, the ease of configuring DM-based striping/mirroring, management of large pools of storage all seems to indicate that Btrfs will manage the physical spindles internally. This is very ZFS-ish (ZFS pools) so I'd like to understand where you see Btrfs going in this area. Your initial benchmarks were all done ontop of a single disk with an LVM stack yet your roadmap/todo and design speaks to a tighter integration of the volume management features. So long term is traditional LVM/MD functionality to be pulled directly into Btrfs? 2) The Btrfs notion of subvolumes and snapshots is very elegant and provides for a fluid management of the filesystem system data. It feels as though each subvolume/snapshot is just folded into the parent Btrfs volumes' namespace. Was there any particular reason you elected to do this? I can see that it lends itself to allowing snapshots of snapshots. If you could elaborate I'd appreciate it. In practice subvolumes and/or snapshots appear to be implicitly mounted upon creation (refcount of parent is incremented). Is this correct? For snapshots, this runs counter to mapping the snapshots' data into the namespace of the origin Btrfs (e.g. with a .snapshot dir, but this is only useful for read-only snaps). Having snapshot namespaces in terms of monolithic subvolumes puts a less intuitive face on N Btrfs snapshots. The history of a given file/dir feels to be lost with this model. Aside from folding snapshot history into the origin's namespace... It could be possible to have a mount.btrfs that allows subvolumes and/or snapshot volumes to be mounted as unique roots? I'd imagine a bind mount _could_ provide this too? Anyway, I'm just interested in understanding the vision for managing the potentially complex nature of a Btrfs namespace. Thanks for doing all this work; I think the Linux community got a much needed shot in the arm with this Btrfs announcement. regards, Mike
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 16:14:39 -0400 On Tue, Jun 12, 2007 at 03:53:03PM -0400, Mike Snitzer wrote: > On 6/12/07, Chris Mason [email blocked] wrote: > >Hello everyone, > > > >After the last FS summit, I started working on a new filesystem that > >maintains checksums of all file data and metadata. Many thanks to Zach > >Brown for his ideas, and to Dave Chinner for his help on > >benchmarking analysis. > > Chris, > > Given the substantial work that you've already put into btrfs and the > direction you're Todo list details; it feels as though Btrfs will > quickly provide the features that only Sun's ZFS provides. > > Looking at your Btrfs benchmark and design pages it is clear that > you're motivation is a filesystem that addresses modern concerns > (performance that doesn't degrade over time, online fsck, fast offline > fsck, data/metadata checksums, unlimited snapshots, efficient remote > mirroring, etc). There is still much "Todo" but you've made very > impressive progress for the first announcement! > > I have some management oriented questions/comments. > > 1) > Regarding the direction of Btrfs as it relates to integration with DM. > The allocation policies, the ease of configuring DM-based > striping/mirroring, management of large pools of storage all seems to > indicate that Btrfs will manage the physical spindles internally. > This is very ZFS-ish (ZFS pools) so I'd like to understand where you > see Btrfs going in this area. There's quite a lot of hand waving in that section. What I'd like to do is work closely with the LVM/DM/MD maintainers and come up with something that leverages what linux already does. I don't want to rewrite LVM into the FS, but I do want to make better use of info about the underlying storage. > > Your initial benchmarks were all done ontop of a single disk with an > LVM stack yet your roadmap/todo and design speaks to a tighter > integration of the volume management features. So long term is > traditional LVM/MD functionality to be pulled directly into Btrfs? > > 2) > The Btrfs notion of subvolumes and snapshots is very elegant and > provides for a fluid management of the filesystem system data. It > feels as though each subvolume/snapshot is just folded into the parent > Btrfs volumes' namespace. Was there any particular reason you elected > to do this? I can see that it lends itself to allowing snapshots of > snapshots. If you could elaborate I'd appreciate it. > Yes, I wanted snapshots to be writable and resnapshottable. It also lowers the complexity to keep each snapshot as a subvolume/tree. subvolumes are only slightly more expensive than a directory. So, even though a subvolume is a large grained unit for a snapshot, you can get around this by just making more subvolumes. > In practice subvolumes and/or snapshots appear to be implicitly > mounted upon creation (refcount of parent is incremented). Is this > correct? For snapshots, this runs counter to mapping the snapshots' > data into the namespace of the origin Btrfs (e.g. with a .snapshot > dir, but this is only useful for read-only snaps). Having snapshot > namespaces in terms of monolithic subvolumes puts a less intuitive > face on N Btrfs snapshots. The history of a given file/dir feels to > be lost with this model. That's somewhat true, the disk format does have enough information to show you that history, but cleanly expressing it to the user is a daunting task. > > Aside from folding snapshot history into the origin's namespace... It > could be possible to have a mount.btrfs that allows subvolumes and/or > snapshot volumes to be mounted as unique roots? I'd imagine a bind > mount _could_ provide this too? Anyway, I'm just interested in > understanding the vision for managing the potentially complex nature > of a Btrfs namespace. One option is to put the real btrfs root into some directory in (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind outside of that. I wanted to wait to get fancy until I had a better idea of how people would use the feature. > > Thanks for doing all this work; I think the Linux community got a much > needed shot in the arm with this Btrfs announcement. > Thanks for the comments. -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 23:46:20 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: Chris> After the last FS summit, I started working on a new filesystem Chris> that maintains checksums of all file data and metadata. Many Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his Chris> help on benchmarking analysis. Chris> The basic list of features looks like this: Chris> * Extent based file storage (2^64 max file size) Chris> * Space efficient packing of small files Chris> * Space efficient indexed directories Chris> * Dynamic inode allocation Chris> * Writable snapshots Chris> * Subvolumes (separate internal filesystem roots) Chris> - Object level mirroring and striping Chris> * Checksums on data and metadata (multiple algorithms available) Chris> - Strong integration with device mapper for multiple device support Chris> - Online filesystem check Chris> * Very fast offline filesystem check Chris> - Efficient incremental backup and FS mirroring So, can you resize a filesystem both bigger and smaller? Or is that implicit in the Object level mirroring and striping? As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. And of course some way for efficient backups and more importantly RESTORES of data which is segregated like this. John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 06:35:22 -0400 On Tue, Jun 12, 2007 at 11:46:20PM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blocked] writes: > > Chris> After the last FS summit, I started working on a new filesystem > Chris> that maintains checksums of all file data and metadata. Many > Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his > Chris> help on benchmarking analysis. > > Chris> The basic list of features looks like this: > > Chris> * Extent based file storage (2^64 max file size) > Chris> * Space efficient packing of small files > Chris> * Space efficient indexed directories > Chris> * Dynamic inode allocation > Chris> * Writable snapshots > Chris> * Subvolumes (separate internal filesystem roots) > Chris> - Object level mirroring and striping > Chris> * Checksums on data and metadata (multiple algorithms available) > Chris> - Strong integration with device mapper for multiple device support > Chris> - Online filesystem check > Chris> * Very fast offline filesystem check > Chris> - Efficient incremental backup and FS mirroring > > So, can you resize a filesystem both bigger and smaller? Or is that > implicit in the Object level mirroring and striping? Growing the FS is just either extending or adding a new extent tree. Shrinking is more complex. The extent trees do have back pointers to the objectids that own the extent, but snapshotting makes that a little non-deterministic. The good news is there are no fixed locations for any of the metadata. So it is at least possible to shrink and pop out arbitrary chunks. > > As a user of Netapps, having quotas (if only for reporting purposes) > and some way to migrate non-used files to slower/cheaper storage would > be great. So far, I'm not planning quotas beyond the subvolume level. > > Ie. being able to setup two pools, one being RAID6, the other being > RAID1, where all currently accessed files are in the RAID1 setup, but > if un-used get migrated to the RAID6 area. HSM in general is definitely interesting. I'm afraid it is a long ways off, but it could be integrated into the scrubber that wanders the trees in the background. -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 10:00:56 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: >> So, can you resize a filesystem both bigger and smaller? Or is that >> implicit in the Object level mirroring and striping? Chris> Growing the FS is just either extending or adding a new extent Chris> tree. Shrinking is more complex. The extent trees do have Chris> back pointers to the objectids that own the extent, but Chris> snapshotting makes that a little non-deterministic. The good Chris> news is there are no fixed locations for any of the metadata. Chris> So it is at least possible to shrink and pop out arbitrary Chris> chunks. That's good to know. Being able to grow (online of course!) is great, but so would shrinking as well. It makes life so much more flexible for the SysAdmins, which is my particular focus... since it's my day job. >> As a user of Netapps, having quotas (if only for reporting purposes) >> and some way to migrate non-used files to slower/cheaper storage would >> be great. Chris> So far, I'm not planning quotas beyond the subvolume level. So let me get this straight. Are you saying that quotas would only be on the volume level, and for the initial level of sub-volumes below that level? Or would *all* sub-volumes have quota support? And does that include snapshots as well? >> Ie. being able to setup two pools, one being RAID6, the other being >> RAID1, where all currently accessed files are in the RAID1 setup, but >> if un-used get migrated to the RAID6 area. Chris> HSM in general is definitely interesting. I'm afraid it is a Chris> long ways off, but it could be integrated into the scrubber Chris> that wanders the trees in the background. Neat. As long as the idea is kept around a bit, that would be nice for an eventual addition. Or maybe someone needs to come up with a stackable filesystems to take care of this... John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 10:54:42 -0400 On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blocked] writes: > >> As a user of Netapps, having quotas (if only for reporting purposes) > >> and some way to migrate non-used files to slower/cheaper storage would > >> be great. > > Chris> So far, I'm not planning quotas beyond the subvolume level. > > So let me get this straight. Are you saying that quotas would only be > on the volume level, and for the initial level of sub-volumes below > that level? Or would *all* sub-volumes have quota support? And does > that include snapshots as well? On disk, snapshots and subvolumes are identical...the only difference is their starting state (sorry, it's confusing, and it doesn't help that I interchange the terms when describing features). Every subvolume will have a quota on the number of blocks it can consume. I haven't yet decided on the best way to account for blocks that are actually shared between snapshots, but it'll be in there somehow. So if you wanted to make a snapshot readonly, you just set the quota to 1 block. But, I'm not planning on adding a way to say user X in subvolume Y has quota Z. I'll just be: this subvolume can't get bigger than a given size. (at least for version 1.0). -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 12:12:23 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: Chris> On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote: >> >>>>> "Chris" == Chris Mason [email blocked] writes: >> >> As a user of Netapps, having quotas (if only for reporting purposes) >> >> and some way to migrate non-used files to slower/cheaper storage would >> >> be great. >> Chris> So far, I'm not planning quotas beyond the subvolume level. >> >> So let me get this straight. Are you saying that quotas would only be >> on the volume level, and for the initial level of sub-volumes below >> that level? Or would *all* sub-volumes have quota support? And does >> that include snapshots as well? Chris> On disk, snapshots and subvolumes are identical...the only Chris> difference is their starting state (sorry, it's confusing, and Chris> it doesn't help that I interchange the terms when describing Chris> features). Ok, that's fine. A sub-volume is the unit and depending on it's state, it's either a snapshot of an existing volume, or it's a volume on it's own, though it still has a parent (?) which it is mounted below? Do I have it right now? Chris> Every subvolume will have a quota on the number of blocks it Chris> can consume. I haven't yet decided on the best way to account Chris> for blocks that are actually shared between snapshots, but Chris> it'll be in there somehow. So if you wanted to make a snapshot Chris> readonly, you just set the quota to 1 block. Ok, so you really aren't talking about Quotas here, but space reservations instead. Also, I think you're wrong here when you state that making a snapshot (sub-volume?) RO just requires you to set the quota to 1 block. What is to stop me from writing 1 block to a random file that already exists? Chris> But, I'm not planning on adding a way to say user X in Chris> subvolume Y has quota Z. I'll just be: this subvolume can't Chris> get bigger than a given size. (at least for version 1.0). Ok, so version 1.0 isn't as interesting to me in a production environment, since we pretty much need quotas (or a quick way to monitor how much space a user has been allocated on a volume. But for a home system, it's certainly looking interesting as well, since I could give each home directory it's own sub-volume and just grow/shrink them as needed. Maybe. :] Thanks for your work on this. John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 12:34:05 -0400 On Wed, Jun 13, 2007 at 12:12:23PM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blcoked] writes: [ nod ] > Also, I think you're wrong here when you state that making a snapshot > (sub-volume?) RO just requires you to set the quota to 1 block. What > is to stop me from writing 1 block to a random file that already > exists? It's copy on write, so changing one block means allocating a new one and putting the new contents there. The old blocks don't become available for reuse until the transaction commits. -chris

Related Links:

Easy?

Anonymous (not verified)
on
June 14, 2007 - 10:52am

He seems to be able to get a filesystem that is "current status is a very early alpha state" into the kernel.

But Hans Reiser couldn't get Reiser4 into the kernel?

The point is: how much

Anonymous (not verified)
on
June 14, 2007 - 11:35am

The point is: how much impact does it have on the rest of the kernel? If it changes semantics of the whole VFS-Layer it will be harder to push through than if there are no changes.

Not that easy ;)

Anonymous (not verified)
on
June 14, 2007 - 12:25pm

Btrfs is not in the kernel, only posted for discussion.

why there is no lkml link?

Anonymous (not verified)
on
June 14, 2007 - 12:49pm

why there is no lkml link?

where can i download source?

Anonymous (not verified)
on
June 14, 2007 - 3:04pm

I can imagine where this goes next:
CRC/md5sum/whatnot for a HUGE file totally screws you up when you perform file checks. SO: does the FS have a check for each 'block' (whatever that may mean to a device) or is it a single check for a huge file? Remember that the CRC, like the MD5 hash, does not produce unique results so the size of block checked must be significantly smaller than the size expected to start causing problems.

Now, with a check on blocks rather than the entire file, a file checking system can use tiny amounts of time and use some (not yet invented) algorithm to slowly check the entire filesystem. It would also be possible to check data as it is requested by programs (although it's crazy to do this all the time on servers because it really kills performance).

Maybe one day HDs will have more sophisticated dedicated checking hardware ...

That's exactly what ZFS

Anonymous (not verified)
on
June 14, 2007 - 5:22pm

That's exactly what ZFS does, it keeps checksums on a per-block basis and it verifies them every time a block is read.

And the "not yet invented" algorithm to check the entire filesystem that you mentioned is also in ZFS, which is actually pretty simple - it just traverses (scrubs) the whole disk and sees if the checksums match on every block.

actually hard drives today

Anonymous (not verified)
on
June 14, 2007 - 11:38pm

actually hard drives today already have that, and not just error detection but error *correction* mechanisms. They'd be a lot less reliable than they are if they didn't.

link

Blocklevel Checksum

Anonymous (not verified)
on
June 14, 2007 - 11:04pm

I always find it amusing how Sun-fanboys think that checksums at block level are some amazing new technology. My Amiga did that in 1988 :-o.

http://www.osdev.org/wiki/FFS_(Amiga)

NOT block level checksum

Anonymous (not verified)
on
June 14, 2007 - 11:25pm

The good thing is that the checksum is not block based. If you'd read the wrong block the checksum would be OK, but still the data would be the wrong data.

ZFS (and I hope btrfs) stores the checksum in the inode (which makes it part of the chain). The inode is read anyway when accessing the file, so the checksum is available for free.

This checksum mechanism goes up all the way to the root. All checksums match: you know you didn't have any data corruption or misaddressing. Also, after a crash and fsck, if an old inode comes up, or an old block, it can be seen the data is not the right one.

If you have whole file

Anonymous (not verified)
on
June 15, 2007 - 12:01am

If you have whole file chekcsum, then append to this (big) file is pain. You must recalculate checksum for all file. This is long operation. Lets say we have a big log file...
This means append to file is operation O(n). This is same thing like do append on tape device.
Better/clever way to chekcsum every block. And maybe checksum all checksums.

Depends on the algorithm

Mr_Z
on
June 15, 2007 - 4:55pm

You can recompute the CRC on a file without scanning the whole file. You can change any byte you want or even append to the file or truncate it, and you can successfully recompute the CRC taking into account only the new data, and in the case of a change to the file, the data it replaces or that got truncated. CRCs are linear codes, which is what makes that possible. Other linear codes will have the same property.

CRCs are computed approximately as follows:

sum = (sum * constant1 + next_byte) modulo constant2

To remove a byte's impact on the checksum, you need to know how far from the end it was. Say it's 100 bytes from the end. You simply subtract (constant1^100 * old_byte) from the checksum, modulo constant2. That's a nearly constant time operation. (Actually it's O(log n), where n is the distance to end of file, and that's to compute the exponentiation. The ^ in the preceding is exponentiation.) To add in a new byte in that position, you add (constant1^100 * new_byte) to the checksum, modulo constant2. The two steps can be combined as (sum + constant1^100 * (new_byte - old_byte)) modulo constant2.

Turns out that this scales pretty well, too. Given the CRC checksums of an old block and a new block, the checksum for the entire file, and the position of the block in the file, you can actually update the file-level checksum by this method. You don't even need to know the data in the block, just the block-level checksum.

(CRCs are computed in a Galois field, so it may not be immediately obvious that you're doing multiplication and addition when you look at the implementation. Galois fields wrap around, so the modulo constant2 is kinda built in.)

If you change 1 byte in a

Anonymous (not verified)
on
June 17, 2007 - 3:30pm

If you change 1 byte in a file on ZFS, only that 1 byte will have to be scanned; not the while file. So I'd bet ZFS does it the way you describe. You can put the whoel checksumming off btw, but the overhead is not that much (Google will show you some benchmarks).

I'm rather worried about incompatibilities between the *BSD/*Solaris and Linux camps though. Also, ZFS is (almost) production-ready whereas the Linux solution is not quite there yet.

What ZFS lacks is file/disk encryption. In Linux this should be easy to implement via dm-crypt. On FreeBSD it is easy to implement via GELI. OpenBSD seems more focussed on firewalls and routers these days.

Interesting times. :)

ZFS has this.

Anonymous (not verified)
on
June 17, 2007 - 3:48pm

ZFS has this.

At last! Great! We need

gapsf (not verified)
on
June 15, 2007 - 2:47am

At last! Great!
We need modern FS with the best features of XFS, ZFS and others. No compromises! Only the best technologies and ideas in one FS.
- Checksums on data and metadata (multiple algorithms available) - great!
- Online filesystem check - great!
- Copy on Write - great!
- The disk format allows Btrfs to deal with most corruptions at run time, without crashing the system and without requiring offline filesystem checking - great!
Good luck!

PS
What is max size of filesystem?
What about something like ZFS's RAID-Z2, redundant copies for data (http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection)?

The post says "Extent based

Anonymous (not verified)
on
June 15, 2007 - 7:05am

The post says "Extent based file storage (2^64 max file size)", which I take to mean that a file can be 2^64 blocks. It might be 2^64 bits, but either way that's large enough files for now. It prompts a question, though, in line with the assumptions made by ZFS regarding storage capacities (that the pace of storage increase predicts that 64-bit filesystems will be exhausted in around 20 years) which spurred their adoption of a 128-bit model: why not a larger block-number counter?

Hmmm

Mr_Z
on
June 15, 2007 - 5:06pm

If I write 4GB per second (2^32), it'd still take me 2^32 seconds to exhaust 2^64. At that rate it'll take me over 130 years. And that's for one file. Current computers can only write about that fast to RAM.

The mind boggles.

Then again, when MS Word 2027 comes out....

This is Linux, not Windows or Solaris.

Anonymous (not verified)
on
June 17, 2007 - 9:40am

In 20 years, when people start hitting the wall, just write a new file system. At the current rate of change, popular Linux file systems only last about 10 years anyway. 20 years from now we will all be moving to btrfs's successor, which will solve this problem, as well as many others we are sure to find along the way.

Is this reiserfs+ ?

umka
on
June 16, 2007 - 4:04am

All this looks to me similar to reiserfs ;) Why to work on a cycle which probably will have square wheels? ;) Why not to expand existing FS?

--
umka

Seek-optimization

Anonymous (not verified)
on
June 16, 2007 - 11:57am

One thing where ReiserFS always stood out, and Reiser4 improves this even further, is seek-optimization, especially for writes. (Future read patterns are impossible to predict precisely, so read seek-optimization is impossible. The write pattern is obviously known at write; However no other Linux filesystem approaches Reiser's 99% efficient seek-optimization by far.)
BSD-LFS (ie in NetBSD) achieves 100% efficient write seek-optimization, but that is natural as it's a log-structured filesystem. ZFS also comes close, however it's nowhere near Reiser4.

If Btrfs comes close to Reiser4 here, I am sold. However if it does not (and that's not something that can be tuned once the disk layout format is set), I think I will use Reiser4 instead, as it supports all features I would use in Btrfs already. Now I really hope the ReiserFS development process will resume...

Re: Seek-optimization

umka
on
June 17, 2007 - 5:26am

In fact Reiser4 takes even better care of seeks due to using so called "dancing trees". This means, that any balanced tree change is done without taking into account final block numbers, and only later, at the flush time, internal tree gets block numbers assigned to new and old nodes so that order "parent first" is preserved. This improves tree traversal a lot. Additionally, reiser4 tends to keep data and metadata nodes in same order as names in directory, this also was found to be very useful. There is also much more features: fibers, re-allocations, etc.

--
umka

secure file sistems

werner (not verified)
on
June 17, 2007 - 11:27am

I want to suggest something 'very stupid', but warranting to almost 100% that after accidents all remains of files still existing can be recovered.

Include an option, that the user can enable, if he want, that in each sector, at the end (f.ex. in the last 6 or 8 bytes) is wroten a unique id number of the file to what it belongs.

Typically, the number of files on a HD is a few millions, so that 6 or 8 byte are sufficiently. This would increase the size by 1.5% , but the security very.

After a bad crash, one simply scan the HD, and put all sectors with the same ID-no. together, to get back his files.

Its less important, to get the exact file/path names. However, one could provide this, too, by writing by default, at the last sector after the #EOF mark of the file, its complete name (with path), and perhaps a few additional infos like size, date/time, checksum. Doing it in this simple manner, is suficient and a) would not need bigger sectors for long path names (other than if storaged the whole name in each sector); b) dont have the same weakness like storage this infos for all files into one file or root directory (which can be lost); c) permits to verify the integrity if in the middle are missing sectors.

The id no. for the files, could have within the first digit the partition no., for the case that the partition table also disappears. However, for the most files, which have the same no., during a scan one find out on what partition they were if of each of them the last sector with the EOF mark / name was preserved.

Not that simple

Mr_Z
on
June 17, 2007 - 10:54pm

You wouldn't be able to implement mmap() very easily on such a filesystem because the blocks are no longer power-of-2 sized. This would complicate the VM vs. filesystem interaction somewhat. Ordinarily, system pages cover a small integer number of filesystem blocks. This makes it very easy to mmap() them. Executable pages from programs and libraries use pretty much the same mechanism. The only special case that comes up often is tail-packing.

Also, you'd need a sequence number in addition to the ID. Sure, the ID number gets you the file, but you also need to know where that block fits in the file.

These problems aren't unsolvable, but I'm not sure the solution is terribly light weight.

Filesystem with random access insertion support.

David Pottage (not verified)
on
June 18, 2007 - 7:40am

One new feature that I think it would be very useful to have in a new filesystem would be the ability to insert and remove data at any point in the file, and have the rest of the move to accommodate it, in the same way that you can add or remove lines in a text editor.

I know that the reason this feature was not available earlier computers was that the underlying filesystem records did not support it, so the only way to insert bytes into a file would be to move all the other data in that file, (a job that could just as easily be done in userland), however modern filesystems support fragmentation, and they are often designed so that files larger than a certain size will always be fragmented, even if those fragments are consecutive on the disc. Those fragments can be used to support an insertion API.

Suppose we have a 100 Mb file on disc, which is currently contiguous. Via the new insertion filesystem API, a userland program opens it, seeks to 30Mb, inserts 10Mb of data and closes the file. From the userland perspective, the file is still contiguous, and is now 110Mb long. On disc it is now in three fragments of 30, 10 and 70Mb, with the 30 and 70Mb fragments consecutive on the disc. The filesystem treats this fragmented file the same as any other fragmented file. Data can be removed in a similar way, by truncating some disc fragments and updating the disc records that describe how the join up.

The big advantage of such a feature is that it would make it much easer to write multimedia editing programs, and they would run much faster. Currently if you edit a large video file, and cut out sections, there is a wait of several minutes while your edits are saved. With this feature it would be done in a fraction of a second.

An obvious criticism of this plan is that it will lead to a lot more fragmented files, that need defragging, FS performance will be harmed, and that filesystem defrag utilities will end up doing the job that should have been done by the userland program that created the fragments in the first place.

Firstly, I don’t thing the extra fragmentation will be a serous problem. For large files (that this proposal is aimed at), file read and write times will be dominated by the data transfer time from the disc, rather than the disc head seek time. For small files, it will take very little time to defrag them if necessary and that defrag could be done automatically when the file is closed after an insertion write.

Secondly while a background defrag demon would be needed, and that utility will do the work that would otherwise be done by the foreground userland program, it would run in the background, not while the user is staring at an hourglass or percent complete bar. Also such a demon would combine the effects of several edits so less work is done overall. The demon could be tuned to defrag small frequently accessed files over large infrequently accessed one. I would expect the defrag demon to run in kernel space as an idle priority thread of the filesystem.

Another criticism is that a new API would be needed, (or an extension of existing APIs), and that this would require VFS extensions that would affect all filesystems in Linux. This is a valid criticism, but as I think this is a potentially very useful feature, I think it is worth trying to find a way to support it. Obviously other filesystems will not support it, either because it is not implemented, or because it is technically impossible, but those filesystems (or the VFS layer) could be modified to just return an exception if any program attempted to use this new insertion API. Programs that wanted to use the insertion API would have to check if the underlying FS supported it before using it.

This should be done at a higher level

Mr_Z
on
June 18, 2007 - 8:41am

If a particular class of files benefits greatly from splicing, it seems like that could be done at a higher level, such as in a library or what not. Even MS Word documents are stored as a database of text fragments that are linked together with an index table, to support precisely what you suggest.

You say:

Firstly, I don’t thing the extra fragmentation will be a serous problem. For large files (that this proposal is aimed at), file read and write times will be dominated by the data transfer time from the disc, rather than the disc head seek time.

This is true for large, linear files with few or no edit points. If non-linear editing is so important—which seems to be your original supposition—a large file may have many of these cut points.

Non-linear editing does sound like it would be a good job for an abstraction library, certainly, say combining a metafile and a large flat file. The metafile would track the contents and layout of the flat file. The library would hide the fact that the flat file isn't stored contiguously. (That is, the sequence of bytes seen by the app is not the same as the sequence of bytes seen by "cat flatfile".)

You further say:

For small files, it will take very little time to defrag them if necessary and that defrag could be done automatically when the file is closed after an insertion write.

If the files are small, then the cost to defrag them is very similar to the cost of rewriting them to begin with, as you note. Again, this could be handled at the library level.

Many distributed hard-disks

Anonymous (not verified)
on
June 18, 2007 - 9:41am

Many distributed hard-disks over a network of hard-disks ATA-over-ethernet.

I need an efficient RAID-5 distributed filesystem, don't reinvent the old wheel for single PC.

Older operating systems (up

Anonymous (not verified)
on
June 20, 2007 - 3:15pm

Older operating systems (up to VMS) often provided the ability to insert data in files through record level i/o.

Ah yes... strongly typed and structured files

Mr_Z
on
June 22, 2007 - 10:46am

Didn't these OSes enforce the record structure at the OS level, though? Structured files were such a pervasive concept that even my TI-99/4A implemented them. (OPEN #1: "DSK1.FOO", INTERNAL, OUTPUT, FIXED 80. Will some things not leave my brain?) One of UNIX's big "innovations" was to make everything flat streams of bytes and let userspace decide what they meant.

I put "innovation" in quotes because not everyone agreed it was the Right Thing. Look up "Worse is Better" for a different take on the genius (or not) of the UNIX philosophy.

At any rate, you'd have to use a record size of 1 to implement the byte-stream editor concept in such a structured file. Somehow I have a hard time envisioning that as being very efficient.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.