logo
Published on KernelTrap (http://kerneltrap.org)

Linux: Btrfs, File Data and Metadata Checksums

By Jeremy
Created Jun 14 2007 - 11:33

Chris Mason announced an early alpha release of his new Btrfs filesystem [1], "after the last FS summit, I started working on a new filesystem that maintains checksums of all file data and metadata." He listed the following features as "mostly implemented": "extent based file storage (2^64 max file size), space efficient packing of small files, space efficient indexed directories, dynamic inode allocation, writable snapshots, subvolumes (separate internal filesystem roots), checksums on data and metadata (multiple algorithms available), very fast offline filesystem check". He listed the following features as yet to be implemented: "object level mirroring and striping, strong integration with device mapper for multiple device support, online filesystem check, efficient incremental backup and FS mirroring". Regarding the current state of the project, Chris said:

"The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. I've gotten this far pretty quickly, and plan on continuing to knock off the features as fast as I can. Hopefully I'll manage a release every few weeks or so. The disk format will probably change in some major way every couple of releases."


From: Chris Mason [email blocked]
To:  linux-kernel, [email blocked]
Subject: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
Date:	Tue, 12 Jun 2007 12:10:29 -0400

Hello everyone,

After the last FS summit, I started working on a new filesystem that
maintains checksums of all file data and metadata.  Many thanks to Zach
Brown for his ideas, and to Dave Chinner for his help on 
benchmarking analysis.

The basic list of features looks like this:

	* Extent based file storage (2^64 max file size)
	* Space efficient packing of small files
	* Space efficient indexed directories
	* Dynamic inode allocation
	* Writable snapshots
	* Subvolumes (separate internal filesystem roots)
	- Object level mirroring and striping
	* Checksums on data and metadata (multiple algorithms available)
	- Strong integration with device mapper for multiple device support
	- Online filesystem check
	* Very fast offline filesystem check
	- Efficient incremental backup and FS mirroring

The ones with marked with * are mostly working, and the others are on
my todo list.  There are more details on the FS design, some benchmarks
and download links here:

http://oss.oracle.com/~mason/btrfs/ [2]

The current status is a very early alpha state, and the kernel code
weighs in at a sparsely commented 10,547 lines.  I'm releasing now in
hopes of finding people interested in testing, benchmarking,
documenting, and contributing to the code.

I've gotten this far pretty quickly, and plan on continuing to knock off
the features as fast as I can.  Hopefully I'll manage a release every
few weeks or so.  The disk format will probably change in some major way
every couple of releases.

The TODO list has some critical stuff:

	* Ability to return -ENOSPC instead of oopsing
	* mmap()ed writes
	* Fault tolerance, (EIO, bad metadata etc)
	* Concurrency.  I use one mutex for all operations today
	* ACLs and extended attributes
	* Reclaim dead roots after a crash
	* Various other bits from the feature list above

And finally, here's a quick and dirty summary of the FS design points:

	* One large Btree per subvolume
	* Copy on write logging for all data and metadata
	* Reference count snapshots are the basis of the transaction
	  system.  A transaction is just a snapshot where the old root
	  is immediately deleted on commit
	* Subvolumes can be snapshotted any number of times
	* Snapshots are read/write and can be snapshotted again
	* Directories are doubly indexed to improve readdir speeds

So, please give it a try or a look and let me know what you think.

-chris


From: "Mike Snitzer" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 15:53:03 -0400 On 6/12/07, Chris Mason [email blocked] wrote: > Hello everyone, > > After the last FS summit, I started working on a new filesystem that > maintains checksums of all file data and metadata. Many thanks to Zach > Brown for his ideas, and to Dave Chinner for his help on > benchmarking analysis. > > The basic list of features looks like this: > > * Extent based file storage (2^64 max file size) > * Space efficient packing of small files > * Space efficient indexed directories > * Dynamic inode allocation > * Writable snapshots > * Subvolumes (separate internal filesystem roots) > - Object level mirroring and striping > * Checksums on data and metadata (multiple algorithms available) > - Strong integration with device mapper for multiple device support > - Online filesystem check > * Very fast offline filesystem check > - Efficient incremental backup and FS mirroring > > The ones with marked with * are mostly working, and the others are on > my todo list. There are more details on the FS design, some benchmarks > and download links here: > > http://oss.oracle.com/~mason/btrfs/ [3] > > The current status is a very early alpha state, and the kernel code > weighs in at a sparsely commented 10,547 lines. I'm releasing now in > hopes of finding people interested in testing, benchmarking, > documenting, and contributing to the code. > > I've gotten this far pretty quickly, and plan on continuing to knock off > the features as fast as I can. Hopefully I'll manage a release every > few weeks or so. The disk format will probably change in some major way > every couple of releases. > > The TODO list has some critical stuff: > > * Ability to return -ENOSPC instead of oopsing > * mmap()ed writes > * Fault tolerance, (EIO, bad metadata etc) > * Concurrency. I use one mutex for all operations today > * ACLs and extended attributes > * Reclaim dead roots after a crash > * Various other bits from the feature list above > > And finally, here's a quick and dirty summary of the FS design points: > > * One large Btree per subvolume > * Copy on write logging for all data and metadata > * Reference count snapshots are the basis of the transaction > system. A transaction is just a snapshot where the old root > is immediately deleted on commit > * Subvolumes can be snapshotted any number of times > * Snapshots are read/write and can be snapshotted again > * Directories are doubly indexed to improve readdir speeds > > So, please give it a try or a look and let me know what you think. Chris, Given the substantial work that you've already put into btrfs and the direction you're Todo list details; it feels as though Btrfs will quickly provide the features that only Sun's ZFS provides. Looking at your Btrfs benchmark and design pages it is clear that you're motivation is a filesystem that addresses modern concerns (performance that doesn't degrade over time, online fsck, fast offline fsck, data/metadata checksums, unlimited snapshots, efficient remote mirroring, etc). There is still much "Todo" but you've made very impressive progress for the first announcement! I have some management oriented questions/comments. 1) Regarding the direction of Btrfs as it relates to integration with DM. The allocation policies, the ease of configuring DM-based striping/mirroring, management of large pools of storage all seems to indicate that Btrfs will manage the physical spindles internally. This is very ZFS-ish (ZFS pools) so I'd like to understand where you see Btrfs going in this area. Your initial benchmarks were all done ontop of a single disk with an LVM stack yet your roadmap/todo and design speaks to a tighter integration of the volume management features. So long term is traditional LVM/MD functionality to be pulled directly into Btrfs? 2) The Btrfs notion of subvolumes and snapshots is very elegant and provides for a fluid management of the filesystem system data. It feels as though each subvolume/snapshot is just folded into the parent Btrfs volumes' namespace. Was there any particular reason you elected to do this? I can see that it lends itself to allowing snapshots of snapshots. If you could elaborate I'd appreciate it. In practice subvolumes and/or snapshots appear to be implicitly mounted upon creation (refcount of parent is incremented). Is this correct? For snapshots, this runs counter to mapping the snapshots' data into the namespace of the origin Btrfs (e.g. with a .snapshot dir, but this is only useful for read-only snaps). Having snapshot namespaces in terms of monolithic subvolumes puts a less intuitive face on N Btrfs snapshots. The history of a given file/dir feels to be lost with this model. Aside from folding snapshot history into the origin's namespace... It could be possible to have a mount.btrfs that allows subvolumes and/or snapshot volumes to be mounted as unique roots? I'd imagine a bind mount _could_ provide this too? Anyway, I'm just interested in understanding the vision for managing the potentially complex nature of a Btrfs namespace. Thanks for doing all this work; I think the Linux community got a much needed shot in the arm with this Btrfs announcement. regards, Mike
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 16:14:39 -0400 On Tue, Jun 12, 2007 at 03:53:03PM -0400, Mike Snitzer wrote: > On 6/12/07, Chris Mason [email blocked] wrote: > >Hello everyone, > > > >After the last FS summit, I started working on a new filesystem that > >maintains checksums of all file data and metadata. Many thanks to Zach > >Brown for his ideas, and to Dave Chinner for his help on > >benchmarking analysis. > > Chris, > > Given the substantial work that you've already put into btrfs and the > direction you're Todo list details; it feels as though Btrfs will > quickly provide the features that only Sun's ZFS provides. > > Looking at your Btrfs benchmark and design pages it is clear that > you're motivation is a filesystem that addresses modern concerns > (performance that doesn't degrade over time, online fsck, fast offline > fsck, data/metadata checksums, unlimited snapshots, efficient remote > mirroring, etc). There is still much "Todo" but you've made very > impressive progress for the first announcement! > > I have some management oriented questions/comments. > > 1) > Regarding the direction of Btrfs as it relates to integration with DM. > The allocation policies, the ease of configuring DM-based > striping/mirroring, management of large pools of storage all seems to > indicate that Btrfs will manage the physical spindles internally. > This is very ZFS-ish (ZFS pools) so I'd like to understand where you > see Btrfs going in this area. There's quite a lot of hand waving in that section. What I'd like to do is work closely with the LVM/DM/MD maintainers and come up with something that leverages what linux already does. I don't want to rewrite LVM into the FS, but I do want to make better use of info about the underlying storage. > > Your initial benchmarks were all done ontop of a single disk with an > LVM stack yet your roadmap/todo and design speaks to a tighter > integration of the volume management features. So long term is > traditional LVM/MD functionality to be pulled directly into Btrfs? > > 2) > The Btrfs notion of subvolumes and snapshots is very elegant and > provides for a fluid management of the filesystem system data. It > feels as though each subvolume/snapshot is just folded into the parent > Btrfs volumes' namespace. Was there any particular reason you elected > to do this? I can see that it lends itself to allowing snapshots of > snapshots. If you could elaborate I'd appreciate it. > Yes, I wanted snapshots to be writable and resnapshottable. It also lowers the complexity to keep each snapshot as a subvolume/tree. subvolumes are only slightly more expensive than a directory. So, even though a subvolume is a large grained unit for a snapshot, you can get around this by just making more subvolumes. > In practice subvolumes and/or snapshots appear to be implicitly > mounted upon creation (refcount of parent is incremented). Is this > correct? For snapshots, this runs counter to mapping the snapshots' > data into the namespace of the origin Btrfs (e.g. with a .snapshot > dir, but this is only useful for read-only snaps). Having snapshot > namespaces in terms of monolithic subvolumes puts a less intuitive > face on N Btrfs snapshots. The history of a given file/dir feels to > be lost with this model. That's somewhat true, the disk format does have enough information to show you that history, but cleanly expressing it to the user is a daunting task. > > Aside from folding snapshot history into the origin's namespace... It > could be possible to have a mount.btrfs that allows subvolumes and/or > snapshot volumes to be mounted as unique roots? I'd imagine a bind > mount _could_ provide this too? Anyway, I'm just interested in > understanding the vision for managing the potentially complex nature > of a Btrfs namespace. One option is to put the real btrfs root into some directory in (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind outside of that. I wanted to wait to get fancy until I had a better idea of how people would use the feature. > > Thanks for doing all this work; I think the Linux community got a much > needed shot in the arm with this Btrfs announcement. > Thanks for the comments. -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Tue, 12 Jun 2007 23:46:20 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: Chris> After the last FS summit, I started working on a new filesystem Chris> that maintains checksums of all file data and metadata. Many Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his Chris> help on benchmarking analysis. Chris> The basic list of features looks like this: Chris> * Extent based file storage (2^64 max file size) Chris> * Space efficient packing of small files Chris> * Space efficient indexed directories Chris> * Dynamic inode allocation Chris> * Writable snapshots Chris> * Subvolumes (separate internal filesystem roots) Chris> - Object level mirroring and striping Chris> * Checksums on data and metadata (multiple algorithms available) Chris> - Strong integration with device mapper for multiple device support Chris> - Online filesystem check Chris> * Very fast offline filesystem check Chris> - Efficient incremental backup and FS mirroring So, can you resize a filesystem both bigger and smaller? Or is that implicit in the Object level mirroring and striping? As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. And of course some way for efficient backups and more importantly RESTORES of data which is segregated like this. John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 06:35:22 -0400 On Tue, Jun 12, 2007 at 11:46:20PM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blocked] writes: > > Chris> After the last FS summit, I started working on a new filesystem > Chris> that maintains checksums of all file data and metadata. Many > Chris> thanks to Zach Brown for his ideas, and to Dave Chinner for his > Chris> help on benchmarking analysis. > > Chris> The basic list of features looks like this: > > Chris> * Extent based file storage (2^64 max file size) > Chris> * Space efficient packing of small files > Chris> * Space efficient indexed directories > Chris> * Dynamic inode allocation > Chris> * Writable snapshots > Chris> * Subvolumes (separate internal filesystem roots) > Chris> - Object level mirroring and striping > Chris> * Checksums on data and metadata (multiple algorithms available) > Chris> - Strong integration with device mapper for multiple device support > Chris> - Online filesystem check > Chris> * Very fast offline filesystem check > Chris> - Efficient incremental backup and FS mirroring > > So, can you resize a filesystem both bigger and smaller? Or is that > implicit in the Object level mirroring and striping? Growing the FS is just either extending or adding a new extent tree. Shrinking is more complex. The extent trees do have back pointers to the objectids that own the extent, but snapshotting makes that a little non-deterministic. The good news is there are no fixed locations for any of the metadata. So it is at least possible to shrink and pop out arbitrary chunks. > > As a user of Netapps, having quotas (if only for reporting purposes) > and some way to migrate non-used files to slower/cheaper storage would > be great. So far, I'm not planning quotas beyond the subvolume level. > > Ie. being able to setup two pools, one being RAID6, the other being > RAID1, where all currently accessed files are in the RAID1 setup, but > if un-used get migrated to the RAID6 area. HSM in general is definitely interesting. I'm afraid it is a long ways off, but it could be integrated into the scrubber that wanders the trees in the background. -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 10:00:56 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: >> So, can you resize a filesystem both bigger and smaller? Or is that >> implicit in the Object level mirroring and striping? Chris> Growing the FS is just either extending or adding a new extent Chris> tree. Shrinking is more complex. The extent trees do have Chris> back pointers to the objectids that own the extent, but Chris> snapshotting makes that a little non-deterministic. The good Chris> news is there are no fixed locations for any of the metadata. Chris> So it is at least possible to shrink and pop out arbitrary Chris> chunks. That's good to know. Being able to grow (online of course!) is great, but so would shrinking as well. It makes life so much more flexible for the SysAdmins, which is my particular focus... since it's my day job. >> As a user of Netapps, having quotas (if only for reporting purposes) >> and some way to migrate non-used files to slower/cheaper storage would >> be great. Chris> So far, I'm not planning quotas beyond the subvolume level. So let me get this straight. Are you saying that quotas would only be on the volume level, and for the initial level of sub-volumes below that level? Or would *all* sub-volumes have quota support? And does that include snapshots as well? >> Ie. being able to setup two pools, one being RAID6, the other being >> RAID1, where all currently accessed files are in the RAID1 setup, but >> if un-used get migrated to the RAID6 area. Chris> HSM in general is definitely interesting. I'm afraid it is a Chris> long ways off, but it could be integrated into the scrubber Chris> that wanders the trees in the background. Neat. As long as the idea is kept around a bit, that would be nice for an eventual addition. Or maybe someone needs to come up with a stackable filesystems to take care of this... John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 10:54:42 -0400 On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blocked] writes: > >> As a user of Netapps, having quotas (if only for reporting purposes) > >> and some way to migrate non-used files to slower/cheaper storage would > >> be great. > > Chris> So far, I'm not planning quotas beyond the subvolume level. > > So let me get this straight. Are you saying that quotas would only be > on the volume level, and for the initial level of sub-volumes below > that level? Or would *all* sub-volumes have quota support? And does > that include snapshots as well? On disk, snapshots and subvolumes are identical...the only difference is their starting state (sorry, it's confusing, and it doesn't help that I interchange the terms when describing features). Every subvolume will have a quota on the number of blocks it can consume. I haven't yet decided on the best way to account for blocks that are actually shared between snapshots, but it'll be in there somehow. So if you wanted to make a snapshot readonly, you just set the quota to 1 block. But, I'm not planning on adding a way to say user X in subvolume Y has quota Z. I'll just be: this subvolume can't get bigger than a given size. (at least for version 1.0). -chris
From: "John Stoffel" [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 12:12:23 -0400 >>>>> "Chris" == Chris Mason [email blocked] writes: Chris> On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote: >> >>>>> "Chris" == Chris Mason [email blocked] writes: >> >> As a user of Netapps, having quotas (if only for reporting purposes) >> >> and some way to migrate non-used files to slower/cheaper storage would >> >> be great. >> Chris> So far, I'm not planning quotas beyond the subvolume level. >> >> So let me get this straight. Are you saying that quotas would only be >> on the volume level, and for the initial level of sub-volumes below >> that level? Or would *all* sub-volumes have quota support? And does >> that include snapshots as well? Chris> On disk, snapshots and subvolumes are identical...the only Chris> difference is their starting state (sorry, it's confusing, and Chris> it doesn't help that I interchange the terms when describing Chris> features). Ok, that's fine. A sub-volume is the unit and depending on it's state, it's either a snapshot of an existing volume, or it's a volume on it's own, though it still has a parent (?) which it is mounted below? Do I have it right now? Chris> Every subvolume will have a quota on the number of blocks it Chris> can consume. I haven't yet decided on the best way to account Chris> for blocks that are actually shared between snapshots, but Chris> it'll be in there somehow. So if you wanted to make a snapshot Chris> readonly, you just set the quota to 1 block. Ok, so you really aren't talking about Quotas here, but space reservations instead. Also, I think you're wrong here when you state that making a snapshot (sub-volume?) RO just requires you to set the quota to 1 block. What is to stop me from writing 1 block to a random file that already exists? Chris> But, I'm not planning on adding a way to say user X in Chris> subvolume Y has quota Z. I'll just be: this subvolume can't Chris> get bigger than a given size. (at least for version 1.0). Ok, so version 1.0 isn't as interesting to me in a production environment, since we pretty much need quotas (or a quick way to monitor how much space a user has been allocated on a volume. But for a home system, it's certainly looking interesting as well, since I could give each home directory it's own sub-volume and just grow/shrink them as needed. Maybe. :] Thanks for your work on this. John
From: Chris Mason [email blocked] Subject: Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS Date: Wed, 13 Jun 2007 12:34:05 -0400 On Wed, Jun 13, 2007 at 12:12:23PM -0400, John Stoffel wrote: > >>>>> "Chris" == Chris Mason [email blcoked] writes: [ nod ] > Also, I think you're wrong here when you state that making a snapshot > (sub-volume?) RO just requires you to set the quota to 1 block. What > is to stop me from writing 1 block to a random file that already > exists? It's copy on write, so changing one block means allocating a new one and putting the new contents there. The old blocks don't become available for reuse until the transaction commits. -chris



Related Links:


Source URL:
http://kerneltrap.org/node/8376