Hi All, For over a year, Next3 has been developed in-house by CTERA networks, as part of its NAS appliances. Now that the appliances are out in the market, Next3 project can finally be shared with the world. Main Next3 features: - Backward and forward compatible with Ext3 - Incremental, volume level, read-only snapshots - Snapshots use available file system disk space - Snapshot deletion frees up disk space - Retains Ext3 stability including journaling and fsck - Minimal performance overhead (in average usage scenarios) - No upper limit on number or size of snapshots Please visit Next3 wiki page: http://sourceforge.net/apps/mediawiki/next3/ Next3 project is looking for code reviewers, beta testers and public attention. Would love to read your comments on Next3-users mailing list: https://lists.sourceforge.net/lists/listinfo/next3-users Amir. --
Hi Ted, Next3 project was introduced 2 weeks ago. I hope you've had a chance to visit the wiki page. Since the snapshot support patches do not fall into standard patch categories (feature/bug fix), I would like to request your advise on the best way to move forward with code review. There are 2 patch series available for download from: http://sourceforge.net/projects/next3/files/Latest%20patch%20series e2fsprogs-1.41.11-next3-020510_patches.tar.gz contains a "small and simple" patch series (1350 lines/8 patches), which adds Next3 snapshots awareness to e2fsprogs. next3_snapshot-020510_patches.tar.gz contains a "large and complex" patch series (7000 lines/38 patches), which adds Next3 built-in snapshots support to an Ext3 clone. In my opinion, accepting the Next3 patches to e2fsprogs makes sense for 3 reasons: 1. It is "small and simple" (mostly harmless). 2. It will help finalizing the Next3 on-disk format, so the sooner the better. 3. It will be of service for all those who would like to start using/testing Next3 and still want to be up-to-date with latest e2fsprogs. Please let me know what you think. Your blessing means a lot to me (and to potential reviewers I should hope so). And now for a few personal requests from the list: - For those of you who didn't visit the wiki yet, please find the time to visit, it is very friendly. - For those of you who have read the wiki and found it interesting (or not), please send me some comments to feed my ego (or not). - For those of you who will be willing to pick up one of the code review gloves above (small or large size), please let me know if you need anything else from me and I will provide. Thanks in advance, Amir. --
The typical way to get code review is to post patches to the mailing list. However do it in smaller batches if you have a lot of them. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It is in fact a rather big feature, that is, built-in snapshot support for Ext3. However, since I did not expect such a big feature to be added to Ext3 in its current state of development, Very well, I will do that. I just figured there's not much sense in reviewing the patches before reading the basic design concepts in the wiki. Thanks, Amir. --
I took a quick look, but to be honest, I've been swamped lately, and with the merge window close at hand, it was something I was going to put off for another 2-3 weeks. I didn't realize you were in need of some immediate comments so that you could finalize the on-disk format. So the high bit on that front is that it looks like at least some of the fields used, bit positions grabbed, etc., overlap with those used by ext4. Ext4 is where new development takes place in the ext2/3/4 series. So enhancements such as Next3 will probably not be received with great welcome into ext3. And as far as e2fsprogs (which handles the ext2, ext3, and ext4 file systems) is concerned, I don't want to deal with the complexity of certain fields that mean one thing for ext4, and something else for Next3. I'll try to carve out time to look at the Next3 patches in greater detail this week. Best regards, - Ted --
Indeed, on-disk format changes, that is, the first patch in each of the e2fsprogs and Next3 patch series, I would like to explain a few things regarding on-disk format changes: 1. Whenever it was possible, I tried to grab fields from the end of the structs (i.e., super_block, s_features, i_flags) to stay as far away as possible from ongoing changes in Ext4. If you find it better, I could move these fields to a different location you assign them to. 2. I plea guilt as changed in grabbing i_flags & 0x00F00000 for snapshot file non-persistent status flags, which overlaps with recent Ext4 flags. However, I do not, store these flags on-disk, they are only used by lsattr -X, to display in-memory snapshot status along with the on-disk snapshot status stored in i_flags & 0x1F000000. 3. I plea guilt as changed in grabbing l_i_reserved1 for snapshot on-disk list (i_next_snapshot), which overlaps with Ext4 on-disk i_version. However, since i_version can take any arbitrary value, this doesn't "break" the Ext4 on-disk format. The following wiki section lists the on disk format changes in detail: Yes, of course, I realize that. This is the reason I chose to introduce Next3 as a new f/s, which was branched from Ext3 and not as a new feature to Ext3. Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task, I was kind of expecting you to say that and I understand why that can be a problem for you. Let's try to address this issue in the code review and find That would be great. Thanks, Amir. --
As I understand it the ext4 code base still supports not having extents enabled in the super block (although I'm not sure how well that variant is tested in practice) So in theory you could have a feature that requires disabling extents. It might not make users very happy though. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
In theory, it is possible to have 2 modes for Ext4 (extents or snapshots) and some would argue that it makes sense to do that. But I think that making that decision can be deferred to a later time, after people have experienced with Next3 and have decided if they would like to have the snapshot feature merged into Ext4 or not. Besides, it would take me a considerable amount of time to merge the snapshot feature into Ext4, and Next3 is ready to be used now. Amir. --
I think that the counter argument would be that moving features into ext3 is probably the wrong thing to do. I don't think that anyone is in a huge hurry given that we have LVM based snapshots with ext3 and btrfs snapshots around the corner. Probably this is most interesting when done to the latest version of the ext family. Best regards, Ric --
This is a valid argument, but it is important for me to clarify a few issues regarding the statements above: 1. No features are added to Ext3, so there is no concern for the stability of Ext3. The feature is added as a new f/s, with the slight overhead of duplicate code in the kernel tree and an extra loadable module in the system. 2. From the user's point of view, there is not much difference between "mount -t next3" and "mount -t ext4 -o snapshots", because in both cases it would not be possible to mount ext4 with extents support on that volume before discarding snapshots and it will be possible to mount ext4 with extents support after discarding snapshots. 3. Next3 snapshots are much more scalable durable and efficient than LVM snapshots. These are some of the benefits of built-in snapshots support. 4. I do not want to restart the discussion about when btrfs will be production ready. As for Next3 stability, I think that with the help of the community, Next3 can be production ready within a matter of months, because the Next3 code religiously attempts to retain the stability of its ancestor Ext3. I dare you to prove me wrong ;-) Amir. --
This is where it's important to understand exactly what is meant by a ***file system***. Are you referring to the format, or the implementation? The way I've always treated it, and it's the way I believe most of the ext234 developers have treated it is, that what users call ext2, ext3, and ext4 are different _implementations_ of the same _file_ _system_ [format]. That is to say, ext4 simply happens to be a fuller, more complete implementation of the same file system as ext2 and ext3. Ext2 doesn't support certain features such as journaling and directory indexing; ext3 doesn't support some advanced features such as delayed allocation and extents, and requires that the journal always be present. Ext4 is a superset of ext2 plus ext3 plus delayed allocation, extents, a multi-block allocator, and a few other new features. But they are all the same file system. Nor are they the only implementations of that file system. The BSD file systems have a compatible (although feature-restricted) implementation, which was independently implemented. So does the GNU HURD. And there are others. And note that all of these folks all use the same userspace utilities, e2fsprogs, for all of these various implementations: BSD, GNU HURD, and the Linux implementations of ext2, ext3, and ext4 all use the same set of tools: mke2fs, e2fsck, tune2fs, debugfs, and so on. The same this is true for NTFS. There are features in NTFS that you will find in Windows Vista that don't exist in Windows NT or Windows Vista. But everybody treats them as the same file system, even though they have more advanced features in newer versions of the operating system. The "ext" in ext2 stands for "extended", as in the "the second extended file system" for Linux. It perhaps would be better if we had used the term "extensible", since that's the main thing about ext2/3/4 that has given it so much staying power. We've been able to add, in very carefully backwards and forwards compatible way, new features to the file system ...
Next3 is another implementation of the extended f/s format. Next3 is backward and forward compatible with ext3. Next3 path to e2fsprogs doesn't treat it as a different file system format. It is _practically_ possible to support the snapshot features/fields in e2fsprogs today It makes me very happy that you've studied Next3 enough to be able to make this almost correct observation. I do plan to use indirect mapped snapshot files when I merge them to Ext4. The only place that extent mapped files break the snapshots design is when doing move-on-write when writing in-place to extent mapped file. Should the extent be broken into 2 extents + single block and then move the block to snapshot? Should the block be copied-on-write instead of moved-on-write and pay the performance penalty? There is an important design decision to make here. Amir. --
As long as Next3 uses fields which have already assigned to ext4, this is a claim that you can not make correctly. Because, you see, the ext4 is also an implementation of the extended f/s format, and those As long as you are willing to say that, then sure, let's work towards If you do the "move-on-write" trick, you just have to split the extent and do a COW of the extent tree and/or the inode. So for a single block, the performance hit the same, yes? But in the long-run, it's Technically speaking, it's possible to do it both way, yes? I'm not sure why you consider this such a important design decision. We can even play games where for some files we might do copy-on-write, and for some files, we do move-on-write. It's always possible to check the COW bitmaps to decide what had happened. In any case, if this is all you have to do, I'm not sure why you said it was fundamentally impossible to support extents with the Next3 design. Best regards, - Ted --
Let me state my case then: Next3 uses 1 assigned field (i_version), but it does not "abuse" it. You see, Next3 only tampers with i_version of snapshot files. And by tamper I mean: set it to next snapshot inode number on snapshot take. And snapshot files are not modifiable by users (only by the f/s itself). So if the f/s decides to assign an arbitrary value to i_version of snapshot files, it doesn't break the extended f/s format. does it? Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only, some currently overlapping flags recently assigned to Ext4 (you beat me to it). There is a big waste in i_flag bits space, for example, the 4 bits reserved for compression, which are not in use by non-compressed files. Snapshot files are never compressed, so I wouldn't mind reusing those 4 bits for snapshot flags. Overloading auxiliary bits with different meanings depending on some other bit does not make this a different f/s format. All metadata is COWed, inside the JBD hooks, so the extent tree and inode are taken care of. It is the data blocks which are being moved-on-write for efficiency. The problem with splitting the extent is that when an application does a lot of in-place writes to an extent mapped file, it will eventually end up being broken down into tiny extents or Definitely yes! I never thought it would really have to come down to a "decision", because there is a trade-off at hand. Even in Next3, without extents, it makes sense to have a choice of write performance vs. fragmentation per file. The few applications that use random in-place write (db, virtual disk) Wait just a minute! I said "not an easy task" and "break the design concepts", but I never said (as far as I recall) "fundamentally impossible". Well, perhaps "breaking the design concepts" was too harsh :-) I quote from Next3 wiki FAQ: "Can Next3 snapshot support be applied to Ext4? Most of the snapshot code can work on Ext4 as is, but the move-on-write technique used for regular ...
You need to justify ALL new fields which are used by Next3, not just the ones which overlap ext4, since they are precious resources, not be squandered for just one new file system feature/extension. The compression people were amazingly profligate with their flags, yes. It's one of the reasons why I push back now, and ask people to justify *every* single bit assignment or field usage. For example: i_snapshot_blocks_count. Is that really necessary? Why can not be computed by looking at i_size of the snapshot inode? Or by checking to see if the superblock has be COW'ed? If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count. If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value. Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places. I have similar question about bg_cow_bitmap. Is that really necessary? The COW bitmap is just a copy of the base file system's block allocation bitmap, right? (The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.) So why do you need to allocate a field out of the bg discriptor field for it? It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor. One or the other, but not both... If s_snapshot_inum and s_snapshot_id refer to the "active" snapshot, why do they need to be in the on-disk structure. Why not just have the first item of the linked list whose head is s_last_snapshot be the "active" snapshot (if this needs to be in the on-disk state at all); wouldn't the active snapshot be the most recent one anyway? Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list. That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already. So that way you don't need ...
I have 4 non-persistent flags. I will move them to i_state. I've kept them in i_flags out of laziness, since I use lsattr -X i_version is used to chain the snapshot list on-disk, similar to orphan list. I used i_dtime in the past, but I was concerned that a bug would result in cleanup of all snapshots, so I started using i_version instead. I can revert back to using i_dtime (snapshot files are non-truncatable non-unlinkable) instead of i_version. the snapshot file directory entry name is arbitrary and may be used by a "snapshot management system" as it wishes, to organize and display snapshots. As far as the snapshot sub-system is concerned, the on-disk snapshot very well, I can read snapshot_blocks_count from COWed superblock (it bg_cow_bitmap/bg_exculde_bitmap are used by Next3 as non-persistent cache for the address of a bitmap blocks, which can be read from active_snapshot/exclude_inode. in other words, instead of allocating per group in-memory structure, I used the 2 unused fields in the in-memory group descriptor. the only side effect for the ext* on-disk format is that those fields are no longer 0 after mounting a volume with Next3. is that a problem? can the CSUM feature resolve that problem? in e2fsprogs, I only reference those fields for debugging purpose (dumpe2fs displays them). also create_exclude_inode and resize2fs set the bg_exclude_bitmap, but they don't have to, because Next3 re-reads all exclude_bitmap block addresses from exclude inode on mount time. so please feel free to reject those field assignments. I can include good question. again, there use to be only 1 field s_last_snapshot, but I split it into 2 field to recover from crash in the middle of snapshot take. a half taken snapshot is set as s_last_snapshot, but only a ready snapshot is set as s_snapshot_inum. Next3 will cleanup a half taken snapshot on mount time. tune2fs -O ^has_snapshot will cleanup (discard) all snapshot files, I don't know if you noticed, but I reused the code ...
I have started making some changes to on-disk format based on the
points we seem to agree upon.
I would like to register only 1 ro_compat feature (has_snapshot) and 1
compat feature (exclude_inode).
the rest of the "informational features" I would like to move to
s_flags, including NEXT3_FLAGS_BIG_JOURNAL.
A Next3 big journal can be created with the option -J big or by
mkfs.next3/mkn3fs.
I will evacuate all the fields that Next3 can do without (i.e.,
{s_snapshot_blocks_count,bg_{cow,exclude}_bitmap}).
I will move the non-persistent snapshot flags to i_state.
I would like to stick with i_version list chaining and not revert to
using i_dtime. awaiting further discussion on that topic.
Awaiting permanent assignments for the rest of the fields/flags.
Per your request, I have added the information above to the Next3 wiki.
Please find the TODO items in red and WIP items in green (implemented
and not published):
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Code_documentation#Reserve...
Also, if you could please drop a line about your view of how to
progress with Next3
(something in the lines of what you said in the conference call), that
would be nice.
Some people may have gotten the impression that the fork from Ext3 is
a show stopper for you,
see: http://lwn.net/SubscriberLink/387231/1310b1360769c12b/
Thanks,
Amir.
--
As Ted mentioned in his reply, the big concern is that you are forking ext3 instead of adding a new feature to the end of the ext* family of file systems. Since we have multiple snapshot mechanisms in place already (not just btrfs & lvm, but don't forget all of the builtin array snapshots), I think that we are not in a hurry to get this done quickly. I would strongly prefer we get this rebased onto the latest ext4 and resubmitted. As far as proof goes, I think that the unfortunate burden of proof is on your shoulders to prove to us that we should take and maintain those new features given the often conflicting priorities :-) Thanks! Ric --
That is a valid concern, but this is where Next3 stands today. There is no intention of replacing Ext4 with Next4 as the leading ext* f/s. The branch from Ext3 was made at the time to speed up the development If I were you I would also prefer to get snapshots in ext4, and even snapshots along side extent mapped files, but unfortunately, I cannot promise to deliver either anytime soon. I can only promise my support to anyone who wishes to participate in What can I say, a Windows file server can display previous file versions without using a costly storage array. Can a RedHat server do that? Can LVM snapshots be used to do that? The CTERA NAS appliances do that using Next3. Amir. --
