Re: Introducing Next3 - built-in snapshots support for Ext3

Previous thread: [Bug 15806] New: Kernel Panic or Kernel becomes unstable with encrypted drives (TrueCrypt) by bugzilla-daemon on Sunday, April 18, 2010 - 4:15 am. (3 messages)

Next thread: Ext4: batched discard support by Lukas Czerner on Monday, April 19, 2010 - 3:55 am. (49 messages)
From: Amir G.
Date: Sunday, April 18, 2010 - 8:41 am

Hi All,

For over a year, Next3 has been developed in-house by CTERA networks,
as part of its NAS appliances.
Now that the appliances are out in the market, Next3 project can
finally be shared with the world.

Main Next3 features:
- Backward and forward compatible with Ext3
- Incremental, volume level, read-only snapshots
- Snapshots use available file system disk space
- Snapshot deletion frees up disk space
- Retains Ext3 stability including journaling and fsck
- Minimal performance overhead (in average usage scenarios)
- No upper limit on number or size of snapshots

Please visit Next3 wiki page:
http://sourceforge.net/apps/mediawiki/next3/

Next3 project is looking for code reviewers, beta testers and public attention.

Would love to read your comments on Next3-users mailing list:
https://lists.sourceforge.net/lists/listinfo/next3-users

Amir.
--

From: Amir G.
Date: Monday, May 3, 2010 - 2:47 am

Hi Ted,

Next3 project was introduced 2 weeks ago. I hope you've had a chance
to visit the wiki page.

Since the snapshot support patches do not fall into standard patch
categories (feature/bug fix),
I would like to request your advise on the best way to move forward
with code review.

There are 2 patch series available for download from:
http://sourceforge.net/projects/next3/files/Latest%20patch%20series

e2fsprogs-1.41.11-next3-020510_patches.tar.gz contains a "small and
simple" patch series (1350 lines/8 patches),
which adds Next3 snapshots awareness to e2fsprogs.

next3_snapshot-020510_patches.tar.gz contains a "large and complex"
patch series (7000 lines/38 patches),
which adds Next3 built-in snapshots support to an Ext3 clone.

In my opinion, accepting the Next3 patches to e2fsprogs makes sense
for 3 reasons:
1. It is "small and simple" (mostly harmless).
2. It will help finalizing the Next3 on-disk format, so the sooner the better.
3. It will be of service for all those who would like to start
using/testing Next3 and still want to be up-to-date with latest
e2fsprogs.

Please let me know what you think.
Your blessing means a lot to me (and to potential reviewers I should hope so).

And now for a few personal requests from the list:
- For those of you who didn't visit the wiki yet, please find the time
to visit, it is very friendly.
- For those of you who have read the wiki and found it interesting (or
not), please send me some comments to feed my ego (or not).
- For those of you who will be willing to pick up one of the code
review gloves above (small or large size), please let me know if you
need anything else from me and I will provide.

Thanks in advance,
Amir.
--

From: Andi Kleen
Date: Tuesday, May 4, 2010 - 12:55 pm

The typical way to get code review is to post patches to the mailing
list.

However do it in smaller batches if you have a lot of them.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Amir G.
Date: Tuesday, May 4, 2010 - 6:03 pm

It is in fact a rather big feature, that is, built-in snapshot support for Ext3.
However, since I did not expect such a big feature to be added to Ext3
in its current state of development,

Very well, I will do that. I just figured there's not much sense in
reviewing the patches before
reading the basic design concepts in the wiki.

Thanks,
Amir.
--

From: tytso
Date: Tuesday, May 4, 2010 - 3:42 pm

I took a quick look, but to be honest, I've been swamped lately, and
with the merge window close at hand, it was something I was going to
put off for another 2-3 weeks.  I didn't realize you were in need of
some immediate comments so that you could finalize the on-disk format.

So the high bit on that front is that it looks like at least some of
the fields used, bit positions grabbed, etc., overlap with those used
by ext4.  Ext4 is where new development takes place in the ext2/3/4
series.  So enhancements such as Next3 will probably not be received
with great welcome into ext3.  And as far as e2fsprogs (which handles
the ext2, ext3, and ext4 file systems) is concerned, I don't want to
deal with the complexity of certain fields that mean one thing for
ext4, and something else for Next3.

I'll try to carve out time to look at the Next3 patches in greater
detail this week.

Best regards,

					- Ted
--

From: Amir G.
Date: Tuesday, May 4, 2010 - 7:05 pm

Indeed, on-disk format changes, that is, the first patch in each of
the e2fsprogs and Next3 patch series,

I would like to explain a few things regarding on-disk format changes:

1. Whenever it was possible, I tried to grab fields from the end of
the structs (i.e., super_block, s_features, i_flags) to stay as far
away as possible from ongoing changes in Ext4. If you find it better,
I could move these fields to a different location you assign them to.

2. I plea guilt as changed in grabbing i_flags & 0x00F00000 for
snapshot file non-persistent status flags, which overlaps with recent
Ext4 flags. However, I do not, store these flags on-disk, they are
only used by lsattr -X, to display in-memory snapshot status along
with the on-disk snapshot status stored in i_flags & 0x1F000000.

3. I plea guilt as changed in grabbing l_i_reserved1 for snapshot
on-disk list (i_next_snapshot), which overlaps with Ext4 on-disk
i_version. However, since i_version can take any arbitrary value, this
doesn't "break" the Ext4 on-disk format.

The following wiki section lists the on disk format changes in detail:

Yes, of course, I realize that. This is the reason I chose to
introduce Next3 as a new f/s,
which was branched from Ext3 and not as a new feature to Ext3.
Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,

I was kind of expecting you to say that and I understand why that can
be a problem for you.
Let's try to address this issue in the code review and find

That would be great.
Thanks,
Amir.
--

From: Andi Kleen
Date: Friday, May 7, 2010 - 8:12 am

As I understand it the ext4 code base still supports not having
extents enabled in the super block (although I'm not sure how well
that variant is tested in practice)

So in theory you could have a feature that requires disabling extents.

It might not make users very happy though.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Amir G.
Date: Friday, May 7, 2010 - 12:22 pm

In theory, it is possible to have 2 modes for Ext4 (extents or snapshots)
and some would argue that it makes sense to do that.
But I think that making that decision can be deferred to a later time,
after people have experienced with Next3 and have decided if they
would like to have
the snapshot feature merged into Ext4 or not.

Besides, it would take me a considerable amount of time to merge the
snapshot feature into Ext4,
and Next3 is ready to be used now.

Amir.
--

From: Ric Wheeler
Date: Friday, May 7, 2010 - 2:25 pm

I think that the counter argument would be that moving features into 
ext3 is probably the wrong thing to do.

I don't think that anyone is in a huge hurry given that we have LVM 
based snapshots with ext3 and btrfs snapshots around the corner.  
Probably this is most interesting when done to the latest version of the 
ext family.

Best regards,

Ric

--

From: Amir G.
Date: Friday, May 7, 2010 - 10:43 pm

This is a valid argument, but it is important for me to clarify a few
issues regarding the statements above:

1. No features are added to Ext3, so there is no concern for the
stability of Ext3.
The feature is added as a new f/s, with the slight overhead of
duplicate code in the
kernel tree and an extra loadable module in the system.

2. From the user's point of view, there is not much difference between
"mount -t next3"
and "mount -t ext4 -o snapshots", because in both cases it would not
be possible to
mount ext4 with extents support on that volume before discarding snapshots and
it will be possible to mount ext4 with extents support after
discarding snapshots.

3. Next3 snapshots are much more scalable durable and efficient than
LVM snapshots.
These are some of the benefits of built-in snapshots support.

4. I do not want to restart the discussion about when btrfs will be
production ready.
As for Next3 stability, I think that with the help of the community,
Next3 can be production ready within a matter of months,
because the Next3 code religiously attempts to retain the stability of
its ancestor Ext3.

I dare you to prove me wrong ;-)

Amir.
--

From: Theodore Tso
Date: Saturday, May 8, 2010 - 4:48 am

This is where it's important to understand exactly what is meant by a ***file system***.   Are you referring to the format, or the implementation?   The way I've always treated it, and it's the way I believe most of the ext234 developers have treated it is, that what users call ext2, ext3, and ext4 are different _implementations_ of the same _file_ _system_ [format].    That is to say, ext4 simply happens to be a fuller, more complete implementation  of the same file system as ext2 and ext3.   Ext2 doesn't support certain features such as journaling and directory indexing; ext3 doesn't support some advanced features such as delayed allocation and extents, and requires that the journal always be present.   Ext4 is a superset of ext2 plus ext3 plus delayed allocation, extents, a multi-block allocator, and a few other new features.   But they are all the same file system.

Nor are they the only implementations of that file system.  The BSD file systems have a compatible (although feature-restricted) implementation, which was independently implemented.  So does the GNU HURD.   And there are others.  And note that all of these folks all use the same userspace utilities, e2fsprogs, for all of these various implementations: BSD, GNU HURD, and the Linux implementations of ext2, ext3, and ext4 all use the same set of tools: mke2fs, e2fsck, tune2fs, debugfs, and so on.

The same this is true for NTFS.   There are features in NTFS that you will find in Windows Vista that don't exist in Windows NT or Windows Vista.  But everybody treats them as the same file system, even though they have more advanced features in newer versions of the operating system.

The "ext" in ext2 stands for "extended", as in the "the second extended file system" for Linux.   It perhaps would be better if we had used the term "extensible", since that's the main thing about ext2/3/4 that has given it so much staying power.  We've been able to add, in very carefully backwards and forwards compatible way, new features to the file system ...
From: Amir G.
Date: Saturday, May 8, 2010 - 9:07 am

Next3 is another implementation of the extended f/s format.

Next3 is backward and forward compatible with ext3.
Next3 path to e2fsprogs doesn't treat it as a different file system format.

It is _practically_ possible to support the snapshot features/fields
in e2fsprogs today

It makes me very happy that you've studied Next3 enough to be able to
make this almost correct observation.
I do plan to use indirect mapped snapshot files when I merge them to Ext4.
The only place that extent mapped files break the snapshots design is
when doing move-on-write
when writing in-place to extent mapped file.
Should the extent be broken into 2 extents + single block and then
move the block to snapshot?
Should the block be copied-on-write instead of moved-on-write and pay
the performance penalty?
There is an important design decision to make here.

Amir.
--

From: tytso
Date: Saturday, May 8, 2010 - 10:25 am

As long as Next3 uses fields which have already assigned to ext4, this
is a claim that you can not make correctly.  Because, you see, the
ext4 is also an implementation of the extended f/s format, and those

As long as you are willing to say that, then sure, let's work towards

If you do the "move-on-write" trick, you just have to split the extent
and do a COW of the extent tree and/or the inode.  So for a single
block, the performance hit the same, yes?  But in the long-run, it's

Technically speaking, it's possible to do it both way, yes?  I'm not
sure why you consider this such a important design decision.  We can
even play games where for some files we might do copy-on-write, and
for some files, we do move-on-write.  It's always possible to check
the COW bitmaps to decide what had happened.

In any case, if this is all you have to do, I'm not sure why you said
it was fundamentally impossible to support extents with the Next3
design.

Best regards,

						- Ted
--

From: Amir G.
Date: Saturday, May 8, 2010 - 12:40 pm

Let me state my case then:

Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
You see, Next3 only tampers with i_version of snapshot files.
And by tamper I mean: set it to next snapshot inode number on snapshot take.
And snapshot files are not modifiable by users (only by the f/s itself).
So if the f/s decides to assign an arbitrary value to i_version of
snapshot files,
it doesn't break the extended f/s format. does it?

Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only,
some currently overlapping flags recently assigned to Ext4 (you beat me to it).
There is a big waste in i_flag bits space, for example, the 4 bits
reserved for compression,
which are not in use by non-compressed files.
Snapshot files are never compressed, so I wouldn't mind reusing those
4 bits for snapshot flags.
Overloading auxiliary bits with different meanings depending on some
other bit does not make this a different f/s format.

All metadata is COWed, inside the JBD hooks, so the extent tree and
inode are taken care of.
It is the data blocks which are being moved-on-write for efficiency.
The problem with splitting the extent is that when an application does
a lot of in-place writes to an extent mapped file,
it will eventually end up being broken down into tiny extents or

Definitely yes! I never thought it would really have to come down to a
"decision",
because there is a trade-off at hand.
Even in Next3, without extents, it makes sense to have a choice of
write performance vs. fragmentation per file.
The few applications that use random in-place write (db, virtual disk)

Wait just a minute! I said "not an easy task" and "break the design
concepts", but I never said (as far as I recall) "fundamentally
impossible". Well, perhaps "breaking the design concepts" was too
harsh :-)

I quote from Next3 wiki FAQ:
"Can Next3 snapshot support be applied to Ext4?
Most of the snapshot code can work on Ext4 as is, but the
move-on-write technique used for regular ...
From: Theodore Tso
Date: Saturday, May 8, 2010 - 7:25 pm

You need to justify ALL new fields which are used by Next3, not just the ones which overlap ext4, since they are precious resources, not be squandered for just one new file system feature/extension.



The compression people were amazingly profligate with their flags, yes.  It's one of the reasons why I push back now, and ask people to justify *every* single bit assignment or field usage. 

For example: i_snapshot_blocks_count.  Is that really necessary?   Why can not be computed by looking at i_size of the snapshot inode?   Or by checking to see if the superblock has be COW'ed?  If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count.  If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value.  Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places.

I have similar question about bg_cow_bitmap.  Is that really necessary?   The COW bitmap is just a copy of the base file system's block allocation bitmap, right?  (The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.)   So why do you need to allocate a field out of the bg discriptor field for it?

It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor.  One or the other, but not both...

If s_snapshot_inum and s_snapshot_id refer to the "active" snapshot, why do they need to be in the on-disk structure.   Why not just have the first item of the linked list whose head is s_last_snapshot be the "active" snapshot (if this needs to be in the on-disk state at all); wouldn't the active snapshot be the most recent one anyway?   Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list.  That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already.   So that way you don't need ...
From: Amir G.
Date: Sunday, May 9, 2010 - 4:56 am

I have 4 non-persistent flags. I will move them to i_state.
I've kept them in i_flags out of laziness, since I use lsattr -X

i_version is used to chain the snapshot list on-disk, similar to orphan list.
I used i_dtime in the past, but I was concerned that a bug would
result in cleanup of all snapshots,
so I started using i_version instead.
I can revert back to using i_dtime (snapshot files are non-truncatable
non-unlinkable) instead of i_version.

the snapshot file directory entry name is arbitrary and may be used by
a "snapshot management system" as it wishes,
to organize and display snapshots.
As far as the snapshot sub-system is concerned, the on-disk snapshot

very well, I can read snapshot_blocks_count from COWed superblock (it

bg_cow_bitmap/bg_exculde_bitmap are used by Next3 as non-persistent
cache for the address of a bitmap blocks,
which can be read from active_snapshot/exclude_inode.
in other words, instead of allocating per group in-memory structure, I
used the 2 unused fields in the in-memory group descriptor.
the only side effect for the ext* on-disk format is that those fields
are no longer 0 after mounting a volume with Next3.
is that a problem? can the CSUM feature resolve that problem?

in e2fsprogs, I only reference those fields for debugging purpose
(dumpe2fs displays them).
also create_exclude_inode and resize2fs set the bg_exclude_bitmap, but
they don't have to,
because Next3 re-reads all exclude_bitmap block addresses from exclude
inode on mount time.
so please feel free to reject those field assignments. I can include

good question. again, there use to be only 1 field s_last_snapshot,
but I split it into 2 field to recover from crash in the middle
of snapshot take.
a half taken snapshot is set as s_last_snapshot, but only a ready
snapshot is set as s_snapshot_inum.
Next3 will cleanup a half taken snapshot on mount time.
tune2fs -O ^has_snapshot will cleanup (discard) all snapshot files,

I don't know if you noticed, but I reused the code ...
From: Amir G.
Date: Friday, May 14, 2010 - 11:14 pm

I have started making some changes to on-disk format based on the
points we seem to agree upon.

I would like to register only 1 ro_compat feature (has_snapshot) and 1
compat feature (exclude_inode).
the rest of the "informational features" I would like to move to
s_flags, including NEXT3_FLAGS_BIG_JOURNAL.

A Next3 big journal can be created with the option -J big or by
mkfs.next3/mkn3fs.

I will evacuate all the fields that Next3 can do without (i.e.,
{s_snapshot_blocks_count,bg_{cow,exclude}_bitmap}).

I will move the non-persistent snapshot flags to i_state.

I would like to stick with i_version list chaining and not revert to
using i_dtime. awaiting further discussion on that topic.

Awaiting permanent assignments for the rest of the fields/flags.

Per your request, I have added the information above to the Next3 wiki.
Please find the TODO items in red and WIP items in green (implemented
and not published):
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Code_documentation#Reserve...

Also, if you could please drop a line about your view of how to
progress with Next3
(something in the lines of what you said in the conference call), that
would be nice.
Some people may have gotten the impression that the fork from Ext3 is
a show stopper for you,
see: http://lwn.net/SubscriberLink/387231/1310b1360769c12b/

Thanks,
Amir.
--

From: Ric Wheeler
Date: Saturday, May 8, 2010 - 5:51 am

As Ted mentioned in his reply, the big concern is that you are forking 
ext3 instead of adding a new feature to the end of the ext* family of 
file systems.

Since we have multiple snapshot mechanisms in place already (not just 
btrfs & lvm, but don't forget all of the builtin array snapshots), I 
think that we are not in a hurry to get this done quickly. I would 
strongly prefer we get this rebased onto the latest ext4 and resubmitted.

As far as proof goes, I think that the unfortunate burden of proof is on 
your shoulders to prove to us that we should take and maintain those new 
features given the often conflicting priorities :-)

Thanks!

Ric



--

From: Amir G.
Date: Saturday, May 8, 2010 - 3:56 pm

That is a valid concern, but this is where Next3 stands today.
There is no intention of replacing Ext4 with Next4  as the leading ext* f/s.
The branch from Ext3 was made at the time to speed up the development

If I were you I would also prefer to get snapshots in ext4, and even
snapshots along side extent mapped files,
but unfortunately, I cannot promise to deliver either anytime soon.
I can only promise my support to anyone who wishes to participate in

What can I say, a Windows file server can display previous file
versions without using a costly storage array.
Can a RedHat server do that? Can LVM snapshots be used to do that?
The CTERA NAS appliances do that using Next3.

Amir.
--

Previous thread: [Bug 15806] New: Kernel Panic or Kernel becomes unstable with encrypted drives (TrueCrypt) by bugzilla-daemon on Sunday, April 18, 2010 - 4:15 am. (3 messages)

Next thread: Ext4: batched discard support by Lukas Czerner on Monday, April 19, 2010 - 3:55 am. (49 messages)