Hi,
Currently, ext3 doesn't have the freeze feature which suspends write
requests. So, we cannot get a backup which keeps the filesystem's
consistency with the storage device's features (snapshot, replication)
while it is mounted.
In many case, a commercial filesystems (e.g. VxFS) has the freeze
feature and it would be used to get the consistent backup.
So I am planning on implementing the ioctl of the freeze feature for ext3.
I think we can get the consistent backup with the following steps.
1. Freeze the filesystem with ioctl.
2. Separate the replication volume or get the snapshot
with the storage device's feature.
3. Unfreeze the filesystem with ioctl.
4. Get the backup from the separated replication volume
or the snapshot.
The usage of the ioctl is as below.
int ioctl(int fd, int cmd, long *timeval)
fd: The file descriptor of the mountpoint.
cmd: EXT3_IOC_FREEZE for the freeze or EXT3_IOC_THAW for the unfreeze.
timeval: The timeout value expressed in seconds.
If it's 0, the timeout isn't set.
Return value: 0 if the operation succeeds. Otherwise, -1.
I have made sure that write requests were suspended with the experimental
patch for this feature and attached it in this mail.
The points of the implementation are followings.
- Add calls of the freeze function (freeze_bdev) and
the unfreeze function (thaw_bdev) in ext3_ioctl().
- ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
is registered to the delayed work queue to unfreeze the filesystem
automatically after the lapse of the specified time.
Any comments are very welcome.
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
---
diff -uprN -X linux-2.6.24-rc8/Documentation/dontdiff linux-2.6.24-rc8/fs/ext3/ioctl.c linux-2.6.24-rc8-freeze/fs/ext3/ioctl.c
--- linux-2.6.24-rc8/fs/ext3/ioctl.c 2008-01-16 13:22:48.000000000 +0900
+++ linux-2.6.24-rc8-freeze/fs/ext3/ioctl.c 2008-01-22 18:20:33.000000000 +0900
@@ -254,6 +254,42 @@ flags_err:
return err;
}
...I am also wondering whether we should have system call(s) for these:
And just convert XFS to use them too?
Pekka
--
First of all Linux already have at least one open-source(dm-snap), and several commercial snapshot solutions. In fact dm-snaps it not perfect: a) bit map loading is not supported (this is useful for freezing only used blocks) which causing significant slowdown even for new writes b) non patched dm-snap code has significant performance slowdown for all rewrite requests. c) IMHO memory footprint is too big. You have to realize what delay between 1-3 stages have to be minimal. for example dm-snap perform it only for explicit journal flushing. From my experience if delay is more than 4-5 seconds whole system becomes unstable. BTW: you have to always remember that while locking ext3 via freeze_bdev sb->ext3_write_super_lockfs() will be called wich implemented as "simple" journal lock. This means what some bio-s still may reach original device even after file system was locked (i've observed this in real life WOW timeout extending is not supported !? So you wanna say what caller have to set timer to the maximal possible timeout from the very beginning. IMHO it is better to use heart-beat timer approach, for example: each second caller extend it's timeout for two seconds. in this approach even after caller was killed by any reason, it's timeout will be expired in two seconds. if (inode->i_sb->s_frozen == SB_FROZEN) /* extending timeout */ --
Yes, but it requires that the filesystem be stored under LVM. Unlike
what EVMS v1 allowed us to do, we can't currently take a snapshot of a
bare block device. This patch could potentially be useful for systems
That's the problem. You can't afford to freeze for very long.
What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.
The other approach would be to say, "oh well, the freeze ioctl is
inherently dangerous, and root is allowed to himself in the foot, so
who cares". :-)
But it was this concern which is why ext3 never exported freeze
functionality to userspace, even though other commercial filesystems
do support this. It wasn't that it wasn't considered, but the concern
about whether or not it was sufficiently safe to make available.
And I do agree that we probably should just implement this in
filesystem independent way, in which case all of the filesystems that
support this already have super_operations functions
write_super_lockfs() and unlockfs().
So if this is done using a new system call, there should be no
filesystem-specific changes needed, and all filesystems which support
those super_operations method functions would be able to provide this
functionality to the new system call.
- Ted
P.S. Oh yeah, it should be noted that freezing at the filesystem
layer does *not* guarantee that changes to the block device aren't
happening via mmap()'ed files. The LVM needs to freeze writes the
block device level if it wants to guarantee a completely stable
snapshot ...I tend to agree. Either you need your fs frozen, or not, and if you do, That's what I was thinking; can't the path to freeze_bdev just be elevated out of dm-ioctl.c to fs/ioctl.c and exposed, such that any filesystem which implements .write_super_lockfs can be frozen? This is essentially what the xfs_freeze userspace does via xfs_ioctl/XFS_IOC_FREEZE - which, AFAIK, isn't used much now that the lvm hooks are in place. I'm also not sure I see the point of the timeout in the original patch; either you are done snapshotting and ready to unfreeze, or you're not; 1, or 2, or 3 seconds doesn't really matter. When you're done, you're done, and you can only unfreeze then. Shouldn't this be done programmatically, and not with some pre-determined timeout? -Eric --
That the admin would manage to deadlock him/herself and wedge up the This is only a guess, but I suspect it was a fail-safe in case the admin did manage to deadlock him/herself. I would think a better approach would be to make the filesystem unfreeze if the file descriptor that was used to freeze the filesystem is closed, and then have explicit deadlock detection that kills the process doing the freeze, at which point the filesystem unlocks and the system can recover. - Ted --
Hmm, not sure that works. I have shell I used to freeze the ext3. Then it is pushed out by dirty data waiting to be written to that ext3. Deadlock, with file descriptor still open, and very hard to detect. Ok, OOM killer will eventually hit the shell, close the fd and unfreeze, but that is probably not what you want. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Hi Ted,
There are a few holes:
* The process may try to handle the signal and end up blocking on
the filesystem again.
* The process might pass the fd to another process by forking or
fd passing.
* The process holding the fd might be trying to take a lock held
by another process that is blocked on the filesystem, and infinite
variations on that theme.
Remembering the task that did the ioctl might work out better than
remembering the fd. Or just not try to be so fancy and rely on the
application to take appropriate measures to ensure it will not access
the filesystem, such as memlocking and not execing.
The freezer also needs to run in PF_MEMALLOC mode or similar
unless it can be sure it will not cause pageout to the frozen filesystem
under low memory conditions.
Regards,
Daniel
--
Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws..... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
Sorry, ignore this bit - I just realised the timer is set up after the freeze has occurred.... Still, that makes it potentially dangerous to whatever is being done while the filesystem is frozen.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --
