Hi all, I would like to present a new way to deal with TRIM in ext4 file system. The current solution is not ideal because of its bad performance impact. So basic idea to improve things is to avoid discarding every time some blocks are freed. and instead batching is together into bigger trims, which tends to be more effective. The basic idea behind my discard support is to create an ioctl which walks through all the free extents in each allocating group and discard those extents. As an addition to improve its performance one can specify minimum free extent length, so ioctl will not bother with shorter extents. This of course means, that with each invocation the ioctl must walk through whole file system, checking and discarding free extents, which is not very efficient. The best way to avoid this is to keep track of deleted (freed) blocks. Then the ioctl have to trim just those free extents which were recently freed. In order to implement this I have added new bitmap into ext4_group_info (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then walk through bb_bitmap_deleted, compare deleted extents with free extents trim them and then removes it from the bb_bitmap_deleted. But you may notice, that there is one problem. bb_bitmap_deleted does not survive umount. To bypass the problem the first ioctl call have to walk through whole file system trimming all free extents. But there is a better solution to this problem. The bb_bitmap_deleted can be stored on disk an can be restored in mount time along with other bitmaps, but I think it is a quite big change and should be discussed further. I have also benchmarked it a little. You can find results here: people.redhat.com/jmoyer/discard/ext4_batched_discard/ comparison with current solution included. Keep in mind that ideal ioctl invocation interval is yet to be determined, so in benchmark I have used the performance-worst scenario - without any sleep between execution. There are two patches for this. ...
Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
fs/ioctl.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 6c75110..a43eaea 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -537,6 +537,33 @@ static int ioctl_fsthaw(struct file *filp)
return thaw_bdev(sb->s_bdev, sb);
}
+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+ struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+ unsigned int minlen;
+ int err;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* If filesystem doesn't support trim feature, return. */
+ if (sb->s_op->trim_fs == NULL)
+ return -EOPNOTSUPP;
+
+ /* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+ if (sb->s_bdev == NULL)
+ return -EINVAL;
+
+ err = get_user(minlen, (unsigned int __user *) arg);
+ if (err)
+ return err;
+
+ err = sb->s_op->trim_fs(minlen, sb);
+ if (err)
+ return err;
+ return 0;
+}
+
/*
* When you add any new common ioctls to the switches above and below
* please update compat_sys_ioctl() too.
@@ -587,6 +614,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
error = ioctl_fsthaw(filp);
break;
+ case FITRIM:
+ error = ioctl_fstrim(filp, arg);
+ break;
+
case FS_IOC_FIEMAP:
return ioctl_fiemap(filp, arg);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..f6ecdff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
#define FIGETBSZ _IO(0x00,2) /* get the block size used for bmap */
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
+#define FITRIM _IOWR('X', 121, int) /* Trim */
#define FS_IOC_GETFLAGS _IOR('f', 1, long)
...Create an ioctl which walks through all the free extents in each
allocating group and discard those extents. As an addition to improve
its performance one can specify minimum free extent length, so ioctl
will not bother with shorter extents.
This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.
In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted.
But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
fs/ext4/ext4.h | 4 +
fs/ext4/mballoc.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
fs/ext4/super.c | 1 +
3 files changed, 202 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..e25f672 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
ext4_lblk_t, int, int *);
@@ -1682,6 +1684,8 @@ struct ext4_group_info {
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
+ void *bb_bitmap_deleted;
+ ext4_grpblk_t bb_deleted;
struct ...Mark, This is the patch implementing the new discard logic. You did the benchmarking last year, but I thought you found calling trim one contiguous sector range at a time was too inefficient. See my highlight below: Mark, unless I'm missing something, sb_issue_discard() above is going to trigger a trim command for just the one range. I thought the benchmarks you did showed that a collection of ranges needed to be built, then a single trim command invoked that trimmed that group of ranges. --
.. Mmm.. If that's what it is doing, then this patch set would be a complete disaster. It would take *hours* to do the initial TRIM. Lukas ? --
I'm confused; do we have an interface to send a trim command for multiple ranges? I didn't think so ... Lukas' patch is finding free ranges (above a size threshold) to discard; it's not doing it a block at a time, if that's the concern. -Eric --
Eric, I don't know what kernel APIs have been created to support discard, but the ATA8 draft spec. allows for specifying multiple ranges in one trim command. See section 7.10.3.1 and .2 of the latest draft spec. Both talk about multiple trim ranges per trim command (think thousands of ranges per command). Recent hdparm versions accept a trim command argument that causes multiple ranges to be trimmed per command. --trim-sector-ranges Tell SSD firmware to discard unneeded data sectors: lba:count .. --trim-sector-ranges-stdin Same as above, but reads lba:count pairs from stdin As I understand it, this is critical from a performance perspective for the SSDs Mark tested with. ie. He found a single trim command with 1000 ranges takes much less time than 1000 discrete trim commands. Per Mark's comment's in wiper.sh, a trim command can have a minimum of 128KB of associated range information, so it is thousands of ranges that can be discarded in a single command ie. hdparm can accept extremely large lists of ranges on stdin, but it parses the list into discrete trim commands with thousands of ranges per command. A kernel implementation which is trying to implement after that fact discards as this patch is doing, also needs to somehow craft trim commands with a large payload of ranges if it is going to be efficient. If the block layer cannot do this yet, then in my opinion this type of batched discarding needs to stay in user space as done with Mark's wiper.sh script and enhanced hdparm until the block layer grows that ability. Greg --
Greg, We have full support for this in the "discard" support at the file system layer for several file systems. The block layer effectively muxes the "discard" into the right target device command. TRIM for ATA, WRITE_SAME (with unmap) or UNMAP for SCSI... If your favourite fs supports this, you can enable this feature with "-o discard" for fine grained discards, ric --
Well, sb_issue_discard is what ext4 is using, and that takes a single Thanks, it's worth pointing out that TRIM is not the only backend to the discard API. However, even if we do implement a vectored API, we can translate that to dumber commands if a given spec doesn't support it. Getting back to the problem... From the file system, you want to discard discrete ranges of blocks. The API to support this can either take care of the data integrity guarantees by itself, or make the upper layer ensure that trim and write do not pass each other. The current implementation does the latter. In order to do the former, there is the potential for a lot of overhead to be introduced into the block allocation layers for the file systems. So, given the above, it is up to the file system to send down the biggest discard requests it can in order to reduce the overhead of the command. If a vectored approach is made available, then that would be even better. Christoph, is this something that's on your radar? Cheers, Jeff --
Looking at the benchmarks (for the first time) at http://people.redhat.com/jmoyer/discard/ext4_batched_discard/ I don't see anything that says how long the proposed trim ioctl takes to complete on the full filesystem. What they do show is that with the 3 test SSDs used for this benchmark, the current released discard implementation is a net loss. ie. You are better off running without the discards for all 3 vendors. (at least under the conditions tested.) After the patch is applied and optimizing the discards to large free extents only, it works out to same performance with or without the discards. ie. no net gain or loss. That is extremely cool because one assumes that the non-discard case would degrade over time, but that the discard case will not. So that argues for the current proposed patch going in. But quoting from the first email: == The basic idea behind my discard support is to create an ioctl which walks through all the free extents in each allocating group and discard those extents. As an addition to improve its performance one can specify minimum free extent length, so ioctl will not bother with shorter extents. This of course means, that with each invocation the ioctl must walk through whole file system, checking and discarding free extents, which is not very efficient. The best way to avoid this is to keep track of deleted (freed) blocks. Then the ioctl have to trim just those free extents which were recently freed. In order to implement this I have added new bitmap into ext4_group_info (bb_bitmap_deleted) which stores recently freed blocks. The ioctl then walk through bb_bitmap_deleted, compare deleted extents with free extents trim them and then removes it from the bb_bitmap_deleted. But you may notice, that there is one problem. bb_bitmap_deleted does not survive umount. To bypass the problem the first ioctl call have to walk through whole file system trimming all free extents. But there is a better solution to this problem. The ...
correcting Christoph's email address - no other edits/comments -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer CNN/TruTV Aired Forensic Imaging Demo - http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/ The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
All the other things aside, mount-time is interesting, but it's an infrequent operation, at least in my world. I think we need something that can be done runtime. For anything with uptime, I don't think it's acceptable to wait until the next mount to trim unused blocks. But as long as the mechanism can be called either at mount time and/or kicked off runtime somehow, I'm happy. -Eric --
That makes sense to me. Most enterprise servers will go without remounting a file system for (hopefully!) a very long time. It is really important to keep in mind that this is not just a laptop feature for laptop SSD's, this is also used by high end arrays and *could* be useful for virt IO, etc as well :-) ric --
Adding James Bottomley because high-end scsi is entering the discussion. James, I have a couple scsi questions for you at the end. I'm not arguing that a runtime solution is not needed. I'm arguing that at least for SSD backed filesystems Mark's userspace implementation shows how the mount time initialization of the runtime bitmap can be accomplished in a few seconds by leveraging the hardware and using vector'ed trims as opposed to having to build an additional on-disk structure. At least for SSDs, the primary purpose of the proposed on-disk structure seems to be to overcome the current lack of a vector'ed discard implementation. If it is too difficult to implement a fully functional vector'ed discard in the block layer due to locking issues, possibly a special purpose version could be written that is only used at mount time when one can be assured no other i/o is occurring to the filesystem. James, The ATA-8 spec. supports vectored trims and requires a minimum of 255 sectors worth of range payload be supported. That equates to a single trim being able to trim thousands of ranges in one command. Mark Lord has benchmarked in found a vectored trim to be drastically faster than calling trim individually for each of those ranges. Does scsi support vector'ed discard? (ie. write-same commands) Or are high-end scsi arrays so fast they can process tens of thousands of discard commands in a reasonable amount of time, unlike the SSDs have so far proven to do. It would be interesting to find out that a SSD can discard thousands of ranges drastically faster than a high-end scsi device can. But if true, that might argue for the on-disk bitmap to track previously discarded blocks/extents. Greg --
So what's wrong with using wiper.sh? It can do online discard of No ... they actually have two problems: firstly they can only use discard ranges which align with their internal block size (usually something huge like 3/4MB) and then a trim operation tends to be O(1) I think SSDs and Arrays both have discard problems, arrays more to do with the time and expense of the operation, SSDs because the TRIM command isn't queued. James --
.. Does anyone have an Intel-based SSD they could spare? :) Rumour has it they they do not comply with the ATA-8 spec for TRIM, in that they don't support more than one "sector" of range data. The SSDs I have here all support much more than that. Still, even if we just do a single 512-byte payload of TRIM ranges, it would be a huge win. Cheers --
Well, it strongly depends on how is the file system fragmented. On the fresh file system (147G) the initial ioctl takes 2 seconds to finish (it may be worth to mention that on another SSD (111G) it takes 56s). I will try to get some numbers for the "usual" file system (not full, not I do not know much about how production system is being used, but I doubt this is that big issue. Sure the initial ioctl takes long to finish and there is a place for improvement, there was a proposal to do the initial trim at mount time. I do not think that it is wise, why to block mount, when the trim can be run at background when the fs is mounted ? Of course there will be some performance loss, while ioctl will be in progress, but it will not block. There are also another way to overcome this problem. We can assure that the file system is left trimmed after umount. To do this, we can simply trim the fs at umount time. I think this won't be any problem and we even do not prolong the umount time too much, because we will not trim whole fs, but just recently freed blocks. This of course bring another problem, when the system is not properly terminated and the umount is not properly finished (or done at all). But this can be solved in fsck at boot time I think. This will entirely eliminate the need to trim the whole fs (except the fsck obviously), Vectored trim will be great, I did not tested anything like that but obviously it will strongly reduce time needed to trim the fs. But we do And also, currently I am rewriting the patch do use rbtree instead of the bitmap, because there were some concerns of memory consumption. It is a question whether or not the rbtree will be more memory friendly. Generally I think that in most "normal" cases it will, but there are some extreme scenarios, where the rbtree will be much worse. Any comment on this ? Thanks. -Lukas
I know I've been arguing against this patch for the single SSD case and I still think that use case should be handled by userspace as hdparm/wiper.sh currently does. In particular for those extreme scenarios with JBOD SSDs, the user space solution wins because it knows how to optimize the trim calls via vectorized ranges in the payload. Thus I think the community and distro's should be testing that pair and pushing it out in the distro's for typical laptop use. But, that still leaves high-end external raid arrays, mdraid, and lvm unaddressed. Those use cases will likely benefit from the approach this patch takes the most. In particular, mdraid with raid 5/6 requires an approach like this patch provides, or it has to create its own in kernel discard aggregator which seems like a waste of time. In general, those use cases have large minimum discard units. Thus I think this patch should be tuned to work with large discard units and ignore small ones. That means it needs to get the underlying block layer topology and ignore unused space smaller than underlying layers minimum discard unit. That should allow a rb tree to be used and eliminate the extreme scenarios. (ie. I assume your extreme scenarios involve large numbers of very small unused ranges.) That may mean the topology information needs to grow some discard info. Does anyone know if that info is easily derived from the currently existing topo info? Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer CNN/TruTV Aired Forensic Imaging Demo - http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/ The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
I think that you have missed the broader point. This is not on by default, so you can mount without discard and use whatever user space utility you like at your discretion. --
Ric, I was trying to say the design should be driven by the large discard range use case, not the potentially pathological small discard range use case that would only benefit SSDs. Greg --
Bear in mind that this patch makes the discard range requests substantially -larger- than what mount -o discard does on ext4 today, in fact that was a main goal. If the kernel could assemble vectors of ranges and pass them down, I think it could be extended to use that as well. -Eric --
Eric/Ric, I was responding to the Lukas latest post which stated: == And also, currently I am rewriting the patch do use rbtree instead of the bitmap, because there were some concerns of memory consumption. It is a question whether or not the rbtree will be more memory friendly. Generally I think that in most "normal" cases it will, but there are some extreme scenarios, where the rbtree will be much worse. Any comment on this ? == If one optimizes for large discard ranges and ignores the pathological cases only beneficial to SSDs, then a rbtree wins. Correct? Greg --
Let's summarize. 1. Everyone agrees that doing larger discard instead of little discards is a good thing. 2. Some devices care about this more than others (various types of SSD's and arrays have different designs and performance with discards). Some devices do small discards well, others don't. 3. How you get to those bigger discards in our implementation - using a series of single range requests, using vectored requests, tracking extents that can be combined in an rbtree or not - is something that we are working on. Using rbtrees versus a bitmap efficiency is about DRAM consumption, not performance of the resulting discard on the target. 4. Devices (some devices) can export their preferences in a standard way (look in /sys/block/....). If you want to influence the code, please do try the various options on devices you have at hand and report results. That is what we are doing (we includes Lukas, Eric, Jeff and others on this thread) will real devices from vendors that have given us access. We are talking to them directly and trying out different work loads but certainly welcome real world results and suggestions. Thanks! Ric --
On Sat, Apr 24, 2010 at 1:04 PM, Ric Wheeler <rwheeler@redhat.com> wrote: Ric, Is it also agreed by all that the current ext4 kernel implementation of calling discard is a poor solution for most hardware / block layers stacks / workloads and therefore is not worth retaining nor performing further benchmarks? I've not seen anyone arguing to keep the current kernel implementation and I for one accept the previously posted benchmarks that show it is not adding any value relative to the traditional non-discard case. Therefore benchmarks between the current hdparm/wiper.sh userspace implementation and a proposed new kernel implementation would be the most beneficial? I have yet to see any of those benchmarks posted. Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer CNN/TruTV Aired Forensic Imaging Demo - http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/ The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
Greg, I don't like the user space wiper.sh approach in general, but running wiper.sh is entirely your choice. Most users prefer having the file system and the IO stack take care of this for them, but again, entirely configurable. The benchmarks we have done are often done on hardware that is under NDA so we cannot publish results across the board. Feel free to run on hardware that you buy and share the results. Thanks! Ric --
On 24/04/10 02:41 PM, Ric Wheeler wrote: .. I wrote wiper.sh, and I would prefer more of an in-kernel solution, simply because it could potentially have the smarts to deal with RAID, LVM, and btrfs's own duplicate implementation of RAID/LVM. Those cannot be done from userspace on a mounted filesystem. So.. again.. in an ideal kernel, I'd like to see use of larger ranges (and _multiple_ ranges) per TRIM command. And options for the kernel to do it automatically (-o discard), as well as an ioctl() interface for userspace to "scrub" (or "wipe") all free ranges in a gradual fashion. Cheers --
>>>>> "Mark" == Mark Lord <kernel@teksavvy.com> writes: Mark> So.. again.. in an ideal kernel, I'd like to see use of larger Mark> ranges (and _multiple_ ranges) per TRIM command. And options for Mark> the kernel to do it automatically (-o discard), as well as an Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free Mark> ranges in a gradual fashion. Again: Discard splitting, merging, and coalescing is coming. I'm working on it. It's not exactly trivial. My objection was purely wrt. nuking realtime discard in favor of scrubbing. We absolutely need both approaches. But I'm not going to let crappy SSD firmware implementations set the direction for what I'm working on. It is much more interesting to ensure that we work well for the devices that'll be shipping 6-12 months from now (at the time where this stuff will realistically end up in distro kernels). -- Martin K. Petersen Oracle Linux Engineering --
On Mon, Apr 26, 2010 at 10:42 AM, Martin K. Petersen Is this an agreed summary as relates to ext4: 1) Current "-o discard" causes real-time calls to discard. Although not optimized for current generation hardware, both block-layer optimizations and new hardware are coming, so it needs to be kept. 2) Kernel based discard scrubbing - Doesn't currently exist in 2.6.34, all agree that for the LVM, mdraid, btrfs cases it is needed and there is no linux tool (kernel or userspace) at present. The Lukas proposed patch is userspace invoked so a mount option is not needed. 3) Userspace discard is available today and works well with ext4 on a single JBOD SSD which will be the typical laptop use as one example. Mark, or anyone, do you think it would be a bad idea for me to push for Opensuse 11.3 (2.6.34 based) to use wiper.sh as a standard ext4 discard tool? hdparm and wiper.sh are already in the distro, it just needs a cron entry and maybe some basic management infrastructure. They're at feature freeze for 11.3, so I don't know if I can get it in or not. If and when the suse team want to move to the kernel based scrubber, they can. ie. a year from now when the next release after 11.3 comes out. Greg --
It is userspace invoked indeed, but once you set the nodiscard mount option it will not do anything. It seems only logical to me to NOT It (wiper) do not work on devices which does not support vectored trim -Lukas
On 26/04/10 11:27 AM, Greg Freemyer wrote: .. I'd hold off for now. Rumour has it that Intel SSDs are not 100% compliant with TRIM -- they fail with large amounts of range data. I don't have an Intel SSD to verify that or fix it with. Cheers --
And one thing to keep in mind is that we are likely to need both run time support for discard as well as occasional resync in bulk since the storage can choose to ignore the command and still be in spec. Kind of like the unfortunate "defrag" thing that windows people do, ric --
>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: Greg> Is it also agreed by all that the current ext4 kernel Greg> implementation of calling discard is a poor solution for most Greg> hardware / block layers stacks / workloads and therefore is not Greg> worth retaining nor performing further benchmarks? Our discard implementation is meant to accommodate a wide range of devices. Just because some of the currently shipping low-end consumer SSDs implement TRIM poorly does not mean we're going to scrap what we have. We are not in the business of designing for the past. Especially not when the past can be handled by a shell script. For future devices TRIM/UNMAP is going to be an inherent part of the command set. And that's what our kernel support is aimed at. There are some deficiencies right now because the block layer was not built to handle what is essentially a hybrid between a filesystem and a blk_pc type request. I'm working on fixing that. Greg> I've not seen anyone arguing to keep the current kernel Greg> implementation and I for one accept the previously posted Greg> benchmarks that show it is not adding any value relative to the Greg> traditional non-discard case. For enterprise storage the cost of sending discards is limited to the overhead of sending the command. I.e. negligible. Eventually that's going to be the case for ATA devices as well. And let me just reiterate that Windows 7 does issue TRIM like we do (at runtime). And consequently Windows 7 will help weed out the crap SSD implementations from the market. That's already happening. There is also work underway to make TRIM a queueable command which would further alleviate the situation. Greg> Therefore benchmarks between the current hdparm/wiper.sh userspace Greg> implementation and a proposed new kernel implementation would be Greg> the most beneficial? I don't know what you mean by "new" kernel implementation. We're working on tweaking what we have so that we can ...
On 24/04/10 03:06 PM, Martin K. Petersen wrote: .. The current implementation works extremely poorly for the singlemost common style of hardware that's out there. Millions and millions of SATA SSDs, and they're becoming more and more ubiquitous. A "solution" that ignores reality isn't very helpful. .. We are in the business of supporting the hardware that people run Linux on. Today, and tomorrow, and the next few years, that means SATA SSDs by the gazillions, as well as a relatively smaller number of enterprise behemoths. A shell script cannot currently deal with LVM, RAID, or btrfs filesystems. Cheers --
>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: Greg> That may mean the topology information needs to grow some discard Greg> info. Does anyone know if that info is easily derived from the Greg> currently existing topo info? The values are already part of the topology information (assuming the device reports them). Went into 2.6.33. -- Martin K. Petersen Oracle Linux Engineering --
I see two possible improvements here: a) At a cost of some code complexity, you can bound the worst case by combining RB-trees with bitmaps. The basic idea is that when space to TRIM gets too fragmented (memory to keep to-TRIM blocks in RB-tree for a given group exceeds the memory needed to keep it in a bitmap), you convert RB-tree for a problematic group to a bitmap and attach it to an appropriate RB-node. If you track with a bitmap also a number of to-TRIM extents in the bitmap, you can also decide whether it's benefitial to switch back to an RB-tree. b) Another idea might be: When to-TRIM space is fragmented (again, let's say in some block group), there's not much point in sending tiny trim commands anyway (at least that's what I've understood from this discussion). So you might as well stop maintaining information which blocks we need to trim for that group. When the situation gets better, you can always walk block bitmap and issue trim commands. You might even trigger this rescan from kernel - if you'd maintain number of free block extents for each block group (which is rather easy), you could trigger the bitmap rescan and trim as soon as ratio number of free blocks / number of extents gets above a reasonable threshold. Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs --
This sounds like a good idea, but I wonder if it is worth it : 1. The tree will have very short life, because with next ioctl all stored deleted extents will be trimmed and removed from the tree. 2. Also note, that the longer it lives the less fragmented it possibly became. 3. I do not expect, that deleted ranges can be too fragmented, and even if it is, it will be probably merged into one big extent very In what I am preparing now, I simple ignore small extents, which would be created by splitting the deleted extent into smaller pieces by chunks of used blocks. This, in my opinion, will prevent the fragmentation, which otherwise may occur in the longer term (between ioctl calls). Thanks for suggestions. -Lukas --
I am not convinced that ignoring small extents is a good idea. Remember that for SSD's specifically, they remap *everything* internally so our "fragmentation" set of small spaces could be useful for them. That does not mean that we should not try to send larger requests down to the target device which is always a good idea I think :-) ric --
That's right, so the other approach would be probably better. Merge small extents together into one, but there must be some limit, because I do not want two little extents at the beginning and the end of the group to force trimming whole group. The whole rbtree thing gets a little complicated :) -Lukas --
This discussion is getting a bit too abstract for me. Show us the code and we can make some progress. =) On the topic of discarding small blocks, I agree with Ric, it should be done. Cheers, Jeff --
Create an ioctl which walks through all the free extents in each allocating group and discard those extents. As an addition to improve its performance one can specify minimum free extent length, so ioctl will not bother with shorter extents. This of course means, that with each invocation the ioctl must walk through whole file system, checking and discarding free extents, which is not very efficient. The best way to avoid this is to keep track of deleted (freed) blocks. Then the ioctl have to trim just those free extents which were recently freed. In order to implement this I have created new structure ext4_deleted_data which represents deleted extent in per-group rbtree. When blocks are freed, extents are stored in the tree (possibly merged with other extents). The ioctl then can walk through the tree, take out extents from the tree, compare them with the free extents and possibly trim them. Note that there is support for setting minimum extent length in ioctl call, so it will not bother with shorter extents. Also, when the previously deleted range is taken from the tree and it is not entirely free, the free fragments are discarded and extents shorter than minlen are NOT returned back to the tree to avoid fragmentation of the tree which could lead to the big memory consumption. But you may notice, that there is one problem. bb_bitmap_deleted does not survive umount. To bypass the problem the first ioctl call have to walk through whole file system trimming all free extents. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 5 + fs/ext4/mballoc.c | 321 +++++++++++++++++++++++++++++++++++++++++++++++++++-- fs/ext4/mballoc.h | 9 ++ fs/ext4/super.c | 1 + 4 files changed, 325 insertions(+), 11 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index bf938cf..41fe9ec 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb, extern int ...
For now it just ignores the small extents to avoid fragmentation. As I said before, I agree that they should not be ignored, I just need to figure out the way to do this efficiently. Also it was not properly tested yet. -Lukas. --
I suggest to not start with optimisations: let's first take a look what is going on in the simple case. Benchmarks are our friends.. --
correcting Christoph's email address - no other edits/comments -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer CNN/TruTV Aired Forensic Imaging Demo - http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/ The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
Adding Mark Lord in cc. He wrote a preliminary discard solution last summer. I'm not sure how it has progressed. Mark, you can find the 2 patches at: http://patchwork.ozlabs.org/patch/50441/ http://patchwork.ozlabs.org/patch/50442/ Greg --
The difference here is that Mark's stuff wasn't as tightly integrated with the kernel, IIRC. What I saw was more at a user level - make a big file, map it, discard all the blocks, unlink the file. It was a good first step, but I think we can do a lot better by using fs-specific calls to be efficient & targeted about the discards. Christoph has a similar approach for XFS, FWIW. --
I haven't looked closely at this patch, but I recall Mark consolidated numerous discontinuous trim/discard/unmap ranges into a single command to the SSD drive. That was why he felt he was getting superior performance. ie. There was an overhead per command to the drive that was eliminated if a single more complex command with multiple ranges went to the SSD drive. But he's the one that did the work and the benchmarking, so I'll let him take it from here, especially if I mis-understood what he was doing. Greg --
This work certainly builds on Mark's early results which clearly showed that several devices see a win from doing larger/batched discards instead of the fine grained ones. Thanks! Ric --
.. Perfect. I proposed exactly this last year, but never found time to work it further. Please proceed with it! .. That's not necessary. It is a simple matter to TRIM a clean filesystem before it is mounted R/W during the boot sequence. No new information is required for that operation. My wiper.sh script already shows walking the freelists and trimming all available space, something which takes only a couple of seconds on a 120GB filesystem that has not been kept up-to-date with on-the-fly trim. Cheers --
