[PATCH 1/2] Add ioctl FITRIM.

Previous thread: Introducing Next3 - built-in snapshots support for Ext3 by Amir G. on Sunday, April 18, 2010 - 8:41 am. (19 messages)

Next thread: Re: [PATCH 13/35] fallthru: ext2 fallthru support by Jan Blunck on Monday, April 19, 2010 - 5:40 am. (5 messages)
From: Lukas Czerner
Date: Monday, April 19, 2010 - 3:55 am

Hi all,

I would like to present a new way to deal with TRIM in ext4 file system.
The current solution is not ideal because of its bad performance impact.
So basic idea to improve things is to avoid discarding every time some
blocks are freed. and instead batching is together into bigger trims,
which tends to be more effective.

The basic idea behind my discard support is to create an ioctl which
walks through all the free extents in each allocating group and discard
those extents. As an addition to improve its performance one can specify
minimum free extent length, so ioctl will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted. 

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents. But there is a
better solution to this problem. The bb_bitmap_deleted can be stored on
disk an can be restored in mount time along with other bitmaps, but I
think it is a quite big change and should be discussed further.

I have also benchmarked it a little. You can find results here:

people.redhat.com/jmoyer/discard/ext4_batched_discard/

comparison with current solution included. Keep in mind that ideal ioctl
invocation interval is yet to be determined, so in benchmark I have used
the performance-worst scenario - without any sleep between execution.


There are two patches for this. ...
From: Lukas Czerner
Date: Monday, April 19, 2010 - 3:55 am

Adds an filesystem independent ioctl to allow implementation of file
system batched discard support.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ioctl.c         |   31 +++++++++++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/fs/ioctl.c b/fs/ioctl.c
index 6c75110..a43eaea 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -537,6 +537,33 @@ static int ioctl_fsthaw(struct file *filp)
 	return thaw_bdev(sb->s_bdev, sb);
 }
 
+static int ioctl_fstrim(struct file *filp, unsigned long arg)
+{
+	struct super_block *sb = filp->f_path.dentry->d_inode->i_sb;
+	unsigned int minlen;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* If filesystem doesn't support trim feature, return. */
+	if (sb->s_op->trim_fs == NULL)
+		return -EOPNOTSUPP;
+
+	/* If a blockdevice-backed filesystem isn't specified, return EINVAL. */
+	if (sb->s_bdev == NULL)
+		return -EINVAL;
+
+	err = get_user(minlen, (unsigned int __user *) arg);
+	if (err)
+		return err;
+
+	err = sb->s_op->trim_fs(minlen, sb);
+	if (err)
+		return err;
+	return 0;
+}
+
 /*
  * When you add any new common ioctls to the switches above and below
  * please update compat_sys_ioctl() too.
@@ -587,6 +614,10 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
 		error = ioctl_fsthaw(filp);
 		break;
 
+	case FITRIM:
+		error = ioctl_fstrim(filp, arg);
+		break;
+
 	case FS_IOC_FIEMAP:
 		return ioctl_fiemap(filp, arg);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 39d57bc..f6ecdff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,7 @@ struct inodes_stat_t {
 #define FIGETBSZ   _IO(0x00,2)	/* get the block size used for bmap */
 #define FIFREEZE	_IOWR('X', 119, int)	/* Freeze */
 #define FITHAW		_IOWR('X', 120, int)	/* Thaw */
+#define FITRIM		_IOWR('X', 121, int)	/* Trim */
 
 #define	FS_IOC_GETFLAGS			_IOR('f', 1, long)
 ...
From: Lukas Czerner
Date: Monday, April 19, 2010 - 3:55 am

Create an ioctl which walks through all the free extents in each
allocating group and discard those extents. As an addition to improve
its performance one can specify minimum free extent length, so ioctl
will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted. 

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h    |    4 +
 fs/ext4/mballoc.c |  207 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 fs/ext4/super.c   |    1 +
 3 files changed, 202 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..e25f672 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
 extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
 extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
 						ext4_group_t, int);
+extern int ext4_trim_fs(unsigned int, struct super_block *);
+
 /* inode.c */
 struct buffer_head *ext4_getblk(handle_t *, struct inode *,
 						ext4_lblk_t, int, int *);
@@ -1682,6 +1684,8 @@ struct ext4_group_info {
 #ifdef DOUBLE_CHECK
 	void            *bb_bitmap;
 #endif
+	void		*bb_bitmap_deleted;
+	ext4_grpblk_t	bb_deleted;
 	struct ...
From: Greg Freemyer
Date: Tuesday, April 20, 2010 - 2:21 pm

Mark,

This is the patch implementing the new discard logic.

You did the benchmarking last year, but I thought you found calling
trim one contiguous sector range at a time was too inefficient.

See my highlight below:


Mark, unless I'm missing something, sb_issue_discard() above is going
to trigger a trim command for just the one range.  I thought the
benchmarks you did showed that a collection of ranges needed to be
built, then a single trim command invoked that trimmed that group of
ranges.

--

From: Mark Lord
Date: Tuesday, April 20, 2010 - 7:26 pm

..

Mmm.. If that's what it is doing, then this patch set would be a complete disaster.
It would take *hours* to do the initial TRIM.

Lukas ?
--

From: Eric Sandeen
Date: Tuesday, April 20, 2010 - 7:45 pm

I'm confused; do we have an interface to send a trim command for multiple ranges?

I didn't think so ...  Lukas' patch is finding free ranges (above a size threshold)
to discard; it's not doing it a block at a time, if that's the concern.

-Eric
--

From: Greg Freemyer
Date: Wednesday, April 21, 2010 - 11:59 am

Eric,

I don't know what kernel APIs have been created to support discard,
but the ATA8 draft spec. allows for specifying multiple ranges in one
trim command.

See section 7.10.3.1 and .2 of the latest draft spec.

Both talk about multiple trim ranges per trim command (think thousands
of ranges per command).

Recent hdparm versions accept a trim command argument that causes
multiple ranges to be trimmed per command.

 --trim-sector-ranges        Tell SSD firmware to discard unneeded
data sectors: lba:count ..
 --trim-sector-ranges-stdin  Same as above, but reads lba:count pairs from stdin

As I understand it, this is critical from a performance perspective
for the SSDs Mark tested with.  ie. He found a single trim command
with 1000 ranges takes much less time than 1000 discrete trim
commands.

Per Mark's comment's in wiper.sh, a trim command can have a minimum of
128KB of associated range information, so it is thousands of ranges
that can be discarded in a single command

ie. hdparm can accept extremely large lists of ranges on stdin, but it
parses the list into discrete trim commands with thousands of ranges
per command.

A kernel implementation which is trying to implement after that fact
discards as this patch is doing, also needs to somehow craft trim
commands with a large payload of ranges if it is going to be
efficient.

If the block layer cannot do this yet, then in my opinion this type of
batched discarding needs to stay in user space as done with Mark's
wiper.sh script and enhanced hdparm until the block layer grows that
ability.

Greg
--

From: Ric Wheeler
Date: Wednesday, April 21, 2010 - 12:04 pm

Greg,

We have full support for this in the "discard" support at the file system layer 
for several file systems.

The block layer effectively muxes the "discard" into the right target device 
command. TRIM for ATA, WRITE_SAME (with unmap) or UNMAP for SCSI...

If your favourite fs supports this, you can enable this feature with "-o 
discard" for fine grained discards,

ric
--

From: Jeff Moyer
Date: Wednesday, April 21, 2010 - 12:22 pm

Well, sb_issue_discard is what ext4 is using, and that takes a single


Thanks, it's worth pointing out that TRIM is not the only backend to the
discard API.  However, even if we do implement a vectored API, we can
translate that to dumber commands if a given spec doesn't support it.

Getting back to the problem...

From the file system, you want to discard discrete ranges of blocks.
The API to support this can either take care of the data integrity
guarantees by itself, or make the upper layer ensure that trim and write
do not pass each other.  The current implementation does the latter.  In
order to do the former, there is the potential for a lot of overhead to
be introduced into the block allocation layers for the file systems.

So, given the above, it is up to the file system to send down the
biggest discard requests it can in order to reduce the overhead of the
command.  If a vectored approach is made available, then that would be
even better.  Christoph, is this something that's on your radar?

Cheers,
Jeff
--

From: Greg Freemyer
Date: Wednesday, April 21, 2010 - 1:44 pm

Looking at the benchmarks (for the first time) at
http://people.redhat.com/jmoyer/discard/ext4_batched_discard/

I don't see anything that says how long the proposed trim ioctl takes
to complete on the full filesystem.

What they do show is that with the 3 test SSDs used for this
benchmark, the current released discard implementation is a net loss.
ie. You are better off running without the discards for all 3 vendors.
 (at least under the conditions tested.)

After the patch is applied and optimizing the discards to large free
extents only, it works out to same performance with or without the
discards.  ie. no net gain or loss.

That is extremely cool because one assumes that the non-discard case
would degrade over time, but that the discard case will not.

So that argues for the current proposed patch going in.

But quoting from the first email:

==
The basic idea behind my discard support is to create an ioctl which
walks through all the free extents in each allocating group and discard
those extents. As an addition to improve its performance one can specify
minimum free extent length, so ioctl will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have added new bitmap into ext4_group_info
(bb_bitmap_deleted) which stores recently freed blocks. The ioctl then
walk through bb_bitmap_deleted, compare deleted extents with free
extents trim them and then removes it from the bb_bitmap_deleted.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents. But there is a
better solution to this problem. The ...
From: Greg Freemyer
Date: Wednesday, April 21, 2010 - 1:53 pm

correcting Christoph's email address - no other edits/comments




-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: Eric Sandeen
Date: Wednesday, April 21, 2010 - 2:01 pm

All the other things aside, mount-time is interesting, but it's an
infrequent operation, at least in my world.  I think we need something
that can be done runtime.

For anything with uptime, I don't think it's acceptable to wait until
the next mount to trim unused blocks.

But as long as the mechanism can be called either at mount time and/or
kicked off runtime somehow, I'm happy.

-Eric
--

From: Ric Wheeler
Date: Wednesday, April 21, 2010 - 2:03 pm

That makes sense to me.  Most enterprise servers will go without 
remounting a file system for (hopefully!) a very long time.

It is really important to keep in mind that this is not just a laptop 
feature for laptop SSD's, this is also used by high end arrays and 
*could* be useful for virt IO, etc as well :-)

ric

--

From: Greg Freemyer
Date: Wednesday, April 21, 2010 - 2:47 pm

Adding James Bottomley because high-end scsi is entering the
discussion.  James, I have a couple scsi questions for you at the end.


I'm not arguing that a runtime solution is not needed.

I'm arguing that at least for SSD backed filesystems Mark's userspace
implementation shows how the mount time initialization of the runtime
bitmap can be accomplished in a few seconds by leveraging the hardware
and using vector'ed trims as opposed to having to build an additional
on-disk structure.

At least for SSDs, the primary purpose of the proposed on-disk
structure seems to be to overcome the current lack of a vector'ed
discard implementation.

If it is too difficult to implement a fully functional vector'ed
discard in the block layer due to locking issues, possibly a special
purpose version could be written that is only used at mount time when
one can be assured no other i/o is occurring to the filesystem.

James,

The ATA-8 spec. supports vectored trims and requires a minimum of 255
sectors worth of range payload be supported.  That equates to a single
trim being able to trim thousands of ranges in one command.

Mark Lord has benchmarked in found a vectored trim to be drastically
faster than calling trim individually for each of those ranges.

Does scsi support vector'ed discard? (ie. write-same commands)

Or are high-end scsi arrays so fast they can process tens of thousands
of discard commands in a reasonable amount of time, unlike the SSDs
have so far proven to do.

It would be interesting to find out that a SSD can discard thousands
of ranges drastically faster than a high-end scsi device can.  But if
true, that might argue for the on-disk bitmap to track previously
discarded blocks/extents.

Greg
--

From: James Bottomley
Date: Wednesday, April 21, 2010 - 2:56 pm

So what's wrong with using wiper.sh?  It can do online discard of


No ... they actually have two problems: firstly they can only use
discard ranges which align with their internal block size (usually
something huge like 3/4MB) and then a trim operation tends to be O(1)

I think SSDs and Arrays both have discard problems, arrays more to do
with the time and expense of the operation, SSDs because the TRIM
command isn't queued.

James


--

From: Mark Lord
Date: Wednesday, April 21, 2010 - 2:59 pm

..

Does anyone have an Intel-based SSD they could spare?  :)

Rumour has it they they do not comply with the ATA-8 spec for TRIM,
in that they don't support more than one "sector" of range data.
The SSDs I have here all support much more than that.

Still, even if we just do a single 512-byte payload of TRIM ranges,
it would be a huge win.

Cheers
--

From: Lukas Czerner
Date: Friday, April 23, 2010 - 1:23 am

Well, it strongly depends on how is the file system fragmented. On the
fresh file system (147G) the initial ioctl takes 2 seconds to finish (it 
may be worth to mention that on another SSD (111G) it takes 56s). I will
try to get some numbers for the "usual" file system (not full, not

I do not know much about how production system is being used, but I
doubt this is that big issue. Sure the initial ioctl takes long to
finish and there is a place for improvement, there was a proposal to
do the initial trim at mount time. I do not think that it is wise,
why to block mount, when the trim can be run at background when the fs
is mounted ? Of course there will be some performance loss, while ioctl
will be in progress, but it will not block.

There are also another way to overcome this problem. We can assure that
the file system is left trimmed after umount. To do this, we can simply
trim the fs at umount time. I think this won't be any problem and we even
do not prolong the umount time too much, because we will not trim whole
fs, but just recently freed blocks.

This of course bring another problem, when the system is not properly
terminated and the umount is not properly finished (or done at all). But
this can be solved in fsck at boot time I think. This will entirely
eliminate the need to trim the whole fs (except the fsck obviously),

Vectored trim will be great, I did not tested anything like that but
obviously it will strongly reduce time needed to trim the fs. But we do


And also, currently I am rewriting the patch do use rbtree instead of the
bitmap, because there were some concerns of memory consumption. It is a
question whether or not the rbtree will be more memory friendly.
Generally I think that in most "normal" cases it will, but there are some
extreme scenarios, where the rbtree will be much worse. Any comment on
this ?


Thanks.
-Lukas
From: Greg Freemyer
Date: Saturday, April 24, 2010 - 6:24 am

I know I've been arguing against this patch for the single SSD case
and I still think that use case should be handled by userspace as
hdparm/wiper.sh currently does.  In particular for those extreme
scenarios with JBOD SSDs, the user space solution wins because it
knows how to optimize the trim calls via vectorized ranges in the
payload.

Thus I think the community and distro's should be testing that pair
and pushing it out in the distro's for typical laptop use.

But, that still leaves high-end external raid arrays, mdraid, and lvm
unaddressed.

Those use cases will likely benefit from the approach this patch takes
the most.  In particular, mdraid with raid 5/6  requires an approach
like this patch provides, or it has to create its own in kernel
discard aggregator which seems like a waste of time.

In general, those use cases have large minimum discard units.  Thus I
think this patch should be tuned to work with large discard units and
ignore small ones.  That means it needs to get the underlying block
layer topology and ignore unused space smaller than underlying layers
minimum discard unit.  That should allow a rb tree to be used and
eliminate the extreme scenarios.  (ie. I assume your extreme scenarios
involve large numbers of very small unused ranges.)

That may mean the topology information needs to grow some discard
info.  Does anyone know if that info is easily derived from the
currently existing topo info?

Greg






-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: Ric Wheeler
Date: Saturday, April 24, 2010 - 6:48 am

I think that you have missed the broader point. This is not on by 
default, so you can mount without discard and use whatever user space 
utility you like at your discretion.


--

From: Greg Freemyer
Date: Saturday, April 24, 2010 - 7:30 am

Ric,

I was trying to say the design should be driven by the large discard
range use case, not the potentially pathological small discard range
use case that would only benefit SSDs.

Greg
--

From: Eric Sandeen
Date: Saturday, April 24, 2010 - 7:43 am

Bear in mind that this patch makes the discard range requests substantially
-larger- than what mount -o discard does on ext4 today, in fact that was
a main goal.

If the kernel could assemble vectors of ranges and pass them down, I think it
could be extended to use that as well.

-Eric
--

From: Greg Freemyer
Date: Saturday, April 24, 2010 - 8:03 am

Eric/Ric,

I was responding to the Lukas latest post which stated:

==
And also, currently I am rewriting the patch do use rbtree instead of the
bitmap, because there were some concerns of memory consumption. It is a
question whether or not the rbtree will be more memory friendly.
Generally I think that in most "normal" cases it will, but there are some
extreme scenarios, where the rbtree will be much worse. Any comment on
this ?
==

If one optimizes for large discard ranges and ignores the pathological
cases only beneficial to SSDs, then a rbtree wins.

Correct?

Greg
--

From: Ric Wheeler
Date: Saturday, April 24, 2010 - 10:04 am

Let's summarize.

1. Everyone agrees that doing larger discard instead of little discards 
is a good thing.

2. Some devices care about this more than others (various types of SSD's 
and arrays have different designs and performance with discards). Some 
devices do small discards well, others don't.

3. How you get to those bigger discards in our implementation - using a 
series of single range requests, using vectored requests, tracking 
extents that can be combined in an rbtree or not - is something that we 
are working on. Using rbtrees versus a bitmap efficiency is about DRAM 
consumption, not performance of the resulting discard on the target.

4. Devices (some devices) can export their preferences in a standard way 
(look in /sys/block/....).

If you want to influence the code, please do try the various options on 
devices you have at hand and report results.  That is what we are doing 
(we includes Lukas, Eric, Jeff and others on this thread) will real 
devices from vendors that have given us access. We are talking to them 
directly and trying out different work loads but certainly welcome real 
world results and suggestions.

Thanks!

Ric


--

From: Greg Freemyer
Date: Saturday, April 24, 2010 - 11:30 am

On Sat, Apr 24, 2010 at 1:04 PM, Ric Wheeler <rwheeler@redhat.com> wrote:

Ric,

Is it also agreed by all that the current ext4 kernel implementation
of calling discard is a poor solution for most hardware / block layers
stacks / workloads and therefore is not worth retaining nor performing
further benchmarks?

I've not seen anyone arguing to keep the current kernel implementation
and I for one accept the previously posted benchmarks that show it is
not adding any value relative to the traditional non-discard case.

Therefore benchmarks between the current hdparm/wiper.sh userspace
implementation and a proposed new kernel implementation would be the
most beneficial?

I have yet to see any of those benchmarks posted.

Greg



-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: Ric Wheeler
Date: Saturday, April 24, 2010 - 11:41 am

Greg,

I don't like the user space wiper.sh approach in general, but running 
wiper.sh is entirely your choice.

Most users prefer having the file system and the IO stack take care of 
this for them, but again, entirely configurable.

The benchmarks we have done are often done on hardware that is under NDA 
so we cannot publish results across the board.

Feel free to run on hardware that you buy and share the results.

Thanks!

Ric

--

From: Mark Lord
Date: Monday, April 26, 2010 - 7:00 am

On 24/04/10 02:41 PM, Ric Wheeler wrote:
..

I wrote wiper.sh, and I would prefer more of an in-kernel solution,
simply because it could potentially have the smarts to deal with RAID,
LVM, and btrfs's own duplicate implementation of RAID/LVM.

Those cannot be done from userspace on a mounted filesystem.

So.. again.. in an ideal kernel, I'd like to see use of larger ranges
(and _multiple_ ranges) per TRIM command.  And options for the kernel
to do it automatically (-o discard), as well as an ioctl() interface
for userspace to "scrub" (or "wipe") all free ranges in a gradual fashion.

Cheers

--

From: Martin K. Petersen
Date: Monday, April 26, 2010 - 7:42 am

>>>>> "Mark" == Mark Lord <kernel@teksavvy.com> writes:

Mark> So.. again.. in an ideal kernel, I'd like to see use of larger
Mark> ranges (and _multiple_ ranges) per TRIM command.  And options for
Mark> the kernel to do it automatically (-o discard), as well as an
Mark> ioctl() interface for userspace to "scrub" (or "wipe") all free
Mark> ranges in a gradual fashion.

Again: Discard splitting, merging, and coalescing is coming.  I'm
working on it.  It's not exactly trivial.

My objection was purely wrt. nuking realtime discard in favor of
scrubbing.  We absolutely need both approaches.

But I'm not going to let crappy SSD firmware implementations set the
direction for what I'm working on.  It is much more interesting to
ensure that we work well for the devices that'll be shipping 6-12 months
from now (at the time where this stuff will realistically end up in
distro kernels).

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Greg Freemyer
Date: Monday, April 26, 2010 - 8:27 am

On Mon, Apr 26, 2010 at 10:42 AM, Martin K. Petersen

Is this an agreed summary as relates to ext4:

1) Current "-o discard" causes real-time calls to discard.  Although
not optimized for current generation hardware, both block-layer
optimizations and new hardware are coming, so it needs to be kept.

2) Kernel based discard scrubbing - Doesn't currently exist in 2.6.34,
all agree that for the LVM, mdraid, btrfs cases it is needed and there
is no linux tool (kernel or userspace) at present.  The Lukas proposed
patch is userspace invoked so a mount option is not needed.

3) Userspace discard is available today and works well with ext4 on a
single JBOD SSD which will be the typical laptop use as one example.

Mark, or anyone, do you think it would be a bad idea for me to push
for Opensuse 11.3 (2.6.34 based) to use wiper.sh as a standard ext4
discard tool?  hdparm and wiper.sh are already in the distro, it just
needs a cron entry and maybe some basic management infrastructure.
They're at feature freeze for 11.3, so I don't know if I can get it in
or not.

If and when the suse team want to move to the kernel based scrubber,
they can.  ie. a year from now when the next release after 11.3 comes
out.

Greg
--

From: Lukas Czerner
Date: Monday, April 26, 2010 - 8:51 am

It is userspace invoked indeed, but once you set the nodiscard mount
option it will not do anything. It seems only logical to me to NOT

It (wiper) do not work on devices which does not support vectored trim

-Lukas
From: Mark Lord
Date: Tuesday, April 27, 2010 - 6:25 pm

On 26/04/10 11:27 AM, Greg Freemyer wrote:
..

I'd hold off for now.  Rumour has it that Intel SSDs are not 100% compliant
with TRIM -- they fail with large amounts of range data.

I don't have an Intel SSD to verify that or fix it with.

Cheers
--

From: Ric Wheeler
Date: Monday, April 26, 2010 - 8:48 am

And one thing to keep in mind is that we are likely to need both run time 
support for discard as well as occasional resync in bulk since the storage can 
choose to ignore the command and still be in spec.  Kind of like the unfortunate 
"defrag" thing that windows people do,

ric
--

From: Martin K. Petersen
Date: Saturday, April 24, 2010 - 12:06 pm

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> Is it also agreed by all that the current ext4 kernel
Greg> implementation of calling discard is a poor solution for most
Greg> hardware / block layers stacks / workloads and therefore is not
Greg> worth retaining nor performing further benchmarks?

Our discard implementation is meant to accommodate a wide range of
devices.  Just because some of the currently shipping low-end consumer
SSDs implement TRIM poorly does not mean we're going to scrap what we
have.

We are not in the business of designing for the past.  Especially not
when the past can be handled by a shell script.

For future devices TRIM/UNMAP is going to be an inherent part of the
command set.  And that's what our kernel support is aimed at.  There are
some deficiencies right now because the block layer was not built to
handle what is essentially a hybrid between a filesystem and a blk_pc
type request.  I'm working on fixing that.


Greg> I've not seen anyone arguing to keep the current kernel
Greg> implementation and I for one accept the previously posted
Greg> benchmarks that show it is not adding any value relative to the
Greg> traditional non-discard case.

For enterprise storage the cost of sending discards is limited to the
overhead of sending the command.  I.e. negligible.

Eventually that's going to be the case for ATA devices as well.  And let
me just reiterate that Windows 7 does issue TRIM like we do (at
runtime).  And consequently Windows 7 will help weed out the crap SSD
implementations from the market.  That's already happening.

There is also work underway to make TRIM a queueable command which would
further alleviate the situation.


Greg> Therefore benchmarks between the current hdparm/wiper.sh userspace
Greg> implementation and a proposed new kernel implementation would be
Greg> the most beneficial?

I don't know what you mean by "new" kernel implementation.  We're
working on tweaking what we have so that we can ...
From: Mark Lord
Date: Monday, April 26, 2010 - 7:03 am

On 24/04/10 03:06 PM, Martin K. Petersen wrote:
..

The current implementation works extremely poorly for the singlemost
common style of hardware that's out there.  Millions and millions
of SATA SSDs, and they're becoming more and more ubiquitous.

A "solution" that ignores reality isn't very helpful.
..

We are in the business of supporting the hardware that people run Linux on.
Today, and tomorrow, and the next few years, that means SATA SSDs by the gazillions,
as well as a relatively smaller number of enterprise behemoths.

A shell script cannot currently deal with LVM, RAID, or btrfs filesystems.

Cheers
--

From: Martin K. Petersen
Date: Saturday, April 24, 2010 - 11:39 am

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> That may mean the topology information needs to grow some discard
Greg> info.  Does anyone know if that info is easily derived from the
Greg> currently existing topo info?

The values are already part of the topology information (assuming the
device reports them).  Went into 2.6.33.

-- 
Martin K. Petersen	Oracle Linux Engineering
--

From: Jan Kara
Date: Monday, April 26, 2010 - 9:55 am

I see two possible improvements here:
a) At a cost of some code complexity, you can bound the worst case by combining
RB-trees with bitmaps. The basic idea is that when space to TRIM gets too
fragmented (memory to keep to-TRIM blocks in RB-tree for a given group exceeds
the memory needed to keep it in a bitmap), you convert RB-tree for a
problematic group to a bitmap and attach it to an appropriate RB-node. If you
track with a bitmap also a number of to-TRIM extents in the bitmap, you can
also decide whether it's benefitial to switch back to an RB-tree.

b) Another idea might be: When to-TRIM space is fragmented (again, let's say
in some block group), there's not much point in sending tiny trim commands
anyway (at least that's what I've understood from this discussion). So you
might as well stop maintaining information which blocks we need to trim
for that group. When the situation gets better, you can always walk block
bitmap and issue trim commands. You might even trigger this rescan from
kernel - if you'd maintain number of free block extents for each block group
(which is rather easy), you could trigger the bitmap rescan and trim as soon
as ratio number of free blocks / number of extents gets above a reasonable
threshold.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
--

From: Lukas Czerner
Date: Monday, April 26, 2010 - 10:46 am

This sounds like a good idea, but I wonder if it is worth it :
 1. The tree will have very short life, because with next ioctl all
 stored deleted extents will be trimmed and removed from the tree.
 2. Also note, that the longer it lives the less fragmented it possibly
 became.
 3. I do not expect, that deleted ranges can be too fragmented, and
 even if it is, it will be probably merged into one big extent very

In what I am preparing now, I simple ignore small extents, which would
be created by splitting the deleted extent into smaller pieces by chunks
of used blocks. This, in my opinion, will prevent the fragmentation,
which otherwise may occur in the longer term (between ioctl calls).

Thanks for suggestions.
-Lukas
--

From: Ric Wheeler
Date: Monday, April 26, 2010 - 10:52 am

I am not convinced that ignoring small extents is a good idea. Remember that for 
SSD's specifically, they remap *everything* internally so our "fragmentation" 
set of small spaces could be useful for them.

That does not mean that we should not try to send larger requests down to the 
target device which is always a good idea I think :-)

ric

--

From: Lukas Czerner
Date: Monday, April 26, 2010 - 11:14 am

That's right, so the other approach would be probably better. Merge
small extents together into one, but there must be some limit, because I
do not want two little extents at the beginning and the end of the group
to force trimming whole group. The whole rbtree thing gets a little
complicated :)

-Lukas
--

From: Jeff Moyer
Date: Monday, April 26, 2010 - 11:28 am

This discussion is getting a bit too abstract for me.  Show us the code
and we can make some progress.  =)

On the topic of discarding small blocks, I agree with Ric, it should be
done.

Cheers,
Jeff
--

From: Lukas Czerner
Date: Monday, April 26, 2010 - 11:38 am

Create an ioctl which walks through all the free extents in each
allocating group and discard those extents. As an addition to improve
its performance one can specify minimum free extent length, so ioctl
will not bother with shorter extents.

This of course means, that with each invocation the ioctl must walk
through whole file system, checking and discarding free extents, which
is not very efficient. The best way to avoid this is to keep track of
deleted (freed) blocks. Then the ioctl have to trim just those free
extents which were recently freed.

In order to implement this I have created new structure
ext4_deleted_data which represents deleted extent in per-group rbtree.
When blocks are freed, extents are stored in the tree (possibly merged
with other extents). The ioctl then can walk through the tree, take out
extents from the tree, compare them with the free extents and possibly
trim them.

Note that there is support for setting minimum extent length in ioctl
call, so it will not bother with shorter extents. Also, when the
previously deleted range is taken from the tree and it is not entirely
free, the free fragments are discarded and extents shorter than minlen
are NOT returned back to the tree to avoid fragmentation of the tree
which could lead to the big memory consumption.

But you may notice, that there is one problem. bb_bitmap_deleted does
not survive umount. To bypass the problem the first ioctl call have to
walk through whole file system trimming all free extents.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/ext4.h    |    5 +
 fs/ext4/mballoc.c |  321 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/mballoc.h |    9 ++
 fs/ext4/super.c   |    1 +
 4 files changed, 325 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf938cf..41fe9ec 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1437,6 +1437,8 @@ extern int ext4_mb_add_groupinfo(struct super_block *sb,
 extern int ...
From: Lukas Czerner
Date: Monday, April 26, 2010 - 11:42 am

For now it just ignores the small extents to avoid fragmentation. As I
said before, I agree that they should not be ignored, I just need to
figure out the way to do this efficiently. 

Also it was not properly tested yet.


-Lukas.
--

From: Edward Shishkin
Date: Tuesday, April 27, 2010 - 8:29 am

I suggest to not start with optimisations: let's first take a look
what is going on in the simple case. Benchmarks are our friends..


--

From: Greg Freemyer
Date: Wednesday, April 21, 2010 - 1:52 pm

correcting Christoph's email address - no other edits/comments




-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: Greg Freemyer
Date: Monday, April 19, 2010 - 9:20 am

Adding Mark Lord in cc.

He wrote a preliminary discard solution last summer.  I'm not sure how
it has progressed.

Mark, you can find the 2 patches at:

http://patchwork.ozlabs.org/patch/50441/
http://patchwork.ozlabs.org/patch/50442/

Greg

--

From: Eric Sandeen
Date: Monday, April 19, 2010 - 9:30 am

The difference here is that Mark's stuff wasn't as tightly integrated
with the kernel, IIRC.  What I saw was more at a user level - make a big
file, map it, discard all the blocks, unlink the file.

It was a good first step, but I think we can do a lot better by using
fs-specific calls to be efficient & targeted about the discards.

Christoph has a similar approach for XFS, FWIW.


--

From: Greg Freemyer
Date: Monday, April 19, 2010 - 10:58 am

I haven't looked closely at this patch, but I recall Mark consolidated
numerous discontinuous trim/discard/unmap ranges into a single command
to the SSD drive.

That was why he felt he was getting superior performance.  ie. There
was an overhead per command to the drive that was eliminated if a
single more complex command with multiple ranges went to the SSD
drive.

But he's the one that did the work and the benchmarking, so I'll let
him take it from here, especially if I mis-understood what he was
doing.

Greg
--

From: Ric Wheeler
Date: Monday, April 19, 2010 - 11:04 am

This work certainly builds on Mark's early results which clearly showed that 
several devices see a win from doing larger/batched discards instead of the fine 
grained ones.

Thanks!

Ric

--

From: Mark Lord
Date: Tuesday, April 20, 2010 - 1:24 pm

..

Perfect.  I proposed exactly this last year, but never found time to work it further.
Please proceed with it!

..

That's not necessary.  It is a simple matter to TRIM a clean filesystem
before it is mounted R/W during the boot sequence.  No new information
is required for that operation.

My wiper.sh script already shows walking the freelists and trimming all
available space, something which takes only a couple of seconds on a 120GB
filesystem that has not been kept up-to-date with on-the-fly trim.

Cheers
--

From: Mark Lord
Date: Tuesday, April 20, 2010 - 1:34 pm

..

Ahh. here it is/was:

--

Previous thread: Introducing Next3 - built-in snapshots support for Ext3 by Amir G. on Sunday, April 18, 2010 - 8:41 am. (19 messages)

Next thread: Re: [PATCH 13/35] fallthru: ext2 fallthru support by Jan Blunck on Monday, April 19, 2010 - 5:40 am. (5 messages)