This patchset fixes some problems with the ENOSPC code for block groups, but more importantly it fixes a huge ENOSPC regression that's occured. With my fs_mark test fs_mark -d /mnt/btrfs-test -D 512 -t 16 -n 4096 -F -S0 on a 2gb fs without these patches fs_mark would exit out with ENOSPC after writing around 50mb. With these patches I can now fill up the disk. Also the new ENOSPC code is super aggressive about allocating metadata chunks, to the point that even with the multi-writer regression fixed I was still only able to fill about 900mb with data on a 2gb fs. With all of these patches I can fill up the 2gb fs with about 1.9gb of data. This is much more reasonable. There doesn't appear to be any performance regression, but I would appreciate testing to make sure this is actually the case. Thanks, Josef --
With multi-threaded writes we were getting ENOSPC early because somebody would come in, start flushing delalloc because they couldn't make their reservation, and in the meantime other threads would come in and use the space that was getting freed up, so when the original thread went to check to see if they had space they didn't and they'd return ENOSPC. So instead if we have some free space but not enough for our reservation, take the reservation and then start doing the flushing. The only time we don't take reservations is when we've already overcommitted our space, that way we don't have people who come late to the party way overcommitting ourselves. This also moves all of the retrying and flushing code into reserve_metdata_bytes so it's all uniform. This keeps my fs_mark test from returning -ENOSPC as soon as it starts and actually lets me fill up the disk. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- fs/btrfs/ctree.h | 4 +- fs/btrfs/extent-tree.c | 230 ++++++++++++++++++++++++++++++++---------------- fs/btrfs/relocation.c | 14 +--- fs/btrfs/transaction.c | 7 +- 4 files changed, 160 insertions(+), 95 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 72f5e1a..9e923c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2144,7 +2144,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes); void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root, - int num_items, int *retries); + int num_items); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root); int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans, @@ -2165,7 +2165,7 @@ void btrfs_add_durable_block_rsv(struct btrfs_fs_info *fs_info, int btrfs_block_rsv_add(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, - u64 ...
With multi-threaded writes we were getting ENOSPC early because somebody would come in, start flushing delalloc because they couldn't make their reservation, and in the meantime other threads would come in and use the space that was getting freed up, so when the original thread went to check to see if they had space they didn't and they'd return ENOSPC. So instead if we have some free space but not enough for our reservation, take the reservation and then start doing the flushing. The only time we don't take reservations is when we've already overcommitted our space, that way we don't have people who come late to the party way overcommitting ourselves. This also moves all of the retrying and flushing code into reserve_metdata_bytes so it's all uniform. This keeps my fs_mark test from returning -ENOSPC as soon as it starts and actually lets me fill up the disk. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> --- V1->V2: Don't allocate chunks, it takes more metadata and can cause problems -only flush if we say so, this keeps us from deadlocking with the tree lock and such -hold our reservation regardless if we are overcommitted or not, just adjust how much we need to reclaim to succeed. -always sync write out delalloc, we cant reclaim unless the io completes anyway fs/btrfs/ctree.h | 4 +- fs/btrfs/extent-tree.c | 238 ++++++++++++++++++++++++++---------------------- fs/btrfs/relocation.c | 14 +-- fs/btrfs/transaction.c | 7 +- 4 files changed, 136 insertions(+), 127 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 72f5e1a..9e923c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2144,7 +2144,7 @@ int btrfs_check_data_free_space(struct inode *inode, u64 bytes); void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root, - int num_items, int *retries); + int num_items); void ...
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.
Signed-off-by: Josef Bacik <josef@redhat.com>
---
fs/btrfs/ctree.h | 3 ++-
fs/btrfs/extent-tree.c | 26 ++++++++++++++------------
fs/btrfs/inode.c | 8 ++++++--
3 files changed, 22 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4833a01..72f5e1a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2439,7 +2439,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u32 min_type);
int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
-int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput);
+int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput,
+ int sync);
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
struct extent_state **cached_state);
int btrfs_writepages(struct address_space *mapping,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d532f00..14a52dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3342,9 +3342,10 @@ static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
* shrink metadata reservation for delalloc
*/
static int shrink_delalloc(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, u64 to_reclaim)
+ struct btrfs_root *root, u64 to_reclaim, int sync)
{
struct btrfs_block_rsv *block_rsv;
+ struct btrfs_space_info *space_info;
u64 reserved;
u64 max_reclaim;
u64 reclaimed = 0;
@@ -3353,9 +3354,10 @@ static int shrink_delalloc(struct btrfs_trans_handle *trans,
int ret;
block_rsv = ...Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.
Signed-off-by: Josef Bacik <josef@redhat.com>
---
V1->V2: fix how we counted reclaimed, and do btrfs_wait_ordered_range if we're
syncing the file since compression does weird things with writeback.
fs/btrfs/ctree.h | 3 ++-
fs/btrfs/extent-tree.c | 26 ++++++++++++++------------
fs/btrfs/inode.c | 24 ++++++++++++++++++++++--
3 files changed, 38 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4833a01..72f5e1a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2439,7 +2439,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
u32 min_type);
int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
-int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput);
+int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput,
+ int sync);
int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
struct extent_state **cached_state);
int btrfs_writepages(struct address_space *mapping,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 080be22..e25525f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3342,9 +3342,10 @@ static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
* shrink metadata reservation for delalloc
*/
static int shrink_delalloc(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, u64 to_reclaim)
+ struct btrfs_root *root, u64 to_reclaim, int sync)
{
struct btrfs_block_rsv *block_rsv;
+ struct btrfs_space_info *space_info;
u64 reserved;
u64 max_reclaim;
u64 ...Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should. For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used. So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk.
Signed-off-by: Josef Bacik <josef@redhat.com>
---
fs/btrfs/extent-tree.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 14a52dd..265d8e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3224,7 +3224,8 @@ static void force_metadata_allocation(struct btrfs_fs_info *info)
rcu_read_unlock();
}
-static int should_alloc_chunk(struct btrfs_space_info *sinfo,
+static int should_alloc_chunk(struct btrfs_fs_info *info,
+ struct btrfs_space_info *sinfo,
u64 alloc_bytes)
{
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
@@ -3237,6 +3238,9 @@ static int should_alloc_chunk(struct btrfs_space_info *sinfo,
alloc_bytes < div_factor(num_bytes, 8))
return 0;
+ if (sinfo->bytes_used < div_factor(num_bytes, 3))
+ return 0;
+
return 1;
}
@@ -3268,7 +3272,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
goto out;
}
- if (!force && !should_alloc_chunk(space_info, alloc_bytes)) {
+ if (!force && !should_alloc_chunk(fs_info, space_info, alloc_bytes)) {
spin_unlock(&space_info->lock);
goto out;
}
@@ -3317,7 +3321,8 @@ static int maybe_allocate_chunk(struct btrfs_trans_handle *trans,
return 0;
spin_lock(&sinfo->lock);
- ret = should_alloc_chunk(sinfo, num_bytes + 2 * 1024 * 1024);
+ ret = should_alloc_chunk(root->fs_info, sinfo,
+ num_bytes + 2 * 1024 * 1024);
spin_unlock(&sinfo->lock);
if (!ret)
return 0;
--
1.6.6.1
--
Self-NAK on this one, it seems to cause a few problems with -m single and smaller fs's, so just drop it. It only creates too much overhead on really small fs's anyway, and if you no likey that overhead, use mixed block groups :). Thanks, Josef --
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should. For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used. So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.
Signed-off-by: Josef Bacik <josef@redhat.com>
---
V1-V2: Cleanup should_alloc_chunk so it doesnt take fs_info, I was using it
before for something different, forgot to clean it up. Also added 256mb free
floor at Chris's suggestion to help with the -m single case.
fs/btrfs/extent-tree.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 14a52dd..eac11b1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3224,8 +3224,7 @@ static void force_metadata_allocation(struct btrfs_fs_info *info)
rcu_read_unlock();
}
-static int should_alloc_chunk(struct btrfs_space_info *sinfo,
- u64 alloc_bytes)
+static int should_alloc_chunk(struct btrfs_space_info *sinfo, u64 alloc_bytes)
{
u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly;
@@ -3237,6 +3236,10 @@ static int should_alloc_chunk(struct btrfs_space_info *sinfo,
alloc_bytes < div_factor(num_bytes, 8))
return 0;
+ if (num_bytes > 256 * 1024 * 1024 &&
+ sinfo->bytes_used < div_factor(num_bytes, 3))
+ return 0;
+
return 1;
}
--
1.6.6.1
--
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups. Instead if we have mixed block groups just set
data used to 0. Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.
Signed-off-by: Josef Bacik <josef@redhat.com>
---
fs/btrfs/extent-tree.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 72c3d5f..d532f00 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3444,7 +3444,8 @@ static int reserve_metadata_bytes(struct btrfs_block_rsv *block_rsv,
spin_lock(&space_info->lock);
unused = space_info->bytes_used + space_info->bytes_reserved +
- space_info->bytes_pinned + space_info->bytes_readonly;
+ space_info->bytes_pinned + space_info->bytes_readonly +
+ space_info->bytes_may_use;
if (unused < space_info->total_bytes)
unused = space_info->total_bytes - unused;
@@ -3738,6 +3739,8 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info)
sinfo = __find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
spin_lock(&sinfo->lock);
+ if (sinfo->flags & BTRFS_BLOCK_GROUP_DATA)
+ data_used = 0;
meta_used = sinfo->bytes_used;
spin_unlock(&sinfo->lock);
@@ -3765,7 +3768,8 @@ static void update_global_block_rsv(struct btrfs_fs_info *fs_info)
block_rsv->size = num_bytes;
num_bytes = sinfo->bytes_used + sinfo->bytes_pinned +
- sinfo->bytes_reserved + sinfo->bytes_readonly;
+ sinfo->bytes_reserved + sinfo->bytes_readonly +
+ sinfo->bytes_may_use;
if (sinfo->total_bytes > num_bytes) {
num_bytes = sinfo->total_bytes - num_bytes;
--
1.6.6.1
--
CAre to add the test to xfstests? We already have a _scratch_mkfs_sized helper to create filesystems of a specific size for ENOSPC tests. --
Yup I will do that first thing on Monday. Thanks, Josef --
