fbc->count is of type s64. The change was introduced by 0216bfcffe424a5473daa4da47440881b36c1f4 which changed the type from long to s64. Moving to s64 also means on 32 bit architectures we can get wrong values on fbc->count. Since fbc->count is read more frequently and updated rarely use seqlocks. This should reduce the impact of locking in the read path for 32bit arch. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Andrew Morton <akpm@linux-foundation.org> CC: linux-kernel@vger.kernel.org --- include/linux/percpu_counter.h | 28 ++++++++++++++++++++++++---- lib/percpu_counter.c | 20 ++++++++++---------- 2 files changed, 34 insertions(+), 14 deletions(-) diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h index 9007ccd..1b711a1 100644 --- a/include/linux/percpu_counter.h +++ b/include/linux/percpu_counter.h @@ -6,7 +6,7 @@ * WARNING: these things are HUGE. 4 kbytes per counter on 32-way P4. */ -#include <linux/spinlock.h> +#include <linux/seqlock.h> #include <linux/smp.h> #include <linux/list.h> #include <linux/threads.h> @@ -16,7 +16,7 @@ #ifdef CONFIG_SMP struct percpu_counter { - spinlock_t lock; + seqlock_t lock; s64 count; #ifdef CONFIG_HOTPLUG_CPU struct list_head list; /* All percpu_counters are on a list */ @@ -53,10 +53,30 @@ static inline s64 percpu_counter_sum(struct percpu_counter *fbc) return __percpu_counter_sum(fbc); } -static inline s64 percpu_counter_read(struct percpu_counter *fbc) +#if BITS_PER_LONG == 64 +static inline s64 fbc_count(struct percpu_counter *fbc) { return fbc->count; } +#else +/* doesn't have atomic 64 bit operation */ +static inline s64 fbc_count(struct percpu_counter *fbc) +{ + s64 ret; + unsigned seq; + do { + seq = read_seqbegin(&fbc->lock); + ret = fbc->count; + } while (read_seqretry(&fbc->lock, seq)); + return ret; + +} +#endif + +static inline s64 ...
This patch add dirty block accounting using percpu_counters.
Delayed allocation block reservation is now done by updating
dirty block counter. In the later patch we switch to non
delalloc mode if the filesystem free blocks is < that
150 % of total filesystem dirty blocks
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
fs/ext4/balloc.c | 59 +++++++++++++++++++++++++++++++++-------------------
fs/ext4/ext4_sb.h | 1 +
fs/ext4/inode.c | 22 +++++++++---------
fs/ext4/mballoc.c | 17 ++------------
fs/ext4/super.c | 8 ++++++-
5 files changed, 59 insertions(+), 48 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 5767332..b19346a 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1605,26 +1605,38 @@ ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
ext4_fsblk_t nblocks)
{
- s64 free_blocks;
+ s64 free_blocks, dirty_blocks;
ext4_fsblk_t root_blocks = 0;
struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
+ struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
- free_blocks = percpu_counter_read(fbc);
+ free_blocks = percpu_counter_read_positive(fbc);
+ dirty_blocks = percpu_counter_read_positive(dbc);
if (!capable(CAP_SYS_RESOURCE) &&
sbi->s_resuid != current->fsuid &&
(sbi->s_resgid == 0 || !in_group_p(sbi->s_resgid)))
root_blocks = ext4_r_blocks_count(sbi->s_es);
- if (free_blocks - (nblocks + root_blocks) < EXT4_FREEBLOCKS_WATERMARK)
- free_blocks = percpu_counter_sum(&sbi->s_freeblocks_counter);
-
- if (free_blocks < (root_blocks + nblocks))
+ if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
+ EXT4_FREEBLOCKS_WATERMARK) {
+ free_blocks = percpu_counter_sum(fbc);
+ dirty_blocks = percpu_counter_sum(dbc);
+ if (dirty_blocks < 0) {
+ printk(KERN_CRIT "Dirty block accounting "
+ "went wrong %lld\n",
+ dirty_blocks);
+ }
+ }
+ /* Check whether we have ...This makes the meta-data reservation simpler. The logic
followed is simpler. After each block allocation request
if we have allocated some meta-data blocks subtract the
same from the reserved meta-data blocks. If the total
reserved data blocks after allocation is zero, free the
remaining meta-data blocks reserved. During reservation
if the total reserved blocks need more meta-data blocks
add the extra meta-data blocks needed to the reserve_meta_blocks
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
fs/ext4/inode.c | 75 +++++++++++++++++++++++++++----------------------------
1 files changed, 37 insertions(+), 38 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a45121f..3ef0822 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1019,31 +1019,34 @@ static int ext4_calc_metadata_amount(struct inode *inode, int blocks)
static void ext4_da_update_reserve_space(struct inode *inode, int used)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- int total, mdb, mdb_free;
spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
- /* recalculate the number of metablocks still need to be reserved */
- total = EXT4_I(inode)->i_reserved_data_blocks - used;
- mdb = ext4_calc_metadata_amount(inode, total);
-
- /* figure out how many metablocks to release */
- BUG_ON(mdb > EXT4_I(inode)->i_reserved_meta_blocks);
- mdb_free = EXT4_I(inode)->i_reserved_meta_blocks - mdb;
-
- if (mdb_free) {
- /* Account for allocated meta_blocks */
- mdb_free -= EXT4_I(inode)->i_allocated_meta_blocks;
-
- /* update fs dirty blocks counter */
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, mdb_free);
+ if (EXT4_I(inode)->i_allocated_meta_blocks) {
+ /* update the reseved meta- blocks */
+ BUG_ON(EXT4_I(inode)->i_allocated_meta_blocks >
+ EXT4_I(inode)->i_reserved_meta_blocks);
+ EXT4_I(inode)->i_reserved_meta_blocks -=
+ EXT4_I(inode)->i_allocated_meta_blocks;
EXT4_I(inode)->i_allocated_meta_blocks = ...Otherwise we skip group 0 during block allocation. This cause ENOSPC even if we have free blocks in group 0. This should be merged with defrag. The expected_group changes are introduced by defrag patches. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- fs/ext4/balloc.c | 1 + fs/ext4/extents.c | 1 + 2 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index b19346a..53fdb05 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -2023,6 +2023,7 @@ static ext4_fsblk_t do_blk_alloc(handle_t *handle, struct inode *inode, ar.goal = goal; ar.len = *count; ar.logical = iblock; + ar.excepted_group = -1; if (S_ISREG(inode->i_mode) && !(flags & EXT4_META_BLOCK)) /* enable in-core preallocation for data block allocation */ diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index bf612a7..268e96d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2879,6 +2879,7 @@ int ext4_ext_get_blocks(handle_t *handle, struct inode *inode, ar.goal = ext4_ext_find_goal(inode, path, iblock); ar.logical = iblock; ar.len = allocated; + ar.excepted_group = -1; if (S_ISREG(inode->i_mode)) ar.flags = EXT4_MB_HINT_DATA; else -- 1.6.0.1.90.g27a6e --
This patch converts some usage of ext4_fsblk_t to s64
This is needed so that some of the sign conversion works
as expected in if loops.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
fs/ext4/balloc.c | 19 ++++++++++---------
fs/ext4/ext4.h | 4 ++--
2 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 53fdb05..7fdc236 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1603,10 +1603,10 @@ ext4_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
}
int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
- ext4_fsblk_t nblocks)
+ s64 nblocks)
{
s64 free_blocks, dirty_blocks;
- ext4_fsblk_t root_blocks = 0;
+ s64 root_blocks = 0;
struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
@@ -1631,7 +1631,7 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
/* Check whether we have space after
* accounting for current dirty blocks
*/
- if (free_blocks < ((s64)(root_blocks + nblocks) + dirty_blocks))
+ if (free_blocks < ((root_blocks + nblocks) + dirty_blocks))
/* we don't have free space */
return -ENOSPC;
@@ -1650,10 +1650,10 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
* On success, return nblocks
*/
ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
- ext4_fsblk_t nblocks)
+ s64 nblocks)
{
- ext4_fsblk_t free_blocks, dirty_blocks;
- ext4_fsblk_t root_blocks = 0;
+ s64 free_blocks, dirty_blocks;
+ s64 root_blocks = 0;
struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;
@@ -1667,14 +1667,15 @@ ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
- free_blocks = percpu_counter_sum_positive(fbc);
- dirty_blocks = ...Make sure we set windowsize to zero if the free
blocks left is less that window size. Otherwise
we skip some group with low freeblock count during
block allocation
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
fs/ext4/balloc.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 7fdc236..a52fde3 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1809,8 +1809,10 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
* turn off reservation for this allocation
*/
if (my_rsv && (free_blocks < windowsz)
- && (rsv_is_empty(&my_rsv->rsv_window)))
+ && (rsv_is_empty(&my_rsv->rsv_window))) {
my_rsv = NULL;
+ windowsz = 0;
+ }
if (free_blocks > 0) {
bitmap_bh = ext4_read_block_bitmap(sb, group_no);
--
1.6.0.1.90.g27a6e
--
This make sure when we have block allocation failure
we don't have inode inode added to the journal handle.
So journal commit will not include the inode for which
block allocation failed.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
fs/ext4/balloc.c | 2 +-
fs/ext4/inode.c | 36 +++++++++++++++---------------------
2 files changed, 16 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index a52fde3..9a0239e 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -2061,7 +2061,7 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode,
/*
* Account for the allocated meta blocks
*/
- if (!(*errp)) {
+ if (!(*errp) && EXT4_I(inode)->i_delalloc_reserved_flag) {
spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
EXT4_I(inode)->i_allocated_meta_blocks += *count;
spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3ef0822..24381bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1591,6 +1591,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free)
*/
mdb_free = EXT4_I(inode)->i_reserved_meta_blocks;
EXT4_I(inode)->i_reserved_meta_blocks = 0;
+ EXT4_I(inode)->i_allocated_meta_blocks = 0;
}
release = to_free + mdb_free;
@@ -2169,18 +2170,23 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
handle_t *handle = NULL;
handle = ext4_journal_current_handle();
- if (!handle) {
- ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
- bh_result, 0, 0, 0);
- BUG_ON(!ret);
- } else {
- ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
- bh_result, create, 0, EXT4_DELALLOC_RSVED);
- }
-
+ BUG_ON(!handle);
+ ret = ext4_get_blocks_wrap(handle, inode, iblock, max_blocks,
+ bh_result, create, 0, EXT4_DELALLOC_RSVED);
if (ret > 0) {
+
bh_result->b_size = (ret << inode->i_blkbits);
+ if ...Current code has a way to try to prevent early ENOSPC with old ext3
block reservation. After searching for all block groups and can't do
block reservation and allocation, it will fall back to no block
reservation and scan the block groups from the beginning again.
But this doesn't work in the case the reservation was turned off in the
first goal block group allocation due to 0 free blocks, and the rest
block groups are skipped due to the check of "free_blocks < windowsz/2",
I think this causes the ENOSPC error you saw.
There are two issues. I am attaching the fix for two issues here.
Thanks,
From: Mingming Cao <cmm@us.ibm.com>
ext4: Fix ext4 nomballoc allocator for ENOSPC
We run into ENOSPC error on nonmballoc ext4, even when there is free blocks
on the filesystem.
The problem is triggered in the case the goal block group has 0 free blocks
, and the rest block groups are skipped due to the check of "free_blocks
< windowsz/2". Current code could fall back to non reservation allocation
to prevent early ENOSPC after examing all the block groups with reservation on
, but this code was bypassed if the reservation window is turned off already,
which is true in this case.
This patch fixed two issues:
1) We don't need to turn off block reservation if the goal block group has
0 free blocks left and continue search for the rest of block groups.
Current code the intention is to turn off the block reservation if the
goal allocation group has a few (some) free blocks left (not enough
for make the desired reservation window),to try to allocation in the
goal block group, to get better locality. But if the goal blocks have
0 free blocks, it should leave the block reservation on, and continues
search for the next block groups,rather than turn off block reservation
completely.
2) we don't need to check the window size if the block reservation is off.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Index: ...I don't see how this change is going to make a difference. The goal group
had free blocks < windowsz and that made my_rsv = NULL. I guess we
should not make my_rsv in the first loop. Or in otherwords we can remove
/*
* if there is not enough free blocks to make a new
* resevation
* turn off reservation for this allocation
*/
if (my_rsv && (free_blocks < windowsz)
&& (free_blocks > 0)
&& (rsv_is_empty(&my_rsv->rsv_window)))
my_rsv = NULL;
And since we have the below check in the for loop
if (my_rsv && (free_blocks <= (windowsz/2)))
continue;
We would skip all the groups that have low free block count.
Now if we are not able to allocate any blocks (ENOSPC)
we loop back because of
if (my_rsv) {
my_rsv = NULL;
windowsz = 0;
group_no = goal_group;
goto retry_alloc;
}
and that would allocate blocks from the first group available.
This also give a chance to scan all the groups to make sure
if we have any of them left with enough free blocks to
--
Ok how about this. The Final change is same to what you have done.
But it make the code easier to understand. I also added a comment
explaining the details
commit 0216ee1ac13270c1ab7b7517d41775727f7da02d
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Fri Aug 29 09:35:15 2008 +0530
ext4: Fix ext4 nomballoc allocator for ENOSPC
We run into ENOSPC error on nonmballoc ext4, even when there is free blocks
on the filesystem.
The patch include two changes
a) Set reservation to NULL if we trying to allocate near group_target_block
from the goal group if the free block in the group is less than windowsz.
This should give us a better chance to allocate near group_target_block.
This also ensures that if we are not allocating near group_target_block
then we don't trun off reservation. This should enable us to allocate
with reservation from other groups that have large free blocks count.
b) we don't need to check the window size if the block reservation is off.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index cfe01b4..399bec5 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -1802,15 +1802,17 @@ ext4_fsblk_t ext4_old_new_blocks(handle_t *handle, struct inode *inode,
goto io_error;
free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
- /*
- * if there is not enough free blocks to make a new resevation
- * turn off reservation for this allocation
- */
- if (my_rsv && (free_blocks < windowsz)
- && (rsv_is_empty(&my_rsv->rsv_window)))
- my_rsv = NULL;
if (free_blocks > 0) {
+ /*
+ * try to allocate with group target block
+ * in the goal group. If we have low free_blocks
+ * count turn off reservation
+ */
+ if (my_rsv && (free_blocks < windowsz)
+ && (rsv_is_empty(&my_rsv->rsv_window)))
+ my_rsv = NULL;
+
bitmap_bh = ...Fine with me, I will update the patch in the ext4 patch queue with additional comment. But Andrew has already took ext2/3 version to mm tree, I am not sure if it worth to resend with an patch against original --
Hmm, if the goal block group had free blocks, why allocation failed (reservation is turned off by setting my_rsv as NULL)? I wonder if there is other threads trying to allocating in the same goal block group at the same time, steal the last free blocks? Mingming --
We are trying block allocation with a grp_target_blk there and even if reservation is turned off it can return ENOSPC. -aneesh --
Reviewed-by: Mingming Cao <cmm@us.ibm.com> --
With this change ext4 keeps unnecessary blocks reserved for metadata blocks for a longer time (untilall dirty data have been flushed), I am concerned this will leads to early ENOSPC. The current metadata reservation logic is a little complex, but it's not that bad. It's there to make sure we don't over-reserve the metadata. --
Added to patch queue --
... (nitpick, I wish the changelog stated why the change was made, rather Why was this part removed? Near as I can tell it's still needed; with all patches in the queue applied, if I run fallocate to try and allocate 10G of space to a file, on a filesystem with 30G free, I run out of space after only 1.6G is allocated! # /mnt/test/fallocate-amit -f /mnt/test/testfile 0 10737418240 SYSCALL: received error 28, ret=-1 # FALLOCATE TEST REPORT # New blocks preallocated = 0. Number of bytes preallocated = 0 Old file size = 0, New file size -474484472. Old num blocks = 0, New num blocks 0. test_fallocate: ERROR ! ret=1 #!# TESTS FAILED #!# I see the request for the original 2621440 blocks come in; this gets limited to 32767 due to max uninit length. Somehow, though, we seem to be allocating only 2048 blocks at a time (haven't worked out why, yet - this also seems problematic) - but at any rate, losing (32767-2048) blocks in each loop from fallocate seems to be causing this space loss and eventual ENOSPC. fallocate loops 243 times for me; losing (32767-2048) each time accounts for the 28G: (32767-2048)*243*4096/1024/1024/1024 28 (plus the ~2G actually allocated gets us back to 30G that was originally free) Anyway, fsck finds no errors, and remounting fixes it. It's apparently just the in-memory counters that get off. -Eric --
Can you test this patch
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 64eeb9a..6e81c38 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2800,7 +2800,7 @@ void exit_ext4_mballoc(void)
*/
static noinline_for_stack int
ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
- handle_t *handle)
+ handle_t *handle, unsigned long reserv_blks)
{
struct buffer_head *bitmap_bh = NULL;
struct ext4_super_block *es;
@@ -2893,7 +2893,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
/*
* Now reduce the dirty block count also. Should not go negative
*/
- percpu_counter_sub(&sbi->s_dirtyblocks_counter, ac->ac_b_ex.fe_len);
+ percpu_counter_sub(&sbi->s_dirtyblocks_counter, reserv_blks);
if (sbi->s_log_groups_per_flex) {
ext4_group_t flex_group = ext4_flex_group(sbi,
ac->ac_b_ex.fe_group);
@@ -4284,12 +4284,13 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
struct ext4_allocation_request *ar, int *errp)
{
+ int freed;
struct ext4_allocation_context *ac = NULL;
struct ext4_sb_info *sbi;
struct super_block *sb;
ext4_fsblk_t block = 0;
- int freed;
- int inquota;
+ unsigned long inquota;
+ unsigned long reserv_blks;
sb = ar->inode->i_sb;
sbi = EXT4_SB(sb);
@@ -4308,6 +4309,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
return 0;
}
}
+ /* Number of reserv_blks for both delayed an non delayed allocation */
+ reserv_blks = ar->len;
while (ar->len && DQUOT_ALLOC_BLOCK(ar->inode, ar->len)) {
ar->flags |= EXT4_MB_HINT_NOPREALLOC;
ar->len--;
@@ -4353,7 +4356,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}
if (likely(ac->ac_status == AC_STATUS_FOUND)) {
- *errp = ext4_mb_mark_diskspace_used(ac, handle);
+ *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_blks);
if (*errp == -EAGAIN) {
ac->ac_b_ex.fe_group = 0;
ac->ac_b_ex.fe_start = ...This does fix my 10G-fallocate testcase. --
I believe the 2048-block (8MB) allocation limit is imposed by mballoc to avoid scanning the whole filesystem looking for huge chunks of free disk. That said, it would be nice if there IS lots of free space that this is allocated optimistically if possible. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Any reason that we don't do percpu_counter_sum_and_sub() together? I --
On Wed, 27 Aug 2008 20:58:26 +0530 So... yesterday's suggestionm to investigate implementing this at a This change means that a percpu_counter_read() from interrupt context on a 32-bit machine is now deadlockable, whereas it previously was not deadlockable on either 32-bit or 64-bit. This flows on to the lib/proportions.c, which uses percpu_counter_read() and also does spin_lock_irqsave() internally, indicating that it is (or was) designed to be used in IRQ contexts. It means that bdi_stat() can no longer be used from interrupt context. So a whole lot of thought and review and checking is needed here. It should all be spelled out in the changelog. This will be a horridly rare deadlock, so suitable WARN_ON()s should be added to detect when callers are vulnerable to it. Or we make the whole thing irq-safe. --
I think its a good idea to investigate a generic atomic64_t type.
i386 could possibly use cmpxchg8 if and when available, although using
that to read might be rather too expensive.
Doing something like:
struct atomic64_t {
seqlock_t lock;
s64 val;
};
might be somewhat unexpected from the sizeof() angle of things. Then
percpu_counter() never was irq safe, which is why the proportion stuff
Actually, as long as the write side of the seqlock usage is done with
IRQs disabled, the read side should be good.
If the read loop gets preempted by a write action, the seq count will
not match up and we'll just try again.
The only lethal combination is trying to do the read loop while inside
the write side.
If you look at backing-dev.h, you'll see that all modifying operations
on a few archs.
--
On Wed, 27 Aug 2008 23:01:52 +0200 percpu_counter_read() was irq-safe. That changes here. Needs careful review, changelogging and, preferably, runtime checks. But perhaps they should be inside some CONFIG_thing which won't normally be done in production. otoh, percpu_counter_read() is in fact a rare operation, so a bit of overhead probably won't matter. (write-often, read-rarely is the whole point. This patch's changelog's assertion that "Since fbc->count is read more frequently and updated rarely" is probably wrong. Most percpu_counters will have their Sure. I _expect_ that this interface change won't actually break anything. But it adds a restriction which we should think about, and document. btw, what the heck is percpu_counter_init_irq()? Some mysterious lockdep-specific thing? <does git-fiddle. Oh. crappy changelog.> I let that one leak through uncommented. Must be getting old. Probably it will need an EXPORT_SYMBOL() sometime. I expect that if we're going to go ahead and make percpu_counter_read() no longer usable from interrupt context then we'll eventually end up needing the full suite of _irq() and _irqsave() interface functions. percpu_counter_add_irqsave(), etc. --
we may actually be doing percpu_counter_add. But that doesn't update fbc->count. Only if the local percpu values cross FBC_BATCH we update fbc->count. If we are modifying fbc->count more frequently than reading fbc->count then i guess we would be contenting of fbc->lock more. -aneesh --
Yep. The frequency of modification of fbc->count is of the order of a tenth or a hundredth of the frequency of precpu_counter_<modification>() calls. But in many cases the frequency of percpu_counter_read() calls is far far less than this. For example, the percpu_counter_read() may only happen when userspace polls a /proc file. --
The global counter is is much more frequently accessed with delalloc.:( With delayed allocation, we have to do read the free blocks counter at each write_begin(), to make sure there is enough free blocks to do block reservation to prevent lately writepages returns ENOSPC. Mingming --
Basically all it does it break the percpu_counter lock into two classes. One for the irq-unsafe users and one for the irq-safe users. Without this lockdep goes splat complaining about irq recursion deadlocks and the like between these two separate users. --
I wanted to sent the entire patchset which fixes ENOSPC issues with delalloc. It happened to be on the next day you looked at the previous mail. Sending the patch again in now way mean we should not have How do we actually figure that out ? I have been making that mistakes -aneesh --
Well. Experience and guesswork, mainly. But a useful metric is to look and the /bin/size output before and after the inlining. In this case fs/ext3/ialloc.o's text shrunk 40-odd bytes, which we think is a net benefit due to reduced CPU cache pressure. --
Weighed against register save/restore, compiler barrier, and function call cost of uninlined. These can add up to 10s of cycles per call I've seen, so if it is called several times between each icache miss it can easily be worth inlining. Basically, measurement is required, and if it isn't important enough to measure policy tends to default to uninline if that saves space. --
