This patchset removes __GFP_NOFAIL from various allocations when those callers don't have __GFP_FS set, meaning they cannot invoke the oom killer to free memory. This is the second phase of four for the total removal of __GFP_NOFAIL: the remaining phases will introduce __GFP_KILLABLE to invoke the oom killer to free memory for all remaining callers since they allow __GFP_FS, and then __GFP_NOFAIL will be removed from the page allocator. /* FIXME */ comments were added where potentially infinite loops were added to the callers. These always had the potential to infinitely loop in the page allocator, but those loops will now occur in the caller instead and some documentation of this behavior was necessary (grep for __GFP_NOFAIL will no longer work when this effort is completed). There's a benefit for the removal of __GFP_NOFAIL: we save several branches in the page allocator if we can move this logic to the caller instead. --
Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
Signed-off-by: David Rientjes <rientjes@google.com>
---
drivers/md/dm-region-hash.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -289,8 +289,12 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
struct dm_region *reg, *nreg;
nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
- if (unlikely(!nreg))
- nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
+ if (unlikely(!nreg)) {
+ /* FIXME: this may potentially loop forever */
+ do {
+ nreg = kmalloc(sizeof(*nreg), GFP_NOIO);
+ } while (!nreg);
+ }
nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
DM_RH_CLEAN : DM_RH_NOSYNC;
--
On Mon, 16 Aug 2010 19:57:51 -0700 (PDT) erm. The reason for adding GFP_NOFAIL in the first place was my observation that the kernel contained lots of open-coded retry-for-ever loops. All of these are wrong, bad, buggy and mustfix. So we consolidated the wrongbadbuggymustfix concept into the core MM so that miscreants could be easily identified and hopefully fixed. I think that simply undoing that change is a bad idea - it allows the wrongbadbuggymustfix code to hide from view. The correct way to remove __GFP_NOFAIL is to fix the wrongbadbuggymustfix code properly. --
That consolidation would have been unnecessary, then, since all allocations with order < PAGE_ALLOC_COSTLY_ORDER automatically loop indefinitely in the page allocator. struct dm_region allocations would already do that. So this retry loop doesn't actually do anything that the page allocator already doesn't, with or without __GFP_NOFAIL. The difference here is that - it doesn't depend on the page allocator's implementation, which may change over time, and - it adds documentation so that the subsystems doing these loops can (hopefully) fix these problems later, although their appear to be If the prerequisite for removing __GFP_NOFAIL is that nobody must ever loop indefinitely looking for memory or smaller order allocations don't implicitly retry, then there's little chance it'll ever get removed since they've existed for years without anybody cleaning them up. --
On Mon, 23 Aug 2010 12:35:22 -0700 (PDT) The difference is that an order-0 !__GFP_NOFAIL allocation attempt can The JBD one is hard - I haven't looked at the others. We should fix them. --
This is GFP_NOIO, which all the allocations in this patchset are (or GFP_NOFS), so the oom killer isn't involved. --
Hows about you add a helper function void *[kmalloc|alloc_page]_retrying_forever_because_i_suck(lots of args) then convert the callsites to use that, then nuke __GFP_NOFAIL? --
That would only serve as documentation of a caller that could potentially loop forever waiting for memory (which I did by adding "/* FIXME: this may potentially loop forever */") since all of the allocations in this patchset never loop in the code that was added, they already loop forever in the page allocator doing the same thing. The hope is that kswapd will eventually be able to free memory since direct reclaim will usually fail for GFP_NOFS and we simply need to wait long enough for there to be memory. --
On Mon, 23 Aug 2010 13:08:53 -0700 (PDT) While holding locks which will prevent kswapd from doing anything useful. --
It implies that these calls are special in some way, basically to mean we have no sane error handling for failures, when in reality this is only a function of the allocation order when the context does not have __GFP_FS. You may as well just add BUG_ON(!nreg) here instead. I thought my "/* FIXME: this may potentially loop forever */" was sufficient to mean that the allocation must succeed independent of the page allocator's implementation, but perhaps you find greping for [kmalloc|alloc_page.*]_nofail easier? I'm not sure what flags such a function would be checking for since the page allocator determines whether the oom killer is called or not. If current is killed, an order-0 allocation will succeed because of TIF_MEMDIE, and if another task is killed, we must still loop waiting for the free memory even with __GFP_FS. --
On Mon, Aug 23, 2010 at 11:01 PM, Andrew Morton I'd prefer killing off __GFP_NOFAIL properly :-) --
And how is this not done properly if it's not even needed for any of the allocations in this patchset since the page allocator loops forever for their orders and context? (This is why we don't need to add __GFP_NOWARN in its place.) This is a cleanup patchset to remove the unnecessary use of __GFP_NOFAIL, there _are_ GFP_KERNEL | __GFP_NOFAIL allocations that need to be addressed in phase three.
Hi David,
My point is that I don't think Andrew's helper will change all that
much. Fixing the actual callers is the hard part and I don't see your
patches helping that either. Hiding the looping in filesystem code is
only going to make problematic callers harder to find (e.g
kmem_alloc() in XFS code).
FWIW, I'd just add a nasty WARN_ON in the page allocator and put a big
fat "fix your shit" comment on top of it to motivate people to fix
their code.
Pekka
--
Nothing is getting hidden in the callers here, all of the patches in this series are using __GFP_NOFAIL unnecessarily. --
Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
Signed-off-by: David Rientjes <rientjes@google.com>
---
fs/btrfs/extent-tree.c | 18 ++++++++++++++----
fs/btrfs/inode.c | 5 ++++-
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3822,6 +3822,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
spin_unlock(&cache->lock);
spin_unlock(&cache->space_info->lock);
} else {
+ int err;
+
old_val -= num_bytes;
btrfs_set_block_group_used(&cache->item, old_val);
cache->pinned += num_bytes;
@@ -3831,9 +3833,12 @@ static int update_block_group(struct btrfs_trans_handle *trans,
spin_unlock(&cache->lock);
spin_unlock(&cache->space_info->lock);
- set_extent_dirty(info->pinned_extents,
+ do {
+ /* FIXME: this may potentially loop forever */
+ err = set_extent_dirty(info->pinned_extents,
bytenr, bytenr + num_bytes - 1,
- GFP_NOFS | __GFP_NOFAIL);
+ GFP_NOFS);
+ } while (err);
}
btrfs_put_block_group(cache);
total -= num_bytes;
@@ -3861,6 +3866,8 @@ static int pin_down_extent(struct btrfs_root *root,
struct btrfs_block_group_cache *cache,
u64 bytenr, u64 num_bytes, int reserved)
{
+ int err;
+
spin_lock(&cache->space_info->lock);
spin_lock(&cache->lock);
cache->pinned += num_bytes;
@@ -3872,8 +3879,11 @@ static int pin_down_extent(struct btrfs_root *root,
spin_unlock(&cache->lock);
spin_unlock(&cache->space_info->lock);
- set_extent_dirty(root->fs_info->pinned_extents, bytenr,
- bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+ do {
+ /* FIXME: this may potentially loop forever */
+ err = set_extent_dirty(root->fs_info->pinned_extents, bytenr,
+ bytenr + num_bytes - 1, GFP_NOFS);
+ } while (err);
return 0;
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
--- ...Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
Signed-off-by: David Rientjes <rientjes@google.com>
---
fs/gfs2/log.c | 10 ++++++++--
fs/gfs2/meta_io.c | 5 ++++-
fs/gfs2/rgrp.c | 27 +++++++++++++++------------
3 files changed, 27 insertions(+), 15 deletions(-)
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -523,7 +523,10 @@ struct buffer_head *gfs2_log_fake_buf(struct gfs2_sbd *sdp,
u64 blkno = log_bmap(sdp, sdp->sd_log_flush_head);
struct buffer_head *bh;
- bh = alloc_buffer_head(GFP_NOFS | __GFP_NOFAIL);
+ do {
+ /* FIXME: this may potentially loop forever */
+ bh = alloc_buffer_head(GFP_NOFS);
+ } while (!bh);
atomic_set(&bh->b_count, 1);
bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock);
set_bh_page(bh, real->b_page, bh_offset(real));
@@ -709,7 +712,10 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl)
}
trace_gfs2_log_flush(sdp, 1);
- ai = kzalloc(sizeof(struct gfs2_ail), GFP_NOFS | __GFP_NOFAIL);
+ do {
+ /* FIXME: this may potentially loop forever */
+ ai = kzalloc(sizeof(struct gfs2_ail), GFP_NOFS);
+ } while (!ai);
INIT_LIST_HEAD(&ai->ai_ail1_list);
INIT_LIST_HEAD(&ai->ai_ail2_list);
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -289,7 +289,10 @@ void gfs2_attach_bufdata(struct gfs2_glock *gl, struct buffer_head *bh,
return;
}
- bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);
+ do {
+ /* FIXME: this may potentially loop forever */
+ bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS);
+ } while (!bd);
bd->bd_bh = bh;
bd->bd_gl = gl;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1440,8 +1440,11 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd *sdp, u64 bstart,
rgrp_blk++;
if (!bi->bi_clone) {
- bi->bi_clone = ...Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
The error handling when kzalloc() returns NULL in start_this_handle()
was removed since it was unreachable.
Signed-off-by: David Rientjes <rientjes@google.com>
---
fs/jbd/journal.c | 5 ++++-
fs/jbd/transaction.c | 14 ++++++--------
2 files changed, 10 insertions(+), 9 deletions(-)
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -301,7 +301,10 @@ int journal_write_metadata_buffer(transaction_t *transaction,
*/
J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
- new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+ do {
+ /* FIXME: this may potentially loop forever */
+ new_bh = alloc_buffer_head(GFP_NOFS);
+ } while (!new_bh);
/* keep subsequent assertions sane */
new_bh->b_state = 0;
init_buffer(new_bh, NULL, NULL);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -98,14 +98,12 @@ static int start_this_handle(journal_t *journal, handle_t *handle)
}
alloc_transaction:
- if (!journal->j_running_transaction) {
- new_transaction = kzalloc(sizeof(*new_transaction),
- GFP_NOFS|__GFP_NOFAIL);
- if (!new_transaction) {
- ret = -ENOMEM;
- goto out;
- }
- }
+ if (!journal->j_running_transaction)
+ do {
+ /* FIXME: this may potentially loop forever */
+ new_transaction = kzalloc(sizeof(*new_transaction),
+ GFP_NOFS);
+ } while (!new_transaction);
jbd_debug(3, "New handle %p going live.\n", handle);
--
Thanks! I've added the patch to my tree. Since rc1 is over, I think this is a material for the next merge window, right? I can take care of pushing it. If you want to push the change yourself, feel free to add Acked-by: Jan Kara <jack@suse.cz> -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Yes, we still need to switch over GFP_KERNEL callers and remove the flag completely, so there's no hurry for this to go into 2.6.36. --
On Tue, 17 Aug 2010 11:51:03 +0200 Please unadd it. JBD should be fixed so that it can appropriately handle out-of-memory conditions. Until that time we shouldn't hide its shortcomings with this open-coded equivalent. --
Well, I wanted to make it easy for David so that he can proceed with his removal of __GFP_NOFAIL. I agree that pushing the looping from the allocator to the callers seems of a disputable value to me as well. So do you think that we should keep __GFP_NOFAIL as long as all callers are not able to handle allocation failures in more reasonable way? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
On Tue, 24 Aug 2010 00:03:47 +0200 The concept should be encapsulated in _some_ centralised fashion. Helper functions would work as well as __GFP_NOFAIL, and will move any runtime cost away from the good code and push it onto the bad code. --
Makes sense. Removed the patch. David, could you provide a function for non-failing allocation and then use this from JBD and whatever else code is also affected? That looks like a cleaner solution as Andrew points out... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
There's no runtime cost on the bad code, the calls never loop since the page allocator already loops itself. Regardless, I'll add a helper function to include/linux/gfp.h to do this with a WARN_ON_ONCE() inside the loop in case any order > PAGE_ALLOC_COSTLY_ORDER callers are ever added (and I really hope nobody merges those). --
Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
Instead of maintaining a seperate _nofail() variant, it's possible to
remove it and use ntfs_malloc_nofs() instead.
Signed-off-by: David Rientjes <rientjes@google.com>
---
fs/ntfs/malloc.h | 17 -----------------
fs/ntfs/runlist.c | 6 ++++--
2 files changed, 4 insertions(+), 19 deletions(-)
diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h
--- a/fs/ntfs/malloc.h
+++ b/fs/ntfs/malloc.h
@@ -66,23 +66,6 @@ static inline void *ntfs_malloc_nofs(unsigned long size)
return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM);
}
-/**
- * ntfs_malloc_nofs_nofail - allocate memory in multiples of pages
- * @size: number of bytes to allocate
- *
- * Allocates @size bytes of memory, rounded up to multiples of PAGE_SIZE and
- * returns a pointer to the allocated memory.
- *
- * This function guarantees that the allocation will succeed. It will sleep
- * for as long as it takes to complete the allocation.
- *
- * If there was insufficient memory to complete the request, return NULL.
- */
-static inline void *ntfs_malloc_nofs_nofail(unsigned long size)
-{
- return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM | __GFP_NOFAIL);
-}
-
static inline void ntfs_free(void *addr)
{
if (!is_vmalloc_addr(addr)) {
diff --git a/fs/ntfs/runlist.c b/fs/ntfs/runlist.c
--- a/fs/ntfs/runlist.c
+++ b/fs/ntfs/runlist.c
@@ -127,8 +127,10 @@ static inline runlist_element *ntfs_rl_realloc_nofail(runlist_element *rl,
if (old_size == new_size)
return rl;
- new_rl = ntfs_malloc_nofs_nofail(new_size);
- BUG_ON(!new_rl);
+ do {
+ /* FIXME: this may potentially loop forever */
+ new_rl = ntfs_malloc_nofs(new_size);
+ } while (!new_rl);
if (likely(rl != NULL)) {
if (unlikely(old_size > new_size))
--
Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.
Signed-off-by: David Rientjes <rientjes@google.com>
---
fs/reiserfs/journal.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -2593,8 +2593,11 @@ static int journal_read(struct super_block *sb)
static struct reiserfs_journal_list *alloc_journal_list(struct super_block *s)
{
struct reiserfs_journal_list *jl;
- jl = kzalloc(sizeof(struct reiserfs_journal_list),
- GFP_NOFS | __GFP_NOFAIL);
+
+ do {
+ /* FIXME: this may potentially loop forever */
+ jl = kzalloc(sizeof(struct reiserfs_journal_list), GFP_NOFS);
+ } while (!jl);
INIT_LIST_HEAD(&jl->j_list);
INIT_LIST_HEAD(&jl->j_working_list);
INIT_LIST_HEAD(&jl->j_tail_bh_list);
--
