Re: [patch 4/6] jbd: remove dependency on __GFP_NOFAIL

Previous thread: linux-next: build warning after merge of the tree by Stephen Rothwell on Monday, August 16, 2010 - 7:24 pm. (2 messages)

Next thread: linux-next: Tree for August 17 by Stephen Rothwell on Monday, August 16, 2010 - 8:32 pm. (2 messages)
From: David Rientjes
Date: Monday, August 16, 2010 - 7:57 pm

This patchset removes __GFP_NOFAIL from various allocations when those
callers don't have __GFP_FS set, meaning they cannot invoke the oom
killer to free memory.

This is the second phase of four for the total removal of __GFP_NOFAIL:
the remaining phases will introduce __GFP_KILLABLE to invoke the oom
killer to free memory for all remaining callers since they allow
__GFP_FS, and then __GFP_NOFAIL will be removed from the page allocator.

/* FIXME */ comments were added where potentially infinite loops were
added to the callers.  These always had the potential to infinitely loop
in the page allocator, but those loops will now occur in the caller
instead and some documentation of this behavior was necessary (grep for
__GFP_NOFAIL will no longer work when this effort is completed).

There's a benefit for the removal of __GFP_NOFAIL: we save several
branches in the page allocator if we can move this logic to the caller
instead.
--

From: David Rientjes
Date: Monday, August 16, 2010 - 7:57 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/md/dm-region-hash.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -289,8 +289,12 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
 	struct dm_region *reg, *nreg;
 
 	nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
-	if (unlikely(!nreg))
-		nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
+	if (unlikely(!nreg)) {
+		/* FIXME: this may potentially loop forever */
+		do {
+			nreg = kmalloc(sizeof(*nreg), GFP_NOIO);
+		} while (!nreg);
+	}
 
 	nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
 		      DM_RH_CLEAN : DM_RH_NOSYNC;
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 12:26 pm

On Mon, 16 Aug 2010 19:57:51 -0700 (PDT)

erm.

The reason for adding GFP_NOFAIL in the first place was my observation
that the kernel contained lots of open-coded retry-for-ever loops.

All of these are wrong, bad, buggy and mustfix.  So we consolidated the
wrongbadbuggymustfix concept into the core MM so that miscreants could
be easily identified and hopefully fixed.

I think that simply undoing that change is a bad idea - it allows the
wrongbadbuggymustfix code to hide from view.

The correct way to remove __GFP_NOFAIL is to fix the
wrongbadbuggymustfix code properly.
--

From: David Rientjes
Date: Monday, August 23, 2010 - 12:35 pm

That consolidation would have been unnecessary, then, since all 
allocations with order < PAGE_ALLOC_COSTLY_ORDER automatically loop 
indefinitely in the page allocator.  struct dm_region allocations would 
already do that.

So this retry loop doesn't actually do anything that the page allocator 
already doesn't, with or without __GFP_NOFAIL.  The difference here is 
that

 - it doesn't depend on the page allocator's implementation, which may
   change over time, and

 - it adds documentation so that the subsystems doing these loops can
   (hopefully) fix these problems later, although their appear to be


If the prerequisite for removing __GFP_NOFAIL is that nobody must ever 
loop indefinitely looking for memory or smaller order allocations don't 
implicitly retry, then there's little chance it'll ever get removed since 
they've existed for years without anybody cleaning them up.
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 12:51 pm

On Mon, 23 Aug 2010 12:35:22 -0700 (PDT)

The difference is that an order-0 !__GFP_NOFAIL allocation attempt can


The JBD one is hard - I haven't looked at the others.

We should fix them.
--

From: David Rientjes
Date: Monday, August 23, 2010 - 1:03 pm

This is GFP_NOIO, which all the allocations in this patchset are (or 
GFP_NOFS), so the oom killer isn't involved.
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 1:01 pm

Hows about you add a helper function

	void *[kmalloc|alloc_page]_retrying_forever_because_i_suck(lots of args)

then convert the callsites to use that, then nuke __GFP_NOFAIL?
--

From: David Rientjes
Date: Monday, August 23, 2010 - 1:08 pm

That would only serve as documentation of a caller that could potentially 
loop forever waiting for memory (which I did by adding "/* FIXME: this may 
potentially loop forever */") since all of the allocations in this 
patchset never loop in the code that was added, they already loop forever 
in the page allocator doing the same thing.  The hope is that kswapd will 
eventually be able to free memory since direct reclaim will usually fail 
for GFP_NOFS and we simply need to wait long enough for there to be 
memory.
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 1:23 pm

On Mon, 23 Aug 2010 13:08:53 -0700 (PDT)



While holding locks which will prevent kswapd from doing anything useful.

--

From: David Rientjes
Date: Monday, August 23, 2010 - 1:37 pm

It implies that these calls are special in some way, basically to mean 
we have no sane error handling for failures, when in reality this is only 
a function of the allocation order when the context does not have 
__GFP_FS.  You may as well just add BUG_ON(!nreg) here instead.

I thought my "/* FIXME: this may potentially loop forever */" was 
sufficient to mean that the allocation must succeed independent of the 
page allocator's implementation, but perhaps you find greping for 
[kmalloc|alloc_page.*]_nofail easier?

I'm not sure what flags such a function would be checking for since the 
page allocator determines whether the oom killer is called or not.  If 
current is killed, an order-0 allocation will succeed because of 
TIF_MEMDIE, and if another task is killed, we must still loop waiting for 
the free memory even with __GFP_FS.
--

From: Pekka Enberg
Date: Monday, August 23, 2010 - 1:09 pm

On Mon, Aug 23, 2010 at 11:01 PM, Andrew Morton

I'd prefer killing off __GFP_NOFAIL properly :-)
--

From: David Rientjes
Date: Monday, August 23, 2010 - 1:13 pm

And how is this not done properly if it's not even needed for any of the 
allocations in this patchset since the page allocator loops forever for 
their orders and context?  (This is why we don't need to add __GFP_NOWARN 
in its place.)

This is a cleanup patchset to remove the unnecessary use of __GFP_NOFAIL, 
there _are_ GFP_KERNEL | __GFP_NOFAIL allocations that need to be 
addressed in phase three.
From: Pekka Enberg
Date: Monday, August 23, 2010 - 1:29 pm

Hi David,



My point is that I don't think Andrew's helper will change all that
much. Fixing the actual callers is the hard part and I don't see your
patches helping that either. Hiding the looping in filesystem code is
only going to make problematic callers harder to find (e.g
kmem_alloc() in XFS code).

FWIW, I'd just add a nasty WARN_ON in the page allocator and put a big
fat "fix your shit" comment on top of it to motivate people to fix
their code.

                         Pekka
--

From: David Rientjes
Date: Monday, August 23, 2010 - 1:40 pm

Nothing is getting hidden in the callers here, all of the patches in this 
series are using __GFP_NOFAIL unnecessarily.
--

From: David Rientjes
Date: Monday, August 16, 2010 - 7:57 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/btrfs/extent-tree.c |   18 ++++++++++++++----
 fs/btrfs/inode.c       |    5 ++++-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3822,6 +3822,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 		} else {
+			int err;
+
 			old_val -= num_bytes;
 			btrfs_set_block_group_used(&cache->item, old_val);
 			cache->pinned += num_bytes;
@@ -3831,9 +3833,12 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 
-			set_extent_dirty(info->pinned_extents,
+			do {
+				/* FIXME: this may potentially loop forever */
+				err = set_extent_dirty(info->pinned_extents,
 					 bytenr, bytenr + num_bytes - 1,
-					 GFP_NOFS | __GFP_NOFAIL);
+					 GFP_NOFS);
+			} while (err);
 		}
 		btrfs_put_block_group(cache);
 		total -= num_bytes;
@@ -3861,6 +3866,8 @@ static int pin_down_extent(struct btrfs_root *root,
 			   struct btrfs_block_group_cache *cache,
 			   u64 bytenr, u64 num_bytes, int reserved)
 {
+	int err;
+
 	spin_lock(&cache->space_info->lock);
 	spin_lock(&cache->lock);
 	cache->pinned += num_bytes;
@@ -3872,8 +3879,11 @@ static int pin_down_extent(struct btrfs_root *root,
 	spin_unlock(&cache->lock);
 	spin_unlock(&cache->space_info->lock);
 
-	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
-			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+	do {
+		/* FIXME: this may potentially loop forever */
+		err = set_extent_dirty(root->fs_info->pinned_extents, bytenr,
+			 bytenr + num_bytes - 1, GFP_NOFS);
+	} while (err);
 	return 0;
 }
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
--- ...
From: David Rientjes
Date: Monday, August 16, 2010 - 7:57 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/gfs2/log.c     |   10 ++++++++--
 fs/gfs2/meta_io.c |    5 ++++-
 fs/gfs2/rgrp.c    |   27 +++++++++++++++------------
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -523,7 +523,10 @@ struct buffer_head *gfs2_log_fake_buf(struct gfs2_sbd *sdp,
 	u64 blkno = log_bmap(sdp, sdp->sd_log_flush_head);
 	struct buffer_head *bh;
 
-	bh = alloc_buffer_head(GFP_NOFS | __GFP_NOFAIL);
+	do {
+		/* FIXME: this may potentially loop forever */
+		bh = alloc_buffer_head(GFP_NOFS);
+	} while (!bh);
 	atomic_set(&bh->b_count, 1);
 	bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock);
 	set_bh_page(bh, real->b_page, bh_offset(real));
@@ -709,7 +712,10 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl)
 	}
 	trace_gfs2_log_flush(sdp, 1);
 
-	ai = kzalloc(sizeof(struct gfs2_ail), GFP_NOFS | __GFP_NOFAIL);
+	do {
+		/* FIXME: this may potentially loop forever */
+		ai = kzalloc(sizeof(struct gfs2_ail), GFP_NOFS);
+	} while (!ai);
 	INIT_LIST_HEAD(&ai->ai_ail1_list);
 	INIT_LIST_HEAD(&ai->ai_ail2_list);
 
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -289,7 +289,10 @@ void gfs2_attach_bufdata(struct gfs2_glock *gl, struct buffer_head *bh,
 		return;
 	}
 
-	bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);
+	do {
+		/* FIXME: this may potentially loop forever */
+		bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS);
+	} while (!bd);
 	bd->bd_bh = bh;
 	bd->bd_gl = gl;
 
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1440,8 +1440,11 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd *sdp, u64 bstart,
 		rgrp_blk++;
 
 		if (!bi->bi_clone) {
-			bi->bi_clone = ...
From: David Rientjes
Date: Monday, August 16, 2010 - 7:58 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

The error handling when kzalloc() returns NULL in start_this_handle()
was removed since it was unreachable.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/jbd/journal.c     |    5 ++++-
 fs/jbd/transaction.c |   14 ++++++--------
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -301,7 +301,10 @@ int journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+	do {
+		/* FIXME: this may potentially loop forever */
+		new_bh = alloc_buffer_head(GFP_NOFS);
+	} while (!new_bh);
 	/* keep subsequent assertions sane */
 	new_bh->b_state = 0;
 	init_buffer(new_bh, NULL, NULL);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -98,14 +98,12 @@ static int start_this_handle(journal_t *journal, handle_t *handle)
 	}
 
 alloc_transaction:
-	if (!journal->j_running_transaction) {
-		new_transaction = kzalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
-		if (!new_transaction) {
-			ret = -ENOMEM;
-			goto out;
-		}
-	}
+	if (!journal->j_running_transaction)
+		do {
+			/* FIXME: this may potentially loop forever */
+			new_transaction = kzalloc(sizeof(*new_transaction),
+								GFP_NOFS);
+		} while (!new_transaction);
 
 	jbd_debug(3, "New handle %p going live.\n", handle);
 
--

From: Jan Kara
Date: Tuesday, August 17, 2010 - 2:51 am

Thanks! I've added the patch to my tree. Since rc1 is over, I think this
is a material for the next merge window, right? I can take care of pushing
it. If you want to push the change yourself, feel free to add
  Acked-by: Jan Kara <jack@suse.cz>

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: David Rientjes
Date: Tuesday, August 17, 2010 - 10:48 am

Yes, we still need to switch over GFP_KERNEL callers and remove the flag 
completely, so there's no hurry for this to go into 2.6.36.
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 12:28 pm

On Tue, 17 Aug 2010 11:51:03 +0200

Please unadd it.  JBD should be fixed so that it can appropriately
handle out-of-memory conditions.  Until that time we shouldn't hide its
shortcomings with this open-coded equivalent.

--

From: Jan Kara
Date: Monday, August 23, 2010 - 3:03 pm

Well, I wanted to make it easy for David so that he can proceed with his
removal of __GFP_NOFAIL. I agree that pushing the looping from the
allocator to the callers seems of a disputable value to me as well.  So do
you think that we should keep __GFP_NOFAIL as long as all callers are not
able to handle allocation failures in more reasonable way?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Andrew Morton
Date: Monday, August 23, 2010 - 3:11 pm

On Tue, 24 Aug 2010 00:03:47 +0200

The concept should be encapsulated in _some_ centralised fashion.

Helper functions would work as well as __GFP_NOFAIL, and will move any
runtime cost away from the good code and push it onto the bad code.

--

From: Jan Kara
Date: Monday, August 23, 2010 - 3:21 pm

Makes sense. Removed the patch.

  David, could you provide a function for non-failing allocation and then
use this from JBD and whatever else code is also affected? That looks like
a cleaner solution as Andrew points out...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: David Rientjes
Date: Monday, August 23, 2010 - 3:22 pm

There's no runtime cost on the bad code, the calls never loop since the 
page allocator already loops itself.  Regardless, I'll add a helper 
function to include/linux/gfp.h to do this with a WARN_ON_ONCE() inside 
the loop in case any order > PAGE_ALLOC_COSTLY_ORDER callers are ever 
added (and I really hope nobody merges those).
--

From: David Rientjes
Date: Monday, August 16, 2010 - 7:58 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

Instead of maintaining a seperate _nofail() variant, it's possible to
remove it and use ntfs_malloc_nofs() instead.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/ntfs/malloc.h  |   17 -----------------
 fs/ntfs/runlist.c |    6 ++++--
 2 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h
--- a/fs/ntfs/malloc.h
+++ b/fs/ntfs/malloc.h
@@ -66,23 +66,6 @@ static inline void *ntfs_malloc_nofs(unsigned long size)
 	return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM);
 }
 
-/**
- * ntfs_malloc_nofs_nofail - allocate memory in multiples of pages
- * @size:	number of bytes to allocate
- *
- * Allocates @size bytes of memory, rounded up to multiples of PAGE_SIZE and
- * returns a pointer to the allocated memory.
- *
- * This function guarantees that the allocation will succeed.  It will sleep
- * for as long as it takes to complete the allocation.
- *
- * If there was insufficient memory to complete the request, return NULL.
- */
-static inline void *ntfs_malloc_nofs_nofail(unsigned long size)
-{
-	return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM | __GFP_NOFAIL);
-}
-
 static inline void ntfs_free(void *addr)
 {
 	if (!is_vmalloc_addr(addr)) {
diff --git a/fs/ntfs/runlist.c b/fs/ntfs/runlist.c
--- a/fs/ntfs/runlist.c
+++ b/fs/ntfs/runlist.c
@@ -127,8 +127,10 @@ static inline runlist_element *ntfs_rl_realloc_nofail(runlist_element *rl,
 	if (old_size == new_size)
 		return rl;
 
-	new_rl = ntfs_malloc_nofs_nofail(new_size);
-	BUG_ON(!new_rl);
+	do {
+		/* FIXME: this may potentially loop forever */
+		new_rl = ntfs_malloc_nofs(new_size);
+	} while (!new_rl);
 
 	if (likely(rl != NULL)) {
 		if (unlikely(old_size > new_size))
--

From: David Rientjes
Date: Monday, August 16, 2010 - 7:58 pm

Removes the dependency on __GFP_NOFAIL by looping indefinitely in the
caller.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/reiserfs/journal.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -2593,8 +2593,11 @@ static int journal_read(struct super_block *sb)
 static struct reiserfs_journal_list *alloc_journal_list(struct super_block *s)
 {
 	struct reiserfs_journal_list *jl;
-	jl = kzalloc(sizeof(struct reiserfs_journal_list),
-		     GFP_NOFS | __GFP_NOFAIL);
+
+	do {
+		/* FIXME: this may potentially loop forever */
+		jl = kzalloc(sizeof(struct reiserfs_journal_list), GFP_NOFS);
+	} while (!jl);
 	INIT_LIST_HEAD(&jl->j_list);
 	INIT_LIST_HEAD(&jl->j_working_list);
 	INIT_LIST_HEAD(&jl->j_tail_bh_list);
--

Previous thread: linux-next: build warning after merge of the tree by Stephen Rothwell on Monday, August 16, 2010 - 7:24 pm. (2 messages)

Next thread: linux-next: Tree for August 17 by Stephen Rothwell on Monday, August 16, 2010 - 8:32 pm. (2 messages)