Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

Previous thread: [PATCH] Input: pxa27x_keypad - remove input_free_device() in pxa27x_keypad_remove() by Axel Lin on Tuesday, August 24, 2010 - 3:47 am. (1 message)

Next thread: [PATCH] fanotify: Return EPERM when a process is not privileged by Andreas Gruenbacher on Tuesday, August 24, 2010 - 3:58 am. (1 message)
From: David Rientjes
Date: Tuesday, August 24, 2010 - 3:50 am

Add kmalloc_nofail(), kcalloc_nofail(), and kzalloc_nofail().  These
functions are equivalent to kmalloc(), kcalloc(), and kzalloc(),
respectively, except that they will never return NULL and instead loop
forever trying to allocate memory.

If the first allocation attempt fails, a warning will be emitted,
including a call trace.  Subsequent failures will suppress this warning.

These were added as helper functions for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/md/dm-region-hash.c |    2 +-
 fs/btrfs/inode.c            |    2 +-
 fs/gfs2/log.c               |    2 +-
 fs/gfs2/rgrp.c              |   18 ++++---------
 fs/jbd/transaction.c        |   11 ++------
 fs/reiserfs/journal.c       |    3 +-
 include/linux/slab.h        |   55 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 68 insertions(+), 25 deletions(-)

diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -290,7 +290,7 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
 
 	nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
 	if (unlikely(!nreg))
-		nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
+		nreg = kmalloc_nofail(sizeof(*nreg), GFP_NOIO);
 
 	nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
 		      DM_RH_CLEAN : DM_RH_NOSYNC;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1967,7 +1967,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
-	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
+	delayed = kmalloc_nofail(sizeof(*delayed), GFP_NOFS);
 	delayed->inode = inode;
 
 	spin_lock(&fs_info->delayed_iput_lock);
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -709,7 +709,7 @@ void ...
From: David Rientjes
Date: Tuesday, August 24, 2010 - 3:50 am

Add kmem_cache_zalloc_nofail().  This function is equivalent to
kmem_cache_zalloc(), except that it will never return NULL and instead
loop forever trying to allocate memory.

If the first allocation attempt fails, a warning will be emitted,
including a call trace.  Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/gfs2/meta_io.c    |    2 +-
 include/linux/slab.h |   18 ++++++++++++++++++
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -289,7 +289,7 @@ void gfs2_attach_bufdata(struct gfs2_glock *gl, struct buffer_head *bh,
 		return;
 	}
 
-	bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);
+	bd = kmem_cache_zalloc_nofail(gfs2_bufdata_cachep, GFP_NOFS);
 	bd->bd_bh = bh;
 	bd->bd_gl = gl;
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -313,6 +313,24 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
 	return kmem_cache_alloc(k, flags | __GFP_ZERO);
 }
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
+ */
+static inline void *kmem_cache_zalloc_nofail(struct kmem_cache *k, gfp_t flags)
+{
+	void *ret;
+
+	for (;;) {
+		ret = kmem_cache_zalloc(k, flags);
+		if (ret)
+			return ret;
+		WARN_ONCE(1, "Out of memory, no fallback implemented "
+				"(flags=0x%x)\n",
+				flags);
+	}
+}
+
 /**
  * kzalloc - allocate memory. The memory is set to zero.
  * @size: how many bytes of memory are required.
--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 3:50 am

Add alloc_buffer_head_nofail().  This function is equivalent to
alloc_buffer_head(), except that it will never return NULL and instead
loop forever trying to allocate memory.

If the first allocation attempt fails, a warning will be emitted,
including a call trace.  Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c                 |   18 ++++++++++++++++++
 fs/gfs2/log.c               |    2 +-
 fs/jbd/journal.c            |    2 +-
 include/linux/buffer_head.h |    1 +
 4 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3238,6 +3238,24 @@ struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 }
 EXPORT_SYMBOL(alloc_buffer_head);
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
+ */
+struct buffer_head *alloc_buffer_head_nofail(gfp_t gfp_flags)
+{
+	struct buffer_head *ret;
+
+	for (;;) {
+		ret = alloc_buffer_head(gfp_flags);
+		if (ret)
+			return ret;
+		WARN_ONCE(1, "Out of memory; no fallback implemented "
+				"(flags=0x%x)\n",
+				gfp_flags);
+	}
+}
+
 void free_buffer_head(struct buffer_head *bh)
 {
 	BUG_ON(!list_empty(&bh->b_assoc_buffers));
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -523,7 +523,7 @@ struct buffer_head *gfs2_log_fake_buf(struct gfs2_sbd *sdp,
 	u64 blkno = log_bmap(sdp, sdp->sd_log_flush_head);
 	struct buffer_head *bh;
 
-	bh = alloc_buffer_head(GFP_NOFS | __GFP_NOFAIL);
+	bh = alloc_buffer_head_nofail(GFP_NOFS);
 	atomic_set(&bh->b_count, 1);
 	bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock);
 	set_bh_page(bh, real->b_page, bh_offset(real));
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
--- a/fs/jbd/journal.c
+++ ...
From: Jan Kara
Date: Tuesday, August 24, 2010 - 5:17 am

Acked-by: Jan Kara <jack@suse.cz>
for the JBD part here as well.

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 3:50 am

Add set_extent_dirty_nofail().  This function is equivalent to
set_extent_dirty(), except that it will never fail because of allocation
failure and instead loop forever trying to allocate memory.

If the first allocation attempt fails, a warning will be emitted,
including a call trace.  Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/btrfs/extent-tree.c |    8 ++++----
 fs/btrfs/extent_io.c   |   19 +++++++++++++++++++
 fs/btrfs/extent_io.h   |    2 ++
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3831,9 +3831,9 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 
-			set_extent_dirty(info->pinned_extents,
+			set_extent_dirty_nofail(info->pinned_extents,
 					 bytenr, bytenr + num_bytes - 1,
-					 GFP_NOFS | __GFP_NOFAIL);
+					 GFP_NOFS);
 		}
 		btrfs_put_block_group(cache);
 		total -= num_bytes;
@@ -3872,8 +3872,8 @@ static int pin_down_extent(struct btrfs_root *root,
 	spin_unlock(&cache->lock);
 	spin_unlock(&cache->space_info->lock);
 
-	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
-			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+	set_extent_dirty_nofail(root->fs_info->pinned_extents, bytenr,
+			 bytenr + num_bytes - 1, GFP_NOFS);
 	return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -940,6 +940,25 @@ int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
 			      NULL, mask);
 }
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
+ */
+int set_extent_dirty_nofail(struct ...
From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 6:30 am

I'd much rather someone help mason to clean this up.
--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 3:50 am

Reimplement ntfs_malloc_nofs_nofail() to loop forever calling
ntfs_malloc_nofs() until the allocation succeeds.

If the first allocation attempt fails, a warning will be emitted,
including a call trace.  Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/ntfs/malloc.h |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h
--- a/fs/ntfs/malloc.h
+++ b/fs/ntfs/malloc.h
@@ -76,11 +76,21 @@ static inline void *ntfs_malloc_nofs(unsigned long size)
  * This function guarantees that the allocation will succeed.  It will sleep
  * for as long as it takes to complete the allocation.
  *
- * If there was insufficient memory to complete the request, return NULL.
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
  */
 static inline void *ntfs_malloc_nofs_nofail(unsigned long size)
 {
-	return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM | __GFP_NOFAIL);
+	void *ret;
+
+	for (;;) {
+		ret = ntfs_malloc_nofs(size);
+		if (ret)
+			return ret;
+		WARN_ONCE(1, "Out of memory, no fallback implemented "
+				"(size=%lu)\n",
+				size);
+	}
 }
 
 static inline void ntfs_free(void *addr)
--

From: Jan Kara
Date: Tuesday, August 24, 2010 - 5:15 am

This looks like a cleaner approach than the previous one. You can add
Acked-by: Jan Kara <jack@suse.cz>
  for the JBD part.

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 6:29 am

git grep GFP_NOFAIL isn't auditable enough?

might as well declare these functions depricated if you really want to
do this.

FWIW I think mason said he'd fix btrfs to not suck like this 'soon'.
--

From: Jens Axboe
Date: Tuesday, August 24, 2010 - 6:33 am

Agree, adding a helper function would seem to be going in the reverse
direction imho. Nobody will read that comment on top of the function.

Should be possible to warn at build time for anyone using __GFP_NOFAIL
without wrapping it in a function.

-- 
Jens Axboe

--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 1:11 pm

We could make this __deprecated functions as Peter suggested if you think 
build time warnings for existing users would be helpful.
--

From: Ted Ts'o
Date: Wednesday, August 25, 2010 - 4:24 am

Let me take a few steps backwards and look at the problem from a
somewhat higher level.

Part of the problem is that we have a few places in the kernel where
failure is really not an option --- or rather, if we're going to fail
while we're in the middle of doing a commit, our choices really are
(a) retry the loop in the jbd layer (which Andrew really doesn't
like), (b) keep our own private cache of free memory so we don't fail
and/or loop, (c) fail the file system and mark it read-only, or (d)
panic.

There are other places where we can fail safely (for example, in jbd's
start_this_handle, although that just pushes the layer up the stack,
and ultimately, to userspace where most userspace programs don't
really expect ENOMEM to get returned by a random disk write --- how
much do _you_ trust a random GNOME or KDE developer to do correct
error checking and do something sane at the application?)

So we can mark the retry loop helper function as deprecated, and that
will make some of these cases go away, but ultimately if we're going
to fail the memory allocation, something bad is going to happen, and
the only question is whether we want to have something bad happen by
looping in the memory allocator, or to force the file system to
panic/oops the system, or have random application die and/or lose user
data because they don't expect write() to return ENOMEM.

So at some level it would be nice if we had a few different levels of
"we *really* need this memory".  Level 0 might be, "if we can't get
it, no biggie, we'll figure out some other way around it.  I doubt
there is much at level 0, but in theory, we could have some good
citizens that fall in that camp and who simply will bypass some
optimization if they can't get the memory.  Level 1 might be, if you
can't get the memory, we will propagate a failure up to userspace, but
it's probably a relatively "safe" place to fail (i.e., when the user
is opening a file).  Level 2 might be, "if you can't get the memory,
we will propgate a ...
From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 4:35 am

d) do the allocation before you're committed to going fwd and can still
fail and back out.
--

From: Ted Ts'o
Date: Wednesday, August 25, 2010 - 4:57 am

Sure in some cases that can be done, but the commit has to happen at
some point, or we run out of journal space, at which point we're back
to (c) or (d).

						- Ted


--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 5:48 am

Well (b) sounds a lot saner than either of those. Simply revert to a
state that is sub-optimal but has bounded memory use and reserve that
memory up-front. That way you can always get out of a tight memory spot.

Its what the block layer has always done to avoid the memory deadlock
situation, it has a private stash of BIOs that is big enough to always
service some IO, and as long as IO is happening stuff keeps moving fwd
and we don't deadlock.

Filesystems might have a slightly harder time creating such a bounded
state because there might be more involved like journals and the like,
but still it should be possible to create something like that (my swap
over nfs patches created such a state for the network rx side of
things).

Also, we cannot let our fear of crappy userspace get in the way of doing
sensible things. Your example of write(2) returning -ENOMEM is not
correct though, the syscall (and the page_mkwrite callback for mmap()s)
happens before we actually dirty the data and need to write things out,
so we can always simply wait for memory to become available to dirty.


--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 5:52 am

Also, there's a good reason for disliking (a), its a deadlock scenario,
suppose we need to write out data to get free pages, but the writing out
is blocked on requiring free pages.

There's really nothing the page allocator can do to help you there, its
a situation you have to avoid getting into.


--

From: Theodore Tso
Date: Wednesday, August 25, 2010 - 6:20 am

Well, if all of these users start having their own private pools of emergency memory, I'm not sure that's such a great idea either.

And in some cases, there *is* extra memory.  For example, if the reason why the page allocator failed is because there isn't enough memory in the current process's cgroup, maybe it's important enough that the kernel code might decide to say, "violate the cgroup constraints --- it's more important that we not bring down the entire system" than to honor whatever yahoo decided that a particular cgroup has been set down to something ridiculous like 512mb, when there's plenty of free physical memory --- but not in that cgroup.

-- Ted

--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 6:31 am

I'm not sure, but I think the cgroup thing doesn't account kernel
allocations, in which case the above problem doesn't exist.

For the cpuset case we punch through the cpuset constraints for kernel
allocations (unless __GFP_HARDWALL).
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 1:43 pm

Right, the only cgroup that does is cpusets because it binds the memory 

__GFP_HARDWALL doesn't mean that the allocation won't be constrained, this 
is a common misconception.  __GFP_HARDWALL only prevents us from looking 
at our cpuset.mem_exclusive flag and checking our nearest common ancestor 
cpuset if we can block.

The cpusets case is actually the easiest to fix: use GFP_ATOMIC.  
GFP_ATOMIC allocations aren't bound by any cpuset and, in the general 
case, can allocate below the min watermark because of 
ALLOC_HARD | ALLOC_HARDER in the page allocator which creates the notion 
of "memory reserves" available to these tasks.  Then, success really 
depends on the setting of the watermarks instead.
--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 1:55 pm

I don't think that's a valid usage of GFP_ATOMIC, I think we should
fallback to outside the cpuset for kernel allocations by default.
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 2:11 pm

Cpusets doesn't enforce isolation for only user memory, it's always bound 
_all_ allocations that aren't atomic or in irq context (or oom killed 
tasks).  Allowing slab, for example, to be allocated in other cpusets 
could cause them to oom themselves since they are bound by the same memory 
isolation policy that all other cpusets are.  We'd get random oom 
conditions in cpusets only depending on where the slab was allocated at 
now fault to those applications themselves, and that's certainly not a 
situation we want.  The memory controller cgroup also has slab accounting 
on their TODO list.

If you think GFP_ATOMIC is inappropriate in these contexts, then they are 
by definition blockable.  So this seems like a good candidate for using 
memory compaction since we're talking only about PAGE_ALLOC_COSTLY_ORDER 
and higher allocs, even though it's only currently configurable for 
hugepages.

There's still no hard guarantee that the memory will allocatable 
(GFP_KERNEL, the compaction, then GFP_ATOMIC could all still fail), but I 
don't see how continuously looping the page allocator is possibly supposed 
to help in these situations.
--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 2:27 pm

Why do you think I'm a proponent of that behaviour?

I've been arguing that the existance of GFP_NOFAIL is the bug, and I
started the whole discussion because your patchset didn't outline the
purpose of its existance, it merely changes __GFP_NOFAIL usage into
$foo_nofail() functions, which on its own is a rather daft change.

Optimizing the page allocator by removing those conditional from its
innards into an outer loop not used by most callers seems a fine goal,
but you didn't state that.

Also, I like the postfix proposed by Andi better: _i_suck() :-)
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 4:11 pm

I originally pushed these to the callers, but Andrew thought it would be 
better so that we could audit the existing users that have no fallback 
behavior, even though they could go implement it on their own (like 
getblk() which actually does try _some_ memory freeing).  It eliminates 
the flag from the page allocator and saves branches in the slowpath.  We 
don't need this logic in the allocator itself, it can exist at a higher 
level and, with deprecation, will hopefully be incentive enough for those 
subsystems to fix the issue.

I'll repropose the patchset with __deprecated as you suggested.  Thanks!
--

From: Ted Ts'o
Date: Wednesday, August 25, 2010 - 5:19 pm

And what Dave and I are saying is that we'll either need to do our on
loop to avoid the deprecation warning, or the use of the deprecated
function will probably be used forever...

	      	       	     	 - Ted
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 5:30 pm

We certainly hope that nobody will reimplement the same function without 
the __deprecated warning, especially for order < PAGE_ALLOC_COSTLY_ORDER 
where there's no looping at a higher level.  So perhaps the best 
alternative is to implement the same _nofail() functions but do a 
WARN_ON(get_order(size) > PAGE_ALLOC_COSTLY_ORDER) instead?

I think it's really sad that the caller can't know what the upper bounds 
of its memory requirement are ahead of time or at least be able to 
implement a memory freeing function when kmalloc() returns NULL.
--

From: Ted Ts'o
Date: Wednesday, August 25, 2010 - 6:48 pm

Oh, we can determine an upper bound.  You might just not like it.
Actually ext3/ext4 shouldn't be as bad as XFS, which Dave estimated to
be around 400k for a transaction.  My guess is that the worst case for
ext3/ext4 is probably around 256k or so; like XFS, most of the time,
it would be a lot less.  (At least, if data != journalled; if we are
doing data journalling and every single data block begins with
0xc03b3998U, we'll need to allocate a 4k page for every single data
block written.)  We could dynamically calculate an upper bound if we
had to.  Of course, if ext3/ext4 is attached to a network block
device, then it could get a lot worse than 256k, of course.

	      	      	      	    	- Ted

--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 8:09 pm

Ok, and we'll make it a WARN_ON_ONCE() to be nice to the kernel log.  
Although the current patchset does this with WARN_ON_ONCE(1, ...) instead, 
this serves to ensure that we aren't dependent on the page allocator's 
implementation to always loop for order < PAGE_ALLOC_COSTLY_ORDER in which 

On my 8GB machine, /proc/zoneinfo says the min watermark for ZONE_NORMAL 
is 5086 pages, or ~20MB.  GFP_ATOMIC would allow access to ~12MB of that, 
so perhaps we should consider this is an acceptable abuse of GFP_ATOMIC as 
a fallback behavior when GFP_NOFS or GFP_NOIO fails?
--

From: Dave Chinner
Date: Thursday, August 26, 2010 - 12:06 am

It would take a handful of concurrent transactions in XFS with
worst case memory allocation requirements to exhaust that pool, and
then we really would be in trouble.  Alternatively, it would take a
few allocations from each of a couple of thousand concurrent
transactions to get to the same point.

Bound memory pools only work when serialised access to the pool can
be enforced and there are no dependencies on other operations in
progress for completion of the work and freeing of the memory.
This is where it becomes exceedingly difficult to guarantee
progress.

One of the ideas that has floated around (I think Mel Gorman came up
with it first) was that if hardening the filesystem is so difficult,
why not just harden a single path via a single thread? e.g. we allow
the bdi flusher thread to have a separate reserve pool of free
pages, and when memory allocations start to fail, then that thread
can dip into it's pool to complete the writeback of the dirty pages
being flushed.  When a fileystem attaches to a bdi, it can specify
the size of the reserve pool it needs. 

This can be easily tested for during allocation (say a PF_ flag) and
switched to the reserve pool as necessary. because it is per-thread,
access to the pool is guaranteed to serialised. Memory reclaim can
then refill these pools before putting pages on freelists. This
could give us a mechanism for ensuring that allocations succeed in
the ->writepage path without needing to care about filesystem
implementation details.

And in the case of ext3/4, a pool could be attached to the jbd
thread as well so that it never starves of memory when commits are
required...

So, rather than turning filesystems upside down, maybe we should
revisit per-thread reserve pools for threads that are tasked with
cleaning pages for the VM?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Peter Zijlstra
Date: Thursday, August 26, 2010 - 1:29 am

Agreed with the fact that 400k isn't much to worry about.

Not agreed with the GFP_ATOMIC stmt.

Direct reclaim already has PF_MEMALLOC, but then we also need a
concurrency limit on that, otherwise you can still easily blow though
your reserves by having 100 concurrency users of it, resulting in an
upper bound of 40000k instead, which will be too much.

There were patches to limit the direct reclaim contexts, not sure they
ever got anywhere..

It is something to consider in the re-design of the whole
direct-reclaim/writeback paths though..
--

From: Dave Chinner
Date: Wednesday, August 25, 2010 - 11:38 pm

For a 4k block size filesystem.

If I use 64k block size directories (which XFS can even on 4k page
size machines), the maximum transaction reservation goes up to at
around 3MB, and that's just for blocks being _modified_. It's not
the limit on the amount of memory that may need to be allocated

Right, it usually is a lot less, but one of the big problems is that
during low memory situations memory reclaim of the metadata page
cache actually causes _more_ memory allocation during tranactions
than otherwise would occur.......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 6:34 am

That's a secondary problem, and could be reduced by something like the
memory reservation system I provided in the swap-over-nfs patches. Of
course, when all these users can actually happen concurrently there's
really nothing much you can do about it.


--

From: Dave Chinner
Date: Wednesday, August 25, 2010 - 6:24 am

Filesystems are way more complex than the block layer - the block
layer simply doesn't have to handle situations were thread X is
holding A, B and C, while thread Y needs C to complete the
transaction. thread Y is the user of the low memory pool, but has
almost depleted it and so even if we swith to thread X, the pool doe
snot have enouhg memory for X to complete and allow us to switch
back to Y and have it complete, freeing the memory from the pool
that it holds.

That is, the guarantee that we will always make progress simply does
not exist in filesystems, so a mempool-like concept seems to me to
be doomed from the start....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 6:35 am

While I appreciate that it might be somewhat (a lot) harder for a
filesystem to provide that guarantee, I'd be deeply worried about your
claim that its impossible.

It would render a system without swap very prone to deadlocks. Even with
the very tight dirty page accounting we currently have you can fill all
your memory with anonymous pages, at which point there's nothing free
and you require writeout of dirty pages to succeed.

--

From: Ted Ts'o
Date: Wednesday, August 25, 2010 - 1:53 pm

For file systems that do delayed allocation, the situation is very
similar to swapping over NFS.  Sometimes in order to make some free
memory, you need to spend some free memory...  which implies that for
these file systems, being more aggressive about triggering writeout,
and being more aggressive about throttling processes which are
creating too many dirty pages, especially dirty delayed allocaiton
pages (regardless of whether this is via write(2) or accessing mmapped
memory), is a really good idea.

A pool of free pages which is reserved for routines that are doing
page cleaning would probably also be a good idea.  Maybe that's just
retrying with GFP_ATOMIC if a normal allocation fails, or maybe we
need our own special pool, or maybe we need to dynamically resize the
GFP_ATOMIC pool based on how many subsystems might need to use it....

Just brainstorming here; what do people think?

							- Ted
--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 1:59 pm

PF_MEMALLOC is used for those tasks, not GFP_ATOMIC.
--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 2:35 pm

Which means you need to be able to compute a bounded amount of that

That seems unrelated, the VM has a strict dirty limit and controls

We have a smallish reserve, accessible with PF_MEMALLOC, but its use is
not regulated nor bounded, it just mostly works good enough.


--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 1:58 pm

While I'd really love for callers to be able to gracefully handle getting 
a NULL back from the page allocator in all cases, it's not a prerequisite 
for this patchset.  This patchset actually does nothing interesting except 
removing the __GFP_NOFAIL bit from their gfp mask.  All of these 
allocations already loop looking for memory because they have orders that 
are less than PAGE_ALLOC_COSTLY_ORDER (which defaults to 3).  So the loops 
in kzalloc_nofail(), etc., never actually loop.

Demanding that the page allocator return order-3 memory in any context is 
a non-starter, so I'm not really interested in that.  I'm more concerned 
about proper error handling being implemented for these callers iff 
someone redefines PAGE_ALLOC_COSTLY_ORDER to something else, perhaps even 
0.

Callers can, when desperate for memory, use GFP_ATOMIC to use some memory 
reserves across zones, hopefully order-0 and not an egregious amount.  But 
the remainder of the burden really is back on the caller when this is 
depleted or it needs higher order allocs to be fixed in a way that doesn't 
rely on memory that doesn't exist.  That's an implementation choice by the 
caller and I agree that some failsafe behavior is the only way that we 
don't get really bad results like corrupted user data or filesystems.
--

From: Christoph Lameter
Date: Wednesday, August 25, 2010 - 2:11 pm

We already have __GFP_NOFAIL behavior for slab allocations since
a __GFP_NOFAIL flag is passed through to the page allocator if no objects
are available.

The page allocator will do the same as when called directly with
__GFP_NOFAIL and so we have consistent behavior regardless of the
allocator used.




--

From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 2:21 pm

Thank's for quoting the context to which you're replying, that really
helps parsing things..

--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 2:23 pm

It all depends on what flags are passed to kmalloc(), slab nor slub 
enforce __GFP_NOFAIL behavior themselves.  In slab, cache_grow() will 
return NULL depending on whether the page allocator returns NULL, and that 
would only happen for __GFP_NORETRY or
cachep->gfp->gfporder >= PAGE_ALLOC_COSTLY_ORDER.  In slub, the default 
order is tried with __GFP_NORETRY and if it returns NULL, the higher order 
alloc will fail under the same circumstances.  So the nofail behavior for 
slab depends only on the flags passed from the caller.
--

From: Christoph Lameter
Date: Wednesday, August 25, 2010 - 2:35 pm

If the higher order fails in slub then an order 0 alloc is attempted
without __GFP_NORETRY. In both cases the nofail behavior of the page
allocator determines the outcode.

True if the caller mixes in __GFP_NORETRY then you may still get NULL. But
that is an issue that can be resolved by the caller.

--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 4:05 pm

Right, I thought you said the slab layer passes __GFP_NOFAIL when there's 
no objects available.
--

From: Christoph Lameter
Date: Wednesday, August 25, 2010 - 6:30 pm

Yes, the slab layer calls the page allocator when there are no objects
available and passes the __GFP_NOFAIL that the user may have set in the
call to the page allocator.

Why then add new functions that do the same?



--

From: David Rientjes
Date: Wednesday, August 25, 2010 - 8:12 pm

Because we can remove the flag, remove branches from the page allocator 
slowpath, and none of these allocations actually are dependent on 
__GFP_NOFAIL since they are all under PAGE_ALLOC_COSTLY_ORDER.
--

From: Christoph Lameter
Date: Thursday, August 26, 2010 - 7:16 am

Then we can simply remove __GFP_NOFAIL? Functions are only needed for
higher order allocs that can fail?

--

From: David Rientjes
Date: Thursday, August 26, 2010 - 3:31 pm

Yes, that's the intent.  We'd like to add the 
WARN_ON_ONCE(get_order(size) >= PAGE_ALLOC_COSTLY_ORDER) warning, though, 
so we're ensured that redefinition of that #define doesn't cause 
allocations to fail to that don't have appropriate error handling or 
future callers use higher order allocs.  The _nofail() functions help that 
and do some due diligence in ensuring that we aren't changing gfp flags 
based only on the current page allocator implementation which may later 
change with very specialized corner cases.
--

From: Dave Chinner
Date: Wednesday, August 25, 2010 - 5:09 pm

I didn't say impossible, just that there's no way we can always
guarantee of forward progress with a specific, bound pool of memory.

Sure, we know what the worst case amount of log space is needed for
each transaction (i.e. how many pages that will be dirtied), but
that does not take into account all the blocks that need to be read
to make those modifications, the memory needed for stuff like btree
cursors, log tickets, transaction commit vectors, btree blocks
needed to do the searches, etc.  A typical transaction reservation
on a 4k block filesystem is between 200-400k (it's worst case), and
if you add in all the other allocations that might be required,
we're at the order of requiring megabytes of RAM to guarantee a
single transaction will succeed in low memory conditions. The exact
requirement is very difficult to quantify algorithmically, but for a
single transaction it should be possible.

However, consider the case of running a thousand concurrent
transactions and in the middle of that the system runs out of
memory. All the transactions need memory allocation to succeed, some
are blocked waiting for resources held in other transactions, etc.
Firstly, how to you stop all the transactions from making further
progress to serialise access to the low memory pool?  Secondly, how
do you select which transaction you want to use the low memory pool?
What do you do if the selected transaction then blocks on a resource
held by another transaction (which you can't know ahead of time)? Do
you switch to another thread and hope the pool doesn't run dry? What
do you do when (not if) the memory pool runs dry?

I'm sure this could be done, but it's lot of difficult, unrewarding
work that greatly increases code complexity, touches a massive
amount of the filesystem code base, exponentially increases the test
matrix, is likely to have significant operational overhead, and even
then there's no guarantee that we've got it right. That doesn't

Then don't allow anonymous pages to fill all ...
From: Peter Zijlstra
Date: Wednesday, August 25, 2010 - 7:13 am

While talking with Chris about this, if you can indeed push the error
out that far you can basically ensure this particular fs-op does not
complicate the journal commit and thereby limit the number of extra
entries in your journal, and thus the amount of memory required.

So once stuff starts failing, push out ops back out of the filesystem
code, force a journal commit, and then let these ops retry. There is no
need to actually push the -ENOMEM all the way back to userspace.


--

From: Dave Chinner
Date: Tuesday, August 24, 2010 - 6:55 am

Also, if you are going to add tight loops, you might want to put a
backoff in the loops like "congestion_wait(BLK_RW_ASYNC, HZ/50);" so
that they don't spin....

FWIW, in all this "allocations can't fail" churn, no one has noticed
that XFS has been doing these "allocations can't fail" loop in
kmem_alloc() and kmem_zone_alloc(), well, forever. I can't ever
remember seeing it report a potential deadlock, though....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Peter Zijlstra
Date: Tuesday, August 24, 2010 - 7:03 am

I occasionally check if its still there and cry a little ;-)
--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 1:12 pm

These loops don't actually loop at all, all users are passing 
order < PAGE_ALLOC_COSTLY_ORDER which implicitly loop forever in the page 
allocator without killing anything (they are all GFP_NOIO or GFP_NOFS, so 
the oom killer isn't involved).
--

From: David Rientjes
Date: Tuesday, August 24, 2010 - 1:08 pm

I'm trying to remove that bit, and __GFP_REPEAT, to simplify and remove 

Great!
--

From: David Rientjes
Date: Wednesday, September 1, 2010 - 6:02 pm

Add kmalloc_nofail(), kcalloc_nofail(), and kzalloc_nofail().  These
functions are equivalent to kmalloc(), kcalloc(), and kzalloc(),
respectively, except that they will never return NULL and instead loop
forever trying to allocate memory.

If the first allocation attempt fails because the page allocator doesn't 
implicitly loop, a warning will be emitted, including a call trace.
Subsequent failures will suppress this warning.

These were added as helper functions for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 drivers/md/dm-region-hash.c |    2 +-
 fs/btrfs/inode.c            |    2 +-
 fs/gfs2/log.c               |    2 +-
 fs/gfs2/rgrp.c              |   18 +++++----------
 fs/jbd/transaction.c        |   11 ++------
 fs/reiserfs/journal.c       |    3 +-
 include/linux/slab.h        |   51 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 64 insertions(+), 25 deletions(-)

diff --git a/drivers/md/dm-region-hash.c b/drivers/md/dm-region-hash.c
--- a/drivers/md/dm-region-hash.c
+++ b/drivers/md/dm-region-hash.c
@@ -290,7 +290,7 @@ static struct dm_region *__rh_alloc(struct dm_region_hash *rh, region_t region)
 
 	nreg = mempool_alloc(rh->region_pool, GFP_ATOMIC);
 	if (unlikely(!nreg))
-		nreg = kmalloc(sizeof(*nreg), GFP_NOIO | __GFP_NOFAIL);
+		nreg = kmalloc_nofail(sizeof(*nreg), GFP_NOIO);
 
 	nreg->state = rh->log->type->in_sync(rh->log, region, 1) ?
 		      DM_RH_CLEAN : DM_RH_NOSYNC;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1967,7 +1967,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	if (atomic_add_unless(&inode->i_count, -1, 1))
 		return;
 
-	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
+	delayed = kmalloc_nofail(sizeof(*delayed), GFP_NOFS);
 	delayed->inode = inode;
 
 	spin_lock(&fs_info->delayed_iput_lock);
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ ...
From: David Rientjes
Date: Wednesday, September 1, 2010 - 6:03 pm

Add kmem_cache_zalloc_nofail().  This function is equivalent to
kmem_cache_zalloc(), except that it will never return NULL and instead
loop forever trying to allocate memory.

If the first allocation attempt fails because the page allocator doesn't 
implicitly loop, a warning will be emitted, including a call trace.
Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/gfs2/meta_io.c    |    2 +-
 include/linux/slab.h |   17 +++++++++++++++++
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -289,7 +289,7 @@ void gfs2_attach_bufdata(struct gfs2_glock *gl, struct buffer_head *bh,
 		return;
 	}
 
-	bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);
+	bd = kmem_cache_zalloc_nofail(gfs2_bufdata_cachep, GFP_NOFS);
 	bd->bd_bh = bh;
 	bd->bd_gl = gl;
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -313,6 +313,23 @@ static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
 	return kmem_cache_alloc(k, flags | __GFP_ZERO);
 }
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
+ */
+static inline void *kmem_cache_zalloc_nofail(struct kmem_cache *k, gfp_t flags)
+{
+	void *ret;
+
+	for (;;) {
+		ret = kmem_cache_zalloc(k, flags);
+		if (ret)
+			return ret;
+		WARN_ON_ONCE(get_order(kmem_cache_size(k)) >
+						PAGE_ALLOC_COSTLY_ORDER);
+	}
+}
+
 /**
  * kzalloc - allocate memory. The memory is set to zero.
  * @size: how many bytes of memory are required.
--

From: David Rientjes
Date: Wednesday, September 1, 2010 - 6:03 pm

Add alloc_buffer_head_nofail().  This function is equivalent to
alloc_buffer_head(), except that it will never return NULL and instead
loop forever trying to allocate memory.

If the first allocation attempt fails because the page allocator doesn't
implicitly loop, a warning will be emitted, including a call trace.
Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/buffer.c                 |   17 +++++++++++++++++
 fs/gfs2/log.c               |    2 +-
 fs/jbd/journal.c            |    2 +-
 include/linux/buffer_head.h |    1 +
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3238,6 +3238,23 @@ struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 }
 EXPORT_SYMBOL(alloc_buffer_head);
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
+ */
+struct buffer_head *alloc_buffer_head_nofail(gfp_t gfp_flags)
+{
+	struct buffer_head *ret;
+
+	for (;;) {
+		ret = alloc_buffer_head(gfp_flags);
+		if (ret)
+			return ret;
+		WARN_ON_ONCE(get_order(sizeof(struct buffer_head)) >
+						PAGE_ALLOC_COSTLY_ORDER);
+	}
+}
+
 void free_buffer_head(struct buffer_head *bh)
 {
 	BUG_ON(!list_empty(&bh->b_assoc_buffers));
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -523,7 +523,7 @@ struct buffer_head *gfs2_log_fake_buf(struct gfs2_sbd *sdp,
 	u64 blkno = log_bmap(sdp, sdp->sd_log_flush_head);
 	struct buffer_head *bh;
 
-	bh = alloc_buffer_head(GFP_NOFS | __GFP_NOFAIL);
+	bh = alloc_buffer_head_nofail(GFP_NOFS);
 	atomic_set(&bh->b_count, 1);
 	bh->b_state = (1 << BH_Mapped) | (1 << BH_Uptodate) | (1 << BH_Lock);
 	set_bh_page(bh, real->b_page, bh_offset(real));
diff --git a/fs/jbd/journal.c ...
From: David Rientjes
Date: Wednesday, September 1, 2010 - 6:03 pm

Reimplement ntfs_malloc_nofs_nofail() to loop forever calling
ntfs_malloc_nofs() until the allocation succeeds.

If the first allocation attempt fails because the page allocator doesn't
implicitly loop, a warning will be emitted, including a call trace.
Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/ntfs/malloc.h |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs/malloc.h b/fs/ntfs/malloc.h
--- a/fs/ntfs/malloc.h
+++ b/fs/ntfs/malloc.h
@@ -76,11 +76,19 @@ static inline void *ntfs_malloc_nofs(unsigned long size)
  * This function guarantees that the allocation will succeed.  It will sleep
  * for as long as it takes to complete the allocation.
  *
- * If there was insufficient memory to complete the request, return NULL.
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever possible.
  */
 static inline void *ntfs_malloc_nofs_nofail(unsigned long size)
 {
-	return __ntfs_malloc(size, GFP_NOFS | __GFP_HIGHMEM | __GFP_NOFAIL);
+	void *ret;
+
+	for (;;) {
+		ret = ntfs_malloc_nofs(size);
+		if (ret)
+			return ret;
+		WARN_ON_ONCE(get_order(size) > PAGE_ALLOC_COSTLY_ORDER);
+	}
 }
 
 static inline void ntfs_free(void *addr)
--

From: Anton Altaparmakov
Date: Thursday, September 2, 2010 - 2:08 am

Hi David,


What is the relevance of the above get_order check?  ntfs_malloc_nofs() does a vmalloc() if size >= PAGE_SIZE so I cannot see how that has anything to do with get_order()?

Other than that patch looks fine.

Note the _nofail function is only called from one place and there will be no more call sites for it.  One day we ought to work something out so it is not needed at all but given it only gets called in very obscure circumstances (definitely not in the general code paths) it shouldn't matter too much...

Best regards,


Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--

From: David Rientjes
Date: Sunday, September 5, 2010 - 3:55 pm

It probably shouldn't be doing vmalloc() for size == PAGE_SIZE, but I can 
understand how it would use it for higher order allocations.  I didn't 
look at that specifically because I assumed these were going through the 
page allocator (and no current user should have any order less than 
PAGE_ALLOC_COSTLY_ORDER, correct me if I'm wrong).

I'm pretty sure we want some sort of warning emitted once if this has the 
potential to loop forever, so it should probably be changed to WARN_ONCE() 

That would be great!
--

From: David Rientjes
Date: Wednesday, September 1, 2010 - 6:03 pm

Add set_extent_dirty_nofail().  This function is equivalent to
set_extent_dirty(), except that it will never fail because of allocation
failure and instead loop forever trying to allocate memory.

If the first allocation attempt fails because the page allocator doesn't
implicitly loop, a warning will be emitted, including a call trace.
Subsequent failures will suppress this warning.

This was added as a helper function for documentation and auditability.
No future callers should be added.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/btrfs/extent-tree.c |    8 ++++----
 fs/btrfs/extent_io.c   |   18 ++++++++++++++++++
 fs/btrfs/extent_io.h   |    2 ++
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3831,9 +3831,9 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 
-			set_extent_dirty(info->pinned_extents,
+			set_extent_dirty_nofail(info->pinned_extents,
 					 bytenr, bytenr + num_bytes - 1,
-					 GFP_NOFS | __GFP_NOFAIL);
+					 GFP_NOFS);
 		}
 		btrfs_put_block_group(cache);
 		total -= num_bytes;
@@ -3872,8 +3872,8 @@ static int pin_down_extent(struct btrfs_root *root,
 	spin_unlock(&cache->lock);
 	spin_unlock(&cache->space_info->lock);
 
-	set_extent_dirty(root->fs_info->pinned_extents, bytenr,
-			 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
+	set_extent_dirty_nofail(root->fs_info->pinned_extents, bytenr,
+			 bytenr + num_bytes - 1, GFP_NOFS);
 	return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -940,6 +940,24 @@ int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
 			      NULL, mask);
 }
 
+/*
+ * NOTE: no new callers of this function should be implemented!
+ * All memory allocations should be failable whenever ...
From: Jiri Slaby
Date: Thursday, September 2, 2010 - 12:59 am

This doesn't work as you expect. kmalloc will warn every time it fails.
__GFP_NOFAIL used to disable the warning. Actually what's wrong with
__GFP_NOFAIL? I cannot find a reason in the changelogs why the patches



-- 
js
--

From: Jan Kara
Date: Thursday, September 2, 2010 - 7:51 am

David should probably add the reasoning to the changelogs so that he
doesn't have to explain again and again ;). But if I understood it
correctly, the concern is that the looping checks slightly impact fast path
of the callers which do not need it. Generally, also looping for a long
time inside allocator isn't a nice thing but some callers aren't able to do
better for now to the patch is imperfect in this sence...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Neil Brown
Date: Thursday, September 2, 2010 - 2:15 pm

On Thu, 2 Sep 2010 16:51:41 +0200

I'm actually a bit confused about this too.
I thought David said he was removing a branch on the *slow* path - which make
sense as you wouldn't even test NOFAIL until you had a failure.
Why are branches on the slow-path an issue??
This is an important question to me because I still hope to see the
swap-over-nfs patches merged eventually and they add a branch on the slow
path (if I remember correctly).

NeilBrown
--

From: David Rientjes
Date: Sunday, September 5, 2010 - 4:03 pm

They aren't necessarily an issue in the performance sense, this is a 
cleanup series since all converted callers to using these new functions 
(and the eventual removal of __GFP_NOFAIL entirely) are using the bit 
unnecessarily since they all have orders that implicitly loop in the page 
allocator forever already, with or without the flag.
--

From: David Rientjes
Date: Sunday, September 5, 2010 - 4:01 pm

It actually does work as I expect since the WARN_ON_ONCE() never even gets 
triggered here for any of kmalloc_nofail()'s callers since they all have 
sizes small enough that the conditional is never true.  The page allocator 
implicitly loops forever if the order <= PAGE_ALLOC_COSTLY_ORDER, so this 
warning is only emitted if the #define implementation of the page 
allocator only changes.  That's intended, we want to know the consequences 

No, it didn't, it was unnecessary for all of the kmalloc_nofail() callers 
since they already implicitly loop.  No warning is emitted (these are all 
GFP_NOFS or GFP_NOIO users, there are no order-4 or larger GFP_ATOMIC 

Couple reasons:

 - it's unnecessary in the page allocator, it can be implemented at a 
   higher level, which allows us to remove these branches from the 
   slowpath,

 - it's mostly unnecessary since all users have orders that will 
   implicitly loop forever anyway (although __GFP_NOFAIL does guarantee 
   that it won't fail even if we change the looping behavior internally,
   these warnings help to isolate cases where it's needed), and
--

From: David Rientjes
Date: Monday, September 6, 2010 - 2:05 am

Are there any objections to merging this series through -mm with the 
exception of the fifth patch for ntfs?  That particular patch needs to 
have its WARN_ON_ONCE() condition rewritten since it fallbacks to 
vmalloc for high order allocs.

Thanks.
--

Previous thread: [PATCH] Input: pxa27x_keypad - remove input_free_device() in pxa27x_keypad_remove() by Axel Lin on Tuesday, August 24, 2010 - 3:47 am. (1 message)

Next thread: [PATCH] fanotify: Return EPERM when a process is not privileged by Andreas Gruenbacher on Tuesday, August 24, 2010 - 3:58 am. (1 message)