Re: [PATCH 0/5] Candidate fix for increased number of GFP_ATOMIC failures V2

Previous thread: [PATCH 3/5] vmscan: Force kswapd to take notice faster when high-order watermarks are being hit by Mel Gorman on Thursday, October 22, 2009 - 7:22 am. (2 messages)

Next thread: [PATCH net-next-2.6] rtnetlink: speedup rtnl_dump_ifinfo() by Eric Dumazet on Thursday, October 22, 2009 - 7:34 am. (2 messages)
From: Mel Gorman
Date: Thursday, October 22, 2009 - 7:22 am

Sorry for the large cc list. Variations of this bug have cropped up in a
number of different places and so there are a fair few people that should
be vaguely aware of what's going on.

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
	Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
	allocations within the wireless driver. This has caused large numbers
	of failure reports to occur as reported by Frans Pop. Fixing this
	requires changes to the driver if it wants to use GFP_ATOMIC which
	is in the hands of Mohamed Abbas and Reinette Chatre. However,
	it is very likely that it has being compounded by core mm changes
	that this series is aimed at.

[Bug #14141] order 2 page allocation failures (generic)
	This problem is being tracked under bug #14141 but chances are it's
	unrelated to the wireless change. Tobi Oetiker has reported that a
	virtualised machine using a bridged interface is reporting a small
	number of order-5 GFP_ATOMIC failures. He has reported that the
	errors can be suppressed with kswapd patches in this series. However,
	I would like to confirm they are necessary.

[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
	Karol Lewandows reported that e100 fails to allocate order-5
	GFP_ATOMIC when loading firmware during resume. This has started
	happening relatively recent.

[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
	This apparently is easily reproducible, particular in comparison to
	the other reports. The point of greatest interest is that this is
	order-0 GFP_ATOMIC failures. ...
From: Mel Gorman
Date: Thursday, October 22, 2009 - 7:22 am

When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure.

This patch pre-emptively checks if watermarks have been hit after a
high-order allocation completes successfully. If the watermarks have been
reached, kswapd is woken in the hope it fixes the watermarks before the
next GFP_ATOMIC allocation fails.

Warning, this patch is somewhat of a band-aid. If this makes a difference,
it still implies that something has changed that is either causing more
GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
driver) or make them more likely to fail.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/page_alloc.c |   33 ++++++++++++++++++++++-----------
 1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f2aa3e..851df40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,6 +1596,17 @@ try_next_zone:
 	return page;
 }
 
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+						enum zone_type high_zoneidx)
+{
+	struct zoneref *z;
+	struct zone *zone;
+
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
+}
+
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
-	return page;
-}
-
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
-						enum zone_type high_zoneidx)
-{
-	struct zoneref *z;
-	struct zone *zone;
+	/*
+	 * If after a high-order allocation we are now below ...
From: David Rientjes
Date: Thursday, October 22, 2009 - 12:41 pm

Hmm, is this really supposed to be added to __alloc_pages_high_priority()?  
By the patch description I was expecting kswapd to be woken up 
preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and 
we're known to have just allocated at a higher order, not just when 
current was oom killed (when we should already be freeing a _lot_ of 
memory soon) or is doing a higher order allocation during direct reclaim.

For the best coverage, it would have to be add the branch to the fastpath.  
That seems fine for a debugging aid and to see if progress is being made 
on the GFP_ATOMIC allocation issues, but doesn't seem like it should make 
its way to mainline, the subsequent GFP_ATOMIC allocation could already be 
happening and in the page allocator's slowpath at this point that this 
wakeup becomes unnecessary.

If this is moved to the fastpath, why is this wake_all_kswapd() and not
wakeup_kswapd(preferred_zone, order)?  Do we need to kick kswapd in all 
zones even though they may be free just because preferred_zone is now 
below the watermark?

Wouldn't it be better to do this on page_zone(page) instead of 
preferred_zone anyway?
--

From: Mel Gorman
Date: Friday, October 23, 2009 - 2:13 am

It was a somewhat arbitrary choice to have it trigger in the event high


It probably makes no difference as zones are checked for their watermarks
before any real work happens. However, even if this patch makes a difference,
I don't want to see it merged.  At best, it is an extremely heavy-handed
hack which is why I asked for it to be tested in isolation. It shouldn't
be necessary at all because sort of pre-emptive waking of kswapd was never

No. The preferred_zone is the zone we should be allocating from. If we
failed to allocate from it, it implies the watermarks are not being met
so we want to wake it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: David Rientjes
Date: Friday, October 23, 2009 - 2:36 am

I don't quite understand, users of PF_MEMALLOC shouldn't be doing these 
higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom 
killer, we should be freeing a substantial amount of memory imminently 

Ahh, that makes a ton more sense: this particular patch is a debugging 

Oops, I'm even more confused now :)  I thought the existing 
wake_all_kswapd() in the slowpath was doing that and that this patch was 
waking them prematurely because it speculates that a subsequent high 
order allocation will fail unless memory is reclaimed.  I thought we'd  
want to reclaim from the zone we just did a high order allocation from so 
that the fastpath could find the memory next time with ALLOC_WMARK_LOW.
--

From: Mel Gorman
Date: Friday, October 23, 2009 - 4:25 am

I agree. I think it's highly unlikely this patch will make any
difference but I wanted to eliminate it as a possibility. Patch 3 and 4



It should be doing that. This patch should be junk but because it was tested

The fastpath should be getting the pages it needs from the
preferred_zone. If it's not, we still want to get pages back in that
zone and the zone we actually ended up getting pages from.

It's probably best to ignore this patch except in the unlikely event Tobias
says it makes a difference to his testing. I'm hoping he's covered by patches
1+2 and maybe 3 and that patches 4 and 5 of this set get consigned to
the bit bucket.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Tobias Oetiker
Date: Friday, October 23, 2009 - 4:31 am

Mel,


hi hi ... I have tested '3 only' this morning, and the allocation
problems started again ... so for me 3 alone does not work while
3+4 does.

cheers
tobi

-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900
--

From: Mel Gorman
Date: Friday, October 23, 2009 - 6:39 am

Hi,

What was the outcome of 1+2?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Monday, October 26, 2009 - 7:42 pm

hmm, I'm confused. this description addressed generic high order allocation.

__alloc_pages_high_priority() is only called if ALLOC_NO_WATERMARKS.
ALLOC_NO_WATERMARKS mean PF_MEMALLOC or TIF_MEMDIE and GFP_ATOMIC don't make
nested alloc_pages() (= don't make PF_MEMALLOC case). 
Then, I haven't understand why this patch improve iwlagn GFP_ATOMIC case.




--

From: Mel Gorman
Date: Tuesday, October 27, 2009 - 8:26 am

The description is misleading but in the patches current form, it makes

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--


Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
least that the commits 373c0a7e 8aa7e847 dramatically increased the
number of GFP_ATOMIC failures that were occuring within a wireless
driver. It was never isolated which of the changes was the exact problem
and it's possible it has been fixed since. If problems are still
occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
applied to determine if the congestion_wait() callers are still broken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    2 +-
 mm/page_alloc.c             |    4 ++--
 mm/vmscan.c                 |    8 ++++----
 16 files changed, 42 insertions(+), 45 deletions(-)

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
 
 			if (retval == -ENOMEM && is_global_init(current)) {
 				up_read(&current->mm->mmap_sem);
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				congestion_wait(WRITE, HZ/50);
 				goto survive;
 			}
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2ddf03a..d69bf9c 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
 	wakeup = (pd->write_congestion_on > 0
 	 		&& pd->bio_queue_size <= pd->write_congestion_off);
 	spin_unlock(&pd->lock);
-	if (wakeup) ...

This is a clean revert against 2.6.31.4

==== CUT HERE ====
Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion

Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. It was
never isolated which of the changes was the exact problem and it's possible
it has been fixed since.

However the fixes, if they exist in mainline, have not been back-ported to
-stable so for the -stable series, it might be best just to revert.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 arch/x86/lib/usercopy_32.c  |    2 +-
 drivers/block/pktcdvd.c     |   10 ++++------
 drivers/md/dm-crypt.c       |    2 +-
 fs/fat/file.c               |    2 +-
 fs/fuse/dev.c               |    8 ++++----
 fs/nfs/write.c              |    8 +++-----
 fs/reiserfs/journal.c       |    2 +-
 fs/xfs/linux-2.6/kmem.c     |    4 ++--
 fs/xfs/linux-2.6/xfs_buf.c  |    2 +-
 include/linux/backing-dev.h |   11 +++--------
 include/linux/blkdev.h      |   13 +++++++++----
 mm/backing-dev.c            |    7 ++++---
 mm/memcontrol.c             |    2 +-
 mm/page-writeback.c         |    8 ++++----
 mm/page_alloc.c             |    4 ++--
 mm/vmscan.c                 |    8 ++++----
 16 files changed, 45 insertions(+), 48 deletions(-)

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
 
 			if (retval == -ENOMEM && is_global_init(current)) {
 				up_read(&current->mm->mmap_sem);
-				congestion_wait(BLK_RW_ASYNC, HZ/50);
+				congestion_wait(WRITE, HZ/50);
 				goto survive;
 			}
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 99a506f..83650e0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
 	wakeup = ...

I still think this is a complete red herring.

-- 
Jens Axboe

--


Oops. no, please no.
8aa7e847 is regression fixing commit. this revert indicate the regression
occur again.
if we really need to revert it, we need to revert 1faa16d2287 too.
however, I doubt this commit really cause regression to iwlan. IOW,
I agree Jens.

I hope to try reproduce this problem on my test environment. Can anyone
please explain reproduce way?
Is special hardware necessary?


----------------------------------------------------
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Thu Jul 9 14:52:32 2009 +0200

    Fix congestion_wait() sync/async vs read/write confusion

    Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
    the bdi congestion wait queue logic, causing us to wait on congestion
    for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>



--


This is not intended as a patch for mainline, but just as a test to see if 
it improves things. It may be a regression fix, but it also creates a 
significant change in behavior during swapping in my test case.
If a fix is needed, it will probably by different from this revert.
Please read: http://lkml.org/lkml/2009/10/26/510.


Please see my mails in this thread for bug #14141: 
http://thread.gmane.org/gmane.linux.kernel/896714

You will probably need to read some of them to understand the context of 
the two mails linked above.

The most relevant ones are (all from the same thread; not sure why gmane 
gives such weird links):
http://article.gmane.org/gmane.linux.kernel.mm/39909
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7228

Not special hardware, but you may need an encrypted partition and NFS; the 
test may need to be modified according to the amount of memory you have.
I think it should be possible to reproduce the freezes I see while ignoring 
the SKB allocation errors as IMO those are just a symptom, not the cause.
So you should not need wireless.

The severity of the freezes during my test often increases if the test is 
repeated (without rebooting).

Cheers,
FJP
--

From: Pekka Enberg
Date: Thursday, October 22, 2009 - 7:47 am

As explained by Jens Axboe, this changes timing but is not the source
of the OOMs so the revert is bogus even if it "helps" on some
workloads. IIRC the person who reported the revert to help things did
report that the OOMs did not go away, they were simply harder to
trigger with the revert.
--

From: Mel Gorman
Date: Thursday, October 22, 2009 - 9:03 am

Agreed, but I wanted to pin down where exactly we stand with this

IIRC, there were mixed reports as to how much the revert helped.  I'm hoping
that patches 1+2 cover the bases hence why I asked them to be tested on
their own. Patch 2 in particular might be responsible for watermarks being
impacted enough to cause timing problems. I left reverting with patch 5 as
a standalone test to see how much of a factor the timing changes introduced
are if there are still allocation problems.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Friday, October 23, 2009 - 6:52 pm

Bug fixes go into main not linux-next. Lets make sure these fixes really
work and then merge.

--

From: Pekka Enberg
Date: Friday, October 23, 2009 - 11:48 pm

Regardless, patches 1-2 and should _really_ go to Linus' tree (and 
eventually -stable) while we figure out the rest of the problems. They 
fix obvious regressions in the code paths and we have reports from 
people that they help. Yes, they don't fix everything for everyone but 
we there's no upside in holding back fixes that are simple one line 
fixes to regressions.

		Pekka
--

From: reinette chatre
Date: Thursday, October 22, 2009 - 8:43 am

Driver has been changed to allocate paged skb for its receive buffers.
This reduces amount of memory needed from order-2 to order-1. This work
is significant and will thus be in 2.6.33. 

Reinette


--

From: Mel Gorman
Date: Tuesday, October 27, 2009 - 3:40 am

What do you want to do for -stable in 2.6.31?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: reinette chatre
Date: Tuesday, October 27, 2009 - 4:34 pm

I have just posted two patches to stable. The first is to address a bug
in which buffer loss occurs when there is an allocation failure and the
second is what Frans has been testing that reduces the noise when these
allocations fail. They are:

  iwlwifi: fix potential rx buffer loss­­­­­­­
de0bd50845eb5935ce3d503c5d2f565d6cb9ece1 in linux-2.6
  iwlwifi: reduce noise when skb allocation fails
­­­­­­­f82a924cc88a5541df1d4b9d38a0968cd077a051 in linux-2.6

Reinette


--

From: Sven Geggus
Date: Friday, October 23, 2009 - 12:31 am

I will see what I can do on the weekend. Unfortunately the crash happens on
a somewhat important machine and afterwards the Software-RAID needs a resync
which takes a few hours.

Sven

-- 
"Those who do not understand Unix are condemned to reinvent it, poorly"
(Henry Spencer)

/me is giggls@ircnet, http://sven.gegg.us/ on the Web
--

From: Karol Lewandowski
Date: Friday, October 23, 2009 - 9:58 am

On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote:




No, problem doesn't go away with these patches (1+2+3).  However, from
my testing this particular patch makes it way, way harder to trigger
allocation failures (but these are still present).

This bothers me - should I test following patches with or without
above patch?  This patch makes bug harder to find, IMVHO it doesn't
fix the real problem.

(Rest not tested yet.)

Thanks.
--

From: Karol Lewandowski
Date: Friday, October 23, 2009 - 2:12 pm

Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
still present. I'll test complete 1-4 patchset as time permits.

Thanks.
--

From: Mel LKML
Date: Saturday, October 24, 2009 - 6:46 am

Hi,

This is the same Mel as mel@csn.ul.ie. The mail server the address is
on has no power until Tuesday so I'm not going to be very unresponsive
until then. Monday is also a public holiday here and apparently they
are upgrading the power transformers near the building.


And also patch 5 please which is the revert. Patch 5 as pointed out is
probably a red herring. Hwoever, it has changed the timing and made a
difference for some testing so I'd like to know if it helps yours as
well.

As things stand, it looks like patches 1+2 should certainly go ahead.
I need to give more thought on patches 3 and 4 as to why they help
Tobias but not anyone elses testing.

Thanks
--

From: Karol Lewandowski
Date: Wednesday, October 28, 2009 - 4:42 am

I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
suspend, do work, suspend, ...) and it failed today after 4 days (== 4
suspend-resume cycles).

I'll test 1-5 now.


Thanks.
--

From: Mel Gorman
Date: Wednesday, October 28, 2009 - 4:59 am

I was digging through commits for suspend-related changes. Rafael, is
there any chance that some change to suspend is responsible for this
regression? This commit for example is a vague possibility;
c6f37f12197ac3bd2e5a35f2f0e195ae63d437de: PM/Suspend: Do not shrink memory before suspend

I say vague because FREE_PAGE_NUMBER is so small.

Also, what was the behaviour of the e100 driver when suspending before
this commit?

6905b1f1a03a48dcf115a2927f7b87dba8d5e566: Net / e100: Fix suspend of devices that cannot be power managed

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Karol Lewandowski
Date: Friday, October 30, 2009 - 7:23 am

This was discussed before with e100 maintainers and Rafael.  Reverting
this patch didn't change anything.


Thanks.
--

From: Mel Gorman
Date: Monday, November 2, 2009 - 1:30 pm

Does applying the following on top make any difference?

==== CUT HERE ====
PM: Shrink memory before suspend

This is a partial revert of c6f37f12197ac3bd2e5a35f2f0e195ae63d437de. It
is an outside possibility for fixing the e100 bug where an order-5
allocation is failing during resume. The commit notes that the shrinking
of memory should be unnecessary but maybe it is in error.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>

diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 6f10dfc..4f6ae64 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -23,6 +23,9 @@ const char *const pm_states[PM_SUSPEND_MAX] = {
 	[PM_SUSPEND_MEM]	= "mem",
 };
 
+/* This is just an arbitrary number */
+#define FREE_PAGE_NUMBER (100)
+
 static struct platform_suspend_ops *suspend_ops;
 
 /**
@@ -78,6 +81,7 @@ static int suspend_test(int level)
 static int suspend_prepare(void)
 {
 	int error;
+	unsigned int free_pages;
 
 	if (!suspend_ops || !suspend_ops->enter)
 		return -EPERM;
@@ -92,10 +96,24 @@ static int suspend_prepare(void)
 	if (error)
 		goto Finish;
 
-	error = suspend_freeze_processes();
+	if (suspend_freeze_processes()) {
+		error = -EAGAIN;
+		goto Thaw;
+	}
+
+	free_pages = global_page_state(NR_FREE_PAGES);
+	if (free_pages < FREE_PAGE_NUMBER) {
+		pr_debug("PM: free some memory\n");
+		shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
+		if (nr_free_pages() < FREE_PAGE_NUMBER) {
+			error = -ENOMEM;
+			printk(KERN_ERR "PM: No enough memory\n");
+		}
+	}
 	if (!error)
 		return 0;
 
+ Thaw:
 	suspend_thaw_processes();
 	usermodehelper_enable();
  Finish:
--

From: Karol Lewandowski
Date: Tuesday, November 3, 2009 - 7:03 pm

No, this patch didn't change anything either.

IIRC I get failures while free(1) shows as much as 20MB free RAM
(ie. without buffers/caches).  Additionaly nr_free_pages (from
/proc/vmstat) stays at about 800-1000 under heavy memory pressure
(gitk on full linux repository).


--- babbling follows ---

Hmm, I wonder if it's really timing issue then wouldn't be the case
that lowering swappiness sysctl would make problem more visible?
I've vm.swappiness=15, would testing with higher value make any sense?

Thanks.
--

From: Tobi Oetiker
Date: Wednesday, October 28, 2009 - 5:55 am

I have been testing 1+2,1+2+3 as well as 3+4 and have been of the
assumption that 3+4 does help ... I have now been runing a modified
version of 4 which prints a warning instead of doing anything ... I
have now seen the allocation issue again without the warning being
printed. So in other words

1+2+3 make the problem less severe, but do not solve it
4 seems to be a red hering.

cheers
tobi


-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900
--

From: Frans Pop
Date: Saturday, October 24, 2009 - 6:51 am

I needed a break and have thus been off-line for a few days. Good to see 
there's been progress. I'll try to do some testing tomorrow.

Cheers,
FJP
--

From: Sven Geggus
Date: Saturday, October 24, 2009 - 7:02 am

Problem persists. RAID resync in progress :(

Sven

-- 
"linux is evolution, not intelligent design"
(Linus Torvalds)

/me is giggls@ircnet, http://sven.gegg.us/ on the Web
--

From: Mel Gorman
Date: Tuesday, October 27, 2009 - 6:27 am

What about the rest of the patches, any luck?

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Tobias Oetiker
Date: Monday, October 26, 2009 - 10:37 am

Hi Mel,

I have no done additional tests ... and can report the following




3 allone does not help
3+4 does ...

cheers
tobi
-- 
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900
--

From: Mel Gorman
Date: Tuesday, October 27, 2009 - 8:36 am

This is a bit surprising.....

Tell me, do you have an Intel IO-MMU on your system by any chance?  It should
be mentioned in either dmesg or lspci -v (please send the full output of
both). If you do have one of these things, I notice they abuse PF_MEMALLOC
which would explain why this patch makes a difference to your testing.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Frans Pop
Date: Monday, October 26, 2009 - 3:17 pm

I've tested against 2.6.31.1 as it's easier for me to compare behaviors 
with that than with .32. All patches applied without problems against .31.

I've also tested 2.6.31.1 with SLAB instead of SLUB, but that does not seem 

Does not look to make any difference. Possibly causes more variation in the 



Applied on top of patches 1-4. Despite Jens' scepticism is this still the 
patch that makes the most significant difference in my test.
The reading of commits in gitk is much more fluent and music skips are a 
lot less severe. But most important is that there is no long total freeze 
of the system halfway during the reading of commits and gitk loads 
fastest. It also gives by far the most consistent results.
The likelyhood of SKB allocation errors during the test is a lot smaller.
See also http://lkml.org/lkml/2009/10/26/455.


Detailed test results follow. I've done 2 test runs with each kernel (3 for 
the last).

The columns below give the following info:
- time at which all commits have been read by gitk
- time at which gitk fills in "branch", "follows" and "precedes" data for
  the current commit
- time at which there's no longer any disk activity, i.e. when gitk is
  fully loaded and all swapping is done
- total number of SKB allocation errors during the test
A "freeze" during the reading of commits is indicated by an "f" (short 
freeze) or "F" (long "hard" freeze). An "S" shows when there were SKB 
allocation errors.

		end commits	show branch	done		SKB errs
1) vanilla .31.1
run 1:		1:20 fFS	2:10 S		2:30		44 a)
run 2:		1:35 FS		1:45		2:10		13

2) .31.1 + patches 1-2
run1:		2:30 fFS	2:45		3:00		58
run2:		1:15 fS		2:00		2:20		2 a)

3) .31.1 + patches 1-3
run1:		1:00 fS		1:15		1:45		1 *)
run2:		3:00 fFS	3:15		3:30		33
*) unexpected; fortunate timing?

4) .31.1 + patches 1-4
run1:		1:10 ffS	1:55 S		2:20		35 a)
run2:		3:05 fFS	3:15		3:25		36

5) .31.1 + patches 1-5
run1:		1:00		1:15		1:35		0
run2:		0:50		1:15 S		1:45		45 ...
From: Frans Pop
Date: Monday, October 26, 2009 - 4:45 pm

Forgot to mention that each run was after a reboot, so they are not 
interdependant.
--

From: Tobias Diedrich
Date: Thursday, November 5, 2009 - 11:03 pm

I've also seen order-0 failures on 2.6.31.5:
Note that this is with a one process hogging and mlocking memory and
min_free_kbytes reduced to 100 to reproduce the problem more easily.

I tried bisecting the issue, but in the end without memory pressure
I can't reproduce it reliably and with the above mentioned pressure
I get allocation failures even on 2.6.30.o

Initially the issue was that the machine hangs after the allocation
failure, but that seems to be a netconsole related issue, since I
didn't get a hang on 2.6.31 compiled without netconsole.
http://lkml.org/lkml/2009/11/1/66
http://lkml.org/lkml/2009/11/5/100

[  375.398423] swapper: page allocation failure. order:0, mode:0x20
[  375.398483] Pid: 0, comm: swapper Not tainted 2.6.31.5-nokmem-tomodachi #3
[  375.398519] Call Trace:
[  375.398566]  [<c10395a8>] ? __alloc_pages_nodemask+0x40f/0x453
[  375.398613]  [<c104e988>] ? cache_alloc_refill+0x1f3/0x382
[  375.398648]  [<c104eb76>] ? __kmalloc+0x5f/0x97
[  375.398690]  [<c1228003>] ? __alloc_skb+0x44/0x101
[  375.398723]  [<c12289b5>] ? dev_alloc_skb+0x11/0x25
[  375.398760]  [<c11a53a1>] ? tulip_refill_rx+0x3c/0x115
[  375.398793]  [<c11a57f7>] ? tulip_poll+0x37d/0x416
[  375.398832]  [<c122cf56>] ? net_rx_action+0x3a/0xdb
[  375.398874]  [<c101d8b0>] ? __do_softirq+0x5b/0xcb
[  375.398908]  [<c101d855>] ? __do_softirq+0x0/0xcb
[  375.398937]  <IRQ>  [<c1003e0b>] ? do_IRQ+0x66/0x76
[  375.398989]  [<c1002d70>] ? common_interrupt+0x30/0x38
[  375.399026]  [<c100725c>] ? default_idle+0x25/0x38
[  375.399058]  [<c1001a1e>] ? cpu_idle+0x64/0x7a
[  375.399102]  [<c14105ff>] ? start_kernel+0x251/0x258
[  375.399133] Mem-Info:
[  375.399159] DMA per-cpu:
[  375.399184] CPU    0: hi:    0, btch:   1 usd:   0
[  375.399214] Normal per-cpu:
[  375.399241] CPU    0: hi:   90, btch:  15 usd:  28
[  375.399276] Active_anon:6709 active_file:1851 inactive_anon:6729
[  375.399278]  inactive_file:2051 unevictable:40962 dirty:998 writeback:914 unstable:0
[  375.399281]  ...
From: Mel Gorman
Date: Friday, November 6, 2009 - 2:24 am

To be honest, it's not entirely unexpected with min_free_kbytes set that
low. The system should cope with a certain amount of pressure but with
pressure and a low min_free_kbytes, the system will simply be reacting
too late to free memory in the non-atomic paths.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Tobias Diedrich
Date: Friday, November 6, 2009 - 4:15 am

That was on vanilla 2.6.31.5.

I tried 2.6.31.5 before with patches 1+2 and netconsole enabled and
still got the order-1 failures (apparently I get order-1 failures
Maybe I should try again on 2.6.30 without netconsole und try
increasing min_free_kbytes until the allocation failures
disappear and try to bisect again with that setting...

-- 
Tobias						PGP: http://8ef7ddba.uguu.de
--

Previous thread: [PATCH 3/5] vmscan: Force kswapd to take notice faster when high-order watermarks are being hit by Mel Gorman on Thursday, October 22, 2009 - 7:22 am. (2 messages)

Next thread: [PATCH net-next-2.6] rtnetlink: speedup rtnl_dump_ifinfo() by Eric Dumazet on Thursday, October 22, 2009 - 7:34 am. (2 messages)