Sorry for the large cc list. Variations of this bug have cropped up in a number of different places and so there are a fair few people that should be vaguely aware of what's going on. Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC failures. A significant number of these have been high-order GFP_ATOMIC failures and while they are generally brushed away, there has been a large increase in them recently and there are a number of possible areas the problem could be in - core vm, page writeback and a specific driver. The bugs affected by this that I am aware of are; [Bug #14141] order 2 page allocation failures in iwlagn Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC allocations within the wireless driver. This has caused large numbers of failure reports to occur as reported by Frans Pop. Fixing this requires changes to the driver if it wants to use GFP_ATOMIC which is in the hands of Mohamed Abbas and Reinette Chatre. However, it is very likely that it has being compounded by core mm changes that this series is aimed at. [Bug #14141] order 2 page allocation failures (generic) This problem is being tracked under bug #14141 but chances are it's unrelated to the wireless change. Tobi Oetiker has reported that a virtualised machine using a bridged interface is reporting a small number of order-5 GFP_ATOMIC failures. He has reported that the errors can be suppressed with kswapd patches in this series. However, I would like to confirm they are necessary. [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100 Karol Lewandows reported that e100 fails to allocate order-5 GFP_ATOMIC when loading firmware during resume. This has started happening relatively recent. [No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..) This apparently is easily reproducible, particular in comparison to the other reports. The point of greatest interest is that this is order-0 GFP_ATOMIC failures. ...
When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure.
This patch pre-emptively checks if watermarks have been hit after a
high-order allocation completes successfully. If the watermarks have been
reached, kswapd is woken in the hope it fixes the watermarks before the
next GFP_ATOMIC allocation fails.
Warning, this patch is somewhat of a band-aid. If this makes a difference,
it still implies that something has changed that is either causing more
GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
driver) or make them more likely to fail.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
1 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f2aa3e..851df40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,6 +1596,17 @@ try_next_zone:
return page;
}
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+ enum zone_type high_zoneidx)
+{
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ wakeup_kswapd(zone, order);
+}
+
static inline int
should_alloc_retry(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed)
@@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
- return page;
-}
-
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
- enum zone_type high_zoneidx)
-{
- struct zoneref *z;
- struct zone *zone;
+ /*
+ * If after a high-order allocation we are now below ...Hmm, is this really supposed to be added to __alloc_pages_high_priority()? By the patch description I was expecting kswapd to be woken up preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and we're known to have just allocated at a higher order, not just when current was oom killed (when we should already be freeing a _lot_ of memory soon) or is doing a higher order allocation during direct reclaim. For the best coverage, it would have to be add the branch to the fastpath. That seems fine for a debugging aid and to see if progress is being made on the GFP_ATOMIC allocation issues, but doesn't seem like it should make its way to mainline, the subsequent GFP_ATOMIC allocation could already be happening and in the page allocator's slowpath at this point that this wakeup becomes unnecessary. If this is moved to the fastpath, why is this wake_all_kswapd() and not wakeup_kswapd(preferred_zone, order)? Do we need to kick kswapd in all zones even though they may be free just because preferred_zone is now below the watermark? Wouldn't it be better to do this on page_zone(page) instead of preferred_zone anyway? --
It was a somewhat arbitrary choice to have it trigger in the event high It probably makes no difference as zones are checked for their watermarks before any real work happens. However, even if this patch makes a difference, I don't want to see it merged. At best, it is an extremely heavy-handed hack which is why I asked for it to be tested in isolation. It shouldn't be necessary at all because sort of pre-emptive waking of kswapd was never No. The preferred_zone is the zone we should be allocating from. If we failed to allocate from it, it implies the watermarks are not being met so we want to wake it. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I don't quite understand, users of PF_MEMALLOC shouldn't be doing these higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom killer, we should be freeing a substantial amount of memory imminently Ahh, that makes a ton more sense: this particular patch is a debugging Oops, I'm even more confused now :) I thought the existing wake_all_kswapd() in the slowpath was doing that and that this patch was waking them prematurely because it speculates that a subsequent high order allocation will fail unless memory is reclaimed. I thought we'd want to reclaim from the zone we just did a high order allocation from so that the fastpath could find the memory next time with ALLOC_WMARK_LOW. --
I agree. I think it's highly unlikely this patch will make any difference but I wanted to eliminate it as a possibility. Patch 3 and 4 It should be doing that. This patch should be junk but because it was tested The fastpath should be getting the pages it needs from the preferred_zone. If it's not, we still want to get pages back in that zone and the zone we actually ended up getting pages from. It's probably best to ignore this patch except in the unlikely event Tobias says it makes a difference to his testing. I'm hoping he's covered by patches 1+2 and maybe 3 and that patches 4 and 5 of this set get consigned to the bit bucket. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Mel, hi hi ... I have tested '3 only' this morning, and the allocation problems started again ... so for me 3 alone does not work while 3+4 does. cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900 --
Hi, What was the outcome of 1+2? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
hmm, I'm confused. this description addressed generic high order allocation. __alloc_pages_high_priority() is only called if ALLOC_NO_WATERMARKS. ALLOC_NO_WATERMARKS mean PF_MEMALLOC or TIF_MEMDIE and GFP_ATOMIC don't make nested alloc_pages() (= don't make PF_MEMALLOC case). Then, I haven't understand why this patch improve iwlagn GFP_ATOMIC case. --
The description is misleading but in the patches current form, it makes -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
least that the commits 373c0a7e 8aa7e847 dramatically increased the
number of GFP_ATOMIC failures that were occuring within a wireless
driver. It was never isolated which of the changes was the exact problem
and it's possible it has been fixed since. If problems are still
occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
applied to determine if the congestion_wait() callers are still broken.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 2 +-
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 8 ++++----
16 files changed, 42 insertions(+), 45 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
if (retval == -ENOMEM && is_global_init(current)) {
up_read(&current->mm->mmap_sem);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto survive;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2ddf03a..d69bf9c 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
wakeup = (pd->write_congestion_on > 0
&& pd->bio_queue_size <= pd->write_congestion_off);
spin_unlock(&pd->lock);
- if (wakeup) ...This is a clean revert against 2.6.31.4
==== CUT HERE ====
Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. It was
never isolated which of the changes was the exact problem and it's possible
it has been fixed since.
However the fixes, if they exist in mainline, have not been back-ported to
-stable so for the -stable series, it might be best just to revert.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 8 ++++----
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 8 ++++----
16 files changed, 45 insertions(+), 48 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
if (retval == -ENOMEM && is_global_init(current)) {
up_read(&current->mm->mmap_sem);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto survive;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 99a506f..83650e0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
wakeup = ...I still think this is a complete red herring. -- Jens Axboe --
Oops. no, please no.
8aa7e847 is regression fixing commit. this revert indicate the regression
occur again.
if we really need to revert it, we need to revert 1faa16d2287 too.
however, I doubt this commit really cause regression to iwlan. IOW,
I agree Jens.
I hope to try reproduce this problem on my test environment. Can anyone
please explain reproduce way?
Is special hardware necessary?
----------------------------------------------------
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0
Author: Jens Axboe <jens.axboe@oracle.com>
Date: Thu Jul 9 14:52:32 2009 +0200
Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
--
This is not intended as a patch for mainline, but just as a test to see if it improves things. It may be a regression fix, but it also creates a significant change in behavior during swapping in my test case. If a fix is needed, it will probably by different from this revert. Please read: http://lkml.org/lkml/2009/10/26/510. Please see my mails in this thread for bug #14141: http://thread.gmane.org/gmane.linux.kernel/896714 You will probably need to read some of them to understand the context of the two mails linked above. The most relevant ones are (all from the same thread; not sure why gmane gives such weird links): http://article.gmane.org/gmane.linux.kernel.mm/39909 http://article.gmane.org/gmane.linux.kernel.kernel-testers/7228 Not special hardware, but you may need an encrypted partition and NFS; the test may need to be modified according to the amount of memory you have. I think it should be possible to reproduce the freezes I see while ignoring the SKB allocation errors as IMO those are just a symptom, not the cause. So you should not need wireless. The severity of the freezes during my test often increases if the test is repeated (without rebooting). Cheers, FJP --
As explained by Jens Axboe, this changes timing but is not the source of the OOMs so the revert is bogus even if it "helps" on some workloads. IIRC the person who reported the revert to help things did report that the OOMs did not go away, they were simply harder to trigger with the revert. --
Agreed, but I wanted to pin down where exactly we stand with this IIRC, there were mixed reports as to how much the revert helped. I'm hoping that patches 1+2 cover the bases hence why I asked them to be tested on their own. Patch 2 in particular might be responsible for watermarks being impacted enough to cause timing problems. I left reverting with patch 5 as a standalone test to see how much of a factor the timing changes introduced are if there are still allocation problems. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Bug fixes go into main not linux-next. Lets make sure these fixes really work and then merge. --
Regardless, patches 1-2 and should _really_ go to Linus' tree (and eventually -stable) while we figure out the rest of the problems. They fix obvious regressions in the code paths and we have reports from people that they help. Yes, they don't fix everything for everyone but we there's no upside in holding back fixes that are simple one line fixes to regressions. Pekka --
Driver has been changed to allocate paged skb for its receive buffers. This reduces amount of memory needed from order-2 to order-1. This work is significant and will thus be in 2.6.33. Reinette --
What do you want to do for -stable in 2.6.31? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I have just posted two patches to stable. The first is to address a bug in which buffer loss occurs when there is an allocation failure and the second is what Frans has been testing that reduces the noise when these allocations fail. They are: iwlwifi: fix potential rx buffer loss de0bd50845eb5935ce3d503c5d2f565d6cb9ece1 in linux-2.6 iwlwifi: reduce noise when skb allocation fails f82a924cc88a5541df1d4b9d38a0968cd077a051 in linux-2.6 Reinette --
I will see what I can do on the weekend. Unfortunately the crash happens on a somewhat important machine and afterwards the Software-RAID needs a resync which takes a few hours. Sven -- "Those who do not understand Unix are condemned to reinvent it, poorly" (Henry Spencer) /me is giggls@ircnet, http://sven.gegg.us/ on the Web --
On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote: No, problem doesn't go away with these patches (1+2+3). However, from my testing this particular patch makes it way, way harder to trigger allocation failures (but these are still present). This bothers me - should I test following patches with or without above patch? This patch makes bug harder to find, IMVHO it doesn't fix the real problem. (Rest not tested yet.) Thanks. --
Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is still present. I'll test complete 1-4 patchset as time permits. Thanks. --
Hi, This is the same Mel as mel@csn.ul.ie. The mail server the address is on has no power until Tuesday so I'm not going to be very unresponsive until then. Monday is also a public holiday here and apparently they are upgrading the power transformers near the building. And also patch 5 please which is the revert. Patch 5 as pointed out is probably a red herring. Hwoever, it has changed the timing and made a difference for some testing so I'd like to know if it helps yours as well. As things stand, it looks like patches 1+2 should certainly go ahead. I need to give more thought on patches 3 and 4 as to why they help Tobias but not anyone elses testing. Thanks --
I've tested patches 1+2+3+4 in my normal usage scenario (do some work, suspend, do work, suspend, ...) and it failed today after 4 days (== 4 suspend-resume cycles). I'll test 1-5 now. Thanks. --
I was digging through commits for suspend-related changes. Rafael, is there any chance that some change to suspend is responsible for this regression? This commit for example is a vague possibility; c6f37f12197ac3bd2e5a35f2f0e195ae63d437de: PM/Suspend: Do not shrink memory before suspend I say vague because FREE_PAGE_NUMBER is so small. Also, what was the behaviour of the e100 driver when suspending before this commit? 6905b1f1a03a48dcf115a2927f7b87dba8d5e566: Net / e100: Fix suspend of devices that cannot be power managed -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This was discussed before with e100 maintainers and Rafael. Reverting this patch didn't change anything. Thanks. --
Does applying the following on top make any difference?
==== CUT HERE ====
PM: Shrink memory before suspend
This is a partial revert of c6f37f12197ac3bd2e5a35f2f0e195ae63d437de. It
is an outside possibility for fixing the e100 bug where an order-5
allocation is failing during resume. The commit notes that the shrinking
of memory should be unnecessary but maybe it is in error.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 6f10dfc..4f6ae64 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -23,6 +23,9 @@ const char *const pm_states[PM_SUSPEND_MAX] = {
[PM_SUSPEND_MEM] = "mem",
};
+/* This is just an arbitrary number */
+#define FREE_PAGE_NUMBER (100)
+
static struct platform_suspend_ops *suspend_ops;
/**
@@ -78,6 +81,7 @@ static int suspend_test(int level)
static int suspend_prepare(void)
{
int error;
+ unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -92,10 +96,24 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- error = suspend_freeze_processes();
+ if (suspend_freeze_processes()) {
+ error = -EAGAIN;
+ goto Thaw;
+ }
+
+ free_pages = global_page_state(NR_FREE_PAGES);
+ if (free_pages < FREE_PAGE_NUMBER) {
+ pr_debug("PM: free some memory\n");
+ shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
+ if (nr_free_pages() < FREE_PAGE_NUMBER) {
+ error = -ENOMEM;
+ printk(KERN_ERR "PM: No enough memory\n");
+ }
+ }
if (!error)
return 0;
+ Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
--
No, this patch didn't change anything either. IIRC I get failures while free(1) shows as much as 20MB free RAM (ie. without buffers/caches). Additionaly nr_free_pages (from /proc/vmstat) stays at about 800-1000 under heavy memory pressure (gitk on full linux repository). --- babbling follows --- Hmm, I wonder if it's really timing issue then wouldn't be the case that lowering swappiness sysctl would make problem more visible? I've vm.swappiness=15, would testing with higher value make any sense? Thanks. --
I have been testing 1+2,1+2+3 as well as 3+4 and have been of the assumption that 3+4 does help ... I have now been runing a modified version of 4 which prints a warning instead of doing anything ... I have now seen the allocation issue again without the warning being printed. So in other words 1+2+3 make the problem less severe, but do not solve it 4 seems to be a red hering. cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900 --
I needed a break and have thus been off-line for a few days. Good to see there's been progress. I'll try to do some testing tomorrow. Cheers, FJP --
Problem persists. RAID resync in progress :( Sven -- "linux is evolution, not intelligent design" (Linus Torvalds) /me is giggls@ircnet, http://sven.gegg.us/ on the Web --
What about the rest of the patches, any luck? Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi Mel, I have no done additional tests ... and can report the following 3 allone does not help 3+4 does ... cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900 --
This is a bit surprising..... Tell me, do you have an Intel IO-MMU on your system by any chance? It should be mentioned in either dmesg or lspci -v (please send the full output of both). If you do have one of these things, I notice they abuse PF_MEMALLOC which would explain why this patch makes a difference to your testing. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I've tested against 2.6.31.1 as it's easier for me to compare behaviors with that than with .32. All patches applied without problems against .31. I've also tested 2.6.31.1 with SLAB instead of SLUB, but that does not seem Does not look to make any difference. Possibly causes more variation in the Applied on top of patches 1-4. Despite Jens' scepticism is this still the patch that makes the most significant difference in my test. The reading of commits in gitk is much more fluent and music skips are a lot less severe. But most important is that there is no long total freeze of the system halfway during the reading of commits and gitk loads fastest. It also gives by far the most consistent results. The likelyhood of SKB allocation errors during the test is a lot smaller. See also http://lkml.org/lkml/2009/10/26/455. Detailed test results follow. I've done 2 test runs with each kernel (3 for the last). The columns below give the following info: - time at which all commits have been read by gitk - time at which gitk fills in "branch", "follows" and "precedes" data for the current commit - time at which there's no longer any disk activity, i.e. when gitk is fully loaded and all swapping is done - total number of SKB allocation errors during the test A "freeze" during the reading of commits is indicated by an "f" (short freeze) or "F" (long "hard" freeze). An "S" shows when there were SKB allocation errors. end commits show branch done SKB errs 1) vanilla .31.1 run 1: 1:20 fFS 2:10 S 2:30 44 a) run 2: 1:35 FS 1:45 2:10 13 2) .31.1 + patches 1-2 run1: 2:30 fFS 2:45 3:00 58 run2: 1:15 fS 2:00 2:20 2 a) 3) .31.1 + patches 1-3 run1: 1:00 fS 1:15 1:45 1 *) run2: 3:00 fFS 3:15 3:30 33 *) unexpected; fortunate timing? 4) .31.1 + patches 1-4 run1: 1:10 ffS 1:55 S 2:20 35 a) run2: 3:05 fFS 3:15 3:25 36 5) .31.1 + patches 1-5 run1: 1:00 1:15 1:35 0 run2: 0:50 1:15 S 1:45 45 ...
Forgot to mention that each run was after a reboot, so they are not interdependant. --
I've also seen order-0 failures on 2.6.31.5: Note that this is with a one process hogging and mlocking memory and min_free_kbytes reduced to 100 to reproduce the problem more easily. I tried bisecting the issue, but in the end without memory pressure I can't reproduce it reliably and with the above mentioned pressure I get allocation failures even on 2.6.30.o Initially the issue was that the machine hangs after the allocation failure, but that seems to be a netconsole related issue, since I didn't get a hang on 2.6.31 compiled without netconsole. http://lkml.org/lkml/2009/11/1/66 http://lkml.org/lkml/2009/11/5/100 [ 375.398423] swapper: page allocation failure. order:0, mode:0x20 [ 375.398483] Pid: 0, comm: swapper Not tainted 2.6.31.5-nokmem-tomodachi #3 [ 375.398519] Call Trace: [ 375.398566] [<c10395a8>] ? __alloc_pages_nodemask+0x40f/0x453 [ 375.398613] [<c104e988>] ? cache_alloc_refill+0x1f3/0x382 [ 375.398648] [<c104eb76>] ? __kmalloc+0x5f/0x97 [ 375.398690] [<c1228003>] ? __alloc_skb+0x44/0x101 [ 375.398723] [<c12289b5>] ? dev_alloc_skb+0x11/0x25 [ 375.398760] [<c11a53a1>] ? tulip_refill_rx+0x3c/0x115 [ 375.398793] [<c11a57f7>] ? tulip_poll+0x37d/0x416 [ 375.398832] [<c122cf56>] ? net_rx_action+0x3a/0xdb [ 375.398874] [<c101d8b0>] ? __do_softirq+0x5b/0xcb [ 375.398908] [<c101d855>] ? __do_softirq+0x0/0xcb [ 375.398937] <IRQ> [<c1003e0b>] ? do_IRQ+0x66/0x76 [ 375.398989] [<c1002d70>] ? common_interrupt+0x30/0x38 [ 375.399026] [<c100725c>] ? default_idle+0x25/0x38 [ 375.399058] [<c1001a1e>] ? cpu_idle+0x64/0x7a [ 375.399102] [<c14105ff>] ? start_kernel+0x251/0x258 [ 375.399133] Mem-Info: [ 375.399159] DMA per-cpu: [ 375.399184] CPU 0: hi: 0, btch: 1 usd: 0 [ 375.399214] Normal per-cpu: [ 375.399241] CPU 0: hi: 90, btch: 15 usd: 28 [ 375.399276] Active_anon:6709 active_file:1851 inactive_anon:6729 [ 375.399278] inactive_file:2051 unevictable:40962 dirty:998 writeback:914 unstable:0 [ 375.399281] ...
To be honest, it's not entirely unexpected with min_free_kbytes set that low. The system should cope with a certain amount of pressure but with pressure and a low min_free_kbytes, the system will simply be reacting too late to free memory in the non-atomic paths. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
That was on vanilla 2.6.31.5. I tried 2.6.31.5 before with patches 1+2 and netconsole enabled and still got the order-1 failures (apparently I get order-1 failures Maybe I should try again on 2.6.30 without netconsole und try increasing min_free_kbytes until the allocation failures disappear and try to bisect again with that setting... -- Tobias PGP: http://8ef7ddba.uguu.de --
