Re: 2.6.26-rc5-mm2

Previous thread: page allocation failure in 2.6.25.5 by Andrew Pochinsky on Monday, June 9, 2008 - 9:26 pm. (1 message)

Next thread: [PATCH -mm 2/2] kexec jump -v11: save/restore device state by Huang, Ying on Tuesday, June 10, 2008 - 12:15 am. (11 messages)
From: Andrew Morton
Date: Monday, June 9, 2008 - 10:31 pm

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm2/

- This is a bugfixed version of 2.6.26-rc5-mm1 - mainly to repair a
  vmscan.c bug which would have prevented testing of the other vmscan.c
  bugs^Wchanges.


Boilerplate:

- See the `hot-fixes' directory for any important updates to this patchset.

- To fetch an -mm tree using git, use (for example)

  git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1
  git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1

- -mm kernel commit activity can be reviewed by subscribing to the
  mm-commits mailing list.

        echo "subscribe mm-commits" | mail majordomo@vger.kernel.org

- If you hit a bug in -mm and it is not obvious which patch caused it, it is
  most valuable if you can perform a bisection search to identify which patch
  introduced the bug.  Instructions for this process are at

        http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt

  But beware that this process takes some time (around ten rebuilds and
  reboots), so consider reporting the bug first and if we cannot immediately
  identify the faulty patch, then perform the bisection search.

- When reporting bugs, please try to Cc: the relevant maintainer and mailing
  list on any email.

- When reporting bugs in this kernel via email, please also rewrite the
  email Subject: in some manner to reflect the nature of the bug.  Some
  developers filter by Subject: when looking for messages to read.

- Occasional snapshots of the -mm lineup are uploaded to
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on
  the mm-commits list.  These probably are at least compilable.

- More-than-daily -mm snapshots may be found at
  http://userweb.kernel.org/~akpm/mmotm/.  These are almost certainly not
  compileable.



Changes since 2.6.26-rc5-mm1:

 origin.patch
 linux-next.patch
 git-jg-misc.patch
 git-leds.patch
 ...
From: Nick Piggin
Date: Monday, June 9, 2008 - 11:12 pm

BTW. this is known to be broken with x86 1GB pages and direct-IO, due
to interaction between huge pages patchset and lockless get_user_pages.

My fault. I was away from the screen over the long weekend here, and
didn't give Andrew the heads-up in time.

This isn't going to be a problem unless you explicitly enable GB pages
and run direct IO (or splice) into or out of them. I can give a fixup
patch to anyone interested in doing so.
--

From: Nick Piggin
Date: Tuesday, June 10, 2008 - 12:28 am

BTW. would be trying to test this more myself, but last mm I based the
lockless patches on didn't boot, and this one dies pretty quickly when
you try to get into reclaim:

------------[ cut here ]------------
kernel BUG at mm/swap_state.c:77!
invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 7
Modules linked in:
Pid: 13550, comm: sh Not tainted 2.6.26-rc5-mm2-dirty #412
RIP: 0010:[<ffffffff80288689>]  [<ffffffff80288689>] 
add_to_swap_cache+0xd9/0x120
RSP: 0018:ffff81010c62d8a8  EFLAGS: 00010246
RAX: 2000000000020009 RBX: ffffe2000107da88 RCX: c000000000000000
RDX: 0000000000000020 RSI: 000000000000eea2 RDI: ffffe2000107da88
RBP: ffff81010c62d8c8 R08: fffffffffa48016e R09: 0000000000000000
R10: ffffffff80857fa0 R11: 2222222222222222 R12: ffff81012e126520
R13: 000000000000eea2 R14: ffff8100727bea20 R15: ffff81010c62d9b8
FS:  00002b5b33cafdc0(0000) GS:ffff81012ff07800(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000175e280 CR3: 000000012e292000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 13550, threadinfo ffff81010c62c000, task ffff810116b01110)
Stack:  ffff81010c62d8c8 ffffe2000107da88 ffff81012e126520 ffff81012e126400
 ffff81010c62d908 ffffffff80292851 000000000000eea2 ffff81012e126708
 ffffe2000107da88 ffffffff80701420 ffff81010c62db68 ffff81010c62dc88
Call Trace:
 [<ffffffff80292851>] shmem_writepage+0x121/0x200
 [<ffffffff80277479>] shrink_page_list+0x559/0x6b0
 [<ffffffff802777ec>] shrink_list+0x21c/0x520
 [<ffffffff80273365>] ? determine_dirtyable_memory+0x15/0x30
 [<ffffffff802733a2>] ? get_dirty_limits+0x22/0x2a0
 [<ffffffff80277d31>] shrink_zone+0x241/0x330
 [<ffffffff80278207>] try_to_free_pages+0x237/0x3a0
 [<ffffffff80276530>] ? isolate_pages_global+0x0/0x270
 [<ffffffff80272546>] __alloc_pages_internal+0x206/0x4b0
 ...
From: Andrew Morton
Date: Tuesday, June 10, 2008 - 1:34 am

It would be good if you could find a day to look through those changes
please.  It's pretty important.

--

From: Nick Piggin
Date: Tuesday, June 10, 2008 - 1:48 am

Doesn't look like it, but I hadn't followed the changes too closely:

OK, I could have a look through them at some point.

Just something very quick while I have Rik's attention are all the
atomic SetPageSwapBacked bitops over a lot of mm/ fastpaths that I have
been slowly working away to get rid of over the past years. Maybe some
don't consider it a big deal, but a single one costs anywhere from
100 - 500 instructions on desktop CPUs, not including secondary effects
of ordering memory and and compiler barrier. Please go through and
ensure you know your page references and ->flags concurrency, and cut
these down to a bare minimum.

Is the plan to merge all reclaim changes in a big hit, rather than
slowly trickle in the different independent changes?
--

From: Andrew Morton
Date: Tuesday, June 10, 2008 - 2:15 am

It's going to take a lot of work to get such extensive reclaim changes
landed.

We need to convince ourselves that these changes are the right way to
fix <whatever they fix>.  We need to review and test the crap out of
them.  The 64-bit-only thing is a concern.  I wonder about whether
we've "fixed" anon pages but didn't do anything about file-backed
mapped pages.  Plus all the other stuff, plus stuff we haven't thought
of yet.

It's huge.
--

From: Rik van Riel
Date: Tuesday, June 10, 2008 - 5:34 am

On Tue, 10 Jun 2008 02:15:19 -0700

Quite possible.  The reclaim policy for file-backed pages has not
changed.  We don't know yet whether we'll have to change that, too.

-- 
All rights reversed.
--

From: Rik van Riel
Date: Wednesday, June 11, 2008 - 11:09 am

On Tue, 10 Jun 2008 18:48:21 +1000



My original plan was to merge them incrementally, but Andrew is
right that we should give the whole set as much testing as
possible.

I have done all the cleanups Andrew asked and fixed the bugs
that I found after that merge/cleanup.  Your bug is the one
I still need to fix before giving Andrew a whole new set of
split LRU patches to merge.

(afterwards, I will go incremental fixes only - the cleanups
he asked for were just too big to do as incrementals)

-- 
All rights reversed.
--

From: Nick Piggin
Date: Wednesday, June 11, 2008 - 4:58 pm

I'm sorry, hmm I didn't look closely enough and forgot that
write_begin/write_end requires the callee to allocate the page
as well, and that Hugh had nicely unified most of that.

So maybe it's not that. It's pretty easy to hit I found with

OK.
--

From: Rik van Riel
Date: Thursday, June 12, 2008 - 12:29 pm

On Thu, 12 Jun 2008 09:58:38 +1000

Turns out the loopback driver uses splice, which moves
the pages from one place to another.  This is why you
were seeing the problem with loopback, but not with
just a really big file on tmpfs.

I'm trying to make sense of all the splice code now
and will send fix as soon as I know how to fix this
problem in a nice way.

-- 
All Rights Reversed
--

From: Hugh Dickins
Date: Thursday, June 12, 2008 - 2:15 pm

The loop-on-tmpfs write side is okay nowaways, but the read side

There's no need to make sense of all the splice code, it's just
that it's doing add_to_page_cache_lru (on a page not marked as
SwapBacked), then shmem and swap_state consistency relies on it
as having been marked as SwapBacked.  Normally, yes, shmem_getpage
is the one that allocates the page, but in this case it's already
been done outside, awkward (and long predates loop's use of splice).

It's remarkably hard to correct the LRU of a page once it's been
launched towards one.  Is it still on this cpu's pagevec?  Have we
been preempted and it's on another cpu's pagevec?  If it's reached
the LRU, has vmscan whisked it off for a moment, even though it's
PageLocked?  Until now it's been that the LRUs are self-correcting,
but these patches move away from that.

I don't know how to fix this problem in a nice way.  For the moment,
to proceed with testing, I'm using the hack below.  But perhaps that
screws things up for the other !mapping_cap_account_dirty filesystems
e.g. ramfs, I just haven't tried them yet - nor shall in the next
couple of days.

It could be turned into a proper bdi check of its own, instead of
parasiting off cap_account_dirty.  But I'm not yet convinced by any
of the PageSwapBacked stuff, so currently preferring a quick hack
to a grand scheme.

It's not clear to me why tmpfs file pages should be counted as anon
pages rather than file pages; though it is clear that switching their
LRU midstream, when swizzled to swap, can have implementation problems.

I don't really get why SwapBacked is the important consideration:
I can see that you may want different balancing for pages mapped
into userspace from pages just cached in kernel; but SwapBacked?

Am I right to think that the memcontrol stuff is now all broken,
because memcontrol.c hasn't yet been converted to the more LRUs?
Certainly I'm now hanging when trying to run in a restricted memcg.

Unrelated fix to compiler warning and silly ...
From: Rik van Riel
Date: Friday, June 13, 2008 - 10:45 am

On Thu, 12 Jun 2008 22:15:54 +0100 (BST)

Yeah, it will break ramfs.  Also, we need to take care of
splice going in the opposite direction (moving a page from
SwapBacked to filesystem backed).

I guess we'll need per-mapping flags to help determine where
a page goes at add_to_page_cache_lru() time.

This does not remove our need for the page flags, because
those need to survive until the del_page_from_lru() call
in __page_cache_release(), by which time the page->mapping

I believe memcontrol has been converted.  Of course, maybe

I sent the fix for that one to Andrew already.  I believe
it's in his mmotm tree.

-- 
All Rights Reversed
--

From: Hugh Dickins
Date: Friday, June 13, 2008 - 2:15 pm

No, that's a different, and blessedly non-existent, problem.

The swap_state.c:77s we're seeing with loop-on-tmpfs-file just comes
from __generic_file_splice_read doing add_to_page_cache_lru without
knowing that the filesystem it's dealing with is tmpfs, which unlike
every other filesystem sets and expects PageSwapBacked on its pages.
(I expect you started out without that, then hit problems when tmpfs
moved its file pages to swap cache, so you therefore elected to make
them SwapBacked from the start.)

You could certainly argue that tmpfs should therefore have its own
shmem_file_splice_read instead of using generic_file_splice_read;
but I'd rather hate to duplicate that splice code within shmem.c just
for this reason, would prefer that __generic_file_splice_read deduce it's
dealing with tmpfs and SetPageSwapBacked before add_to_page_cache_lru
(probably better that way than within add_to_page_cache_lru as I did).

Though I'd even more prefer to find a way of avoiding it altogether:
I've yet to think through on that.

But this is hardly a splice problem, it's just that splice is the
only thing which ever goes the problematic shmem_readpage route.

When above you say that we also need to take care of going the
opposite direction, you're thinking about splice stealing pages
from one mapping and giving them to another, the essence of splice.
But see Nick's year-old 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd
"splice: dont steal" patch: that stealing is currently dead code,
so you shouldn't spend time worrying about how to deal with it.

The better way would be to add a backing_dev_info flag.  (At one
point I had been going to criticize your per-mapping AS_UNEVICTABLE,
to say that one should be a backing_dev_info flag; but no, you're


Ah, yes, there are NR_LRU_LISTS arrays in there now, so it has
the appearance of having been converted.  Fine, then it's worth
my looking into why it isn't actually working as intended.

Hugh
--

From: Rik van Riel
Date: Friday, June 13, 2008 - 3:03 pm

On Fri, 13 Jun 2008 22:15:01 +0100 (BST)


I believe that Lee and Kosaki-san have tested this code,
so the breakage could be pretty new.

-- 
All rights reversed.
--

From: Lee Schermerhorn
Date: Tuesday, June 10, 2008 - 8:34 am

I put those C++ TODO comments in there specifically to raise their
visibility in hopes that someone [like you :)] would notice and maybe
have an answer to the question.  I noted the issue in the change log as
well--i.e., that I had moved set_pte_at() to after the lru_cache_add and
'new_rmap.   The existing order may be that way for a reason, but it's
not clear [to me] what that reason is.  As I noted, do_anonymous_page()
sets the pte after the lru_add and new_rmap.

I agree, these questions need to be answered and the TODO's resolved
before merging.   Any thoughts as to the ordering?

Lee


--

From: Hugh Dickins
Date: Tuesday, June 10, 2008 - 9:50 am

The ordering of lru_cache_add*, page_add_*_rmap and set_pte_at does
not matter (but update_mmu_cache must come after set_pte_at not before).

Even if the page table lock were not held across them (it is), I think
their ordering would not matter much (just benign races); though it's
always worth keeping in mind that once you've done the lru_cache_add,
that page is now visible to vmscan.c.

But I'm all in favour of you imposing consistency there (as part of
a wider patch? perhaps not; and do_swap_page does now look out of step).
It can sometimes help when inserting debug checks e.g. on page_mapcount.

I think you'll find the lru_cache_add_active_or_noreclaim could
actually be moved into page_add_new_rmap - I found that helpful when
working on eliminating the PageSwapCache flag (work now grown out of
date, I'm afraid), to know that the page was not publicly visible
until I did lru_cache_add_active at the end of page_add_new_rmap.

Hugh
--

From: Grant Coady
Date: Tuesday, June 10, 2008 - 3:20 am

No it's not :)

-mm1 worked fine here but -mm2 locks up just after saying:
agpgart: Detected 7164K stolen memory.

Nothing in logs (session not recorded - hit reset to restart).

config and dmseg for -mm1 at (same .config for mm2):

  http://bugsplatter.mine.nu/test/boxen/pooh/config-2.6.26-rc5-mm1a.gz
  http://bugsplatter.mine.nu/test/boxen/pooh/dmesg-2.6.26-rc5-mm1a.gz

Grant.
--

From: Andrew Morton
Date: Tuesday, June 10, 2008 - 11:18 am

hm, intel-agp gtt stuff.

Can you please see whether reverting Keith's stuff fixes it?

 drivers/char/agp/agp.h       |    3 ---
 drivers/char/agp/backend.c   |    2 --
 drivers/char/agp/generic.c   |   28 ----------------------------
 drivers/char/agp/intel-agp.c |    5 -----
 include/linux/agp_backend.h  |    5 -----
 5 files changed, 43 deletions(-)

diff -puN drivers/char/agp/agp.h~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/agp.h
--- a/drivers/char/agp/agp.h~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/agp.h
@@ -148,9 +148,6 @@ struct agp_bridge_data {
 	char minor_version;
 	struct list_head list;
 	u32 apbase_config;
-	/* list of agp_memory mapped to the aperture */
-	struct list_head mapped_list;
-	spinlock_t mapped_lock;
 };
 
 #define KB(x)	((x) * 1024)
diff -puN drivers/char/agp/backend.c~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/backend.c
--- a/drivers/char/agp/backend.c~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/backend.c
@@ -183,8 +183,6 @@ static int agp_backend_initialize(struct
 		rc = -EINVAL;
 		goto err_out;
 	}
-	INIT_LIST_HEAD(&bridge->mapped_list);
-	spin_lock_init(&bridge->mapped_lock);
 
 	return 0;
 
diff -puN drivers/char/agp/generic.c~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/generic.c
--- a/drivers/char/agp/generic.c~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/generic.c
@@ -426,10 +426,6 @@ int agp_bind_memory(struct agp_memory *c
 
 	curr->is_bound = TRUE;
 	curr->pg_start = pg_start;
-	spin_lock(&agp_bridge->mapped_lock);
-	list_add(&curr->mapped_list, &agp_bridge->mapped_list);
-	spin_unlock(&agp_bridge->mapped_lock);
-
 	return 0;
 }
 EXPORT_SYMBOL(agp_bind_memory);
@@ -462,34 +458,10 @@ int agp_unbind_memory(struct agp_memory 
 
 	curr->is_bound = FALSE;
 	curr->pg_start = 0;
-	spin_lock(&curr->bridge->mapped_lock);
-	list_del(&curr->mapped_list);
-	spin_unlock(&curr->bridge->mapped_lock);
 	return 0;
 }
 ...
From: Grant Coady
Date: Tuesday, June 10, 2008 - 2:48 pm

Yes, it does :)

config + dmesg at: http://bugsplatter.mine.nu/test/boxen/pooh/ (*-mm2b.gz)


--

From: Helge Hafting
Date: Tuesday, June 10, 2008 - 4:50 am

Interesting to try out, but I got this:

  $ make
   CHK     include/linux/version.h
   CHK     include/linux/utsrelease.h
   CALL    scripts/checksyscalls.sh
   CHK     include/linux/compile.h
   CC      mm/vmscan.o
mm/vmscan.c: In function 'show_page_path':
mm/vmscan.c:2419: error: 'struct mm_struct' has no member named 'owner'
make[1]: *** [mm/vmscan.o] Error 1
make: *** [mm] Error 2


I then tried to configure with "Track page owner", but that did not 
change anything.

Helge Hafting
--

From: Johannes Weiner
Date: Tuesday, June 10, 2008 - 5:23 am

Hi,


CONFIG_PAGE_OWNER is something else, owner is only active if
CONFIG_MM_OWNER is set.

	Hannes
--

From: Andrew Morton
Date: Tuesday, June 10, 2008 - 11:37 am

Thanks.  I guess this will get you going.

--- a/mm/vmscan.c~mm-only-vmscan-noreclaim-lru-scan-sysctl-fix
+++ a/mm/vmscan.c
@@ -2400,6 +2400,7 @@ static void show_page_path(struct page *
 		       dentry_path(dentry, buf, 256), pgoff);
 		spin_unlock(&mapping->i_mmap_lock);
 	} else {
+#ifdef CONFG_MM_OWNER
 		struct anon_vma *anon_vma;
 		struct vm_area_struct *vma;
 
@@ -2413,6 +2414,7 @@ static void show_page_path(struct page *
 			break;
 		}
 		page_unlock_anon_vma(anon_vma);
+#endif
 	}
 }
 
_


--

From: Helge Hafting
Date: Thursday, June 12, 2008 - 1:13 am

Thanks, that did the trick. It compiled fine this time.

Helge Hafting
--

From: Yasunori Goto
Date: Tuesday, June 10, 2008 - 7:26 pm

This patch is cause of compile error in mm/memory_hotplug.c.
Obviously, just here is old against changing interface of
isolate_lru_page(). :-(

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>


---
 mm/memory_hotplug.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c
+++ current/mm/memory_hotplug.c
@@ -595,8 +595,9 @@ do_migrate_range(unsigned long start_pfn
 		 * We can skip free pages. And we can only deal with pages on
 		 * LRU.
 		 */
-		ret = isolate_lru_page(page, &source);
+		ret = isolate_lru_page(page);
 		if (!ret) { /* Success */
+			list_add_tail(&page->lru, &source);
 			move_pages--;
 		} else {
 			/* Becasue we don't have big zone->lock. we should

-- 
Yasunori Goto 


--

From: Alexey Dobriyan
Date: Tuesday, June 10, 2008 - 11:00 pm

OOM condition happened with 1G free swap.

4G RAM, 1G swap partition, normally LTP survives during much, much higher
load.

vm.overcommit_memory = 0
vm.overcommit_ratio = 50

[    0.442034] TCP bind hash table entries: 65536 (order: 9, 3670016 bytes)
[    0.447278] TCP: Hash tables configured (established 262144 bind 65536)
[    0.447411] TCP reno registered
[    0.459744] NET: Registered protocol family 1
[    0.477840] msgmni has been set to 7862
[    0.477840] io scheduler noop registered
[    0.477840] io scheduler cfq registered (default)
[    0.478136] pci 0000:01:00.0: Boot video device
[    0.487568] Real Time Clock Driver v1.12ac
[    0.487568] Linux agpgart interface v0.103
[    0.487701] ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 19 (level, low) -> IRQ 19
[    0.487869] Int: type 0, pol 3, trig 3, bus 03, IRQ 00, APIC ID 2, APIC INT 13
[    0.488008] PCI: Setting latency timer of device 0000:03:00.0 to 64
[    0.488132] atl1 0000:03:00.0: version 2.1.3
[    0.507047] Switched to high resolution mode on CPU 1
[    0.508123] Switched to high resolution mode on CPU 0
[    0.524910] 8139too Fast Ethernet driver 0.9.28
[    0.524910] ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 23 (level, low) -> IRQ 23
[    0.524910] Int: type 0, pol 3, trig 3, bus 05, IRQ 08, APIC ID 2, APIC INT 17
[    0.525909] eth1: RealTek RTL8139 at 0xb800, 00:80:48:2e:06:2e, IRQ 23
[    0.525909] eth1:  Identified 8139 chip type 'RTL-8100B/8139D'
[    0.526049] netconsole: local port 6665
[    0.526049] netconsole: local IP 192.168.0.1
[    0.526052] netconsole: interface eth0
[    0.526136] netconsole: remote port 9353
[    0.526220] netconsole: remote IP 192.168.0.42
[    0.526307] netconsole: remote ethernet address 00:1b:38:af:22:49
[    0.526410] netconsole: device eth0 not up yet, forcing it
[    2.599764] atl1 0000:03:00.0: eth0 link is up 1000 Mbps full duplex
[    2.611844] console [netcon0] enabled
[    2.639955] netconsole: network logging started
[    2.640951] Driver 'sd' needs ...
From: Nick Piggin
Date: Tuesday, June 10, 2008 - 11:11 pm

Seems like you've got little or no anon pages left, so 1GB free swap

I would hope it is not a memory leak (which might point to lockless
pagecache). It doesn't look like it because there is still lots of
inactive file pages, so that points to the page reclaim changes
(which is not to say page reclaim changes couldn't cause a memory
leak themselves).

Curious: if you kill off all the LTP tests after the OOM condition,
what does your /proc/meminfo look like before and after running
sync ; echo 3 > /proc/sys/vm/drop_caches
--

From: Nick Piggin
Date: Tuesday, June 10, 2008 - 11:15 pm

Hey, I'm liking this kernel-testers list, btw. Makes it much easier
to help people with problems.

Luckily I suggested it at last KS. Oh wait, I recall everybody
laughed or ignored :) I guess I lack the managerial qualities to
make those kinds of suggestions!
--

From: Andrew Morton
Date: Tuesday, June 10, 2008 - 11:27 pm

OK, weird.

Zero pages on active_anon and inactive_anon.  I suspect we lost those pages.

And what's up with the all_unreclaimable logic?  If that isn't working
then we'll spend lots of CPU scanning zones which aren't releasing any
pages.  Hopefully that won't be needed at all if all these patches work
as hoped, but I don't think Rik intentionally disabled it at this
--

From: Nick Piggin
Date: Tuesday, June 10, 2008 - 11:31 pm

It is init that invokes the OOM killer, the actual process killed
comes at the end I believe:

--

From: KOSAKI Motohiro
Date: Tuesday, June 10, 2008 - 11:36 pm

at least, I ran LTP last week and its error didn't happend.
I'll investigate more.

Thanks.




--

From: Frederik Deweerdt
Date: Wednesday, June 11, 2008 - 12:31 am

Hi,

FWIW, I can reproduce it reliably:
$ cd <ltp-dir>/testcases/bin
$ ./growfiles -W gf15 -b -e 1 -u -r 1-49600 -I r -u -i 0 -L 120 Lgfile1
And then wait for a few secs before the OOM triggers.

Regards,
Frederik
--

From: Rik van Riel
Date: Wednesday, June 11, 2008 - 5:57 am

On Tue, 10 Jun 2008 23:27:05 -0700

Known problem.  I fixed this one in the updates I sent you last night.

-- 
All rights reversed.
--

From: Nick Piggin
Date: Wednesday, June 11, 2008 - 6:44 am

Oh good. Yeah I was just running some tests, and got as far as verifying
that the upstream kernel + lockless pagecache patches reclaims file pages
like a dream, but -mm2 sucks very badly at it.

During which, I also did find by inspection a little problem with my
speculative references patch. Andrew please apply this fix.

From: Kamalesh Babulal
Date: Wednesday, June 11, 2008 - 10:56 am

Hi Andrew,

The 2.6.26-rc5-mm2 kernel panic's, while booting up on the x86_64
box with the attached .config file.

kernel BUG at arch/x86/kernel/setup.c:388!
invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC
last sysfs file: 
CPU 0 
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.26-rc5-mm2-autokern1 #1
RIP: 0010:[<ffffffff80210492>]  [<ffffffff80210492>] _node_to_cpumask_ptr+0x54/0x6a
RSP: 0000:ffff8100bf683d30  EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000040
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff806907c0
RBP: ffff8100bf683d40 R08: 0000000000000000 R09: ffff8100bf683c90
R10: ffffffff806a30e0 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000000 R15: ffff81000104da58
FS:  0000000000000000(0000) GS:ffffffff8073fac0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff8100bf682000, task ffff8100bf688000)
Stack:  ffff8100bf683c90 0000000000000001 ffff8100bf683da0 ffffffff8022ee0a
 00000000ffffffff 7fffffff00000001 0000000000000001 0000000000000000
 0000000000000000 ffff81000104da58 ffff81000104da40 ffff8100bf64e030
Call Trace:
 [<ffffffff8022ee0a>] sched_domain_node_span+0x56/0xcb
 [<ffffffff8022f199>] __build_sched_domains+0x1aa/0x64d
 [<ffffffff8025730b>] mark_held_locks+0x4a/0x6a
 [<ffffffff8020b360>] mcount_call+0x5/0x35
 [<ffffffff803c1934>] do_check_likely+0x9/0x65
 [<ffffffff802a0d20>] kmem_cache_alloc+0xb6/0xd6
 [<ffffffff8022face>] arch_init_sched_domains+0x63/0x71
 [<ffffffff80763694>] sched_init_smp+0x60/0x119
 [<ffffffff80750999>] kernel_init+0xf9/0x2bf
 [<ffffffff8020b360>] mcount_call+0x5/0x35
 [<ffffffff8020b360>] mcount_call+0x5/0x35
 [<ffffffff80526b17>] trace_hardirqs_on_thunk+0x3a/0x3f
 ...
From: Dave Hansen
Date: Wednesday, June 11, 2008 - 11:28 am

Just to save everyone the trouble, it looks like this is a new BUG_ON().

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-m...

The machine in question is a single-node machine, but with
CONFIG_NUMA=y.






--

From: Vegard Nossum
Date: Wednesday, June 11, 2008 - 11:37 am

Yes. Sorry, I already responded in a separate e-mail (see below), but
that obviously missed all the Ccs. So here it goes again...:

I'm betting

commit a953e4597abd51b74c99e0e3b7074532a60fd031
Author: Mike Travis <travis@sgi.com>
Date:   Mon May 12 21:21:12 2008 +0200

    sched: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c

will fix this if it's not in -mm2 already.

The BUG() is simply there to prevent silent corruption. Mike already
has a patch that changes it to a WARN(), but it obviously didn't get
through (either)...


Vegard



-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
From: Kamalesh Babulal
Date: Wednesday, June 11, 2008 - 11:55 pm

Hi,



-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--

From: Jiri Slaby
Date: Wednesday, June 11, 2008 - 11:08 am

Hi,

I face problems after some of the pnp changes. If this is not known, I may 
bisect it, it's 100% reproducible. I have no real logs, It panics prior to 
network is woken up to see something on netconsole, I just captured a function 
name and an offset of place where it oopses.

pnpacpi_encode_resources, ACPI_RESOURCE_TYPE_DMA case, pnp_get_resource(dev, 
IORESOURCE_DMA, dma) returns NULL, which is dereferenced at pnpacpi_encode_dma 
at p->flags.

It happens on resume after mem > /sys/power/state.
--

From: Bjorn Helgaas
Date: Wednesday, June 11, 2008 - 12:03 pm

Thanks for the report, I hadn't heard about this.

We used to always have a resource from the static table to encode
(assuming the table was big enough), even if that resource was
disabled or unassigned.  But now we don't keep those around, so
we can end up with null pointers like you're seeing.

Before you go to all the trouble of bisecting it, can you turn on
CONFIG_PNP_DEBUG and try the following debug patch?  I think this
will prevent the oops, but it's just papering over the real problem,
so please capture the complete dmesg log.

Bjorn


Index: work10/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- work10.orig/drivers/pnp/pnpacpi/rsparser.c	2008-06-11 12:46:28.000000000 -0600
+++ work10/drivers/pnp/pnpacpi/rsparser.c	2008-06-11 12:59:43.000000000 -0600
@@ -749,6 +749,11 @@ static void pnpacpi_encode_irq(struct pn
 	struct acpi_resource_irq *irq = &resource->data.irq;
 	int triggering, polarity, shareable;
 
+	if (!p) {
+		dev_err(&dev->dev, "  no irq resource to encode!\n");
+		return;
+	}
+
 	decode_irq_flags(dev, p->flags, &triggering, &polarity, &shareable);
 	irq->triggering = triggering;
 	irq->polarity = polarity;
@@ -771,6 +776,11 @@ static void pnpacpi_encode_ext_irq(struc
 	struct acpi_resource_extended_irq *extended_irq = &resource->data.extended_irq;
 	int triggering, polarity, shareable;
 
+	if (!p) {
+		dev_err(&dev->dev, "  no extended irq resource to encode!\n");
+		return;
+	}
+
 	decode_irq_flags(dev, p->flags, &triggering, &polarity, &shareable);
 	extended_irq->producer_consumer = ACPI_CONSUMER;
 	extended_irq->triggering = triggering;
@@ -791,6 +801,11 @@ static void pnpacpi_encode_dma(struct pn
 {
 	struct acpi_resource_dma *dma = &resource->data.dma;
 
+	if (!p) {
+		dev_err(&dev->dev, "  no dma resource to encode!\n");
+		return;
+	}
+
 	/* Note: pnp_assign_dma will copy pnp_dma->flags into p->flags */
 	switch (p->flags & IORESOURCE_DMA_SPEED_MASK) {
 	case ...
From: Jiri Slaby
Date: Thursday, June 12, 2008 - 2:10 pm

ACPI: PCI interrupt for device 0000:00:02.0 disabled
serial 00:07: disabled
serial 00:06: disabled
ACPI handle has no context!
ACPI: PCI interrupt for device 0000:00:1d.7 disabled
...
serial 00:06:   no dma resource to encode!
serial 00:06: activated
serial 00:07:   no dma resource to encode!
serial 00:07: activated
ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16

--

From: Bjorn Helgaas
Date: Thursday, June 12, 2008 - 2:22 pm

Interesting.  I wonder why a serial device would have a DMA resource.
We encode resources by following a template from _CRS, so evidently
that template had a DMA resource.  Or something deeper is wrong.

Can you send me the rest of that dmesg log?

I take it that with the debug patch, your system is functional
after resume?

Bjorn
--

From: Jiri Slaby
Date: Thursday, June 12, 2008 - 2:39 pm

Yes, it is :).

Linux version 2.6.26-rc5-mm3_64 (ku@bellona) (gcc version 4.3.1 20080507 
(prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) ) #421 SMP Thu Jun 12 
22:59:48 CEST 2008
Command line: root=/dev/md1 vga=1 ro reboot=a,w slub_debug
BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009ec00 (usable)
  BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved)
  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
  BIOS-e820: 0000000000100000 - 000000007d5b0000 (usable)
  BIOS-e820: 000000007d5b0000 - 000000007d5be000 (ACPI data)
  BIOS-e820: 000000007d5be000 - 000000007d5f0000 (ACPI NVS)
  BIOS-e820: 000000007d5f0000 - 000000007d600000 (reserved)
  BIOS-e820: 00000000fed90000 - 00000000fed94000 (reserved)
  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
  BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
last_pfn = 513456 max_arch_pfn = 17179869183
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
init_memory_mapping
DMI present.
ACPI: RSDP 000F9990, 0024 (r2 ACPIAM)
ACPI: XSDT 7D5B0100, 0064 (r1 A M I  OEMXSDT   5000708 MSFT       97)
ACPI: FACP 7D5B0290, 00F4 (r3 A M I  OEMFACP   5000708 MSFT       97)
ACPI: DSDT 7D5B0490, 6643 (r1 SDBLI9 SDBLI944       44 INTL 20051117)
ACPI: FACS 7D5BE000, 0040
ACPI: APIC 7D5B0390, 006C (r1 A M I  OEMAPIC   5000708 MSFT       97)
ACPI: MCFG 7D5B0450, 003C (r1 A M I  OEMMCFG   5000708 MSFT       97)
ACPI: OEMB 7D5BE040, 0079 (r1 A M I  AMI_OEM   5000708 MSFT       97)
ACPI: HPET 7D5B6AE0, 0038 (r1 A M I  OEMHPET   5000708 MSFT       97)
ACPI: GSCI 7D5BE0C0, 2024 (r1 A M I  GMCHSCI   5000708 MSFT       97)
ACPI: iEIT 7D5C00F0, 00B0 (r1 A M I  EITTABLE  5000708 MSFT       97)
ACPI: DMAR 7D5B6BC0, 0118 (r1 A M I  OEMDMAR         1 MSFT       97)
   early res: 0 [0-fff] BIOS data page
   early res: 1 [6000-7fff] TRAMPOLINE
   early res: 2 [200000-7cd447] TEXT DATA BSS
   early res: 3 [9ec00-fffff] BIOS reserved
   early res: 4 [8000-afff] PGTABLE
Scan SMP ...
From: Bjorn Helgaas
Date: Thursday, June 12, 2008 - 2:57 pm

Thanks, but it looks like CONFIG_PNP_DEBUG is not turned on.  Can
you turn that on and capture the log again, please?

Bjorn

--

From: Jiri Slaby
Date: Thursday, June 12, 2008 - 2:57 pm

Sorry, too tired, so I overlooked it. Tomorrow. Thanks.
--

From: Jiri Slaby
Date: Friday, June 13, 2008 - 9:05 am

Here it goes:
Linux version 2.6.26-rc5-mm3_64-pnp (ku@bellona) (gcc version 4.3.1 20080507 
(prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) ) #1 SMP Fri Jun 13 
17:49:16 CEST 2008
Command line: root=/dev/md1 vga=1 ro reboot=a,w slub_debug 2
BIOS-provided physical RAM map:
  BIOS-e820: 0000000000000000 - 000000000009ec00 (usable)
  BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved)
  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
  BIOS-e820: 0000000000100000 - 000000007d5b0000 (usable)
  BIOS-e820: 000000007d5b0000 - 000000007d5be000 (ACPI data)
  BIOS-e820: 000000007d5be000 - 000000007d5f0000 (ACPI NVS)
  BIOS-e820: 000000007d5f0000 - 000000007d600000 (reserved)
  BIOS-e820: 00000000fed90000 - 00000000fed94000 (reserved)
  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
  BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
last_pfn = 513456 max_arch_pfn = 17179869183
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
init_memory_mapping
DMI present.
ACPI: RSDP 000F9990, 0024 (r2 ACPIAM)
ACPI: XSDT 7D5B0100, 0064 (r1 A M I  OEMXSDT   5000708 MSFT       97)
ACPI: FACP 7D5B0290, 00F4 (r3 A M I  OEMFACP   5000708 MSFT       97)
ACPI: DSDT 7D5B0490, 6643 (r1 SDBLI9 SDBLI944       44 INTL 20051117)
ACPI: FACS 7D5BE000, 0040
ACPI: APIC 7D5B0390, 006C (r1 A M I  OEMAPIC   5000708 MSFT       97)
ACPI: MCFG 7D5B0450, 003C (r1 A M I  OEMMCFG   5000708 MSFT       97)
ACPI: OEMB 7D5BE040, 0079 (r1 A M I  AMI_OEM   5000708 MSFT       97)
ACPI: HPET 7D5B6AE0, 0038 (r1 A M I  OEMHPET   5000708 MSFT       97)
ACPI: GSCI 7D5BE0C0, 2024 (r1 A M I  GMCHSCI   5000708 MSFT       97)
ACPI: iEIT 7D5C00F0, 00B0 (r1 A M I  EITTABLE  5000708 MSFT       97)
ACPI: DMAR 7D5B6BC0, 0118 (r1 A M I  OEMDMAR         1 MSFT       97)
   early res: 0 [0-fff] BIOS data page
   early res: 1 [6000-7fff] TRAMPOLINE
   early res: 2 [200000-7cd447] TEXT DATA BSS
   early res: 3 [9ec00-fffff] BIOS reserved
   early res: 4 [8000-afff] PGTABLE
Scan SMP from ...
From: Bjorn Helgaas
Date: Friday, June 13, 2008 - 10:23 am

Thanks a lot!  Your BIOS clearly claims that at least one of your
serial ports can be configured with DMA:

  pnp 00:07:   dependent set 5 (acceptable) io  min 0x3f8 max 0x3f8 align 1 size 8 flags 0x1
  pnp 00:07:   dependent set 5 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1
  pnp 00:07:   dependent set 5 (acceptable) dma 0 1 2 3 (bitmask 0xf) flags 0x0

That's wierd, but whatever, we still have to be careful to give the
BIOS back what it expects, and I think that means we have to keep
track of that disabled DMA resource in pnpacpi_allocated_resource().

Can you please replace the debug patch with the one below and send me
the results again?

Index: work10/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- work10.orig/drivers/pnp/pnpacpi/rsparser.c	2008-06-11 12:46:28.000000000 -0600
+++ work10/drivers/pnp/pnpacpi/rsparser.c	2008-06-13 11:13:21.000000000 -0600
@@ -240,6 +240,7 @@ static acpi_status pnpacpi_allocated_res
 	struct acpi_resource_fixed_memory32 *fixed_memory32;
 	struct acpi_resource_extended_irq *extended_irq;
 	int i, flags;
+	u8 channel;
 
 	switch (res->type) {
 	case ACPI_RESOURCE_TYPE_IRQ:
@@ -259,13 +260,13 @@ static acpi_status pnpacpi_allocated_res
 
 	case ACPI_RESOURCE_TYPE_DMA:
 		dma = &res->data.dma;
-		if (dma->channel_count > 0) {
-			flags = dma_flags(dma->type, dma->bus_master,
-					  dma->transfer);
-			if (dma->channels[0] == (u8) -1)
-				flags |= IORESOURCE_DISABLED;
-			pnp_add_dma_resource(dev, dma->channels[0], flags);
+		channel = dma->channels[0];
+		flags = dma_flags(dma->type, dma->bus_master, dma->transfer);
+		if (dma->channel_count == 0 || dma->channels[0] == (u8) -1) {
+			channel = -1;
+			flags = IORESOURCE_DISABLED;
 		}
+		pnp_add_dma_resource(dev, channel, flags);
 		break;
 
 	case ACPI_RESOURCE_TYPE_IO:
@@ -749,6 +750,11 @@ static void pnpacpi_encode_irq(struct pn
 	struct acpi_resource_irq *irq = &resource->data.irq;
 	int triggering, polarity, ...
From: Jiri Slaby
Date: Monday, June 16, 2008 - 3:43 am

Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
ACPI: bus type pnp registered
pnp 00:00: parse allocated resources
pnp 00:00:   add io  0xcf8-0xcff flags 0x1
pnp 00:00: Plug and Play ACPI device, IDs PNP0a08 PNP0a03 (active)
pnp 00:01: parse allocated resources
pnp 00:01:   add mem 0xfed14000-0xfed19fff flags 0x1
pnp 00:01: PNP0c01: calling quirk_system_pci_resources+0x0/0x1d0
pnp 00:01: Plug and Play ACPI device, IDs PNP0c01 (active)
pnp 00:02: parse allocated resources
pnp 00:02:   add dma 4 flags 0x4
pnp 00:02:   add io  0x0-0xf flags 0x1
pnp 00:02:   add io  0x81-0x83 flags 0x1
pnp 00:02:   add io  0x87-0x87 flags 0x1
pnp 00:02:   add io  0x89-0x8b flags 0x1
pnp 00:02:   add io  0x8f-0x8f flags 0x1
pnp 00:02:   add io  0xc0-0xdf flags 0x1
pnp 00:02: Plug and Play ACPI device, IDs PNP0200 (active)
pnp 00:03: parse allocated resources
pnp 00:03:   add io  0x70-0x71 flags 0x1
pnp 00:03:   add irq 8 flags 0x1
pnp 00:03: Plug and Play ACPI device, IDs PNP0b00 (active)
pnp 00:04: parse allocated resources
pnp 00:04:   add io  0x61-0x61 flags 0x1
pnp 00:04: Plug and Play ACPI device, IDs PNP0800 (active)
pnp 00:05: parse allocated resources
pnp 00:05:   add io  0xf0-0xff flags 0x1
pnp 00:05:   add irq 13 flags 0x1
pnp 00:05: Plug and Play ACPI device, IDs PNP0c04 (active)
pnp 00:06: parse allocated resources
pnp 00:06:   add io  0x3f8-0x3ff flags 0x1
pnp 00:06:   add irq 4 flags 0x1
pnp 00:06:   add dma 255 flags 0x10000000
pnp 00:06: parse resource options
pnp 00:06:   dependent set 0 (preferred) io  min 0x3f8 max 0x3f8 align 1 size 8 
flags 0x1
pnp 00:06:   dependent set 0 (preferred) irq 4 flags 0x1
pnp 00:06:   dependent set 1 (acceptable) io  min 0x3f8 max 0x3f8 align 1 size 8 
flags 0x1
pnp 00:06:   dependent set 1 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1
pnp 00:06:   dependent set 2 (acceptable) io  min 0x2f8 max 0x2f8 align 1 size 8 
flags 0x1
pnp 00:06:   dependent set 2 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1
pnp 00:06:   ...
Previous thread: page allocation failure in 2.6.25.5 by Andrew Pochinsky on Monday, June 9, 2008 - 9:26 pm. (1 message)

Next thread: [PATCH -mm 2/2] kexec jump -v11: save/restore device state by Huang, Ying on Tuesday, June 10, 2008 - 12:15 am. (11 messages)