ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-mm2/ - This is a bugfixed version of 2.6.26-rc5-mm1 - mainly to repair a vmscan.c bug which would have prevented testing of the other vmscan.c bugs^Wchanges. Boilerplate: - See the `hot-fixes' directory for any important updates to this patchset. - To fetch an -mm tree using git, use (for example) git-fetch git://git.kernel.org/pub/scm/linux/kernel/git/smurf/linux-trees.git tag v2.6.16-rc2-mm1 git-checkout -b local-v2.6.16-rc2-mm1 v2.6.16-rc2-mm1 - -mm kernel commit activity can be reviewed by subscribing to the mm-commits mailing list. echo "subscribe mm-commits" | mail majordomo@vger.kernel.org - If you hit a bug in -mm and it is not obvious which patch caused it, it is most valuable if you can perform a bisection search to identify which patch introduced the bug. Instructions for this process are at http://www.zip.com.au/~akpm/linux/patches/stuff/bisecting-mm-trees.txt But beware that this process takes some time (around ten rebuilds and reboots), so consider reporting the bug first and if we cannot immediately identify the faulty patch, then perform the bisection search. - When reporting bugs, please try to Cc: the relevant maintainer and mailing list on any email. - When reporting bugs in this kernel via email, please also rewrite the email Subject: in some manner to reflect the nature of the bug. Some developers filter by Subject: when looking for messages to read. - Occasional snapshots of the -mm lineup are uploaded to ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/ and are announced on the mm-commits list. These probably are at least compilable. - More-than-daily -mm snapshots may be found at http://userweb.kernel.org/~akpm/mmotm/. These are almost certainly not compileable. Changes since 2.6.26-rc5-mm1: origin.patch linux-next.patch git-jg-misc.patch git-leds.patch ...
BTW. this is known to be broken with x86 1GB pages and direct-IO, due to interaction between huge pages patchset and lockless get_user_pages. My fault. I was away from the screen over the long weekend here, and didn't give Andrew the heads-up in time. This isn't going to be a problem unless you explicitly enable GB pages and run direct IO (or splice) into or out of them. I can give a fixup patch to anyone interested in doing so. --
BTW. would be trying to test this more myself, but last mm I based the lockless patches on didn't boot, and this one dies pretty quickly when you try to get into reclaim: ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:77! invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 7 Modules linked in: Pid: 13550, comm: sh Not tainted 2.6.26-rc5-mm2-dirty #412 RIP: 0010:[<ffffffff80288689>] [<ffffffff80288689>] add_to_swap_cache+0xd9/0x120 RSP: 0018:ffff81010c62d8a8 EFLAGS: 00010246 RAX: 2000000000020009 RBX: ffffe2000107da88 RCX: c000000000000000 RDX: 0000000000000020 RSI: 000000000000eea2 RDI: ffffe2000107da88 RBP: ffff81010c62d8c8 R08: fffffffffa48016e R09: 0000000000000000 R10: ffffffff80857fa0 R11: 2222222222222222 R12: ffff81012e126520 R13: 000000000000eea2 R14: ffff8100727bea20 R15: ffff81010c62d9b8 FS: 00002b5b33cafdc0(0000) GS:ffff81012ff07800(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000175e280 CR3: 000000012e292000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process sh (pid: 13550, threadinfo ffff81010c62c000, task ffff810116b01110) Stack: ffff81010c62d8c8 ffffe2000107da88 ffff81012e126520 ffff81012e126400 ffff81010c62d908 ffffffff80292851 000000000000eea2 ffff81012e126708 ffffe2000107da88 ffffffff80701420 ffff81010c62db68 ffff81010c62dc88 Call Trace: [<ffffffff80292851>] shmem_writepage+0x121/0x200 [<ffffffff80277479>] shrink_page_list+0x559/0x6b0 [<ffffffff802777ec>] shrink_list+0x21c/0x520 [<ffffffff80273365>] ? determine_dirtyable_memory+0x15/0x30 [<ffffffff802733a2>] ? get_dirty_limits+0x22/0x2a0 [<ffffffff80277d31>] shrink_zone+0x241/0x330 [<ffffffff80278207>] try_to_free_pages+0x237/0x3a0 [<ffffffff80276530>] ? isolate_pages_global+0x0/0x270 [<ffffffff80272546>] __alloc_pages_internal+0x206/0x4b0 ...
It would be good if you could find a day to look through those changes please. It's pretty important. --
Doesn't look like it, but I hadn't followed the changes too closely: OK, I could have a look through them at some point. Just something very quick while I have Rik's attention are all the atomic SetPageSwapBacked bitops over a lot of mm/ fastpaths that I have been slowly working away to get rid of over the past years. Maybe some don't consider it a big deal, but a single one costs anywhere from 100 - 500 instructions on desktop CPUs, not including secondary effects of ordering memory and and compiler barrier. Please go through and ensure you know your page references and ->flags concurrency, and cut these down to a bare minimum. Is the plan to merge all reclaim changes in a big hit, rather than slowly trickle in the different independent changes? --
It's going to take a lot of work to get such extensive reclaim changes landed. We need to convince ourselves that these changes are the right way to fix <whatever they fix>. We need to review and test the crap out of them. The 64-bit-only thing is a concern. I wonder about whether we've "fixed" anon pages but didn't do anything about file-backed mapped pages. Plus all the other stuff, plus stuff we haven't thought of yet. It's huge. --
On Tue, 10 Jun 2008 02:15:19 -0700 Quite possible. The reclaim policy for file-backed pages has not changed. We don't know yet whether we'll have to change that, too. -- All rights reversed. --
On Tue, 10 Jun 2008 18:48:21 +1000 My original plan was to merge them incrementally, but Andrew is right that we should give the whole set as much testing as possible. I have done all the cleanups Andrew asked and fixed the bugs that I found after that merge/cleanup. Your bug is the one I still need to fix before giving Andrew a whole new set of split LRU patches to merge. (afterwards, I will go incremental fixes only - the cleanups he asked for were just too big to do as incrementals) -- All rights reversed. --
I'm sorry, hmm I didn't look closely enough and forgot that write_begin/write_end requires the callee to allocate the page as well, and that Hugh had nicely unified most of that. So maybe it's not that. It's pretty easy to hit I found with OK. --
On Thu, 12 Jun 2008 09:58:38 +1000 Turns out the loopback driver uses splice, which moves the pages from one place to another. This is why you were seeing the problem with loopback, but not with just a really big file on tmpfs. I'm trying to make sense of all the splice code now and will send fix as soon as I know how to fix this problem in a nice way. -- All Rights Reversed --
The loop-on-tmpfs write side is okay nowaways, but the read side There's no need to make sense of all the splice code, it's just that it's doing add_to_page_cache_lru (on a page not marked as SwapBacked), then shmem and swap_state consistency relies on it as having been marked as SwapBacked. Normally, yes, shmem_getpage is the one that allocates the page, but in this case it's already been done outside, awkward (and long predates loop's use of splice). It's remarkably hard to correct the LRU of a page once it's been launched towards one. Is it still on this cpu's pagevec? Have we been preempted and it's on another cpu's pagevec? If it's reached the LRU, has vmscan whisked it off for a moment, even though it's PageLocked? Until now it's been that the LRUs are self-correcting, but these patches move away from that. I don't know how to fix this problem in a nice way. For the moment, to proceed with testing, I'm using the hack below. But perhaps that screws things up for the other !mapping_cap_account_dirty filesystems e.g. ramfs, I just haven't tried them yet - nor shall in the next couple of days. It could be turned into a proper bdi check of its own, instead of parasiting off cap_account_dirty. But I'm not yet convinced by any of the PageSwapBacked stuff, so currently preferring a quick hack to a grand scheme. It's not clear to me why tmpfs file pages should be counted as anon pages rather than file pages; though it is clear that switching their LRU midstream, when swizzled to swap, can have implementation problems. I don't really get why SwapBacked is the important consideration: I can see that you may want different balancing for pages mapped into userspace from pages just cached in kernel; but SwapBacked? Am I right to think that the memcontrol stuff is now all broken, because memcontrol.c hasn't yet been converted to the more LRUs? Certainly I'm now hanging when trying to run in a restricted memcg. Unrelated fix to compiler warning and silly ...
On Thu, 12 Jun 2008 22:15:54 +0100 (BST) Yeah, it will break ramfs. Also, we need to take care of splice going in the opposite direction (moving a page from SwapBacked to filesystem backed). I guess we'll need per-mapping flags to help determine where a page goes at add_to_page_cache_lru() time. This does not remove our need for the page flags, because those need to survive until the del_page_from_lru() call in __page_cache_release(), by which time the page->mapping I believe memcontrol has been converted. Of course, maybe I sent the fix for that one to Andrew already. I believe it's in his mmotm tree. -- All Rights Reversed --
No, that's a different, and blessedly non-existent, problem. The swap_state.c:77s we're seeing with loop-on-tmpfs-file just comes from __generic_file_splice_read doing add_to_page_cache_lru without knowing that the filesystem it's dealing with is tmpfs, which unlike every other filesystem sets and expects PageSwapBacked on its pages. (I expect you started out without that, then hit problems when tmpfs moved its file pages to swap cache, so you therefore elected to make them SwapBacked from the start.) You could certainly argue that tmpfs should therefore have its own shmem_file_splice_read instead of using generic_file_splice_read; but I'd rather hate to duplicate that splice code within shmem.c just for this reason, would prefer that __generic_file_splice_read deduce it's dealing with tmpfs and SetPageSwapBacked before add_to_page_cache_lru (probably better that way than within add_to_page_cache_lru as I did). Though I'd even more prefer to find a way of avoiding it altogether: I've yet to think through on that. But this is hardly a splice problem, it's just that splice is the only thing which ever goes the problematic shmem_readpage route. When above you say that we also need to take care of going the opposite direction, you're thinking about splice stealing pages from one mapping and giving them to another, the essence of splice. But see Nick's year-old 485ddb4b9741bafb70b22e5c1f9b4f37dc3e85bd "splice: dont steal" patch: that stealing is currently dead code, so you shouldn't spend time worrying about how to deal with it. The better way would be to add a backing_dev_info flag. (At one point I had been going to criticize your per-mapping AS_UNEVICTABLE, to say that one should be a backing_dev_info flag; but no, you're Ah, yes, there are NR_LRU_LISTS arrays in there now, so it has the appearance of having been converted. Fine, then it's worth my looking into why it isn't actually working as intended. Hugh --
On Fri, 13 Jun 2008 22:15:01 +0100 (BST) I believe that Lee and Kosaki-san have tested this code, so the breakage could be pretty new. -- All rights reversed. --
I put those C++ TODO comments in there specifically to raise their visibility in hopes that someone [like you :)] would notice and maybe have an answer to the question. I noted the issue in the change log as well--i.e., that I had moved set_pte_at() to after the lru_cache_add and 'new_rmap. The existing order may be that way for a reason, but it's not clear [to me] what that reason is. As I noted, do_anonymous_page() sets the pte after the lru_add and new_rmap. I agree, these questions need to be answered and the TODO's resolved before merging. Any thoughts as to the ordering? Lee --
The ordering of lru_cache_add*, page_add_*_rmap and set_pte_at does not matter (but update_mmu_cache must come after set_pte_at not before). Even if the page table lock were not held across them (it is), I think their ordering would not matter much (just benign races); though it's always worth keeping in mind that once you've done the lru_cache_add, that page is now visible to vmscan.c. But I'm all in favour of you imposing consistency there (as part of a wider patch? perhaps not; and do_swap_page does now look out of step). It can sometimes help when inserting debug checks e.g. on page_mapcount. I think you'll find the lru_cache_add_active_or_noreclaim could actually be moved into page_add_new_rmap - I found that helpful when working on eliminating the PageSwapCache flag (work now grown out of date, I'm afraid), to know that the page was not publicly visible until I did lru_cache_add_active at the end of page_add_new_rmap. Hugh --
No it's not :) -mm1 worked fine here but -mm2 locks up just after saying: agpgart: Detected 7164K stolen memory. Nothing in logs (session not recorded - hit reset to restart). config and dmseg for -mm1 at (same .config for mm2): http://bugsplatter.mine.nu/test/boxen/pooh/config-2.6.26-rc5-mm1a.gz http://bugsplatter.mine.nu/test/boxen/pooh/dmesg-2.6.26-rc5-mm1a.gz Grant. --
hm, intel-agp gtt stuff.
Can you please see whether reverting Keith's stuff fixes it?
drivers/char/agp/agp.h | 3 ---
drivers/char/agp/backend.c | 2 --
drivers/char/agp/generic.c | 28 ----------------------------
drivers/char/agp/intel-agp.c | 5 -----
include/linux/agp_backend.h | 5 -----
5 files changed, 43 deletions(-)
diff -puN drivers/char/agp/agp.h~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/agp.h
--- a/drivers/char/agp/agp.h~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/agp.h
@@ -148,9 +148,6 @@ struct agp_bridge_data {
char minor_version;
struct list_head list;
u32 apbase_config;
- /* list of agp_memory mapped to the aperture */
- struct list_head mapped_list;
- spinlock_t mapped_lock;
};
#define KB(x) ((x) * 1024)
diff -puN drivers/char/agp/backend.c~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/backend.c
--- a/drivers/char/agp/backend.c~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/backend.c
@@ -183,8 +183,6 @@ static int agp_backend_initialize(struct
rc = -EINVAL;
goto err_out;
}
- INIT_LIST_HEAD(&bridge->mapped_list);
- spin_lock_init(&bridge->mapped_lock);
return 0;
diff -puN drivers/char/agp/generic.c~revert-intel-agp-rewrite-gtt-on-resume drivers/char/agp/generic.c
--- a/drivers/char/agp/generic.c~revert-intel-agp-rewrite-gtt-on-resume
+++ a/drivers/char/agp/generic.c
@@ -426,10 +426,6 @@ int agp_bind_memory(struct agp_memory *c
curr->is_bound = TRUE;
curr->pg_start = pg_start;
- spin_lock(&agp_bridge->mapped_lock);
- list_add(&curr->mapped_list, &agp_bridge->mapped_list);
- spin_unlock(&agp_bridge->mapped_lock);
-
return 0;
}
EXPORT_SYMBOL(agp_bind_memory);
@@ -462,34 +458,10 @@ int agp_unbind_memory(struct agp_memory
curr->is_bound = FALSE;
curr->pg_start = 0;
- spin_lock(&curr->bridge->mapped_lock);
- list_del(&curr->mapped_list);
- spin_unlock(&curr->bridge->mapped_lock);
return 0;
}
...Yes, it does :) config + dmesg at: http://bugsplatter.mine.nu/test/boxen/pooh/ (*-mm2b.gz) --
Interesting to try out, but I got this: $ make CHK include/linux/version.h CHK include/linux/utsrelease.h CALL scripts/checksyscalls.sh CHK include/linux/compile.h CC mm/vmscan.o mm/vmscan.c: In function 'show_page_path': mm/vmscan.c:2419: error: 'struct mm_struct' has no member named 'owner' make[1]: *** [mm/vmscan.o] Error 1 make: *** [mm] Error 2 I then tried to configure with "Track page owner", but that did not change anything. Helge Hafting --
Hi, CONFIG_PAGE_OWNER is something else, owner is only active if CONFIG_MM_OWNER is set. Hannes --
Thanks. I guess this will get you going.
--- a/mm/vmscan.c~mm-only-vmscan-noreclaim-lru-scan-sysctl-fix
+++ a/mm/vmscan.c
@@ -2400,6 +2400,7 @@ static void show_page_path(struct page *
dentry_path(dentry, buf, 256), pgoff);
spin_unlock(&mapping->i_mmap_lock);
} else {
+#ifdef CONFG_MM_OWNER
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
@@ -2413,6 +2414,7 @@ static void show_page_path(struct page *
break;
}
page_unlock_anon_vma(anon_vma);
+#endif
}
}
_
--
Thanks, that did the trick. It compiled fine this time. Helge Hafting --
This patch is cause of compile error in mm/memory_hotplug.c.
Obviously, just here is old against changing interface of
isolate_lru_page(). :-(
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
mm/memory_hotplug.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c
+++ current/mm/memory_hotplug.c
@@ -595,8 +595,9 @@ do_migrate_range(unsigned long start_pfn
* We can skip free pages. And we can only deal with pages on
* LRU.
*/
- ret = isolate_lru_page(page, &source);
+ ret = isolate_lru_page(page);
if (!ret) { /* Success */
+ list_add_tail(&page->lru, &source);
move_pages--;
} else {
/* Becasue we don't have big zone->lock. we should
--
Yasunori Goto
--
OOM condition happened with 1G free swap. 4G RAM, 1G swap partition, normally LTP survives during much, much higher load. vm.overcommit_memory = 0 vm.overcommit_ratio = 50 [ 0.442034] TCP bind hash table entries: 65536 (order: 9, 3670016 bytes) [ 0.447278] TCP: Hash tables configured (established 262144 bind 65536) [ 0.447411] TCP reno registered [ 0.459744] NET: Registered protocol family 1 [ 0.477840] msgmni has been set to 7862 [ 0.477840] io scheduler noop registered [ 0.477840] io scheduler cfq registered (default) [ 0.478136] pci 0000:01:00.0: Boot video device [ 0.487568] Real Time Clock Driver v1.12ac [ 0.487568] Linux agpgart interface v0.103 [ 0.487701] ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 19 (level, low) -> IRQ 19 [ 0.487869] Int: type 0, pol 3, trig 3, bus 03, IRQ 00, APIC ID 2, APIC INT 13 [ 0.488008] PCI: Setting latency timer of device 0000:03:00.0 to 64 [ 0.488132] atl1 0000:03:00.0: version 2.1.3 [ 0.507047] Switched to high resolution mode on CPU 1 [ 0.508123] Switched to high resolution mode on CPU 0 [ 0.524910] 8139too Fast Ethernet driver 0.9.28 [ 0.524910] ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 23 (level, low) -> IRQ 23 [ 0.524910] Int: type 0, pol 3, trig 3, bus 05, IRQ 08, APIC ID 2, APIC INT 17 [ 0.525909] eth1: RealTek RTL8139 at 0xb800, 00:80:48:2e:06:2e, IRQ 23 [ 0.525909] eth1: Identified 8139 chip type 'RTL-8100B/8139D' [ 0.526049] netconsole: local port 6665 [ 0.526049] netconsole: local IP 192.168.0.1 [ 0.526052] netconsole: interface eth0 [ 0.526136] netconsole: remote port 9353 [ 0.526220] netconsole: remote IP 192.168.0.42 [ 0.526307] netconsole: remote ethernet address 00:1b:38:af:22:49 [ 0.526410] netconsole: device eth0 not up yet, forcing it [ 2.599764] atl1 0000:03:00.0: eth0 link is up 1000 Mbps full duplex [ 2.611844] console [netcon0] enabled [ 2.639955] netconsole: network logging started [ 2.640951] Driver 'sd' needs ...
Seems like you've got little or no anon pages left, so 1GB free swap I would hope it is not a memory leak (which might point to lockless pagecache). It doesn't look like it because there is still lots of inactive file pages, so that points to the page reclaim changes (which is not to say page reclaim changes couldn't cause a memory leak themselves). Curious: if you kill off all the LTP tests after the OOM condition, what does your /proc/meminfo look like before and after running sync ; echo 3 > /proc/sys/vm/drop_caches --
Hey, I'm liking this kernel-testers list, btw. Makes it much easier to help people with problems. Luckily I suggested it at last KS. Oh wait, I recall everybody laughed or ignored :) I guess I lack the managerial qualities to make those kinds of suggestions! --
OK, weird. Zero pages on active_anon and inactive_anon. I suspect we lost those pages. And what's up with the all_unreclaimable logic? If that isn't working then we'll spend lots of CPU scanning zones which aren't releasing any pages. Hopefully that won't be needed at all if all these patches work as hoped, but I don't think Rik intentionally disabled it at this --
It is init that invokes the OOM killer, the actual process killed comes at the end I believe: --
at least, I ran LTP last week and its error didn't happend. I'll investigate more. Thanks. --
Hi, FWIW, I can reproduce it reliably: $ cd <ltp-dir>/testcases/bin $ ./growfiles -W gf15 -b -e 1 -u -r 1-49600 -I r -u -i 0 -L 120 Lgfile1 And then wait for a few secs before the OOM triggers. Regards, Frederik --
On Tue, 10 Jun 2008 23:27:05 -0700 Known problem. I fixed this one in the updates I sent you last night. -- All rights reversed. --
Oh good. Yeah I was just running some tests, and got as far as verifying that the upstream kernel + lockless pagecache patches reclaims file pages like a dream, but -mm2 sucks very badly at it. During which, I also did find by inspection a little problem with my speculative references patch. Andrew please apply this fix.
Hi Andrew, The 2.6.26-rc5-mm2 kernel panic's, while booting up on the x86_64 box with the attached .config file. kernel BUG at arch/x86/kernel/setup.c:388! invalid opcode: 0000 [1] SMP DEBUG_PAGEALLOC last sysfs file: CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.26-rc5-mm2-autokern1 #1 RIP: 0010:[<ffffffff80210492>] [<ffffffff80210492>] _node_to_cpumask_ptr+0x54/0x6a RSP: 0000:ffff8100bf683d30 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000040 RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff806907c0 RBP: ffff8100bf683d40 R08: 0000000000000000 R09: ffff8100bf683c90 R10: ffffffff806a30e0 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: ffff81000104da58 FS: 0000000000000000(0000) GS:ffffffff8073fac0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 1, threadinfo ffff8100bf682000, task ffff8100bf688000) Stack: ffff8100bf683c90 0000000000000001 ffff8100bf683da0 ffffffff8022ee0a 00000000ffffffff 7fffffff00000001 0000000000000001 0000000000000000 0000000000000000 ffff81000104da58 ffff81000104da40 ffff8100bf64e030 Call Trace: [<ffffffff8022ee0a>] sched_domain_node_span+0x56/0xcb [<ffffffff8022f199>] __build_sched_domains+0x1aa/0x64d [<ffffffff8025730b>] mark_held_locks+0x4a/0x6a [<ffffffff8020b360>] mcount_call+0x5/0x35 [<ffffffff803c1934>] do_check_likely+0x9/0x65 [<ffffffff802a0d20>] kmem_cache_alloc+0xb6/0xd6 [<ffffffff8022face>] arch_init_sched_domains+0x63/0x71 [<ffffffff80763694>] sched_init_smp+0x60/0x119 [<ffffffff80750999>] kernel_init+0xf9/0x2bf [<ffffffff8020b360>] mcount_call+0x5/0x35 [<ffffffff8020b360>] mcount_call+0x5/0x35 [<ffffffff80526b17>] trace_hardirqs_on_thunk+0x3a/0x3f ...
Just to save everyone the trouble, it looks like this is a new BUG_ON(). http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.26-rc5/2.6.26-rc5-m... The machine in question is a single-node machine, but with CONFIG_NUMA=y. --
Yes. Sorry, I already responded in a separate e-mail (see below), but
that obviously missed all the Ccs. So here it goes again...:
I'm betting
commit a953e4597abd51b74c99e0e3b7074532a60fd031
Author: Mike Travis <travis@sgi.com>
Date: Mon May 12 21:21:12 2008 +0200
sched: replace MAX_NUMNODES with nr_node_ids in kernel/sched.c
will fix this if it's not in -mm2 already.
The BUG() is simply there to prevent silent corruption. Mike already
has a patch that changes it to a WARN(), but it obviously didn't get
through (either)...
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
Hi, -- Thanks & Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. --
Hi, I face problems after some of the pnp changes. If this is not known, I may bisect it, it's 100% reproducible. I have no real logs, It panics prior to network is woken up to see something on netconsole, I just captured a function name and an offset of place where it oopses. pnpacpi_encode_resources, ACPI_RESOURCE_TYPE_DMA case, pnp_get_resource(dev, IORESOURCE_DMA, dma) returns NULL, which is dereferenced at pnpacpi_encode_dma at p->flags. It happens on resume after mem > /sys/power/state. --
Thanks for the report, I hadn't heard about this.
We used to always have a resource from the static table to encode
(assuming the table was big enough), even if that resource was
disabled or unassigned. But now we don't keep those around, so
we can end up with null pointers like you're seeing.
Before you go to all the trouble of bisecting it, can you turn on
CONFIG_PNP_DEBUG and try the following debug patch? I think this
will prevent the oops, but it's just papering over the real problem,
so please capture the complete dmesg log.
Bjorn
Index: work10/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- work10.orig/drivers/pnp/pnpacpi/rsparser.c 2008-06-11 12:46:28.000000000 -0600
+++ work10/drivers/pnp/pnpacpi/rsparser.c 2008-06-11 12:59:43.000000000 -0600
@@ -749,6 +749,11 @@ static void pnpacpi_encode_irq(struct pn
struct acpi_resource_irq *irq = &resource->data.irq;
int triggering, polarity, shareable;
+ if (!p) {
+ dev_err(&dev->dev, " no irq resource to encode!\n");
+ return;
+ }
+
decode_irq_flags(dev, p->flags, &triggering, &polarity, &shareable);
irq->triggering = triggering;
irq->polarity = polarity;
@@ -771,6 +776,11 @@ static void pnpacpi_encode_ext_irq(struc
struct acpi_resource_extended_irq *extended_irq = &resource->data.extended_irq;
int triggering, polarity, shareable;
+ if (!p) {
+ dev_err(&dev->dev, " no extended irq resource to encode!\n");
+ return;
+ }
+
decode_irq_flags(dev, p->flags, &triggering, &polarity, &shareable);
extended_irq->producer_consumer = ACPI_CONSUMER;
extended_irq->triggering = triggering;
@@ -791,6 +801,11 @@ static void pnpacpi_encode_dma(struct pn
{
struct acpi_resource_dma *dma = &resource->data.dma;
+ if (!p) {
+ dev_err(&dev->dev, " no dma resource to encode!\n");
+ return;
+ }
+
/* Note: pnp_assign_dma will copy pnp_dma->flags into p->flags */
switch (p->flags & IORESOURCE_DMA_SPEED_MASK) {
case ...ACPI: PCI interrupt for device 0000:00:02.0 disabled serial 00:07: disabled serial 00:06: disabled ACPI handle has no context! ACPI: PCI interrupt for device 0000:00:1d.7 disabled ... serial 00:06: no dma resource to encode! serial 00:06: activated serial 00:07: no dma resource to encode! serial 00:07: activated ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16 --
Interesting. I wonder why a serial device would have a DMA resource. We encode resources by following a template from _CRS, so evidently that template had a DMA resource. Or something deeper is wrong. Can you send me the rest of that dmesg log? I take it that with the debug patch, your system is functional after resume? Bjorn --
Yes, it is :). Linux version 2.6.26-rc5-mm3_64 (ku@bellona) (gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) ) #421 SMP Thu Jun 12 22:59:48 CEST 2008 Command line: root=/dev/md1 vga=1 ro reboot=a,w slub_debug BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009ec00 (usable) BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007d5b0000 (usable) BIOS-e820: 000000007d5b0000 - 000000007d5be000 (ACPI data) BIOS-e820: 000000007d5be000 - 000000007d5f0000 (ACPI NVS) BIOS-e820: 000000007d5f0000 - 000000007d600000 (reserved) BIOS-e820: 00000000fed90000 - 00000000fed94000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) last_pfn = 513456 max_arch_pfn = 17179869183 x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 init_memory_mapping DMI present. ACPI: RSDP 000F9990, 0024 (r2 ACPIAM) ACPI: XSDT 7D5B0100, 0064 (r1 A M I OEMXSDT 5000708 MSFT 97) ACPI: FACP 7D5B0290, 00F4 (r3 A M I OEMFACP 5000708 MSFT 97) ACPI: DSDT 7D5B0490, 6643 (r1 SDBLI9 SDBLI944 44 INTL 20051117) ACPI: FACS 7D5BE000, 0040 ACPI: APIC 7D5B0390, 006C (r1 A M I OEMAPIC 5000708 MSFT 97) ACPI: MCFG 7D5B0450, 003C (r1 A M I OEMMCFG 5000708 MSFT 97) ACPI: OEMB 7D5BE040, 0079 (r1 A M I AMI_OEM 5000708 MSFT 97) ACPI: HPET 7D5B6AE0, 0038 (r1 A M I OEMHPET 5000708 MSFT 97) ACPI: GSCI 7D5BE0C0, 2024 (r1 A M I GMCHSCI 5000708 MSFT 97) ACPI: iEIT 7D5C00F0, 00B0 (r1 A M I EITTABLE 5000708 MSFT 97) ACPI: DMAR 7D5B6BC0, 0118 (r1 A M I OEMDMAR 1 MSFT 97) early res: 0 [0-fff] BIOS data page early res: 1 [6000-7fff] TRAMPOLINE early res: 2 [200000-7cd447] TEXT DATA BSS early res: 3 [9ec00-fffff] BIOS reserved early res: 4 [8000-afff] PGTABLE Scan SMP ...
Thanks, but it looks like CONFIG_PNP_DEBUG is not turned on. Can you turn that on and capture the log again, please? Bjorn --
Sorry, too tired, so I overlooked it. Tomorrow. Thanks. --
Here it goes: Linux version 2.6.26-rc5-mm3_64-pnp (ku@bellona) (gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux) ) #1 SMP Fri Jun 13 17:49:16 CEST 2008 Command line: root=/dev/md1 vga=1 ro reboot=a,w slub_debug 2 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009ec00 (usable) BIOS-e820: 000000000009ec00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007d5b0000 (usable) BIOS-e820: 000000007d5b0000 - 000000007d5be000 (ACPI data) BIOS-e820: 000000007d5be000 - 000000007d5f0000 (ACPI NVS) BIOS-e820: 000000007d5f0000 - 000000007d600000 (reserved) BIOS-e820: 00000000fed90000 - 00000000fed94000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) last_pfn = 513456 max_arch_pfn = 17179869183 x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 init_memory_mapping DMI present. ACPI: RSDP 000F9990, 0024 (r2 ACPIAM) ACPI: XSDT 7D5B0100, 0064 (r1 A M I OEMXSDT 5000708 MSFT 97) ACPI: FACP 7D5B0290, 00F4 (r3 A M I OEMFACP 5000708 MSFT 97) ACPI: DSDT 7D5B0490, 6643 (r1 SDBLI9 SDBLI944 44 INTL 20051117) ACPI: FACS 7D5BE000, 0040 ACPI: APIC 7D5B0390, 006C (r1 A M I OEMAPIC 5000708 MSFT 97) ACPI: MCFG 7D5B0450, 003C (r1 A M I OEMMCFG 5000708 MSFT 97) ACPI: OEMB 7D5BE040, 0079 (r1 A M I AMI_OEM 5000708 MSFT 97) ACPI: HPET 7D5B6AE0, 0038 (r1 A M I OEMHPET 5000708 MSFT 97) ACPI: GSCI 7D5BE0C0, 2024 (r1 A M I GMCHSCI 5000708 MSFT 97) ACPI: iEIT 7D5C00F0, 00B0 (r1 A M I EITTABLE 5000708 MSFT 97) ACPI: DMAR 7D5B6BC0, 0118 (r1 A M I OEMDMAR 1 MSFT 97) early res: 0 [0-fff] BIOS data page early res: 1 [6000-7fff] TRAMPOLINE early res: 2 [200000-7cd447] TEXT DATA BSS early res: 3 [9ec00-fffff] BIOS reserved early res: 4 [8000-afff] PGTABLE Scan SMP from ...
Thanks a lot! Your BIOS clearly claims that at least one of your
serial ports can be configured with DMA:
pnp 00:07: dependent set 5 (acceptable) io min 0x3f8 max 0x3f8 align 1 size 8 flags 0x1
pnp 00:07: dependent set 5 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1
pnp 00:07: dependent set 5 (acceptable) dma 0 1 2 3 (bitmask 0xf) flags 0x0
That's wierd, but whatever, we still have to be careful to give the
BIOS back what it expects, and I think that means we have to keep
track of that disabled DMA resource in pnpacpi_allocated_resource().
Can you please replace the debug patch with the one below and send me
the results again?
Index: work10/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- work10.orig/drivers/pnp/pnpacpi/rsparser.c 2008-06-11 12:46:28.000000000 -0600
+++ work10/drivers/pnp/pnpacpi/rsparser.c 2008-06-13 11:13:21.000000000 -0600
@@ -240,6 +240,7 @@ static acpi_status pnpacpi_allocated_res
struct acpi_resource_fixed_memory32 *fixed_memory32;
struct acpi_resource_extended_irq *extended_irq;
int i, flags;
+ u8 channel;
switch (res->type) {
case ACPI_RESOURCE_TYPE_IRQ:
@@ -259,13 +260,13 @@ static acpi_status pnpacpi_allocated_res
case ACPI_RESOURCE_TYPE_DMA:
dma = &res->data.dma;
- if (dma->channel_count > 0) {
- flags = dma_flags(dma->type, dma->bus_master,
- dma->transfer);
- if (dma->channels[0] == (u8) -1)
- flags |= IORESOURCE_DISABLED;
- pnp_add_dma_resource(dev, dma->channels[0], flags);
+ channel = dma->channels[0];
+ flags = dma_flags(dma->type, dma->bus_master, dma->transfer);
+ if (dma->channel_count == 0 || dma->channels[0] == (u8) -1) {
+ channel = -1;
+ flags = IORESOURCE_DISABLED;
}
+ pnp_add_dma_resource(dev, channel, flags);
break;
case ACPI_RESOURCE_TYPE_IO:
@@ -749,6 +750,11 @@ static void pnpacpi_encode_irq(struct pn
struct acpi_resource_irq *irq = &resource->data.irq;
int triggering, polarity, ...Linux Plug and Play Support v0.97 (c) Adam Belay pnp: PnP ACPI init ACPI: bus type pnp registered pnp 00:00: parse allocated resources pnp 00:00: add io 0xcf8-0xcff flags 0x1 pnp 00:00: Plug and Play ACPI device, IDs PNP0a08 PNP0a03 (active) pnp 00:01: parse allocated resources pnp 00:01: add mem 0xfed14000-0xfed19fff flags 0x1 pnp 00:01: PNP0c01: calling quirk_system_pci_resources+0x0/0x1d0 pnp 00:01: Plug and Play ACPI device, IDs PNP0c01 (active) pnp 00:02: parse allocated resources pnp 00:02: add dma 4 flags 0x4 pnp 00:02: add io 0x0-0xf flags 0x1 pnp 00:02: add io 0x81-0x83 flags 0x1 pnp 00:02: add io 0x87-0x87 flags 0x1 pnp 00:02: add io 0x89-0x8b flags 0x1 pnp 00:02: add io 0x8f-0x8f flags 0x1 pnp 00:02: add io 0xc0-0xdf flags 0x1 pnp 00:02: Plug and Play ACPI device, IDs PNP0200 (active) pnp 00:03: parse allocated resources pnp 00:03: add io 0x70-0x71 flags 0x1 pnp 00:03: add irq 8 flags 0x1 pnp 00:03: Plug and Play ACPI device, IDs PNP0b00 (active) pnp 00:04: parse allocated resources pnp 00:04: add io 0x61-0x61 flags 0x1 pnp 00:04: Plug and Play ACPI device, IDs PNP0800 (active) pnp 00:05: parse allocated resources pnp 00:05: add io 0xf0-0xff flags 0x1 pnp 00:05: add irq 13 flags 0x1 pnp 00:05: Plug and Play ACPI device, IDs PNP0c04 (active) pnp 00:06: parse allocated resources pnp 00:06: add io 0x3f8-0x3ff flags 0x1 pnp 00:06: add irq 4 flags 0x1 pnp 00:06: add dma 255 flags 0x10000000 pnp 00:06: parse resource options pnp 00:06: dependent set 0 (preferred) io min 0x3f8 max 0x3f8 align 1 size 8 flags 0x1 pnp 00:06: dependent set 0 (preferred) irq 4 flags 0x1 pnp 00:06: dependent set 1 (acceptable) io min 0x3f8 max 0x3f8 align 1 size 8 flags 0x1 pnp 00:06: dependent set 1 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1 pnp 00:06: dependent set 2 (acceptable) io min 0x2f8 max 0x2f8 align 1 size 8 flags 0x1 pnp 00:06: dependent set 2 (acceptable) irq 3 4 5 6 7 10 11 12 flags 0x1 pnp 00:06: ...
