System panics within a few seconds of starting the test.
NaT == Not a Thing. Kernel reports null pointer deref as such. I
believe that NaT Consumption errors come from attempting to deref a
non-NULL pointer that points at non-existent memory.I tried the workload again with an "unpatched kernel" -- i.e., no
automatic page migration nor replication, nor any other of my
experimental patches. Still happens with memory controller configured
-- same stack trace.Then I tried an unpatched 23-rc4-mm1 with memory controller NOT
configured, still panic'ed, but with a different symptom: first a soft
lockup, then a NULL pointer deref--apparently in soft lockup detection
code. Panics because it OOPses in interrupt handler.Tried again, same kernel--mem controller unconfig'd: this time I got
the original stack trace--NaT Consumption in shrink_active_list().
Then, softlockup with NULL pointer deref therein. It's the null pointer
deref that causes the panic: "Aiee, killing interrupt handler!"So, maybe memory controller is "off the hook".
I guess I need to check the lists for 23-rc4-mm1 hot fixes, and try to
right. I noticed that after I sent the mail.
Also, config available at:
http://free.linux.hp.com/~lts/Temp/config-2.6.23-rc4-mm1-gwydyr-nomemcontLater,
Lee-
Be interested to know the outcome of any bisect you do. Given its
tripping in reclaim.What size of box is this? Wondering if we have anything big enough to
test with.-apw
-
Problem isolated to memory controller patches. This patch seems to fix
this particular problem. I've only run the test for a few minutes with
and without memory controller configured, but I did observe reclaim
kicking in several times. W/o this patch, system would panic as soon as
I entered direct/zone reclaim--less than a minute.Lee
--------------------------------PATCH 2.6.23-rc4-mm1 Memory Controller: initialize all scan_controls'
isolate_pages member.We need to initialize all scan_controls' isolate_pages member.
Otherwise, shrink_active_list() attempts to execute at undefined
location.Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)Index: Linux/mm/vmscan.c
===================================================================
--- Linux.orig/mm/vmscan.c 2007-09-10 13:22:21.000000000 -0400
+++ Linux/mm/vmscan.c 2007-09-12 15:30:27.000000000 -0400
@@ -1758,6 +1758,7 @@ unsigned long shrink_all_memory(unsigned
.swap_cluster_max = nr_pages,
.may_writepage = 1,
.swappiness = vm_swappiness,
+ .isolate_pages = isolate_pages_global,
};current->reclaim_state = &reclaim_state;
@@ -1941,6 +1942,7 @@ static int __zone_reclaim(struct zone *z
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
+ .isolate_pages = isolate_pages_global,
};
unsigned long slab_reclaimable;-
Thanks, excellent catch! The patch looks sane. Thanks for your help in
sorting this issue out. Hmm.. that means I never hit direct/zone reclaim
in my tests (I'll make a mental note to enhance my test cases to cover--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
This is a 16-cpu, 4-node, 32GB HP rx8620. The test load that I'm
running is Dave Anderson's "usex" with a custom test script that runs:5 built-in usex IO tests to a separate file system on a SCSI disk.
1 built-in usex IO rate test -- to/from same disk/fs.
1 POV ray tracing app--just because I had it :-)
1 script that does "find / -type f | xargs strings >/dev/null" to
pollute the page cache.
2 memtoy scripts to allocate various size anon segments--up to 20GB--
and mlock() them down to force reclaim.
1 32-way parallel kernel build
3 1GB random vm tests
3 1GB sequential vm tests
9 built-in usex "bin" tests--these run a series of programs
from /usr/bin to simulate users doing random things. Not really random,
tho'. Just walks a table of commands sequentially.This load beats up on the system fairly heavily.
I can package up the usex input script and the other associated scripts
that it invokes, if you're interested. Let me know...Lee
-
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Amit K. Arora | [RFC] Heads up on sys_fallocate() |
| Linus Torvalds | Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk. Screen becomes g... |
git: | |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Ray Lee | Re: [BUG] New Kernel Bugs |
