On Thu, Nov 25, 2010 at 04:33:01PM +0100, Peter Sch??ller wrote:
Disclaimer: I have no idea what I'm doing. :)
Your buddyinfo looks to be pretty low for order 3 and above, before and
after the sudden eviction, so my guess is that it's probably related to
the issues I'm seeing with fragmentation, but maybe not fighting between
zones, since you seem to have a larger Normal zone than DMA32. (Not
sure, you didn't post /proc/zoneinfo). Also, you seem to be on an
actual NUMA system, so other things are happening there, too.
If you have munin installed (it looks like you do), try enabling the
buddyinfo plugin available since munin 1.4.4. It graphs the buddyinfo
data, so it could be lined up with the memory graphs (thanks Erik).
[snip]
Note that all of the above are actually attempting order-3 allocations
first; see /sys/kernel/slab/kmalloc-1024/order, for instance. The "8" is
means "8 pages per slab", which means order 3 is the attempted allocation
size.
I did the following on a system to test, but the free memory did not
actually improve. It seems that even only order 1 allocations are enough
to reclaim too much order 0. Even a "while true; sleep .01; done" caused
free memory to start increasing due to order 1 (task_struct allocation)
watermarks waking kswapd, while our other usual VM activity is happening.
#!/bin/bash
for i in /sys/kernel/slab/*/; do
if [ `cat $i/object_size` -le 4096 ]; then
echo 0 > $i/order
else
echo 1 > $i/order
fi
done
But this is on another machine, without Mel's patch, and with 8 GB
memory, so a bigger Normal zone.
[snip]
Try installing the "perf" tool. It can be built from the kernel tree in
tools/perf, and then you usually can just copy the binary around. You
can use it to trace the points which cause kswapd to wake up, which will
show which processes are doing it, the order, flags, etc.
Just before the eviction is about to happen (or whenever), try this:
perf record --event vmscan:mm_vmscan_wakeup_kswapd --filter 'order>=3' \
--call-graph -a sleep 30
Then view the recorded events with "perf trace", which should spit out
something like this:
lmtp-3531 [003] 432339.243851: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
lmtp-3531 [003] 432339.243856: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
The process which woke kswapd may not be directly responsible for the
allocation as a network interrrupt or something could have happened on
top of it. See "perf report", which is a bit dodgy at least for me, to
see the stack traces, which might make things clearer. For example, my
traces show that even kswapd wakes kswapd sometimes, but it's because of
a trace like this:
- 9.09% kswapd0 [kernel.kallsyms] [k] wakeup_kswapd
wakeup_kswapd
__alloc_pages_nodemask
alloc_pages_current
new_slab
__slab_alloc
__kmalloc_node_track_caller
__alloc_skb
__netdev_alloc_skb
bnx2_poll_work
bnx2_poll
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
truncate_inode_pages
proc_evict_inode
evict
iput
dentry_iput
d_kill
__shrink_dcache_sb
shrink_dcache_memory
shrink_slab
kswapd
Anyway, maybe you'll see some interesting traces. If kswapd isn't waking
very often, you can also trace "kmem:mm_page_alloc" or similar (see "perf
list"), or try a smaller order or a longer sleep.
Cheers,
Simon-
--