Abstract:
Nasty OOM killings appear during massive parallel kernel builds with
SLUB, but not with SLAB. By nasty I mean, cc1 processes are killed --
object files in .ccache and tree are corrupted which makes me to
blow up them entirely.Setup:
92G ext2 partition mounted with noatime. .ccache, master tree, object
trees, build logs residing on it. 2G of RAM, 1G of swap partition.
CFQ scheduler. Core 2 Duo E6400. x86_64 arch.Workload:
build ~250 configs some fat, some not, in parallel from one source
using O= and save logs to separate dir. This is what I do regularly here.Observations:
With SLAB this workload never went to OOM killer.
With SLUB and pretty much all debugging enabled, it finishes to the end
(albeit slowly).
With SLUB and no debugging, OOM killer kicks in.After I notice it at evening, e.g ssh to box doesn't work -- new OOM killings
appear.Below is dmesg from fresh boot, OOM messages and .config. Please, give me a
clue on how to proceed from there. If needed I can prepare cut-down scripts
for reproducing and more precise description of setup.================================================================================
[this is the first OOM]
cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0Call Trace:
[<ffffffff8024cfe6>] oom_kill_process+0xf6/0x110
[<ffffffff8024d49d>] out_of_memory+0x1ad/0x1f0
[<ffffffff8024f79e>] __alloc_pages+0x28e/0x3a0
[<ffffffff802518d1>] __do_page_cache_readahead+0x101/0x270
[<ffffffff8024c43a>] filemap_fault+0x28a/0x3c0
[<ffffffff8025742c>] __do_fault+0x6c/0x430
[<ffffffff802592db>] handle_mm_fault+0x3bb/0x700
[<ffffffff8021c09f>] do_page_fault+0x1bf/0x830
[<ffffffff803d1583>] thread_return+0x3d/0x53a
[<ffffffff8025d296>] do_munmap+0x296/0x2f0
[<ffffffff803d3679>] error_exit+0x0/0x51Mem-info:
DMA per-cpu:
CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
CPU 1: Hot: hi: ...
Not sure what this is. Maybe the slowing SLUB solves the race.
Could you try and boot withslub_debug=F
Regular Order 0 alloc .... but why is there no memory available ,
reclaimed?-
Hi,
What kind of race are you thinking of? What I initially thought was
that the problem is that SLUB messes up some other VM heuristics due
to different object sizes, not holding on to empty slabs, and/or page
allocator pass-through. I guess only object size is affected by
debugging, though?
-
The number of objects per page is reduced by enabling full debugging. That
triggers a potential of more order 1 allocations but we are failing at
order 0 allocs. slub_debug=F does not increase object size with debugging
information but keeps things as they are while doing as much verification
of slab integrity as possible.One--potentially far fetched theory--is that the difference in alloc
performance may trigger a race. But I guess it is more likely that we will
find something wrong with the gfp flags (that I recategorized for 2.6.24).-
Hi Christoph,
(I fixed linux-mm cc to kvack.org.)
Yeah, but we're _not failing_ when debugging is enabled. Thus, it's
likely, that the _failing_ (non-debug) case has potential for more
order 0 allocs, no? I am just guessing here but maybe it's
slab_order() behaving differently from calculate_slab_order() so that
we see more order 0 pressure in SLUB than SLAB?Pekka
-
Seesm that order 0 pressure is better than order 1?
-
Hi,
I am too wondering where all that memory is going. Logging /proc/meminfo
and slabinfo -S (see Documentation/vm/slabinfo.c) every few minutes or
so probably might help answering that question.Pekka
-
__GFP_HARDWALL and __GFP_MOVABLE is set. I guess we have no cpusets on the
box? Wonder what the antifrag status is? Can we get /proc/pagetypeinfo?-
I do not believe the anti-frag status will make a difference as this is
a order-0 failure. Looking at the OOM messages;No reclaimable memory, no free swap and free memory is below the min free
watermark on the DMA32 zone being allocated from. That system is genuinely
out of memory although I am surprised it didn't fallback to ZONE_DMA but
the likely explanation is the lowmem reserves.Same again - really OOM.
It seems to be the same all the way through.
I am somewhat at a loss to explain why SLUB would fail in this case when
SLAB wouldn't. One possibility is that SLAB is artifically holding up
processes in allocation so that fewer are running at the same time allowing
the overall job to complete. For a compile-time benchmark, it might not be
very noticable as the processes are already pretty short-lived. It would be
interesting to add more swap so that SLUB doesn't fail and see if there is
a difference in time-to-completion.If SLUB is faster, it might imply that it is letting more compile-jobs
running at the same time. Output of vmstat during both runs (preferably with
an indication on the SLUB failure where the OOM occured) may tell us. If the
"running" and "blocked" column for SLUB are significantly higher than the
averages for SLAB, would it support that theory Christoph?Another possibility is that SLUBs runtime overhead is higher although I find
that one harder to believe.--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
Thanks, guys for suggestions.
I redid workload with slub_debug=F and collected /proc/meminfo, slabinfo
-S and /proc/pagetyinfo. It OOMed. Here are first few messages from OOM
killer and few last messages from proc files. During run collecting
script managed to wakeup once in 1-3 minutes. There are 12 minutes between
last and second-to-last messages which is I assume when X happened.
If needed, I can send complete writeout.I'm going to add more swap now and make timings.
[ 36.146064] Adding 987956k swap on /dev/sda1. Priority:-1 extents:1 across:987956k
[12728.699150] cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=-17
[12728.699299]
[12728.699299] Call Trace:
[12728.699422] [<ffffffff8024d176>] oom_kill_process+0xf6/0x110
[12728.699505] [<ffffffff8024d62d>] out_of_memory+0x1ad/0x1f0
[12728.699587] [<ffffffff8024f92e>] __alloc_pages+0x28e/0x3a0
[12728.699669] [<ffffffff80251a61>] __do_page_cache_readahead+0x101/0x270
[12728.699757] [<ffffffff8024c5ca>] filemap_fault+0x28a/0x3c0
[12728.699839] [<ffffffff802575bc>] __do_fault+0x6c/0x430
[12728.699920] [<ffffffff8025946b>] handle_mm_fault+0x3bb/0x700
[12728.700004] [<ffffffff8021c1ff>] do_page_fault+0x1bf/0x830
[12728.700086] [<ffffffff8025e94c>] do_mmap_pgoff+0x2fc/0x360
[12728.700168] [<ffffffff8025d426>] do_munmap+0x296/0x2f0
[12728.700249] [<ffffffff802e2d11>] __up_write+0x21/0x130
[12728.700331] [<ffffffff803d3c99>] error_exit+0x0/0x51
[12728.700409]
[12728.700465] Mem-info:
[12728.700526] DMA per-cpu:
[12728.700589] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[12728.700732] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
[12728.700874] DMA32 per-cpu:
[12728.700938] CPU 0: Hot: hi: 186, btch: 31 usd: 94 Cold: hi: 62, btch: 15 usd: 30
[12728.701081] CPU 1: Hot: hi: 186, btch: 31 usd: 115 Cold: hi: 62, btch: 15 usd: 57
[1272...
Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or
what causes this? All pages under writeback? What is the dirty ratio setAnd again.
-
Why is SLUB behaving differently, though.
Memory efficiency wouldn't be the reason, would it? I mean, SLUB
should be more efficient than SLAB, plus have less data lying around
in queues.
-
SLAB may have data around in queues which (if the stars align the right
way) may allow it to go longer without having to get a page from the page
allocator.
-
From the reports it seems pretty consistent. I guess it could well
be something that may occur with SLAB *if the conditions are a bitBut page allocs from slab isn't where the OOMs are occurring, so this
seems unlikely (also, the all_unreclaimable logic now should be pretty
strict, so you have to really run the machine out of memory (1GB of
swap gets fully used, then his DMA32 zone is scanned 8 times without
reclaiming a single page).That said, parallel kernel compiling can really change a lot in memory
footprint depending on small variations in timing. So it might not be
anything to worry about.
-
| Linus Torvalds | Re: O_DIRECT question |
| Greg Kroah-Hartman | [PATCH 002/196] Chinese: rephrase English introduction in HOWTO |
| Andrew Morton | -mm merge plans for 2.6.23 |
| David Miller | Re: [PATCH] net: Fix the prototype of call_netdevice_notifiers |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [GIT]: Networking |
| Radu Rendec | htb parallelism on multi-core platforms |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
