Abstract: Nasty OOM killings appear during massive parallel kernel builds with SLUB, but not with SLAB. By nasty I mean, cc1 processes are killed -- object files in .ccache and tree are corrupted which makes me to blow up them entirely. Setup: 92G ext2 partition mounted with noatime. .ccache, master tree, object trees, build logs residing on it. 2G of RAM, 1G of swap partition. CFQ scheduler. Core 2 Duo E6400. x86_64 arch. Workload: build ~250 configs some fat, some not, in parallel from one source using O= and save logs to separate dir. This is what I do regularly here. Observations: With SLAB this workload never went to OOM killer. With SLUB and pretty much all debugging enabled, it finishes to the end (albeit slowly). With SLUB and no debugging, OOM killer kicks in. After I notice it at evening, e.g ssh to box doesn't work -- new OOM killings appear. Below is dmesg from fresh boot, OOM messages and .config. Please, give me a clue on how to proceed from there. If needed I can prepare cut-down scripts for reproducing and more precise description of setup. ================================================================================ [this is the first OOM] cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0 Call Trace: [<ffffffff8024cfe6>] oom_kill_process+0xf6/0x110 [<ffffffff8024d49d>] out_of_memory+0x1ad/0x1f0 [<ffffffff8024f79e>] __alloc_pages+0x28e/0x3a0 [<ffffffff802518d1>] __do_page_cache_readahead+0x101/0x270 [<ffffffff8024c43a>] filemap_fault+0x28a/0x3c0 [<ffffffff8025742c>] __do_fault+0x6c/0x430 [<ffffffff802592db>] handle_mm_fault+0x3bb/0x700 [<ffffffff8021c09f>] do_page_fault+0x1bf/0x830 [<ffffffff803d1583>] thread_return+0x3d/0x53a [<ffffffff8025d296>] do_munmap+0x296/0x2f0 [<ffffffff803d3679>] error_exit+0x0/0x51 Mem-info: DMA per-cpu: CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: ...
Not sure what this is. Maybe the slowing SLUB solves the race. Could you try and boot with slub_debug=F Regular Order 0 alloc .... but why is there no memory available , reclaimed? -
Hi, I am too wondering where all that memory is going. Logging /proc/meminfo and slabinfo -S (see Documentation/vm/slabinfo.c) every few minutes or so probably might help answering that question. Pekka -
__GFP_HARDWALL and __GFP_MOVABLE is set. I guess we have no cpusets on the box? Wonder what the antifrag status is? Can we get /proc/pagetypeinfo? -
I do not believe the anti-frag status will make a difference as this is a order-0 failure. Looking at the OOM messages; No reclaimable memory, no free swap and free memory is below the min free watermark on the DMA32 zone being allocated from. That system is genuinely out of memory although I am surprised it didn't fallback to ZONE_DMA but the likely explanation is the lowmem reserves. Same again - really OOM. It seems to be the same all the way through. I am somewhat at a loss to explain why SLUB would fail in this case when SLAB wouldn't. One possibility is that SLAB is artifically holding up processes in allocation so that fewer are running at the same time allowing the overall job to complete. For a compile-time benchmark, it might not be very noticable as the processes are already pretty short-lived. It would be interesting to add more swap so that SLUB doesn't fail and see if there is a difference in time-to-completion. If SLUB is faster, it might imply that it is letting more compile-jobs running at the same time. Output of vmstat during both runs (preferably with an indication on the SLUB failure where the OOM occured) may tell us. If the "running" and "blocked" column for SLUB are significantly higher than the averages for SLAB, would it support that theory Christoph? Another possibility is that SLUBs runtime overhead is higher although I find that one harder to believe. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
Thanks, guys for suggestions. I redid workload with slub_debug=F and collected /proc/meminfo, slabinfo -S and /proc/pagetyinfo. It OOMed. Here are first few messages from OOM killer and few last messages from proc files. During run collecting script managed to wakeup once in 1-3 minutes. There are 12 minutes between last and second-to-last messages which is I assume when X happened. If needed, I can send complete writeout. I'm going to add more swap now and make timings. [ 36.146064] Adding 987956k swap on /dev/sda1. Priority:-1 extents:1 across:987956k [12728.699150] cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=-17 [12728.699299] [12728.699299] Call Trace: [12728.699422] [<ffffffff8024d176>] oom_kill_process+0xf6/0x110 [12728.699505] [<ffffffff8024d62d>] out_of_memory+0x1ad/0x1f0 [12728.699587] [<ffffffff8024f92e>] __alloc_pages+0x28e/0x3a0 [12728.699669] [<ffffffff80251a61>] __do_page_cache_readahead+0x101/0x270 [12728.699757] [<ffffffff8024c5ca>] filemap_fault+0x28a/0x3c0 [12728.699839] [<ffffffff802575bc>] __do_fault+0x6c/0x430 [12728.699920] [<ffffffff8025946b>] handle_mm_fault+0x3bb/0x700 [12728.700004] [<ffffffff8021c1ff>] do_page_fault+0x1bf/0x830 [12728.700086] [<ffffffff8025e94c>] do_mmap_pgoff+0x2fc/0x360 [12728.700168] [<ffffffff8025d426>] do_munmap+0x296/0x2f0 [12728.700249] [<ffffffff802e2d11>] __up_write+0x21/0x130 [12728.700331] [<ffffffff803d3c99>] error_exit+0x0/0x51 [12728.700409] [12728.700465] Mem-info: [12728.700526] DMA per-cpu: [12728.700589] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [12728.700732] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0 [12728.700874] DMA32 per-cpu: [12728.700938] CPU 0: Hot: hi: 186, btch: 31 usd: 94 Cold: hi: 62, btch: 15 usd: 30 [12728.701081] CPU 1: Hot: hi: 186, btch: 31 usd: 115 Cold: hi: 62, btch: 15 usd: 57 [12728.701224] Active:433892 inactive:22010 dirty:0 writeback:0 ...
Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or what causes this? All pages under writeback? What is the dirty ratio set And again. -
Why is SLUB behaving differently, though. Memory efficiency wouldn't be the reason, would it? I mean, SLUB should be more efficient than SLAB, plus have less data lying around in queues. -
SLAB may have data around in queues which (if the stars align the right way) may allow it to go longer without having to get a page from the page allocator. -
From the reports it seems pretty consistent. I guess it could well be something that may occur with SLAB *if the conditions are a bit But page allocs from slab isn't where the OOMs are occurring, so this seems unlikely (also, the all_unreclaimable logic now should be pretty strict, so you have to really run the machine out of memory (1GB of swap gets fully used, then his DMA32 zone is scanned 8 times without reclaiming a single page). That said, parallel kernel compiling can really change a lot in memory footprint depending on small variations in timing. So it might not be anything to worry about. -
Hi, What kind of race are you thinking of? What I initially thought was that the problem is that SLUB messes up some other VM heuristics due to different object sizes, not holding on to empty slabs, and/or page allocator pass-through. I guess only object size is affected by debugging, though? -
The number of objects per page is reduced by enabling full debugging. That triggers a potential of more order 1 allocations but we are failing at order 0 allocs. slub_debug=F does not increase object size with debugging information but keeps things as they are while doing as much verification of slab integrity as possible. One--potentially far fetched theory--is that the difference in alloc performance may trigger a race. But I guess it is more likely that we will find something wrong with the gfp flags (that I recategorized for 2.6.24). -
Hi Christoph,
(I fixed linux-mm cc to kvack.org.)
Yeah, but we're _not failing_ when debugging is enabled. Thus, it's
likely, that the _failing_ (non-debug) case has potential for more
order 0 allocs, no? I am just guessing here but maybe it's
slab_order() behaving differently from calculate_slab_order() so that
we see more order 0 pressure in SLUB than SLAB?
Pekka
-
Seesm that order 0 pressure is better than order 1? -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify t |
