Re: SLUB 0:1 SLAB (OOM during massive parallel kernel builds)

Previous thread: [PATCH] LinuxPPS - PPS support for Linux by Rodolfo Giometti on Tuesday, October 23, 2007 - 11:04 am. (8 messages)

Next thread: [GIT PULL] 9p: 2.6.24 patches (phase 2) by Eric Van Hensbergen on Tuesday, October 23, 2007 - 11:59 am. (1 message)
From: Alexey Dobriyan
Date: Tuesday, October 23, 2007 - 11:16 am

Abstract:

Nasty OOM killings appear during massive parallel kernel builds with
SLUB, but not with SLAB. By nasty I mean, cc1 processes are killed --
object files in .ccache and tree are corrupted which makes me to
blow up them entirely.


Setup:

92G ext2 partition mounted with noatime. .ccache, master tree, object
trees, build logs residing on it. 2G of RAM, 1G of swap partition.
CFQ scheduler. Core 2 Duo E6400. x86_64 arch.


Workload:

build ~250 configs some fat, some not, in parallel from one source
using O= and save logs to separate dir. This is what I do regularly here.


Observations:

With SLAB this workload never went to OOM killer.
With SLUB and pretty much all debugging enabled, it finishes to the end
(albeit slowly).
With SLUB and no debugging, OOM killer kicks in.

After I notice it at evening, e.g ssh to box doesn't work -- new OOM killings
appear.


Below is dmesg from fresh boot,  OOM messages and .config. Please, give me a
clue on how to proceed from there. If needed I can prepare cut-down scripts
for reproducing and more precise description of setup.

================================================================================
[this is the first OOM]
cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0

Call Trace:
 [<ffffffff8024cfe6>] oom_kill_process+0xf6/0x110
 [<ffffffff8024d49d>] out_of_memory+0x1ad/0x1f0
 [<ffffffff8024f79e>] __alloc_pages+0x28e/0x3a0
 [<ffffffff802518d1>] __do_page_cache_readahead+0x101/0x270
 [<ffffffff8024c43a>] filemap_fault+0x28a/0x3c0
 [<ffffffff8025742c>] __do_fault+0x6c/0x430
 [<ffffffff802592db>] handle_mm_fault+0x3bb/0x700
 [<ffffffff8021c09f>] do_page_fault+0x1bf/0x830
 [<ffffffff803d1583>] thread_return+0x3d/0x53a
 [<ffffffff8025d296>] do_munmap+0x296/0x2f0
 [<ffffffff803d3679>] error_exit+0x0/0x51

Mem-info:
DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   ...
From: Christoph Lameter
Date: Tuesday, October 23, 2007 - 12:35 pm

Not sure what this is. Maybe the slowing SLUB solves the race.
Could you try and boot with

slub_debug=F


Regular Order 0 alloc .... but why is there no memory available , 
reclaimed?

-

From: Pekka Enberg
Date: Tuesday, October 23, 2007 - 12:48 pm

Hi,



I am too wondering where all that memory is going. Logging /proc/meminfo 
and slabinfo -S (see Documentation/vm/slabinfo.c) every few minutes or 
so probably might help answering that question.

				Pekka

-

From: Christoph Lameter
Date: Tuesday, October 23, 2007 - 12:57 pm

__GFP_HARDWALL and __GFP_MOVABLE is set. I guess we have no cpusets on the 
box? Wonder what the antifrag status is? Can we get /proc/pagetypeinfo?


-

From: Mel Gorman
Date: Wednesday, October 24, 2007 - 1:09 am

I do not believe the anti-frag status will make a difference as this is
a order-0 failure. Looking at the OOM messages;


No reclaimable memory, no free swap and free memory is below the min free
watermark on the DMA32 zone being allocated from. That system is genuinely
out of memory although I am surprised it didn't fallback to ZONE_DMA but
the likely explanation is the lowmem reserves.


Same again - really OOM.

It seems to be the same all the way through.

I am somewhat at a loss to explain why SLUB would fail in this case when
SLAB wouldn't. One possibility is that SLAB is artifically holding up
processes in allocation so that fewer are running at the same time allowing
the overall job to complete. For a compile-time benchmark, it might not be
very noticable as the processes are already pretty short-lived. It would be
interesting to add more swap so that SLUB doesn't fail and see if there is
a difference in time-to-completion.

If SLUB is faster, it might imply that it is letting more compile-jobs
running at the same time. Output of vmstat during both runs (preferably with
an indication on the SLUB failure where the OOM occured) may tell us. If the
"running" and "blocked" column for SLUB are significantly higher than the
averages for SLAB, would it support that theory Christoph?

Another possibility is that SLUBs runtime overhead is higher although I find
that one harder to believe.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Alexey Dobriyan
Date: Wednesday, October 24, 2007 - 12:28 pm

Thanks, guys for suggestions.

I redid workload with slub_debug=F and collected /proc/meminfo, slabinfo
-S and /proc/pagetyinfo. It OOMed. Here are first few messages from OOM
killer and few last messages from proc files. During run collecting
script managed to wakeup once in 1-3 minutes. There are 12 minutes between
last and second-to-last messages which is I assume when X happened.
If needed, I can send complete writeout.

I'm going to add more swap now and make timings.




[   36.146064] Adding 987956k swap on /dev/sda1.  Priority:-1 extents:1 across:987956k
[12728.699150] cc1 invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=-17
[12728.699299] 
[12728.699299] Call Trace:
[12728.699422]  [<ffffffff8024d176>] oom_kill_process+0xf6/0x110
[12728.699505]  [<ffffffff8024d62d>] out_of_memory+0x1ad/0x1f0
[12728.699587]  [<ffffffff8024f92e>] __alloc_pages+0x28e/0x3a0
[12728.699669]  [<ffffffff80251a61>] __do_page_cache_readahead+0x101/0x270
[12728.699757]  [<ffffffff8024c5ca>] filemap_fault+0x28a/0x3c0
[12728.699839]  [<ffffffff802575bc>] __do_fault+0x6c/0x430
[12728.699920]  [<ffffffff8025946b>] handle_mm_fault+0x3bb/0x700
[12728.700004]  [<ffffffff8021c1ff>] do_page_fault+0x1bf/0x830
[12728.700086]  [<ffffffff8025e94c>] do_mmap_pgoff+0x2fc/0x360
[12728.700168]  [<ffffffff8025d426>] do_munmap+0x296/0x2f0
[12728.700249]  [<ffffffff802e2d11>] __up_write+0x21/0x130
[12728.700331]  [<ffffffff803d3c99>] error_exit+0x0/0x51
[12728.700409] 
[12728.700465] Mem-info:
[12728.700526] DMA per-cpu:
[12728.700589] CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
[12728.700732] CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
[12728.700874] DMA32 per-cpu:
[12728.700938] CPU    0: Hot: hi:  186, btch:  31 usd:  94   Cold: hi:   62, btch:  15 usd:  30
[12728.701081] CPU    1: Hot: hi:  186, btch:  31 usd: 115   Cold: hi:   62, btch:  15 usd:  57
[12728.701224] Active:433892 inactive:22010 dirty:0 writeback:0 ...
From: Christoph Lameter
Date: Wednesday, October 24, 2007 - 7:15 pm

Ummm... all unreclaimable is set! Are you mlocking the pages in memory? Or 
what causes this? All pages under writeback? What is the dirty ratio set 


And again.

-

From: Nick Piggin
Date: Wednesday, October 24, 2007 - 7:34 pm

Why is SLUB behaving differently, though.

Memory efficiency wouldn't be the reason, would it? I mean, SLUB
should be more efficient than SLAB, plus have less data lying around
in queues.
-

From: Christoph Lameter
Date: Wednesday, October 24, 2007 - 7:43 pm

SLAB may have data around in queues which (if the stars align the right 
way) may allow it to go longer without having to get a page from the page 
allocator.
-

From: Nick Piggin
Date: Wednesday, October 24, 2007 - 8:06 pm

From the reports it seems pretty consistent. I guess it could well
be something that may occur with SLAB *if the conditions are a bit

But page allocs from slab isn't where the OOMs are occurring, so this
seems unlikely (also, the all_unreclaimable logic now should be pretty
strict, so you have to really run the machine out of memory (1GB of
swap gets fully used, then his DMA32 zone is scanned 8 times without
reclaiming a single page).

That said, parallel kernel compiling can really change a lot in memory
footprint depending on small variations in timing. So it might not be
anything to worry about.
-

From: Pekka Enberg
Date: Tuesday, October 23, 2007 - 1:04 pm

Hi,



What kind of race are you thinking of? What I initially thought was
that the problem is that SLUB messes up some other VM heuristics due
to different object sizes, not holding on to empty slabs, and/or page
allocator pass-through. I guess only object size is affected by
debugging, though?
-

From: Christoph Lameter
Date: Tuesday, October 23, 2007 - 1:09 pm

The number of objects per page is reduced by enabling full debugging. That 
triggers a potential of more order 1 allocations but we are failing at 
order 0 allocs. slub_debug=F does not increase object size with debugging 
information but keeps things as they are while doing as much verification 
of slab integrity as possible.

One--potentially far fetched theory--is that the difference in alloc 
performance may trigger a race. But I guess it is more likely that we will 
find something wrong with the gfp flags (that I recategorized for 2.6.24).


-

From: Pekka Enberg
Date: Tuesday, October 23, 2007 - 1:41 pm

Hi Christoph,

(I fixed linux-mm cc to kvack.org.)


Yeah, but we're _not failing_ when debugging is enabled. Thus, it's
likely, that the _failing_ (non-debug) case has potential for more
order 0 allocs, no? I am just guessing here but maybe it's
slab_order() behaving differently from calculate_slab_order() so that
we see more order 0 pressure in SLUB than SLAB?

                                     Pekka
-

From: Christoph Lameter
Date: Wednesday, October 24, 2007 - 9:56 am

Seesm that order 0 pressure is better than order 1?


-

Previous thread: [PATCH] LinuxPPS - PPS support for Linux by Rodolfo Giometti on Tuesday, October 23, 2007 - 11:04 am. (8 messages)

Next thread: [GIT PULL] 9p: 2.6.24 patches (phase 2) by Eric Van Hensbergen on Tuesday, October 23, 2007 - 11:59 am. (1 message)