Re: SLUB performance regression vs SLAB

Previous thread: [17/17] Allow virtual fallback for dentries by Christoph Lameter on Tuesday, September 18, 2007 - 8:36 pm. (1 message)

Next thread: [10/17] Use GFP_VFALLBACK for sparsemem. by Christoph Lameter on Tuesday, September 18, 2007 - 8:36 pm. (1 message)
From: Christoph Lameter
Date: Tuesday, September 18, 2007 - 8:36 pm

SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
available then the conservative settings for higher order allocations are
overridden. We then request an order that can accomodate at mininum
100 objects. The size of an individual slab allocation is allowed to reach
up to 256k (order 6 on i386, order 4 on IA64).

Implementing fallback requires special handling of virtual mappings in
the free path. However, the impact is minimal since we already check the
address if its NULL or ZERO_SIZE_PTR. No additional cachelines are
touched if we do not fall back. However, if we need to handle a virtual
compound page then walk the kernel page table in the free paths to
determine the page struct.

We also need special handling in the allocation paths since the virtual
addresses cannot be obtained via page_address(). SLUB exploits that
page->private is set to the vmalloc address to avoid a costly
vmalloc_address().

However, for diagnostics there is still the need to determine the
vmalloc address from the page struct. There we must use the costly
vmalloc_address().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |    1 
 include/linux/slub_def.h |    1 
 mm/slub.c                |   83 ++++++++++++++++++++++++++++++++---------------
 3 files changed, 60 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2007-09-18 17:03:30.000000000 -0700
+++ linux-2.6/include/linux/slab.h	2007-09-18 17:07:39.000000000 -0700
@@ -19,6 +19,7 @@
  * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
  */
 #define SLAB_DEBUG_FREE		0x00000100UL	/* DEBUG: Perform (expensive) checks on free */
+#define SLAB_VFALLBACK		0x00000200UL	/* May fall back to vmalloc */
 #define SLAB_RED_ZONE		0x00000400UL	/* DEBUG: Red zone objs in a cache */
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
 #define ...
From: Nick Piggin
Date: Thursday, September 27, 2007 - 2:42 pm

How come SLUB wants such a big amount of objects? I thought the
unqueued nature of it made it better than slab because it minimised
the amount of cache hot memory lying around in slabs...

vmalloc is incredibly slow and unscalable at the moment. I'm still working
on making it more scalable and faster -- hopefully to a point where it would
actually be usable for this... but you still get moved off large TLBs, and
also have to inevitably do tlb flushing.

Or do you have SLUB at a point where performance is comparable to SLAB,
and this is just a possible idea for more performance?
-

From: Christoph Lameter
Date: Friday, September 28, 2007 - 10:33 am

The more objects in a page the more the fast path runs. The more the fast 
path runs the lower the cache footprint and the faster the overall 
allocations etc.

SLAB can be configured for large queues holdings lots of objects. 
SLUB can only reach the same through large pages because it does not 
have queues. One could add the ability to manage pools of cpu slabs but 
that would be adding yet another layer to compensate for the problem of 
the small pages. Reliable large page allocations means that we can get rid 
of these layers and the many workarounds that we have in place right now.

The unqueued nature of SLUB reduces memory requirements and in general the 
more efficient code paths of SLUB offset the advantage that SLAB can reach 
by being able to put more objects onto its queues. SLAB necessarily 
introduces complexity and cache line use through the need to manage those 

Again I have not seen any fallbacks to vmalloc in my testing. What we are 
doing here is mainly to address your theoretical cases that we so far have 
never seen to be a problem and increase the reliability of allocations of
page orders larger than 3 to a usable level. So far I have so far not 
dared to enable orders larger than 3 by default.

AFAICT The performance of vmalloc is not really relevant. If this would 
become an issue then it would be possible to reduce the orders used to 

AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
that from the beginning. I am still concerned about several corner cases 
though (I think most of them are going to be addressed by the per cpu 
patches in mm). Having a comparable or larger amount of per cpu objects as 
SLAB is something that also could address some of these concerns and could 
increase performance much further.
-

From: Peter Zijlstra
Date: Friday, September 28, 2007 - 10:55 am

take a recent -mm kernel, boot with mem=128M.

start 2 processes that each mmap a separate 64M file, and which does
sequential writes on them. start a 3th process that does the same with
64M anonymous.

wait for a while, and you'll see order=1 failures.



-

From: Christoph Lameter
Date: Friday, September 28, 2007 - 11:20 am

Ok so only 32k pages to play with? I have tried parallel kernel compiles 

Really? That means we can no longer even allocate stacks for forking.

Its surprising that neither lumpy reclaim nor the mobility patches can 
deal with it? Lumpy reclaim should be able to free neighboring pages to 
avoid the order 1 failure unless there are lots of pinned pages.

I guess then that lots of pages are pinned through I/O?
-

From: Peter Zijlstra
Date: Friday, September 28, 2007 - 11:25 am

memory got massively fragemented, as anti-frag gets easily defeated.
setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
order blocks to stay available, so we don't mix types. however 12M on
128M is rather a lot.

its still on my todo list to look at it further..

-

From: Christoph Lameter
Date: Friday, September 28, 2007 - 11:41 am

Yes, strict ordering would be much better. On NUMA it may be possible to 
completely forbid merging. We can fall back to other nodes if necessary. 
12M is not much on a NUMA system.

But this shows that (unsurprisingly) we may have issues on systems with a 
small amounts of memory and we may not want to use higher orders on such 
systems.

The case you got may be good to use as a testcase for the virtual 
fallback. Hmmmm... Maybe it is possible to allocate the stack as a virtual 
compound page. Got some script/code to produce that problem?
-

From: Mel Gorman
Date: Friday, September 28, 2007 - 2:14 pm

The forbidding of merging is trivial and the code is isolated to one function
__rmqueue_fallback(). We don't do it because the decision at development
time was that it was better to allow fragmentation than take a reclaim step
for example[1] and slow things up. This is based on my initial assumption
of anti-frag being mainly of interest to hugepages which are happy to wait

This is another option if you want to use a higher order for SLUB by
default. Use order-0 unless you are sure there is enough memory. At boot
if there is loads of memory, set the higher order and up min_free_kbytes on
each node to reduce mixing[2]. We can test with Peters uber-hostile


[1] It might be tunnel vision but I still keep hugepages in mind as the
    principal user of anti-frag. Andy used to have patches that force evicted
    pages of the "foreign" type when mixing occured so the end result was
    no mixing. We never fully completed them because it was too costly
    for hugepages.

[2] This would require the identification of mixed blocks to be a
    statistic available in mainline. Right now, it's only available in -mm
    when PAGE_OWNER is set

[3] The definition of working in this case being that order-0
    allocations fail which he has produced

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Nick Piggin
Date: Friday, September 28, 2007 - 1:22 pm

Yeah, you could do that, but we generally don't have big problems allocating
stacks in mainline, because we have very few users of higher order pages,
the few that are there don't seem to be a problem.
-

From: Mel Gorman
Date: Friday, September 28, 2007 - 1:59 pm

The 12MB is related to the size of pageblock_order. I strongly suspect
that if you forced pageblock_order to be something like 4 or 5, the
min_free_kbytes would not need to be raised. The current values are

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Andrew Morton
Date: Saturday, September 29, 2007 - 1:13 am

That would be really really bad (as in: patch-dropping time) if those
order-1 allocations are not atomic.

What's the callsite? 
-

From: Peter Zijlstra
Date: Saturday, September 29, 2007 - 1:47 am

Ah, right, that was the detail... all this lumpy reclaim is useless for
atomic allocations. And with SLUB using higher order pages, atomic !0
order allocations will be very very common.

One I can remember was:

  add_to_page_cache()
    radix_tree_insert()
      radix_tree_node_alloc()
        kmem_cache_alloc()

which is an atomic callsite.

Which leaves us in a situation where we can load pages, because there is
free memory, but can't manage to allocate memory to track them.. 

-

From: Peter Zijlstra
Date: Saturday, September 29, 2007 - 1:53 am

Ah, I found a boot log of one of these sessions, its also full of
order-2 OOMs.. :-/

-

From: Andrew Morton
Date: Saturday, September 29, 2007 - 2:01 am

oom-killings, or page allocation failures?  The latter, one hopes.
-

From: Peter Zijlstra
Date: Saturday, September 29, 2007 - 2:14 am

Linux version 2.6.23-rc4-mm1-dirty (root@dyad) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

...


mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [<6002dd28>] printk_ratelimit+0x15/0x17
611b3888:  [<60052ed4>] out_of_memory+0x80/0x100
611b38c8:  [<60054b0c>] __alloc_pages+0x1ed/0x280
611b3948:  [<6006c608>] allocate_slab+0x5b/0xb0
611b3968:  [<6006c705>] new_slab+0x7e/0x183
611b39a8:  [<6006cbae>] __slab_alloc+0xc9/0x14b
611b39b0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39b8:  [<600980f2>] do_mpage_readpage+0x3b3/0x472
611b39e0:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b39f8:  [<6006cc81>] kmem_cache_alloc+0x51/0x98
611b3a38:  [<6011f89f>] radix_tree_preload+0x70/0xbf
611b3a58:  [<6004f8e2>] add_to_page_cache+0x22/0xf7
611b3a98:  [<6004f9c6>] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [<6009821e>] mpage_readpages+0x6d/0x109
611b3ac0:  [<600d59f0>] ext3_get_block+0x0/0xf2
611b3b08:  [<6005483d>] get_page_from_freelist+0x8d/0xc1
611b3b88:  [<600d6937>] ext3_readpages+0x18/0x1a
611b3b98:  [<60056f00>] read_pages+0x37/0x9b
611b3bd8:  [<60057064>] __do_page_cache_readahead+0x100/0x157
611b3c48:  [<60057196>] do_page_cache_readahead+0x52/0x5f
611b3c78:  [<60050ab4>] filemap_fault+0x145/0x278
611b3ca8:  [<60022b61>] run_syscall_stub+0xd1/0xdd
611b3ce8:  [<6005eae3>] __do_fault+0x7e/0x3ca
611b3d68:  [<6005ee60>] do_linear_fault+0x31/0x33
611b3d88:  [<6005f149>] handle_mm_fault+0x14e/0x246
611b3da8:  [<60120a7b>] __up_read+0x73/0x7b
611b3de8:  [<60013177>] handle_page_fault+0x11f/0x23b
611b3e48:  [<60013419>] segv+0xac/0x297
611b3f28:  [<60013367>] segv_handler+0x68/0x6e
611b3f48:  [<600232ad>] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [<60023853>] userspace+0x13a/0x19d
611b3fc8:  [<60010d58>] fork_handler+0x86/0x8d

Mem-info:
Normal per-cpu:
CPU    0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Active:11 inactive:9 dirty:0 writeback:1 unstable:0
 ...
From: Andrew Morton
Date: Saturday, September 29, 2007 - 2:27 am

OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.

I'm suspecting that did_some_progress thing.
-

From: Nick Piggin
Date: Friday, September 28, 2007 - 1:19 pm

The allocation didn't fail -- it invoked the OOM killer because the kernel
ran out of unfragmented memory. Probably because higher order
allocations are the new vogue in -mm at the moment ;)
-

From: Andrew Morton
Date: Saturday, September 29, 2007 - 12:20 pm

We can't "run out of unfragmented memory" for an order-2 GFP_KERNEL
allocation in this workload.  We go and synchronously free stuff up to make
it work.


That's a different bug.

bug 1: We shouldn't be doing higher-order allocations in slub because of
the considerable damage this does to atomic allocations.

bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.


-

From: Nick Piggin
Date: Saturday, September 29, 2007 - 12:09 pm

Either no more order-2 pages could be freed, or the ones that were being

I think one causes 2 as well -- it isn't just considerable damage to atomic
allocations but to GFP_KERNEL allocations too.
-

From: Andrew Morton
Date: Sunday, September 30, 2007 - 1:12 pm

No.  The current design of reclaim (for better or for worse) is that for
order 0,1,2 and 3 allocations we just keep on trying until it works.  That
got broken and I think it got broken at a design level when that
did_some_progress logic went in.  Perhaps something else we did later

Well sure, because we already broke GFP_KERNEL allocations.
-

From: Nick Piggin
Date: Saturday, September 29, 2007 - 9:16 pm

It will keep trying until it works. It won't have stopped trying (unless
I'm very mistaken?), it's just oom killing things merrily along the way.
-

From: Andrew Morton
Date: Saturday, September 29, 2007 - 2:00 am

Oh OK.

I thought we'd already fixed slub so that it didn't do that.  Maybe that
fix is in -mm but I don't think so.

Trying to do atomic order-1 allocations on behalf of arbitray slab caches
just won't fly - this is a significant degradation in kernel reliability,

Right.  Leading to application failure which for many is equivalent to a
complete system outage.

-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 1:55 pm

Ummm... SLAB also does order 1 allocations. We have always done them.

See mm/slab.c

/*
 * Do not go above this order unless 0 objects fit into the slab.
 */
#define BREAK_GFP_ORDER_HI      1
#define BREAK_GFP_ORDER_LO      0
static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;

-

From: Andrew Morton
Date: Monday, October 1, 2007 - 2:30 pm

On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)

Do slab and slub use the same underlying page size for each slab?

Single data point: the CONFIG_SLAB boxes which I have access to here are
using order-0 for radix_tree_node, so they won't be failing in the way in
which Peter's machine is.

I've never ever before seen reports of page allocation failures in the
radix-tree node allocation code, and that's the bottom line.  This is just
a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
lots of other things which we have no chance of identifying.

Peter, is this bug -mm only, or is 2.6.23 similarly failing?
-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 2:38 pm

SLAB cannot pack objects as dense as SLUB and they have different 
algorithm to make the choice of order. Thus the number of objects per slab 
may vary between SLAB and SLUB and therefore also the choice of order to 

Upstream SLUB uses order 0 allocations for the radix tree. MM varies 
because the use of higher order allocs is more loose if the mobility 
algorithms are found to be active:

2.6.23-rc8:

Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg\
radix_tree_node          14281     552     9.9M     2432/948/1    7 0  38  79

-

From: Andrew Morton
Date: Monday, October 1, 2007 - 2:45 pm

On Mon, 1 Oct 2007 14:38:55 -0700 (PDT)


Ah.  So the already-dropped
slub-exploit-page-mobility-to-increase-allocation-order.patch was the
culprit?
-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 2:52 pm

Yes without that patch SLUB will no longer take special action if antifrag 
is around.

-

From: Peter Zijlstra
Date: Tuesday, October 2, 2007 - 2:19 am

I'm mainly using -mm (so you have at least one tester :-), I think the
-mm specific SLUB patch that ups slub_min_order makes the problem -mm
specific, would have to test .23.
From: Peter Zijlstra
Date: Saturday, September 29, 2007 - 1:45 am

I think I'm running with 4k stacks...

-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 2:01 pm

4k stacks will never fly on an SGI x86_64 NUMA configuration given the 
additional data that may be kept on the stack. We are currently 
considering to go from 8k to 16k (or even 32k) to make things work. So 
having the ability to put the stacks in vmalloc space may be something to 
look at.

-

From: Nick Piggin
Date: Tuesday, October 2, 2007 - 1:37 am

i386 and x86-64 already used 8K stacks for years and they have never
really been much problem before.

They only started failing when contiguous memory is getting used up
by other things, _even with_ those anti-frag patches in there.

Bottom line is that you do not use higher order allocations when you do
not need them.
-

From: Mel Gorman
Date: Friday, September 28, 2007 - 2:05 pm

Large pages, flood gates etc. Be wary.

SLUB has to run 100% reliable or things go whoops. SLUB regularly depends on
atomic allocations and cannot take the necessary steps to get the contiguous
pages if it gets into trouble. This means that something like lumpy reclaim
cannot help you in it's current state.

We currently do not take the per-emptive steps with kswapd to ensure the
high-order pages are free. We also don't do something like have users that
can sleep keep the watermarks high. I had considered the possibility but
didn't have the justification for the complexity.

Minimally, SLUB by default should continue to use order-0 pages. Peter has
managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
hostile workload but the point remains. It was artifically worked around
with min_free_kbytes (value set based on pageblock_order, could also have
been artifically worked around by dropping pageblock_order) and he eventually

A compromise may be to have per-cpu lists for higher-order pages in the page
allocator itself as they can be easily drained unlike the SLAB queues. The
thing to watch for would be excessive IPI calls which would offset any


If we're falling back to vmalloc ever, there is a danger that the
problem is postponed until vmalloc space is consumed. More an issue for

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 2:10 pm

SLAB default is order 1 so is SLUB default upstream.

SLAB does runtime detection of the amount of memory and configures the max 
order correspondingly:

from mm/slab.c:

	/*
         * Fragmentation resistance on low memory - only use bigger
         * page orders on machines with more than 32MB of memory.
         */
        if (num_physpages > (32 << 20) >> PAGE_SHIFT)
                slab_break_gfp_order = BREAK_GFP_ORDER_HI;


We could duplicate something like that for SLUB.


-

From: Nick Piggin
Date: Thursday, September 27, 2007 - 10:14 pm

That doesn't sound very nice because you don't actually want to use up
higher order allocations if you can avoid it, and you definitely don't want
to be increasing your slab page size unit if you can help it, because it

I thought it was slower. Have you fixed the performance regression?
(OK, I read further down that you are still working on it but not confirmed

Basically, all that shows is that your testing isn't very thorough. 128MB
is an order of magnitude *more* memory than some users have. They
probably wouldn't be happy with a regression in slab allocator performance

OK, so long as it isn't going to depend on using higher order pages, that's
fine. (if they help even further as an optional thing, that's fine too. You
can turn them on your huge systems and not even bother about adding
this vmap fallback -- you won't have me to nag you about these
purely theoretical issues).
-

From: Christoph Lameter
Date: Monday, October 1, 2007 - 1:50 pm

The problem is with the weird way of Intel testing and communication. 
Every 3-6 month or so they will tell you the system is X% up or down on 
arch Y (and they wont give you details because its somehow secret). And 
then there are conflicting statements by the two or so performance test 
departments. One of them repeatedly assured me that they do not see any 

Well the vmap fallback is generally useful AFAICT. Higher order 
allocations are common on some of our platforms. Order 1 failures even 
affect essential things like stacks that have nothing to do with SLUB and 
the LBS patchset.


-

From: Nick Piggin
Date: Tuesday, October 2, 2007 - 1:43 am

Just so long as there aren't known regressions that would require higher

I don't know if it is worth the trouble, though. The best thing to do is to
ensure that contiguous memory is not wasted on frivolous things... a few
order-1 or 2 allocations aren't too much of a problem.

The only high order allocation failure I've seen from fragmentation for a
long time IIRC are the order-3 failures coming from e1000. And obviously
they cannot use vmap.
-

From: Matthew Wilcox
Date: Thursday, October 4, 2007 - 9:16 am

Could you cut out the snarky remarks?  It takes a long time to run a
test, and testing every one of the patches you send really isn't high
on anyone's priority list.  The performance team have also been having
problems getting stable results with recent kernels, adding to the delay.
The good news is that we do now have committment to testing upstream
kernels, so you should see results more frequently than you have been.

I'm taking over from Suresh as liason for the performance team, so
if you hear *anything* from *anyone* else at Intel about performance,
I want you to cc me about it.  OK?  And I don't want to hear any more
whining about hearing different things from different people.

So, on "a well-known OLTP benchmark which prohibits publishing absolute
numbers" and on an x86-64 system (I don't think exactly which model
is important), we're seeing *6.51%* performance loss on slub vs slab.
This is with a 2.6.23-rc3 kernel.  Tuning the boot parameters, as you've
asked for before (slub_min_order=2, slub_max_order=4, slub_min_objects=8)
gets back 0.38% of that.  It's still down 6.13% over slab.

For what it's worth, 2.6.23-rc3 already has a 1.19% regression versus
RHEL 4.5, so the performance guys are really unhappy about going up to
almost 8% regression.

In the detailed profiles, __slab_free is the third most expensive
function, behind only spin locks.  get_partial_node is right behind it
in fourth place, and kmem_cache_alloc is sixth.  __slab_alloc is eight
and kmem_cache_free is tenth.  These positions don't change with the
slub boot parameters.

Now, where do we go next?  I suspect that 2.6.23-rc9 has significant
changes since -rc3, but I'd like to confirm that before kicking off
another (expensive) run.  Please, tell me what useful kernels are to test.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a ...
From: Christoph Lameter
Date: Thursday, October 4, 2007 - 10:38 am

Yeah the fastpath vs. slow path is not the issue as Siddha and I concluded 
earlier. Seems that we are mainly seeing cacheline bouncing due to two 
cpus accessing meta data in the same page struct. The patches in 
MM that are scheduled to be merged for .24 address that issue. I 
have repeatedly asked that these patches be tested. The patches were 

I thought Siddha has a test in the works with the per cpu structure 
patchset from MM? Could you sync up with Siddha?

-

From: Matthew Wilcox
Date: Thursday, October 4, 2007 - 11:32 am

I just checked with the guys who did the test.  When I said -rc3, I
mis-spoke; this is 2.6.23-rc3 *plus* the patches which Suresh agreed to
test for you.
-

From: Christoph Lameter
Date: Thursday, October 4, 2007 - 10:49 am

I was not aware of that. Would it be possible for you to summarize all the 
test data that you have right now about SLUB vs. SLAB with the patches 
listed? Exactly what kernel version and what version of the per cpu 
patches were tested? Was the page allocator pass through patchset 
separately applied as I requested?

Finally: Is there some way that I can reproduce the tests on my machines?
-

From: Matthew Wilcox
Date: Thursday, October 4, 2007 - 12:28 pm

We have three runs, all with 2.6.23-rc3 plus the patches that Suresh
applied from 20070922.  The first run is with slab.  The second run is
with SLUB and the third run is SLUB plus the tuning parameters you
recommended.

I have a spreadsheet with Vtune data in it that was collected during
each of these test runs, so we can see which functions are the hottest.

I don't believe so.  Suresh?

I think for future tests, it would be easiest if you send me a git

As usual for these kinds of setups ... take a two-CPU machine, 64GB
of memory, half a dozen fibre channel adapters, about 3000 discs,
a commercial database, a team of experts for three months worth of
tuning ...

I don't know if anyone's tried to replicate a benchmark like this using
Postgres.  Would be nice if they have ...
-

From: Christoph Lameter
Date: Thursday, October 4, 2007 - 12:05 pm

There was quite a bit of communication on tuning parameters. Guess we got 
more confusion there and multiple configurations settings that I wanted to 
be tested separately were merged. Setting slub_min_order to more than zero 
can certainly be detrimental to performance since higher order page 
allocations can cause cacheline bouncing on zone locks.

Which patches? 20070922 refers to a pull on the slab git tree on the 

Please do. Add the kernel .configs please. Is there any slab queue tuning 
going on on boot with the SLAB configuration?


If it was a git pull then the pass through was included and never taken 


Well we got our own performance test department here at SGI. If we get 
them involved then we can add another 3 months until we get the test 
results confirmed ;-). Seems that this is a small configuration. Why
does it take that long? And the experts knew SLAB and not SLUB right?

Lets look at all the data that you got and then see if this is enough to 
figure out what is wrong.
-

From: Siddha, Suresh B
Date: Thursday, October 4, 2007 - 12:46 pm

It was a git pull from the performance branch that you pointed out earlier
http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance

and the config is based on EL5 config with just the SLUB turned on.
-

From: David Miller
Date: Thursday, October 4, 2007 - 1:55 pm

From: willy@linux.intel.com (Matthew Wilcox)

Anything, I do mean anything, can be simulated using small test
programs.  Pointing at a big fancy machine with lots of storage
and disk is a passive aggressive way to avoid the real issues,
in that nobody is putting forth the effort to try and come up
with an at least publishable test case that Christoph can use to
help you guys.

If coming up with a reproducable and publishable test case is
the difference between this getting fixed and it not getting
fixed, are you going to invest the time to do that?
-

From: Matthew Wilcox
Date: Thursday, October 4, 2007 - 2:05 pm

If that's what it takes, then yes.  But I'm far from convinced that
it's as easy to come up with a TPC benchmark simulator as you think.
There have been efforts in the past (orasim, for example), but
presumably Christoph has already tried these benchmarks.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-

From: Christoph Lameter
Date: Thursday, October 4, 2007 - 7:43 pm

I just spend some time looking at the functions that you see high in the 
list. The trouble is that I have to speculate and that I have nothing to 
verify my thoughts. If you could give me the hitlist for each of the 
3 runs then this would help to check my thinking. I could be totally off 
here.

It seems that we miss the per cpu slab frequently on slab_free() which 
leads to the calling of __slab_free() and which in turn needs to take a 
lock on the page (in the page struct). Typically the page lock is 
uncontended which seems to not be the case here otherwise it would not be 
that high up.

The per cpu patch in mm should reduce the contention on the page struct by 
not touching the page struct on alloc and on free. Does not seem to work 
all the way though. slab_free() still has to touch the page struct if the 
free is not to the currently active cpu slab.

So there could still be page struct contention left if multiple processors 
frequently and simultaneously free to the same slab and that slab is not 
the per cpu slab of a cpu. That could be addressed by optimizing the 
object free handling further to not touch the page struct even if we miss 
the per cpu slab.

That get_partial* is far up indicates contention on the list lock that 
should be addressable by either increasing the slab size or by changing 
the object free handling to batch in some form.

This is an SMP system right? 2 cores with 4 cpus each? The main loop is 
always hitting on the same slabs? Which slabs would this be? Am I right in 
thinking that one process allocates objects and then lets multiple other 
processors do work and then the allocated object is freed from a cpu that 
did not allocate the object? If neighboring objects in one slab are 
allocated on one cpu and then are almost simultaneously freed from a set 
of different cpus then this may be explain the situation.
-

From: Arjan van de Ven
Date: Thursday, October 4, 2007 - 7:53 pm

On Thu, 4 Oct 2007 19:43:58 -0700 (PDT)

one of the characteristics of the application in use is the following:
all cores submit IO (which means they allocate various scsi and block
structures on all cpus).. but only 1 will free it (the one the IRQ is
bound to). SO it's allocate-on-one-free-on-another at a high rate.

That is assuming this is the IO slab; that's a bit of an assumption
obviously (it's one of the slab things that are hot, but it's a complex
workload, there could be others)
-

From: Chuck Ebbert
Date: Thursday, October 4, 2007 - 2:02 pm

How do you simulate reading 100TB of data spread across 3000 disks,
selecting 10% of it using some criterion, then sorting and summarizing
the result?
-

From: David Miller
Date: Thursday, October 4, 2007 - 2:11 pm

From: Chuck Ebbert <cebbert@redhat.com>

You repeatedly read zeros from a smaller disk into the same amount of
memory, and sort that as if it were real data instead.

You're not thinking outside of the box, and you need to do that to
write good test cases and fix kernel bugs effectively.
-

From: Chuck Ebbert
Date: Thursday, October 4, 2007 - 2:47 pm

You've just replaced 3000 concurrent streams of data with a single
stream.  That won't test the memory allocator's ability to allocate
memory to many concurrent users very well.
-

From: David Miller
Date: Thursday, October 4, 2007 - 3:07 pm

From: Chuck Ebbert <cebbert@redhat.com>

You've kindly removed my "thinking outside of the box" comment.

The point is was not that my specific suggestion would be
perfect, but that if you used your creativity and thought
in similar directions you might find a way to do it.

People are too narrow minded when it comes to these things, and
that's the problem I want to address.
-

From: David Chinner
Date: Thursday, October 4, 2007 - 3:23 pm

And it's a good point, too, because often problems to one person are a
no-brainer to someone else.

Creating lots of "fake" disks is trivial to do, IMO.  Use loopback on sparse
files containing sparse filesxi, use ramdisks containing sparse files or write a
sparse dm target for sparse block device mapping, etc. I'm sure there's more than the
few I just threw out...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-

From: Jens Axboe
Date: Thursday, October 4, 2007 - 11:48 pm

Or use scsi_debug to fake drives/controllers, works wonderful as well
for some things and involve the full IO stack.

I'd like to second Davids emails here, this is a serious problem. Having
a reproducible test case lowers the barrier for getting the problem
fixed by orders of magnitude. It's the difference between the problem
getting fixed in a day or two and it potentially lingering for months,
because email ping-pong takes forever and "the test team has moved on to
other tests, we'll let you know the results of test foo in 3 weeks time
when we have a new slot on the box" just removing any developer
motivation to work on the issue.

-- 
Jens Axboe

-

From: Pekka Enberg
Date: Friday, October 5, 2007 - 2:19 am

Hi,


What I don't understand is that why don't the people who _have_ access
to the test case fix the problem? Unlike slab, slub is not a pile of
crap that only Christoph can hack on...

                                   Pekka
-

From: Jens Axboe
Date: Friday, October 5, 2007 - 2:28 am

Often the people testing are only doing just that, testing. So they
kindly offer to test any patches and so on, which usually takes forever
because of the above limitations in response time, machine availability,
etc.

Writing a small test module to exercise slub/slab in various ways
(allocating from all cpus freeing from one, as described) should not be
too hard. Perhaps that would be enough to find this performance
discrepancy between slab and slub?

-- 
Jens Axboe

-

From: Andi Kleen
Date: Friday, October 5, 2007 - 4:12 am

You could simulate that by just sending packets using unix sockets 
between threads bound to different CPUs. Sending a packet allocates; receiving 
deallocates.

But it's not clear that will really simulate the cache bounce environment
of the database test. I don't think all passing of data between CPUs 
using slub objects is slow.

-Andi
-

From: Jens Axboe
Date: Friday, October 5, 2007 - 5:39 am

It might not, it might. The point is trying to isolate the problem and
making a simple test case that could be used to reproduce it, so that
Christoph (or someone else) can easily fix it.

-- 
Jens Axboe

-

From: Christoph Lameter
Date: Friday, October 5, 2007 - 12:31 pm

In case there is someone who wants to hack on it: Here is what I got so 
far for batching the frees. I will try to come up with a test next week if 
nothing else happens before:

Patch 1/2 on top of mm:

SLUB: Keep counter of remaining objects on the per cpu list

Add a counter to keep track of how many objects are on the per cpu list.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 +
 mm/slub.c                |    8 ++++++--
 2 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h	2007-10-04 22:41:58.000000000 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h	2007-10-04 22:42:08.000000000 -0700
@@ -15,6 +15,7 @@ struct kmem_cache_cpu {
 	void **freelist;
 	struct page *page;
 	int node;
+	int remaining;
 	unsigned int offset;
 	unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===================================================================
--- linux-2.6.23-rc8-mm2.orig/mm/slub.c	2007-10-04 22:41:58.000000000 -0700
+++ linux-2.6.23-rc8-mm2/mm/slub.c	2007-10-04 22:42:08.000000000 -0700
@@ -1386,12 +1386,13 @@ static void deactivate_slab(struct kmem_
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(c->freelist)) {
+	while (unlikely(c->remaining)) {
 		void **object;
 
 		/* Retrieve object from cpu_freelist */
 		object = c->freelist;
 		c->freelist = c->freelist[c->offset];
+		c->remaining--;
 
 		/* And put onto the regular freelist */
 		object[c->offset] = page->freelist;
@@ -1491,6 +1492,7 @@ load_freelist:
 
 	object = c->page->freelist;
 	c->freelist = object[c->offset];
+	c->remaining = s->objects - c->page->inuse - 1;
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1574,13 +1576,14 @@ static void __always_inline ...
From: Christoph Lameter
Date: Friday, October 5, 2007 - 12:32 pm

Patch 2/2


SLUB: Allow foreign objects on the per cpu object lists.

In order to free objects we need to touch the page struct of the page that the
object belongs to. If this occurs too frequently then we could generate a bouncing
cacheline.

We do not want that to occur too frequently. We can avoid the page struct touching
for per cpu objects. Now we extend that to allow a limited number of objects that are
not part of the cpu slab. Allow up to 4 times the objects that fit into a page
in the per cpu list.

If the objects are allocated before we need to free them then we have saved touching
a page struct twice. The objects are presumably cache hot, so it is performance wise
good to recycle these locally.

Foreign objects are drained before deactivating cpu slabs and if too many objects
accumulate.

For kmem_cache_free() this also has the beneficial effect of getting virt_to_page()
operations eliminated or grouped together which may help reduce the cache footprint
and increase the speed of virt_to_page() lookups (they hopefully all come from the
same pages).

For kfree() we may have to do virt_to_page() in the worst case twice. Once grouped
together.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   82 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 68 insertions(+), 15 deletions(-)

Index: linux-2.6.23-rc8-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.23-rc8-mm2.orig/include/linux/slub_def.h	2007-10-04 22:42:08.000000000 -0700
+++ linux-2.6.23-rc8-mm2/include/linux/slub_def.h	2007-10-04 22:43:19.000000000 -0700
@@ -16,6 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int node;
 	int remaining;
+	int drain_limit;
 	unsigned int offset;
 	unsigned int objsize;
 };
Index: linux-2.6.23-rc8-mm2/mm/slub.c
===================================================================
--- ...
From: Matthew Wilcox
Date: Friday, October 5, 2007 - 4:56 am

I vaguely remembered something called orasim, so I went looking for it.
I found http://oss.oracle.com/~wcoekaer/orasim/ which is dated from
2004, and I found http://oss.oracle.com/projects/orasimjobfiles/ which
seems to be a stillborn project.  Is there anything else I should know
about orasim?  ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-

From: Jens Axboe
Date: Friday, October 5, 2007 - 5:37 am

I don't know much about orasim, except that internally we're trying to
use fio for that instead. As far as I know, it was a project that was
never feature complete (or completed all together, for that matter).

-- 
Jens Axboe

-

From: Christoph Lameter
Date: Friday, October 5, 2007 - 12:27 pm

Too bad. If this would work then I would have a load to work against. I 
have a patch here that may address the issue for SMP (no NUMA for now) by 
batching all frees on the per cpu freelist and then dumping them in 
groups. But it is likely not too wise to have you run your weeklong 
tests on this one. Needs some more care first.



-

From: Peter Zijlstra
Date: Friday, October 5, 2007 - 1:32 pm

Focus on the slab allocator usage, instrument it, record a trace,
generate a statistical model that matches, and write a small
programm/kernel module that has the same allocation pattern. Then verify
this statistical workload still shows the same performance difference.

Easy: no
Doable: yes



-

From: David Miller
Date: Friday, October 5, 2007 - 2:31 pm

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

The other important bit is likely to generate a lot of DMA traffic
such that the L2 cache bandwidth is getting used on the bus
side by the PCI controller doing invalidations of both dirty
and clean L2 cache lines as devices DMA to/from them.

This will also be exercising the memory controller, further contending
with the cpu when SLAB touches cold data structures.
-

From: Arjan van de Ven
Date: Thursday, October 4, 2007 - 10:50 am

On Thu, 4 Oct 2007 10:38:15 -0700 (PDT)


Ok every time something says anything not 100% positive about SLUB you
come back with "but it's fixed in the next patch set"... *every time*.

To be honest, to me that sounds that SLUB isn't ready for prime time
yet, or at least not ready to be the only one in town...

The day that the answer is "the kernel.org slub is fixing all the
issues" is when it's ready..
-

From: Christoph Lameter
Date: Thursday, October 4, 2007 - 10:58 am

All I ask that people test the fixes that have been out there for the 
known issues. If there are remaining performance issues then lets figure 
them out and address them.
-

From: Peter Zijlstra
Date: Thursday, October 4, 2007 - 11:26 am

Arjan, to be honest, there has been some confusion on _what_ code has
been tested with what results. And with Christoph not able to reproduce
these results locally, it is very hard for him to fix it proper.



-

From: David Miller
Date: Thursday, October 4, 2007 - 1:48 pm

From: Arjan van de Ven <arjan@infradead.org>

I think this is partly Christoph subconsciously venting his
frustration that he's never given a reproducable test case he can use
to fix the problem.

There comes a point where it is the reporter's responsibility to help
the developer come up with a publishable test case the developer can
use to work on fixing the problem and help ensure it stays fixed.

Using an unpublishable benchmark, whose results even cannot be
published, really stretches the limits of "reasonable" don't you
think?

This "SLUB isn't ready yet" bullshit is just a shamans dance which
distracts attention away from the real problem, which is that a
reproducable, publishable test case, is not being provided to the
developer so he can work on fixing the problem.

I can tell you this thing would be fixed overnight if a proper test
case had been provided by now.
-

From: Matthew Wilcox
Date: Thursday, October 4, 2007 - 1:58 pm

That's a lot of effort.  Is it more effort than doing some remote

Yet here we stand.  Christoph is aggressively trying to get slab removed
from the tree.  There is a testcase which shows slub performing worse
than slab.  It's not my fault I can't publish it.  And just because I
can't publish it doesn't mean it doesn't exist.

Slab needs to not get removed until slub is as good a performer on this
benchmark.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-

From: David Miller
Date: Thursday, October 4, 2007 - 2:05 pm

From: Matthew Wilcox <matthew@wil.cx>

That's a good question and an excellent point.  I'm sure that,
either way, Christoph will be more than willing to engage and
assist.
-

From: Christoph Lameter
Date: Thursday, October 4, 2007 - 2:11 pm

I agree with this .... SLAB will stay until we have worked through all the 
performance issues.

-

From: David Schwartz
Date: Thursday, October 4, 2007 - 4:39 pm

I would just like to echo what you said just a bit angrier. This is the same
as someone asking him to fix a bug that they can only see with a binary-only
kernel module. I think he's perfectly justified in simply responding "the
bug is as likely to be in your code as mine".

Now, just because he's justified in doing that doesn't mean he should. I
presume he has an honest desire to improve his own code and if they've found
a real problem, I'm sure he'd love to fix it.

But this is just a preposterous position to put him in. If there's no
reproduceable test case, then why should he care that one program he can't
even see works badly? If you care, you fix it.


It means it may or may not exist. All we have is your word that slub is the
problem. If I said I found a bug in the Linux kernel that caused it to panic
but I could only reproduce it with the nVidia driver, I'd be laughed at.

It may even be that slub is better, your benchmark simply interprets this as
worse. Without the details of your benchmark, we can't know. For example,
I've seen benchmarks that (usually unintentionally) actually do a *variable*
amount of work and details of the implementation may result in the benchmark
actually doing *more* work, so it taking longer does not mean it ran slower.

DS


-

From: Chuck Ebbert
Date: Thursday, October 4, 2007 - 4:49 pm

People have been trying for years to make reproducible test cases
for huge and complex workloads. It doesn't work. The tests that do
work take weeks to run and need to be carefully validated before
they can be officially released. The open source community can and
should be working on similar tests, but they will never be simple.
-

From: David Schwartz
Date: Thursday, October 4, 2007 - 9:18 pm

That's true, but irrelevent. Either the test can identify a problem that
applies generally, or it's doing nothing but measuring how good the system
is at doing the test. If the former, it should be possible to create a
simple test case once you know from the complex test where the problem is.
If the latter, who cares about a supposed regression?

It should be possible to identify exactly what portion of the test shows the
regression the most and exactly what the system is doing during that moment.
The test may be great at finding regressions, but once it finds them, they
should be forever *found*.

Did you follow the recent incident when iperf fout what seemed to be a
significnat CFS networking regression? The only way to identify that it was
a quirk in what iperf was doing was by looking at exactly what iperf was
doing. The only efficient way was to look at iperf's source and see that
iperf's weird yielding meant it didn't replicate typical use cases like it
was supposed to.

DS


-

Previous thread: [17/17] Allow virtual fallback for dentries by Christoph Lameter on Tuesday, September 18, 2007 - 8:36 pm. (1 message)

Next thread: [10/17] Use GFP_VFALLBACK for sparsemem. by Christoph Lameter on Tuesday, September 18, 2007 - 8:36 pm. (1 message)