OK, I have these numbers to show I'm not completely off my rocker to suggest
we merge SLQB :) Given these results, how about I ask to merge SLQB as default
in linux-next, then if nothing catastrophic happens, merge it upstream in the
next merge window, then a couple of releases after that, given some time to
test and tweak SLQB, then we plan to bite the bullet and emerge with just one
main slab allocator (plus SLOB).System is a 2socket, 4 core AMD. All debug and stats options turned off for
all the allocators; default parameters (ie. SLUB using higher order pages,
and the others tend to be using order-0). SLQB is the version I recently
posted, with some of the prefetching removed according to Pekka's review
(probably a good idea to only add things like that in if/when they prove to
be an improvement).time fio examples/netio (10 runs, lower better):
SLAB AVG=13.19 STD=0.40
SLQB AVG=13.78 STD=0.24
SLUB AVG=14.47 STD=0.23SLAB makes a good showing here. The allocation/freeing pattern seems to be
very regular and easy (fast allocs and frees). So it could be some "lucky"
caching behaviour, I'm not exactly sure. I'll have to run more tests and
profiles here.hackbench (10 runs, lower better):
1 GROUP
SLAB AVG=1.34 STD=0.05
SLQB AVG=1.31 STD=0.06
SLUB AVG=1.46 STD=0.072 GROUPS
SLAB AVG=1.20 STD=0.09
SLQB AVG=1.22 STD=0.12
SLUB AVG=1.21 STD=0.064 GROUPS
SLAB AVG=0.84 STD=0.05
SLQB AVG=0.81 STD=0.10
SLUB AVG=0.98 STD=0.078 GROUPS
SLAB AVG=0.79 STD=0.10
SLQB AVG=0.76 STD=0.15
SLUB AVG=0.89 STD=0.0816 GROUPS
SLAB AVG=0.78 STD=0.08
SLQB AVG=0.79 STD=0.10
SLUB AVG=0.86 STD=0.0532 GROUPS
SLAB AVG=0.86 STD=0.05
SLQB AVG=0.78 STD=0.06
SLUB AVG=0.88 STD=0.0664 GROUPS
SLAB AVG=1.03 STD=0.05
SLQB AVG=0.90 STD=0.04
SLUB AVG=1.05 STD=0.06128 GROUPS
SLAB AVG=1.31 STD=0.19
SLQB AVG=1.16 STD=0.36
SLUB AVG=1.29 STD=0.11SLQB tends to be the winner here. SLAB is close at lower numbers of
groups, but drops behind a bit more as they increase.t...
I'm guessing, but then are these Mbit/s figures? Would that be the sending
throughput or the receiving throughput?I love to see netperf used, but why UDP and loopback? Also, how about the
service demands?rick jones
--
You're right ;)
But at least it is exercising the NUMA paths in the allocator, and
represents a pretty common size of system...I can run some tests on bigger systems at SUSE, but it is not always
easy to set up "real" meaningful workloads on them or configureYes, Mbit/s. They were... hmm, sending throughput I think, but each pair
No really good reason. I guess I was hoping to keep other variables as
small as possible. But I guess a real remote test would be a lot more
realistic as a networking test. Hmm, but I could probably set up a testWell, over loopback and using CPU binding, I was hoping it wouldn't
change much... but I see netperf does some measurements for you. I
will consider those in future too.BTW. is it possible to do parallel netperf tests?
--
Not sure if I know enough git to pull your trees, or if this cobbler's child will
have much in the way of bigger systems, but there is a chance I might - contactMega *bits* per second? And those were 4K sends right? That seems rather low
for loopback - I would have expected nearly two orders of magnitude more. I
wonder if the intra-stack flow control kicked-in? You might try adding test
specific -S and -s options to set much larger socket buffers to try to avoid
that. Or simply use TCP.If bandwidth is an issue, that is to say one saturates the link before much of
anything "interesting" happens in the host you can use something like aggregate
TCP_RR - ./configure with --enable_burst and then something likenetperf -H <remote> -t TCP_RR -- -D -b 32
and it will have as many as 33 discrete transactions in flight at one time on the
one connection. The -D is there to set TCP_NODELAY to preclude TCP chunking the
single-byte (default, take your pick of a more reasonable size) transactions intoYes, by (ab)using the confidence intervals code. Poke around in
http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section,
and I can go into further details offline (or here if folks want to see the
discussion).rick jones
--
Can you think of anything with which it will be the loser?
--
Here are some more performance numbers with "slub_test" kernel module.
It's basically a really tiny microbenchmark, so I don't really consider
it gives too useful results, except it does show up some problems in
SLAB's scalability that may start to bite as we continue to get more
threads per socket.(I ran a few of these tests on one of Dave's 2 socket, 128 thread
systems, and slab gets really painful... these kinds of thread counts
may only be a couple of years away from x86).All numbers are in CPU cycles.
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate 10000 objs then free them
obj size SLAB SLQB SLUB
8 77+ 128 69+ 47 61+ 77
16 69+ 104 116+ 70 77+ 80
32 66+ 101 82+ 81 71+ 89
64 82+ 116 95+ 81 94+105
128 100+ 148 106+ 94 114+163
256 153+ 136 134+ 98 124+186
512 209+ 161 170+186 134+276
1024 331+ 249 236+245 134+283
2048 608+ 443 380+386 172+312
4096 1109+ 624 678+661 239+372
8192 1166+1077 767+683 535+433
16384 1213+1160 914+731 577+682We can see SLAB has a fair bit more overhead in this case. SLUB starts
doing higher order allocations I think around size 256, which reduces
costs there. Don't know what the SLQB artifact at 16 is caused by...2. Kmalloc: alloc/free test (repeatedly allocate and free)
SLAB SLQB SLUB
8 98 90 94
16 98 90 93
32 98 90 93
64 99 90 94
128 100 92 93
256 104 93 95
512 105 94 97
1024 106 93 97
2048 107 95 95
4096 111 92 97
8192 111 94 631
16384 114 92 741Here we see SLUB's allocator passthrough (or is the the lack of queueing?).
Straight line speed at small sizes is probably due to instructions in the
fastpaths. It's pretty meaningless though because it probably changes if
there is any actual load on the CPU, or another CPU architecture. Doesn't
look bad fo...
Well, that fio test showed it was behind SLAB. I just discovered that
yesterday during running these tests, so I'll take a look at that. The
Intel performance guys I think have one or two cases where it is slower.
They don't seem to be too serious, and tend to be specific to some
machines (eg. the same test with a different CPU architecture turns out
to be faster). So I'll be looking into these things, but I haven't seen
anything too serious yet. I'm mostly interested in macro benchmarks and
more real world workloads.At a higher level, SLAB has some interesting features. It basically has
"crossbars" of queues, that basically provide queues for allocating and
freeing to and from different CPUs and nodes. This is what bloats up
the kmem_cache data structures to tens or hundreds of gigabytes each
on SGI size systems. But it is also has good properties. On smaller
multiprocessor and NUMA systems, it might be the case that SLAB does
better in workloads that involve objects being allocated on one CPU and
freed on another. I haven't actually observed problems here, but I don't
have a lot of good tests.SLAB is also fundamentally different from SLUB and SLQB in that it uses
arrays to store pointers to objects in its queues, rather than having
a linked list using pointers embedded in the objects. This might in some
cases make it easier to prefetch objects in parallel with finding the
object itself. I haven't actually been able to attribute a particular
regression to this interesting difference, but it might turn up as an
issue.These are two big differences between SLAB and SLQB.
The linked lists of objects were used in favour of arrays again because of
the memory overhead, and to have a better ability to tune the size of the
queues, and reduced overhead in copying around arrays of pointers (SLQB can
just copy the head of one the list to the tail of another in order to move
objects around), and eliminated the need to have additional metadata beyond
the struct page for each slab....
I think I can speak with some measure of confidence for at least the
OLTP-testing part of my company when I say that I have no objection to
Nick's planned merge scheme.I believe the kernel benchmark group have also done some testing with
SLQB and have generally positive things to say about it (Yanmin added to
the gargantuan cc).Did slabtop get fixed to work with SLQB?
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of
SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server
process and bind them to different physical cpus, the result of SLQB is about 20% better
than SLUB's. If I start CPU_NUM clients and the same number of servers without binding,
the results of SLQB is about 100% better than SLUB's. I think that's because SLQB
doesn't pass through big object allocation to page allocator.
netperf UDP-U-1k has less improvement with SLQB.The results of other benchmarks have variations. They are good on some machines,
but bad on other machines. However, the variation is small. For example, hackbench's result
with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with
Nick to do small code changing, SLQB's result is a little better than SLUB's
with hackbench on stoakley.We consider other variations as fluctuation.
--
That sounds like just the page allocator needs to be improved.
That would help everyone. We talked a bit about this earlier,
some of the heuristics for hot/cold pages are quite outdated
and have been tuned for obsolete machines and also its fast path
is quite long. Unfortunately no code currently.-Andi
--
ak@linux.intel.com -- Speaking for myself only.
--
Andi,
Thanks for your kind information. I did more investigation with SLUB
on netperf UDP-U-4k issue.oprofile shows:
328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string
134666 12.3699 linux-2.6.29-rc2 __free_pages_ok
125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist
22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim
21442 1.9696 linux-2.6.29-rc2 list_del
21187 1.9462 linux-2.6.29-rc2 __ip_route_output_keySo __free_pages_ok and get_page_from_freelist consume too much cpu time.
With SLQB, these 2 functions almost don't consume time.Command 'slabinfo -AD' shows:
So kmem_cache :0000256 is very active.
Kernel stack dump in __free_pages_ok shows
[<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0
[<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e
[<ffffffff8060f387>] __kfree_skb+0x9/0x6f
[<ffffffff8061204b>] skb_free_datagram+0xc/0x31
[<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f
[<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45
[<ffffffff80609acd>] sock_recvmsg+0xd5/0xedThe callchain is:
__kfree_skb =>
kfree_skbmem =>
kmem_cache_free(skbuff_head_cache, skb);kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
with :0000256. Their order is 1 which means every slab consists of 2 physical pages.netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets
to server process and server process just receives the packets one by one.If we start CPU_NUM clients and the same number of servers, every client will send lots
of packets within one sched slice, then process scheduler schedules the server to receive
many packets within one sched slice; then client resends again. So there are many packets
in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's
objects are all free, the slab will be released by calling __free_...
of 2 physical pages.
That order can be changed. Try specifying slub_max_order=3D0 on the kernel
command line to force an order 0 alloc.The queues of the page allocator are of limited use due to their overhead.
Order-1 allocations can actually be 5% faster than order-0. order-0 makes
sense if pages are pushed rapidly to the page allocator and are then
reissues elsewhere. If there is a linear consumption then the page
nefit from the page buffer.That usually does not matter because of partial list avoiding page
SLUB has a percpu freelist but its bounded by the basic allocation unit.
You can increase that by modifying the allocation order. Writing a 3 or 5
into the order value in /sys/kernel/slab/xxx/order would do the trick.
I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.I checked my instrumentation in kernel and found it's caused by large object allocation/free
whose size is more than PAGE_SIZE. Here its order is 1.The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
So this case isn't the issue that batch of allocation/free might erase partial page
functionality.'#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add
such info? That will be more helpful.--
So is this the kfree(skb->head) in skb_release_data() or the put_page()
calls in the same function in a loop?If it's the former, with big enough size passed to __alloc_skb(), the
networking code might be taking a hit from the SLUB page allocator
pass-through.Pekka
--
--
Do we know what kind of size is being passed to __alloc_skb() in this
case? Maybe we want to do something like this.Pekka
SLUB: revert page allocator pass-through
This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
direct pass through of page size or higher kmalloc requests").
---diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
* We keep the general caches in an array of slab caches that are used for
* 2^x bytes of allocations.
*/
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];/*
* Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
if (!size)
return 0;+ if (size > KMALLOC_MAX_SIZE)
+ return -1;
+
if (size <= KMALLOC_MIN_SIZE)
return KMALLOC_SHIFT_LOW;@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 1024) return 10;
if (size <= 2 * 1024) return 11;
if (size <= 4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
if (size <= 8 * 1024) return 13;
if (size <= 16 * 1024) return 14;
if (size <= 32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 512 * 1024) return 19;
if (size <= 1024 * 1024) return 20;
if (size <= 2 * 1024 * 1024) return 21;
+ if (size <= 4 * 1024 * 1024) return 22;
+ if (size <= 8 * 1024 * 1024) return 23;
+ if (size <= 16 * 1024 * 1024) return 24;
+ if (size <= 32 * 1024 * 1024) return 25;
return -1;/*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
if (index == 0)
return NULL...
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
This patch amost fixes the netperf UDP-U-4k issue.#slabinfo -AD
So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.--
I'll have to look into this too. Could be evidence of the possible
TLB improvement from using bigger pages and/or page-specific freelist,
I suppose.Do you have a scripted used to start netperf in that configuration?
--
See the attachment.
Steps to run testing:
1) compile netperf;
2) Change PROG_DIR to path/to/netperf/src;
3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
The -T option takes arguments of the form:
N - bind both netperf and netserver to core N
N, - bind only netperf to core N, float netserver
,M - float netperf, bind only netserver to core M
N,M - bind netperf to core N and netserver to core MWithout a comma between N and M knuth only knows what the command line parser
Same thing here for the -P option - there needs to be a comma between the two
port numbers otherwise, the best case is that the second port number is ignored.
Worst case is that netperf starts doing knuth only knows what.To get quick profiles, that form of aggregate netperf is OK - just the one
iteration with background processes using a moderatly long run time. However,
for result reporting, it is best to (ab)use the confidence intervals
functionality to try to avoid skew errors. I tend to add-in a global -i 30
option to get each netperf to repeat its measurments 30 times. That way one is
reasonably confident that skew issues are minimized.http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf...
And I would probably add the -c and -C options to have netperf report service
The documented-only-in-source :( "omni" tests in top-of-trunk netperf:
http://www.netperf.org/svn/netperf2/trunk
./configure --enable-omni
allow one to specify which result values one wants, in which order, either as
more or less traditional netperf output (test-specific -O), CSV (test-specific
-o) or keyval (test-specific -k). All three take an optional filename as an
argument with the file containing a list of desired output values. You can give
a "filename" of '?' to get the list of output values known to that version of
netperf.Might help simplify parsing and whatnot.
happy benchmarking,
--
Thanks. I wanted to run the testing to get result quickly as long as
I'm not sure if prior running might leave any impact on later running, so
Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization--
Feel free to wander over to netperf-talk over at netperf.org if you want to talk
some more about the care and feeding of netperf.happy benchmarking,
rick jones
--
On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <rick.jones2@hp.com> wrote:
For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.grant
--
Christoph, should we merge my patch as-is or do you have an alternative
fix in mind? We could, of course, increase kmalloc() caches one level upMaybe we can use the perfstat and/or kerneltop utilities of the new perf
counters patch to diagnose this:http://lkml.org/lkml/2009/1/21/273
And do oprofile, of course. Thanks!
Pekka
--
I assume binding the client and the server to different physical CPUs
also means that the SKB is always allocated on CPU 1 and freed on CPU
2? If so, we will be taking the __slab_free() slow path all the time on
kfree() which will cause cache effects, no doubt.But there's another potential performance hit we're taking because the
object size of the cache is so big. As allocations from CPU 1 keep
coming in, we need to allocate new pages and unfreeze the per-cpu page.
That in turn causes __slab_free() to be more eager to discard the slab
(see the PageSlubFrozen check there).So before going for cache profiling, I'd really like to see an oprofile
report. I suspect we're still going to see much more page allocator
activity there than with SLAB or SLQB which is why we're still behaving
so badly here.Pekka
--
That's bit surprising, actually. FWIW, I've included a patch for empty
Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.Pekka
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
unsigned long nr_partial;
unsigned long min_partial;
struct list_head partial;
+ unsigned long nr_empty;
+ unsigned long max_empty;
+ struct list_head empty;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
*/
#define MAX_PARTIAL 10+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
free_slab(s, page);
}+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+ struct kmem_cache_node *n;
+ int node;
+
+ node = page_to_nid(page);
+ n = get_node(s, node);
+
+ dec_slabs_node(s, node, page->objects);
+
+ if (likely(n->nr_empty >= n->max_empty)) {
+ free_slab(s, page);
+ } else {
+ n->nr_empty++;
+ list_add(&page->lru, &n->partial);
+ }
+}
+
/*
* Per slab locking using the pagelock
*/
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
}/*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
*
* Must hold list_lock.
*/
@@ -1261,7 +1284,6 @@ static inl...
No there is another way. Increase the allocator order to 3 for the
kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
larger chunks of data gotten from the page allocator. That will allow slub
to do fast allocs.--
After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.But when trying to increased it to 4, I got:
[root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
[root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
-bash: echo: write error: Invalid argumentComparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
against specific benchmarks. One hard is to tune page order number. Although SLQB also
has many tuning options, I almost doesn't tune it manually, just run benchmark and
collect results to compare. Does that mean the scalability of SLQB is better?--
This is because 4 is more than the maximum allowed order. You can
reconfigure that by settingslub_max_order=5
or so on boot.
--
With slub_max_order=5, the default order of kmalloc-8192 becomes
5. I tested it with netperf UDP-U-4k and the result difference from
SLAB/SLQB is less than 1% which is really fluctuation.--
On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
Are you interested in doing that?On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
That's probably because max order is capped to 3. You can change that
by passing slub_max_order=<n> as kernel parameter.On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
One thing is sure, SLUB seems to be hard to tune. Probably because
it's dependent on the page order so much.
--
Pekka,
Sorry for the late update.
The default order of kmalloc-8192 on 2*4 stoakley is really an issue of calculate_order.slab_size order name
-------------------------------------------------
4096 3 sgpool-128
8192 2 kmalloc-8192
16384 3 kmalloc-16384kmalloc-8192's default order is smaller than sgpool-128's.
On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.
Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.Below patch against 2.6.29-rc2 fixes it.
I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
---
diff -Nraup linux-2.6.29-rc2/mm/slub.c linux-2.6.29-rc2_slubcalc_order/mm/slub.c
--- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:08:24.000000000 -0500
@@ -1856,6 +1856,7 @@ static inline int calculate_order(int si
min_objects = slub_min_objects;
if (!min_objects)
min_objects = 4 * (fls(nr_cpu_ids) + 1);
+ min_objects = min(min_objects, (PAGE_SIZE << slub_max_order)/size);
while (min_objects > 1) {
fraction = 16;
while (fraction >= 4) {
@@ -1865,7 +1866,7 @@ static inline int calculate_order(int si
return order;
fraction /= 2;
}
- min_objects /= 2;
+ min_objects --;
}/*
--
Oh, previous patch has a compiling warning. Pls. use below patch.
From: Zhang Yanmin <yanmin.zhang@linux.intel.com>
The default order of kmalloc-8192 on 2*4 stoakley is an issue of calculate_order.
slab_size order name
-------------------------------------------------
4096 3 sgpool-128
8192 2 kmalloc-8192
16384 3 kmalloc-16384kmalloc-8192's default order is smaller than sgpool-128's.
On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.
Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size calculation/checking
in slab_order, sometimes above issue appear.Below patch against 2.6.29-rc2 fixes it.
I checked the default orders of all kmem_cache and they don't become smaller than before. So
the patch wouldn't hurt performance.Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com>
---
--- linux-2.6.29-rc2/mm/slub.c 2009-02-11 00:49:48.000000000 -0500
+++ linux-2.6.29-rc2_slubcalc_order/mm/slub.c 2009-02-12 00:47:52.000000000 -0500
@@ -1844,6 +1844,7 @@ static inline int calculate_order(int si
int order;
int min_objects;
int fraction;
+ int max_objects;/*
* Attempt to find best configuration for a slab. This
@@ -1856,6 +1857,9 @@ static inline int calculate_order(int si
min_objects = slub_min_objects;
if (!min_objects)
min_objects = 4 * (fls(nr_cpu_ids) + 1);
+ max_objects = (PAGE_SIZE << slub_max_order)/size;
+ min_objects = min(min_objects, max_objects);
+
while (min_objects > 1) {
fraction = 16;
while (fraction >= 4) {
@@ -1865,7 +1869,7 @@ static inline int calculate_order(int si
return order;
fraction /= 2;
}
- min_objects /= 2;
+ min_objects --;
}/*
--
Applied to the 'topic/slub/perf' branch. Thanks!
Pekka
--
's.
You reverted the page allocator passthrough patch before this right?
Otherwise kmalloc-8192 should not exist and allocation calls for 8192
bytes would be converted inline to request of an order 1 page from the
page allocator.
Hi Christoph,
On Thu, Feb 12, 2009 at 5:25 PM, Christoph Lameter
Yup, I assume that's the case here.
--
I wonder why that doesn't happen already, actually. The slub_max_order
know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
order 3 should be as good fit as order 2 so 'fraction' can't be too high
either. Hmm.Pekka
--
The kmalloc-8192 is new. Look at slabinfo output to see what allocation
orders are chosen.--
Yes, yes, I know the new cache a result of my patch. I'm just saying
that AFAICT, the existing logic should set the order to 3 but IIRC
Yanmin said it's 2.Pekka
--
^^^^^^
--
Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB
implements /proc/slabinfo).Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo
tool to SLQB. Basically it reports in-depth internal event counts etc. and
can operate on individual caches, making it very useful for performance
"observability" and tuning.It is hard to come up with a single set of statistics that apply usefully
to all the allocators. FWIW, it would be a useful tool to port over to
SLAB too, if we end up deciding to go with SLAB.--
| Artem Bityutskiy | [PATCH 12/44 take 2] [UBI] allocation unit implementation |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Jeff Garzik | Re: [RFC] Heads up on sys_fallocate() |
| Christoph Hellwig | pcmcia ioctl removal |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| David Miller | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
