Hi, Just ran some tbench numbers (from dbench-3.04), on a 2 socket, 8 core x86 system, with 1 NUMA node per socket. With kernel 2.6.24-rc2, comparing slab vs slub allocators. I run from 1 to 16 client threads, 5 times each, and restarting the tbench server between every run. I'm just taking the highest of each of the 5 tests (because the scheduler placement can sometimes be poor). It's not completely scientific, but from the graph you can guess it is relatively stable and seems significant. Summary: slub is consistently slower. When all CPUs are saturated, it is around 20% slower. Attached is a graph (x is nrclients, y is throughput MB/s) If I can help with reproducing it or testing anything, let me know. I'll be trying out a few other benchmarks too... anything you want me to test specifically and I can try. Thanks, Nick
You saw the discussion at http://marc.info/?l=linux-kernel&m=119354245426072&w=2 and the patches / configurations that were posted to address the issues? Could you try these? -
On an 8p 2.6.24-rc2 I see even a 50% regression on tbench SLAB vs. SLUB when specifying 8 threads. Interestingly nothing changes the performance numbers regardless of debugging on or off etc etc. Usually debugging should reduce performance but nada. May have something to do with the localhost interface? Something is effectively throttling SLUB here. 2.6.23 SLUB 2159.62 MB/sec 2.6.24-rc2-slab head SLUB 1260.80 MB/sec 2.6.24 SLUB should be faster than 2.6.23 SLUB. Still trying to figure out what is going on.... -
commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0 seems to cause a drop
in SLUB tbench performance:
8p x86_64 system:
2.6.24-rc2:
1260.80 MB/sec
After reverting the patch:
2350.04 MB/sec
SLAB performance (which is at 2435.58 MB/sec, ~3% better than SLUB) is not
affected by the patch.
Since this is an alignment change it seems that tbench performance is
sensitive to the data layout? SLUB packs data more tightly than SLAB. So
8 byte allocations could result in cacheline contention if adjacent
objects are allocated from different cpus. SLABs minimum size is 32
bytes so the cacheline contention is likely more limited.
Maybe we need to allocate a mininum of one cacheline to the skb head? Or
padd it out to a full cacheline?
commit deea84b0ae3d26b41502ae0a39fe7fe134e703d0
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun Oct 21 16:27:46 2007 -0700
[NET]: Fix SKB_WITH_OVERHEAD calculation
The calculation in SKB_WITH_OVERHEAD is incorrect in that it can cause
an overflow across a page boundary which is what it's meant to prevent.
In particular, the header length (X) should not be lumped together with
skb_shared_info. The latter needs to be aligned properly while the header
has no choice but to sit in front of wherever the payload is.
Therefore the correct calculation is to take away the aligned size of
skb_shared_info, and then subtract the header length. The resulting
quantity L satisfies the following inequality:
SKB_DATA_ALIGN(L + X) + sizeof(struct skb_shared_info) <= PAGE_SIZE
This is the quantity used by alloc_skb to do the actual allocation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f93f22b..369f60a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,8 +41,7 @@
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES ...cc'ed linux-netdev The data should already be cacheline aligned. It is kmalloced, and with a minimum size of somewhere around 200 bytes on a 64-bit machine. So it will hit a cacheline aligned kmalloc slab AFAIKS -- cacheline interference is probably not the problem. (To verify, I built slub with minimum kmalloc size set to 32 like slab and it's no real difference) But I can't see why restricting the allocation to PAGE_SIZE would help either. Maybe the macros are used in some other areas. BTW. your size-2048 kmalloc cache is order-1 in the default setup, wheras kmalloc(1024) or kmalloc(4096) will be order-0 allocations. And SLAB also uses order-0 for size-2048. It would be nice if SLUB did the -
You can try to see the effect that order 0 would have by booting with slub_max_order=0 -
Yeah, that didn't help much, but in general I think it would give more consistent and reliable behaviour from slub. -
From: Nick Piggin <nickpiggin@yahoo.com.au> Just a note that I'm not ignoring this issue, I just don't have time to get to it yet. I suspect the issue is about having a huge skb->data linear area for TCP sends over loopback. We're likely getting a much smaller skb->data linear data area after the patch in question, the rest using the sk_buff scatterlist pages which are a little bit more expensive to process. -
No problem. I would like to have helped more, but it's slow going given my lack of network stack knowledge. If I get any more interesting data, It didn't seem to be noticeable at 1 client. Unless scatterlist processing is going to cause cacheline bouncing, I don't see why this hurts more as you add CPUs? -
From: Nick Piggin <nickpiggin@yahoo.com.au> Is your test system using HIGHMEM? That's one thing the page vector in the sk_buff can do a lot, kmaps. -
No, it's an x86-64, so no highmem. What's also interesting is that SLAB apparently doesn't have this condition. The first thing that sprung to mind is that SLAB caches order > 0 allocations, while SLUB does not. However if anything, that should actually favour the SLUB numbers if network is avoiding order > 0 allocations. I'm doing some oprofile runs now to see if I can get any more info. -
From: Nick Piggin <nickpiggin@yahoo.com.au> Here are some other things you can play around with: 1) Monitor the values of skb->len and skb->data_len for packets going over loopback. 2) Try removing NETIF_F_SG in drivers/net/loopback.c's dev->feastures setting. -
OK, in vanilla kernels, the page allocator definitely shows higher in the results (than with Herbert's patch reverted). 27516 2.7217 get_page_from_freelist 21677 2.1442 __rmqueue_smallest 20513 2.0290 __free_pages_ok 18725 1.8522 get_pageblock_flags_group Just these account for nearly 10% of cycles. __alloc_skb shows up higher too. free_hot_cold_page() shows a lot lower though, which might indicate that actually there is more higher order allocation activity (I'll check that next). **** SLUB, avg throughput 1548 CPU: AMD64 family10, speed 1900 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 94636 9.3609 copy_user_generic_string 38932 3.8509 ipt_do_table 34746 3.4369 tcp_v4_rcv 29539 2.9218 skb_release_data 27516 2.7217 get_page_from_freelist 26046 2.5763 tcp_sendmsg 24482 2.4216 local_bh_enable 22910 2.2661 ip_queue_xmit 22113 2.1873 ktime_get 21677 2.1442 __rmqueue_smallest 20513 2.0290 __free_pages_ok 18725 1.8522 get_pageblock_flags_group 18580 1.8378 tcp_recvmsg 18108 1.7911 __napi_schedule 17593 1.7402 schedule 16998 1.6813 tcp_ack 16102 1.5927 dev_hard_start_xmit 15751 1.5580 system_call 15707 1.5536 net_rx_action 15150 1.4986 __switch_to 14988 1.4825 tcp_transmit_skb 13921 1.3770 kmem_cache_free 13398 1.3253 __mod_timer 13243 1.3099 tcp_rcv_established 13109 1.2967 __tcp_select_window 11022 1.0902 __tcp_push_pending_frames 10732 1.0615 set_normalized_timespec 10561 1.0446 netif_rx 8840 0.8744 netif_receive_skb 7816 0.7731 nf_iterate 7300 0.7221 __update_rq_clock 6683 0.6610 _read_lock_bh 6504 0.6433 ...
Doesn't help (with vanilla kernel -- Herbert's patch applied). data_len histogram drops to 0 and goes to len (I guess that's not surprising). Performance is pretty similar (ie. not good). I'll look at allocator patterns next. -
From: Nick Piggin <nickpiggin@yahoo.com.au>
Thanks for all of this data Nick.
So the thing that's being effected here in TCP is
net/ipv4/tcp.c:select_size(), specifically the else branch:
int tmp = tp->mss_cache;
...
else {
int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
if (tmp >= pgbreak &&
tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE)
tmp = pgbreak;
}
This is deciding, in 'tmp', how much linear sk_buff space to
allocate. 'tmp' is initially set to the path MSS, which
for loopback is 16K - the space necessary for packet headers.
The SKB_MAX_HEAD() value has changed as a result of Herbert's
bug fix. I suspect this 'if' test is passing both with and
without the patch.
But pgbreak is now smaller, and thus the skb->data linear
data area size we choose to use is smaller as well.
You can test if this is precisely what is causing the performance
regression by using the old calculation just here in select_size().
Add something like this local to net/ipv4/tcp.c:
#define OLD_SKB_WITH_OVERHEAD(X) \
(((X) - sizeof(struct skb_shared_info)) & \
~(SMP_CACHE_BYTES - 1))
#define OLD_SKB_MAX_ORDER(X, ORDER) \
OLD_SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
#define OLD_SKB_MAX_HEAD(X) (OLD_SKB_MAX_ORDER((X), 0))
And then use OLD_SKB_MAX_HEAD() in select_size().
-
OK, that makes sense. BTW, are you taking advantage of kmalloc's "quantization" into slabs WRT the linear data area? I wonder if That brings performance back up! I wonder why it isn't causing a problem for SLAB... -
Thanks for the pointer. Indeed there is a bug in that area. I'm not sure whether it's causing the problem at hand but it's certainly suboptimal. [TCP]: Fix size calculation in sk_stream_alloc_pskb We round up the header size in sk_stream_alloc_pskb so that TSO packets get zero tail room. Unfortunately this rounding up is not coordinated with the select_size() function used by TCP to calculate the second parameter of sk_stream_alloc_pskb. As a result, we may allocate more than a page of data in the non-TSO case when exactly one page is desired. In fact, rounding up the head room is detrimental in the non-TSO case because it makes memory that would otherwise be available to the payload head room. TSO doesn't need this either, all it wants is the guarantee that there is no tail room. So this patch fixes this by adjusting the skb_reserve call so that exactly the requested amount (which all callers have calculated in a precise way) is made available as tail room. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/include/net/sock.h b/include/net/sock.h index 5504fb9..567e468 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1235,14 +1235,16 @@ static inline struct sk_buff *sk_stream_alloc_pskb(struct sock *sk, gfp_t gfp) { struct sk_buff *skb; - int hdr_len; - hdr_len = SKB_DATA_ALIGN(sk->sk_prot->max_header); - skb = alloc_skb_fclone(size + hdr_len, gfp); + skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp); if (skb) { skb->truesize += mem; if (sk_stream_wmem_schedule(sk, skb->truesize)) { - skb_reserve(skb, hdr_len); + /* + * Make sure that we have exactly size bytes + * available to the caller, no more, no less. + */ + skb_reserve(skb, skb_tailroom(skb) - ...
This looks like it fixes the problem! -
From: Nick Piggin <nickpiggin@yahoo.com.au> Great, thanks for testing. I'll apply Herbert's patch tomorrow Yes, I wonder why too. I bet objects just got packed differently. There is this fugly "LOOPBACK_OVERHEAD" macro define in drivers/net/loopback.c that is trying to figure out the various overheads that we should subtract from the loopback MTU we use by default. It's almost guarenteed to be wrong for the way the allocators work now. -
The objects are packed tightly in SLUB and SLUB can allocate smaller objects (minimum is 8 SLAB mininum is 32). On free a SLUB object goes directly back to the slab where it came from. We have no queues in SLUB so we use the first word of the object as a freepointer. In SLAB the objects first go onto queues and then are drained later into the slab. On free in SLAB there is usually no need to touch the object itself. The object pointer is simply moved onto the queue (works well in SMP, in NUMA we have overhead identifying the queue and overhead due to the number of queues needed). -
From: Herbert Xu <herbert@gondor.apana.org.au> Applied and I'll queue it up for -stable too. -
From: Nick Piggin <nickpiggin@yahoo.com.au> This case is a good example to use the next time a stupid thread starts up about bug reports not being looked into. To me it's seems clearly more a matter of the quality of the bug report. -
Well this is likely the result of the SLUB regression. If you allocate an order 1 page then the zone locks need to be taken. SLAB queues the a couple of higher order pages and can so serve a couple of requests without going into the page allocator whereas SLUB has to go directly to the page allocator for allocate and free. I guess that needs fixing in the page allocator. Or do I need to add a mechanism to buffer higher order page allcoations to SLUB? -
Actually this serves to discourage people from using high-order allocations which IMHO is a good thing :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Yeah, it appears this is what happened. The lockless page allocator fastpath appears on the list and the slowpaths disappear after Herbert's patches. SLAB is doing its own thing, so it avoids that Yeah I completely agree. The right fix is in the caller... The bug / suboptimal allocation would not have been found in tcp if not for this ;) -
Damn your misleading subject! I thought this was going to be about something interesting. -- Mathematics is the supreme nostalgia of our time. -
Actually I did test slob as well -- it's competitive with slab and slub up to about 4 cores, which is nice. -
