[PATCH] SLUB The unqueued slab allocator v4 V3->V4 - Rename /proc/slabinfo to /proc/slubinfo. We have a different format after all. - More bug fixes and stabilization of diagnostic functions. This seems to be finally something that works wherever we test it. - Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's idea) - Add two new modifications (separate patches) to guarantee a mininum number of objects per slab and to pass through large allocations. Note that SLUB will warn on zero sized allocations. SLAB just allocates some memory. So some traces from the usb subsystem etc should be expected. There are very likely also issues remaining in SLUB. V2->V3 - Debugging and diagnostic support. This is runtime enabled and not compile time enabled. Runtime debugging can be controlled via kernel boot options on an individual slab cache basis or globally. - Slab Trace support (For individual slab caches). - Resiliency support: If basic sanity checks are enabled (via F f.e.) (boot option) then SLUB will do the best to perform diagnostics and then continue (i.e. mark corrupted objects as used). - Fix up numerous issues including clash of SLUBs use of page flags with i386 arch use for pmd and pgds (which are managed as slab caches, sigh). - Dynamic per CPU array sizing. - Explain SLUB slabcache flags V1->V2 - Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA. - Provide NUMA support by splitting partial lists per node. - Better Slab cache merge support (now at around 50% of slabs) - List slab cache aliases if slab caches are merged. - Updated descriptions /proc/slabinfo output This is a new slab allocator which was motivated by the complexity of the existing code in mm/slab.c. It attempts to address a variety of concerns with the existing implementation. A. Management of object queues A particular concern was the complex management of the numerous object queues in SLAB. SLUB has no such queues. Instead we ...
Unlimited kmalloc size and removal of general caches >=4.
We can directly use the page allocator for all allocations 4K and larger. This
means that no general slabs are necessary and the size of the allocation passed
to kmalloc() can be arbitrarily large. Remove the useless general caches over 4k.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.21-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/mm/slub.c 2007-03-06 17:56:50.000000000 -0800
+++ linux-2.6.21-rc2-mm1/mm/slub.c 2007-03-06 17:57:11.000000000 -0800
@@ -1101,6 +1101,13 @@ void kmem_cache_free(struct kmem_cache *
if (unlikely(PageCompound(page)))
page = page->first_page;
+ if (unlikely(!PageSlab(page))) {
+ if (x == page_address(page)) {
+ put_page(page);
+ return;
+ }
+ }
+
if (!s)
s = page->slab;
@@ -1678,7 +1685,8 @@ static struct kmem_cache *get_slab(size_
/* SLAB allows allocations with zero size. So warn on those */
WARN_ON(size == 0);
/* Allocation too large? */
- BUG_ON(index < 0);
+ if (index < 0)
+ return NULL;
#ifdef CONFIG_ZONE_DMA
if ((flags & SLUB_DMA)) {
@@ -1722,15 +1730,32 @@ static struct kmem_cache *get_slab(size_
void *__kmalloc(size_t size, gfp_t flags)
{
- return kmem_cache_alloc(get_slab(size, flags), flags);
+ struct kmem_cache *s = get_slab(size, flags);
+ struct page *page;
+
+ if (s)
+ return kmem_cache_alloc(s, flags);
+
+ page = alloc_pages(flags, get_order(size));
+ if (!page)
+ return NULL;
+ return page_address(page);
}
EXPORT_SYMBOL(__kmalloc);
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
- return kmem_cache_alloc_node(get_slab(size, flags),
- flags, node);
+ struct kmem_cache *s = get_slab(size, flags);
+ struct page *page;
+
+ if (s)
+ return kmem_cache_alloc_node(s, flags, node);
+
+ page = alloc_pages_node(node, flags, get_order(size));
+ if (!page)
+ return NULL;
+ return ...I've been meaning to do this in SLOB as well. Perhaps it warrants doing in stock kmalloc? I've got a grand total of 18 of these objects here. The downside is this makes them suddenly disappear off the slabinfo radar. -- Mathematics is the supreme nostalgia of our time. -
The number increases with the number numa nodes. We have had trouble with the maximum kmalloc size before and this will get rid of it for good. -
Perhaps so something with PAGE_SIZE here, as you know there are platforms/configs where PAGE_SIZE != 4k :-) -
Any allocation > 2k just uses a regular allocation which will waste space. I have a patch here to make this dependent on page size using a loop. The problem is that it does not work with some versions of gcc. On the other hand we really need this since one arch can actually have an order 22 page size! -
You don't need a loop, you need an if (s >= PAGE_SIZE) at the head of your static list. -- Mathematics is the supreme nostalgia of our time. -
As I just said: PAGE_SIZE may be quite high. So I would need a looong static list. We already check for the size being bigger than 2048 which is half the usual page size. Anything larger will get passed through. -
Guarantee a mininum number of objects per slab
The number of objects per slab is important for SLUB because it determines
the number of allocations that can be performed without having to consult
per node slab lists. Add another boot option "min_objects=xx" that
allows the configuration of the objects per slab. This is similar
to SLABS queue configurations.
Set the default of objects to 4. This will increase the page order for
certain slab objects.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.21-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/mm/slub.c 2007-03-06 17:57:11.000000000 -0800
+++ linux-2.6.21-rc2-mm1/mm/slub.c 2007-03-06 17:57:15.000000000 -0800
@@ -1201,6 +1201,12 @@ static __always_inline struct page *get_
static int slub_min_order = 0;
/*
+ * Minumum number of objects per slab. This is necessary in order to
+ * reduce locking overhead. Similar to the queue size in SLAB.
+ */
+static int slub_min_objects = 4;
+
+/*
* Merge control. If this is set then no merging of slab caches will occur.
*/
static int slub_nomerge = 0;
@@ -1232,7 +1238,7 @@ static int calculate_order(int size)
order < MAX_ORDER; order++) {
unsigned long slab_size = PAGE_SIZE << order;
- if (slab_size < size)
+ if (slab_size < slub_min_objects * size)
continue;
rem = slab_size % size;
@@ -1624,6 +1630,15 @@ static int __init setup_slub_min_order(c
__setup("slub_min_order=", setup_slub_min_order);
+static int __init setup_slub_min_objects(char *str)
+{
+ get_option (&str, &slub_min_objects);
+
+ return 1;
+}
+
+__setup("slub_min_objects=", setup_slub_min_objects);
+
static int __init setup_slub_nomerge(char *str)
{
slub_nomerge = 1;
-
SLUB core
Basic new slab allocator. See overview for details
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Index: linux-2.6.21-rc2-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/fs/proc/proc_misc.c 2007-03-06 17:59:44.000000000 -0800
+++ linux-2.6.21-rc2-mm1/fs/proc/proc_misc.c 2007-03-06 18:03:49.000000000 -0800
@@ -399,6 +399,21 @@ static const struct file_operations proc
};
#endif
+#ifdef CONFIG_SLUB
+extern struct seq_operations slubinfo_op;
+static int slubinfo_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &slubinfo_op);
+}
+static const struct file_operations proc_slubinfo_operations = {
+ .open = slubinfo_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+#endif
+
+
#ifdef CONFIG_SLAB
static int slabinfo_open(struct inode *inode, struct file *file)
{
@@ -789,6 +804,9 @@ void __init proc_misc_init(void)
#endif
create_seq_entry("stat", 0, &proc_stat_operations);
create_seq_entry("interrupts", 0, &proc_interrupts_operations);
+#ifdef CONFIG_SLUB
+ create_seq_entry("slubinfo",S_IWUSR|S_IRUGO,&proc_slubinfo_operations);
+#endif
#ifdef CONFIG_SLAB
create_seq_entry("slabinfo",S_IWUSR|S_IRUGO,&proc_slabinfo_operations);
#ifdef CONFIG_DEBUG_SLAB_LEAK
Index: linux-2.6.21-rc2-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.21-rc2-mm1.orig/include/linux/mm_types.h 2007-03-06 17:59:44.000000000 -0800
+++ linux-2.6.21-rc2-mm1/include/linux/mm_types.h 2007-03-06 18:03:49.000000000 -0800
@@ -19,10 +19,16 @@ struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
- atomic_t _mapcount; /* Count of ptes mapped in mms,
+ union {
+ atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map ...Hi Christoph,
I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64
last night to see how they got on. I enabled slub_debug to catch any
suprises that may be creeping about.
The results are mixed.
On x86_64, it completed successfully and looked reliable. There was a 5%
performance loss on kernbench and aim9 figures were way down. However,
with slub_debug enabled, I would expect that so it's not a fair comparison
performance wise. I'll rerun the tests without debug and see what it looks
like if you're interested and do not think it's too early to worry about
performance instead of clarity. This is what I have for bl6-13 (machine
appears on test.kernel.org so additional details are there).
KernBench Comparison
--------------------
2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based
%diff
User CPU time 84.32 86.03 -2.03%
System CPU time 32.97 38.21 -15.89%
Total CPU time 117.29 124.24 -5.93%
Elapsed time 34.95 37.31 -6.75%
AIM9 Comparison
---------------
2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based
1 creat-clo 160706.55 62918.54 -97788.01 -60.85% File Creations and Closes/second
2 page_test 190371.67 204050.99 13679.32 7.19% System Allocations & Pages/second
3 brk_test 2320679.89 1923512.75 -397167.14 -17.11% System Memory Allocations/second
4 jmp_test 16391869.38 16380353.27 -11516.11 -0.07% Non-local gotos/second
5 signal_test 492234.63 235710.71 -256523.92 -52.11% Signal Traps/second
6 exec_test 232.26 220.88 -11.38 -4.90% Program Loads/second
7 fork_test 4514.25 ...No its good to start worrying about performance now. There are still some performance issues to be ironed out in particular on NUMA. I am not sure This was a single node box? Note that the 16kb page size has a major impact on SLUB performance. On IA64 slub will use only 1/4th the locking We have some additional patches here that reduce the max order for some Hmmm... Looks like something is zapping an object. Try to rerun with Someone did a kmalloc(0, ...). Zero sized allocation are not flagged More page table trouble. -
Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and
x86_64). It'll take them a long time to complete though as there is a
Yes, memory looks like this;
Zone PFN ranges:
DMA 1024 -> 262144
Normal 262144 -> 262144
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
0: 1024 -> 30719
0: 32768 -> 65413
0: 65440 -> 65505
On node 0 totalpages: 62405
Node 0 memmap at 0xe000000001126000 size 3670016 first pfn 0xe000000001134000
DMA zone: 220 pages used for memmap
DMA zone: 0 pages reserved
DMA zone: 62185 pages, LIFO batch:7
Normal zone: 0 pages used for memmap
It'll be interesting to see the kernbench tests then with debugging
I've queued up a few tests. One completed as I wrote this and it didn't
explode with SLAB_DEBUG set. Maybe the others will be different. I'll
kick it around for a bit.
I'll chase up what's happening here. It will be "reproducable" independent
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup. This means that we have to rely on your patches to allow higher order allocs to work reliably though. The higher the order of slub the less locking overhead. So the better your patches deal with fragmentation the more we can reduce locking overhead in slub. -
It should work out because of the way buddy always selects the minimum page size will tend to cluster the slab allocations together whether they are reclaimable or not. It's something I can investigate when slub has stabilised a bit. However, in general, high order kernel allocations remain a bad idea. Depending on high order allocations that do not group could potentially lead to a situation where the movable areas are used more and more by kernel allocations. I cannot think of a workload that would actually break I can certainly kick it around a lot and see what happen. It's best that slub_min_order=2 remain an optional performance enhancing switch though. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
Note that I am amazed that the kernbench even worked. On small machine I seem to be getting into trouble with order 1 allocations. SLAB seems to be able to avoid the situation by keeping higher order pages on a freelist and reduce the alloc/frees of higher order pages that the page allocator has to deal with. Maybe we need per order queues in the page allocator? There must be something fundamentally wrong in the page allocator if the SLAB queues fix this issue. I was able to fix the issue in V5 by forcing SLUB to keep a mininum number of objects around regardless of the fit to a page order page. Pass through is deadly since the crappy page allocator cannot handle it. Higher order page allocation failures can be avoided by using kmalloc. Yuck! Hopefully your patches fix that fundamental problem. -
How small? The machines I am testing on aren't "big" but they aren't That in itself is pretty incredible. From what I see, allocations up to 3 generally work unless they are atomic even with the vanilla kernel. That said, it could be because slab is holding onto the high order pages for I'm not sure what you mean by per-order queues. The buddy allocator One way to find out for sure. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -
Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid (or defer) fragmentation by keeping its own queues. -
The results without slub_debug were not good except for IA64. x86_64 and
ppc64 both blew up for a variety of reasons. The IA64 results were
KernBench Comparison
--------------------
2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub %diff
User CPU time 1084.64 1032.93 4.77%
System CPU time 73.38 63.14 13.95%
Total CPU time 1158.02 1096.07 5.35%
Elapsed time 307.00 285.62 6.96%
AIM9 Comparison
---------------
2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-slub
1 creat-clo 425460.75 438809.64 13348.89 3.14% File Creations and Closes/second
2 page_test 2097119.26 3398259.27 1301140.01 62.04% System Allocations & Pages/second
3 brk_test 7008395.33 6728755.72 -279639.61 -3.99% System Memory Allocations/second
4 jmp_test 12226295.31 12254966.21 28670.90 0.23% Non-local gotos/second
5 signal_test 1271126.28 1235510.96 -35615.32 -2.80% Signal Traps/second
6 exec_test 395.54 381.18 -14.36 -3.63% Program Loads/second
7 fork_test 13218.23 13211.41 -6.82 -0.05% Task Creations/second
8 link_test 64776.04 7488.13 -57287.91 -88.44% Link/Unlink Pairs/second
An example console log from x86_64 is below. It's not particular clear why
it went blamo and I haven't had a chance all day to kick it around for a
bit due to a variety of other hilarity floating around.
Linux version 2.6.21-rc2-mm2-autokern1 (root@bl6-13.ltc.austin.ibm.com) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 ...Yuck that is the dst issue that Adrian is also looking at. Likely an issue Crap. Maybe we straddled a slab boundary here? -
Lower bits must be clear right? Looks like the pud was released and then reused for a 64 byte cache or so. This is likely a freelist pointer that slub put there after allocating the page for the 64 byte Data overwritten after free or after slab was allocated. So this may be the same issue. pud was zapped after it was freed destroying the poison of another object in the 64 byte cache. Hmmm.. Maybe I should put the pad checks before the object checks. That way we detect that the whole slab was corrupted and do not flag just a single object. -
