Re: [SLUB 0/3] SLUB: The unqueued slab allocator V4

Previous thread: [PATCH] [RSDL-mm 6/6] sched: document rsdl cpu scheduler by Con Kolivas on Tuesday, March 6, 2007 - 6:29 pm. (1 message)

Next thread: Re: [PATCH] proc: maps protection by Kees Cook on Tuesday, March 6, 2007 - 8:14 pm. (12 messages)
From: Christoph Lameter
Date: Tuesday, March 6, 2007 - 7:35 pm

[PATCH] SLUB The unqueued slab allocator v4

V3->V4
- Rename /proc/slabinfo to /proc/slubinfo. We have a different format after
  all.
- More bug fixes and stabilization of diagnostic functions. This seems
  to be finally something that works wherever we test it.
- Serialize kmem_cache_create and kmem_cache_destroy via slub_lock (Adrian's
  idea)
- Add two new modifications (separate patches) to guarantee
  a mininum number of objects per slab and to pass through large
  allocations.

Note that SLUB will warn on zero sized allocations. SLAB just allocates
some memory. So some traces from the usb subsystem etc should be expected.
There are very likely also issues remaining in SLUB.

V2->V3
- Debugging and diagnostic support. This is runtime enabled and not compile
  time enabled. Runtime debugging can be controlled via kernel boot options
  on an individual slab cache basis or globally.
- Slab Trace support (For individual slab caches).
- Resiliency support: If basic sanity checks are enabled (via F f.e.)
  (boot option) then SLUB will do the best to perform diagnostics and
  then continue (i.e. mark corrupted objects as used).
- Fix up numerous issues including clash of SLUBs use of page
  flags with i386 arch use for pmd and pgds (which are managed
  as slab caches, sigh).
- Dynamic per CPU array sizing.
- Explain SLUB slabcache flags

V1->V2
- Fix up various issues. Tested on i386 UP, X86_64 SMP, ia64 NUMA.
- Provide NUMA support by splitting partial lists per node.
- Better Slab cache merge support (now at around 50% of slabs)
- List slab cache aliases if slab caches are merged.
- Updated descriptions /proc/slabinfo output

This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we ...
From: Christoph Lameter
Date: Tuesday, March 6, 2007 - 7:35 pm

Unlimited kmalloc size and removal of general caches >=4.

We can directly use the page allocator for all allocations 4K and larger. This
means that no general slabs are necessary and the size of the allocation passed
to kmalloc() can be arbitrarily large. Remove the useless general caches over 4k.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/mm/slub.c	2007-03-06 17:56:50.000000000 -0800
+++ linux-2.6.21-rc2-mm1/mm/slub.c	2007-03-06 17:57:11.000000000 -0800
@@ -1101,6 +1101,13 @@ void kmem_cache_free(struct kmem_cache *
 	if (unlikely(PageCompound(page)))
 		page = page->first_page;
 
+	if (unlikely(!PageSlab(page))) {
+		if (x == page_address(page)) {
+			put_page(page);
+			return;
+		}
+	}
+
 	if (!s)
 		s = page->slab;
 
@@ -1678,7 +1685,8 @@ static struct kmem_cache *get_slab(size_
 	/* SLAB allows allocations with zero size. So warn on those */
 	WARN_ON(size == 0);
 	/* Allocation too large? */
-	BUG_ON(index < 0);
+	if (index < 0)
+		return NULL;
 
 #ifdef CONFIG_ZONE_DMA
 	if ((flags & SLUB_DMA)) {
@@ -1722,15 +1730,32 @@ static struct kmem_cache *get_slab(size_
 
 void *__kmalloc(size_t size, gfp_t flags)
 {
-	return kmem_cache_alloc(get_slab(size, flags), flags);
+	struct kmem_cache *s = get_slab(size, flags);
+	struct page *page;
+
+	if (s)
+		return kmem_cache_alloc(s, flags);
+
+	page = alloc_pages(flags, get_order(size));
+	if (!page)
+		return NULL;
+	return page_address(page);
 }
 EXPORT_SYMBOL(__kmalloc);
 
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
-	return kmem_cache_alloc_node(get_slab(size, flags),
-							flags, node);
+	struct kmem_cache *s = get_slab(size, flags);
+	struct page *page;
+
+	if (s)
+		return kmem_cache_alloc_node(s, flags, node);
+
+	page = alloc_pages_node(node, flags, get_order(size));
+	if (!page)
+		return NULL;
+	return ...
From: Matt Mackall
Date: Tuesday, March 6, 2007 - 7:40 pm

I've been meaning to do this in SLOB as well. Perhaps it warrants
doing in stock kmalloc? I've got a grand total of 18 of these objects
here.

The downside is this makes them suddenly disappear off the slabinfo
radar.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Christoph Lameter
Date: Tuesday, March 6, 2007 - 8:22 pm

The number increases with the number numa nodes. We have had trouble with
the maximum kmalloc size before and this will get rid of it for good.
 
-

From: Peter Zijlstra
Date: Wednesday, March 7, 2007 - 2:01 am

Perhaps so something with PAGE_SIZE here, as you know there are
platforms/configs where PAGE_SIZE != 4k :-)

-

From: Christoph Lameter
Date: Wednesday, March 7, 2007 - 8:34 am

Any allocation > 2k just uses a regular allocation which will waste space.

I have a patch here to make this dependent on page size using a loop. The 
problem is that it does not work with some versions of gcc. On the 
other hand we really need this since one arch can 
actually have an order 22 page size!

-

From: Matt Mackall
Date: Wednesday, March 7, 2007 - 11:03 am

You don't need a loop, you need an if (s >= PAGE_SIZE) at the head of
your static list.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Christoph Lameter
Date: Wednesday, March 7, 2007 - 11:23 am

As I just said: PAGE_SIZE may be quite high. So I would need a looong 
static list. We already check for the size being bigger than 2048 which is 
half the usual page size. Anything larger will get passed through.

-

From: Christoph Lameter
Date: Tuesday, March 6, 2007 - 7:35 pm

Guarantee a mininum number of objects per slab

The number of objects per slab is important for SLUB because it determines
the number of allocations that can be performed without having to consult
per node slab lists. Add another boot option "min_objects=xx" that
allows the configuration of the objects per slab. This is similar
to SLABS queue configurations.

Set the default of objects to 4. This will increase the page order for
certain slab objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/mm/slub.c	2007-03-06 17:57:11.000000000 -0800
+++ linux-2.6.21-rc2-mm1/mm/slub.c	2007-03-06 17:57:15.000000000 -0800
@@ -1201,6 +1201,12 @@ static __always_inline struct page *get_
 static int slub_min_order = 0;
 
 /*
+ * Minumum number of objects per slab. This is necessary in order to
+ * reduce locking overhead. Similar to the queue size in SLAB.
+ */
+static int slub_min_objects = 4;
+
+/*
  * Merge control. If this is set then no merging of slab caches will occur.
  */
 static int slub_nomerge = 0;
@@ -1232,7 +1238,7 @@ static int calculate_order(int size)
 			order < MAX_ORDER; order++) {
 		unsigned long slab_size = PAGE_SIZE << order;
 
-		if (slab_size < size)
+		if (slab_size < slub_min_objects * size)
 			continue;
 
 		rem = slab_size % size;
@@ -1624,6 +1630,15 @@ static int __init setup_slub_min_order(c
 
 __setup("slub_min_order=", setup_slub_min_order);
 
+static int __init setup_slub_min_objects(char *str)
+{
+	get_option (&str, &slub_min_objects);
+
+	return 1;
+}
+
+__setup("slub_min_objects=", setup_slub_min_objects);
+
 static int __init setup_slub_nomerge(char *str)
 {
 	slub_nomerge = 1;
-

From: Christoph Lameter
Date: Tuesday, March 6, 2007 - 7:35 pm

SLUB core

Basic new slab allocator. See overview for details

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc2-mm1/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.21-rc2-mm1.orig/fs/proc/proc_misc.c	2007-03-06 17:59:44.000000000 -0800
+++ linux-2.6.21-rc2-mm1/fs/proc/proc_misc.c	2007-03-06 18:03:49.000000000 -0800
@@ -399,6 +399,21 @@ static const struct file_operations proc
 };
 #endif
 
+#ifdef CONFIG_SLUB
+extern struct seq_operations slubinfo_op;
+static int slubinfo_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &slubinfo_op);
+}
+static const struct file_operations proc_slubinfo_operations = {
+	.open		= slubinfo_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+#endif
+
+
 #ifdef CONFIG_SLAB
 static int slabinfo_open(struct inode *inode, struct file *file)
 {
@@ -789,6 +804,9 @@ void __init proc_misc_init(void)
 #endif
 	create_seq_entry("stat", 0, &proc_stat_operations);
 	create_seq_entry("interrupts", 0, &proc_interrupts_operations);
+#ifdef CONFIG_SLUB
+	create_seq_entry("slubinfo",S_IWUSR|S_IRUGO,&proc_slubinfo_operations);
+#endif
 #ifdef CONFIG_SLAB
 	create_seq_entry("slabinfo",S_IWUSR|S_IRUGO,&proc_slabinfo_operations);
 #ifdef CONFIG_DEBUG_SLAB_LEAK
Index: linux-2.6.21-rc2-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.21-rc2-mm1.orig/include/linux/mm_types.h	2007-03-06 17:59:44.000000000 -0800
+++ linux-2.6.21-rc2-mm1/include/linux/mm_types.h	2007-03-06 18:03:49.000000000 -0800
@@ -19,10 +19,16 @@ struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	atomic_t _count;		/* Usage count, see below. */
-	atomic_t _mapcount;		/* Count of ptes mapped in mms,
+	union {
+		atomic_t _mapcount;	/* Count of ptes mapped in mms,
 					 * to show when page is mapped
 					 * & limit reverse map ...
From: Mel Gorman
Date: Thursday, March 8, 2007 - 3:54 am

Hi Christoph,

I shoved these patches through a few tests on x86, x86_64, ia64 and ppc64 
last night to see how they got on. I enabled slub_debug to catch any 
suprises that may be creeping about.

The results are mixed.

On x86_64, it completed successfully and looked reliable. There was a 5% 
performance loss on kernbench and aim9 figures were way down. However, 
with slub_debug enabled, I would expect that so it's not a fair comparison 
performance wise. I'll rerun the tests without debug and see what it looks 
like if you're interested and do not think it's too early to worry about 
performance instead of clarity. This is what I have for bl6-13 (machine 
appears on test.kernel.org so additional details are there).

KernBench Comparison
--------------------
                           2.6.21-rc2-mm2-clean 2.6.21-rc2-mm2-list-based 
%diff
User   CPU time                          84.32                     86.03     -2.03%
System CPU time                          32.97                     38.21    -15.89%
Total  CPU time                         117.29                    124.24     -5.93%
Elapsed    time                          34.95                     37.31     -6.75%

AIM9 Comparison
---------------
                  2.6.21-rc2-mm2-clean  2.6.21-rc2-mm2-list-based
  1 creat-clo                160706.55                   62918.54  -97788.01 -60.85% File Creations and Closes/second
  2 page_test                190371.67                  204050.99   13679.32  7.19% System Allocations & Pages/second
  3 brk_test                2320679.89                 1923512.75 -397167.14 -17.11% System Memory Allocations/second
  4 jmp_test               16391869.38                16380353.27  -11516.11 -0.07% Non-local gotos/second
  5 signal_test              492234.63                  235710.71 -256523.92 -52.11% Signal Traps/second
  6 exec_test                   232.26                     220.88     -11.38 -4.90% Program Loads/second
  7 fork_test                  4514.25             ...
From: Christoph Lameter
Date: Thursday, March 8, 2007 - 9:48 am

No its good to start worrying about performance now. There are still some 
performance issues to be ironed out in particular on NUMA. I am not sure

This was a single node box? Note that the 16kb page size has a major 
impact on SLUB performance. On IA64 slub will use only 1/4th the locking 

We have some additional patches here that reduce the max order for some 

Hmmm... Looks like something is zapping an object. Try to rerun with 



Someone did a kmalloc(0, ...). Zero sized allocation are not flagged

More page table trouble.

-

From: Mel Gorman
Date: Thursday, March 8, 2007 - 10:40 am

Ok, I've sent off a bunch of tests - two of which are on NUMA (numaq and
x86_64). It'll take them a long time to complete though as there is a

Yes, memory looks like this;

Zone PFN ranges:
  DMA          1024 ->   262144
  Normal     262144 ->   262144
Movable zone start PFN for each node
early_node_map[3] active PFN ranges
    0:     1024 ->    30719
    0:    32768 ->    65413
    0:    65440 ->    65505
On node 0 totalpages: 62405
Node 0 memmap at 0xe000000001126000 size 3670016 first pfn 0xe000000001134000
  DMA zone: 220 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 62185 pages, LIFO batch:7
  Normal zone: 0 pages used for memmap

It'll be interesting to see the kernbench tests then with debugging


I've queued up a few tests. One completed as I wrote this and it didn't
explode with SLAB_DEBUG set. Maybe the others will be different. I'll
kick it around for a bit.


I'll chase up what's happening here. It will be "reproducable" independent

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Christoph Lameter
Date: Thursday, March 8, 2007 - 11:16 am

You can get a similar effect on 4kb platforms by specifying slub_min_order=2 on bootup.
This means that we have to rely on your patches to allow higher order 
allocs to work reliably though. The higher the order of slub the less 
locking overhead. So the better your patches deal with fragmentation the 
more we can reduce locking overhead in slub.

-

From: Mel Gorman
Date: Friday, March 9, 2007 - 6:55 am

It should work out because of the way buddy always selects the minimum 
page size will tend to cluster the slab allocations together whether they 
are reclaimable or not. It's something I can investigate when slub has 
stabilised a bit.

However, in general, high order kernel allocations remain a bad idea. 
Depending on high order allocations that do not group could potentially 
lead to a situation where the movable areas are used more and more by 
kernel allocations. I cannot think of a workload that would actually break 

I can certainly kick it around a lot and see what happen. It's best that 
slub_min_order=2 remain an optional performance enhancing switch though.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Christoph Lameter
Date: Thursday, March 8, 2007 - 2:54 pm

Note that I am amazed that the kernbench even worked. On small machine I 
seem to be getting into trouble with order 1 allocations. SLAB seems to be 
able to avoid the situation by keeping higher order pages on a freelist 
and reduce the alloc/frees of higher order pages that the page allocator
has to deal with. Maybe we need per order queues in the page allocator? 

There must be something fundamentally wrong in the page allocator if the 
SLAB queues fix this issue. I was able to fix the issue in V5 by forcing 
SLUB to keep a mininum number of objects around regardless of the fit to
a page order page. Pass through is deadly since the crappy page allocator 
cannot handle it.

Higher order page allocation failures can be avoided by using kmalloc. 
Yuck! Hopefully your patches fix that fundamental problem.

-

From: Mel Gorman
Date: Friday, March 9, 2007 - 7:00 am

How small? The machines I am testing on aren't "big" but they aren't 

That in itself is pretty incredible. From what I see, allocations up to 3 
generally work unless they are atomic even with the vanilla kernel. That 
said, it could be because slab is holding onto the high order pages for 

I'm not sure what you mean by per-order queues. The buddy allocator 

One way to find out for sure.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Christoph Lameter
Date: Friday, March 9, 2007 - 9:40 am

Somehow they do not seem to work right. SLAB (and now SLUB too) can avoid 
(or defer) fragmentation by keeping its own queues.
-

From: Mel Gorman
Date: Friday, March 9, 2007 - 8:06 am

The results without slub_debug were not good except for IA64. x86_64 and 
ppc64 both blew up for a variety of reasons. The IA64 results were

KernBench Comparison
--------------------
                           2.6.21-rc2-mm2-clean       2.6.21-rc2-mm2-slub      %diff
User   CPU time                        1084.64                   1032.93      4.77%
System CPU time                          73.38                     63.14     13.95%
Total  CPU time                        1158.02                   1096.07      5.35%
Elapsed    time                         307.00                    285.62      6.96%

AIM9 Comparison
---------------
                  2.6.21-rc2-mm2-clean        2.6.21-rc2-mm2-slub
  1 creat-clo                425460.75                  438809.64   13348.89  3.14% File Creations and Closes/second
  2 page_test               2097119.26                 3398259.27 1301140.01 62.04% System Allocations & Pages/second
  3 brk_test                7008395.33                 6728755.72 -279639.61 -3.99% System Memory Allocations/second
  4 jmp_test               12226295.31                12254966.21   28670.90  0.23% Non-local gotos/second
  5 signal_test             1271126.28                 1235510.96  -35615.32 -2.80% Signal Traps/second
  6 exec_test                   395.54                     381.18     -14.36 -3.63% Program Loads/second
  7 fork_test                 13218.23                   13211.41      -6.82 -0.05% Task Creations/second
  8 link_test                 64776.04                    7488.13  -57287.91 -88.44% Link/Unlink Pairs/second

An example console log from x86_64 is below. It's not particular clear why 
it went blamo and I haven't had a chance all day to kick it around for a 
bit due to a variety of other hilarity floating around.

Linux version 2.6.21-rc2-mm2-autokern1 (root@bl6-13.ltc.austin.ibm.com) (gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)) #1 SMP Thu Mar 8 12:13:27 CST 2007
Command line: ro root=/dev/VolGroup00/LogVol00 rhgb console=tty0 ...
From: Christoph Lameter
Date: Friday, March 9, 2007 - 10:21 am

Yuck that is the dst issue that Adrian is also looking at. Likely an issue 



Crap. Maybe we straddled a slab boundary here?
-

From: Christoph Lameter
Date: Thursday, March 8, 2007 - 10:46 am

Lower bits must be clear right? Looks like the pud was released
and then reused for a 64 byte cache or so. This is likely a freelist 
pointer that slub put there after allocating the page for the 64 byte 

Data overwritten after free or after slab was allocated. So this may be 
the same issue. pud was zapped after it was freed destroying the poison 
of another object in the 64 byte cache.

Hmmm.. Maybe I should put the pad checks before the object checks. 
That way we detect that the whole slab was corrupted and do not flag just 
a single object.

-

Previous thread: [PATCH] [RSDL-mm 6/6] sched: document rsdl cpu scheduler by Con Kolivas on Tuesday, March 6, 2007 - 6:29 pm. (1 message)

Next thread: Re: [PATCH] proc: maps protection by Kees Cook on Tuesday, March 6, 2007 - 8:14 pm. (12 messages)