[S+Q3 13/23] slub: Move gfpflag masking out of the hotpath

Previous thread: [S+Q3 17/23] slub: Remove MAX_OBJS limitation by Christoph Lameter on Tuesday, August 3, 2010 - 7:45 pm. (1 message)

Next thread: [PATCH] ARM: uaccess: Implement strict user copy checks by Stephen Boyd on Tuesday, August 3, 2010 - 8:02 pm. (30 messages)
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

The following is a first release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like the two prior releases. The NUMA facilities
were much improved vs V2. Shared and alien cache support was added to
track the cache hot state of objects. 

After this patches SLUB will track the cpu cache contents
like SLAB attemped to. There are a number of architectural differences:

1. SLUB accurately tracks cpu caches instead of assuming that there
   is only a single cpu cache per node or system.

2. SLUB object expiration is tied into the page reclaim logic. There
   is no periodic cache expiration.

3. SLUB caches are dynamically configurable via the sysfs filesystem.

4. There is no per slab page metadata structure to maintain (aside
   from the object bitmap that usually fits into the page struct).

5. Keeps all the other good features of SLUB as well.

SLUB+Q is a merging of SLUB with some queuing concepts from SLAB and a
new way of managing objects in the slabs using bitmaps. It uses a percpu
queue so that free operations can be properly buffered and a bitmap for
managing the free/allocated state in the slabs. It is slightly more
inefficient than SLUB (due to the need to place large bitmaps --sized
a few words--in some slab pages if there are more than BITS_PER_LONG
objects in a slab) but in general does not increase space use too much.

The SLAB scheme of not touching the object during management is adopted.
SLUB+Q can efficiently free and allocate cache cold objects without
causing cache misses.

I have had limited time for benchmarking this release so far since I
was more focused on getting SLAB features merged in and making it
work reliably with all the usual SLUB bells and whistles. The queueing
scheme from the SLUB+Q V1/V2 releases was not changed so that the basic
SMP performance is still the same. V1 and V2 did not have NUMA clean
queues and therefore the performance on NUMA system was not ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

count_free() == available()

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-30 18:44:54.767739966 -0500
+++ linux-2.6/mm/slub.c	2010-07-30 18:45:24.248349179 -0500
@@ -1697,11 +1697,6 @@
 	return 1;
 }
 
-static int count_free(struct page *page)
-{
-	return available(page);
-}
-
 static unsigned long count_partial(struct kmem_cache_node *n,
 					int (*get_count)(struct page *))
 {
@@ -1750,7 +1745,7 @@
 		if (!n)
 			continue;
 
-		nr_free  = count_partial(n, count_free);
+		nr_free  = count_partial(n, available);
 		nr_slabs = node_nr_slabs(n);
 		nr_objs  = node_nr_objs(n);
 
@@ -3906,7 +3901,7 @@
 			x = atomic_long_read(&n->total_objects);
 		else if (flags & SO_OBJECTS)
 			x = atomic_long_read(&n->total_objects) -
-				count_partial(n, count_free);
+				count_partial(n, available);
 
 			else
 				x = atomic_long_read(&n->nr_slabs);
@@ -4792,7 +4787,7 @@
 		nr_partials += n->nr_partial;
 		nr_slabs += atomic_long_read(&n->nr_slabs);
 		nr_objs += atomic_long_read(&n->total_objects);
-		nr_free += count_partial(n, count_free);
+		nr_free += count_partial(n, available);
 	}
 
 	nr_inuse = nr_objs - nr_free;

--

From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

Allow resizing of cpu queue and batch size. This is done in the
basic steps that are also followed by SLAB.

Careful: The ->cpu pointer is becoming volatile. References
to the ->cpu pointer either

A. Occur with interrupts disabled. This guarantees that nothing on the
   processor itself interferes. This only serializes access to a single
   processor specific area.

B. Occur with slub_lock taken for operations on all per cpu areas.
   Taking the slub_lock guarantees that no resizing operation will occur
   while accessing the percpu areas. The data in the percpu areas
   is volatile even with slub_lock since the alloc and free functions
   do not take slub_lock and will operate on fields of kmem_cache_cpu.

C. Are racy: Tolerable for statistics. The ->cpu pointer must always
   point to a valid kmem_cache_cpu area.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    9 -
 mm/slub.c                |  218 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 197 insertions(+), 30 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-31 18:25:53.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-31 19:02:05.003563067 -0500
@@ -195,10 +195,19 @@
 
 #endif
 
+/*
+ * We allow stat calls while slub_lock is taken or while interrupts
+ * are enabled for simplicities sake.
+ *
+ * This results in potential inaccuracies. If the platform does not
+ * support per cpu atomic operations vs. interrupts then the counters
+ * may be updated in a racy manner due to slab processing in
+ * interrupts.
+ */
 static inline void stat(struct kmem_cache *s, enum stat_item si)
 {
 #ifdef CONFIG_SLUB_STATS
-	__this_cpu_inc(s->cpu_slab->stat[si]);
+	__this_cpu_inc(s->cpu->stat[si]);
 #endif
 }
 
@@ -303,7 +312,7 @@
 
 static inline int queue_full(struct kmem_cache_queue *q)
 {
-	return q->objects == QUEUE_SIZE;
+	return q->objects == ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

kmalloc caches are statically defined and may take up a lot of space just
because the sizes of the node array has to be dimensioned for the largest
node count supported.

This patch makes the size of the kmem_cache structure dynamic throughout by
creating a kmem_cache slab cache for the kmem_cache objects. The bootstrap
occurs by allocating the initial one or two kmem_cache objects from the
page allocator.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/slub_def.h |    7 -
 mm/slub.c                |  181 +++++++++++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 49 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2010-07-26 14:25:16.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2010-07-26 14:26:24.000000000 -0500
@@ -136,19 +136,16 @@ struct kmem_cache {
 
 #ifdef CONFIG_ZONE_DMA
 #define SLUB_DMA __GFP_DMA
-/* Reserve extra caches for potential DMA use */
-#define KMALLOC_CACHES (2 * SLUB_PAGE_SHIFT)
 #else
 /* Disable DMA functionality */
 #define SLUB_DMA (__force gfp_t)0
-#define KMALLOC_CACHES SLUB_PAGE_SHIFT
 #endif
 
 /*
  * We keep the general caches in an array of slab caches that are used for
  * 2^x bytes of allocations.
  */
-extern struct kmem_cache kmalloc_caches[KMALLOC_CACHES];
+extern struct kmem_cache *kmalloc_caches[SLUB_PAGE_SHIFT];
 
 /*
  * Sorry that the following has to be that ugly but some versions of GCC
@@ -213,7 +210,7 @@ static __always_inline struct kmem_cache
 	if (index == 0)
 		return NULL;
 
-	return &kmalloc_caches[index];
+	return kmalloc_caches[index];
 }
 
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:20.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:24.000000000 -0500
@@ -179,7 ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

Extract the code that memory checkers and other verification tools use from
the hotpaths. Makes it easier to add new ones and reduces the disturbances
of the hotpaths.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   49 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:24.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:33.000000000 -0500
@@ -793,6 +793,37 @@ static void trace(struct kmem_cache *s, 
 }
 
 /*
+ * Hooks for other subsystems that check memory allocations. In a typical
+ * production configuration these hooks all should produce no code at all.
+ */
+static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
+{
+	lockdep_trace_alloc(flags);
+	might_sleep_if(flags & __GFP_WAIT);
+
+	return should_failslab(s->objsize, flags, s->flags);
+}
+
+static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, void *object)
+{
+	kmemcheck_slab_alloc(s, flags, object, s->objsize);
+	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, flags);
+}
+
+static inline void slab_free_hook(struct kmem_cache *s, void *x)
+{
+	kmemleak_free_recursive(x, s->flags);
+}
+
+static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
+{
+	kmemcheck_slab_free(s, object, s->objsize);
+	debug_check_no_locks_freed(object, s->objsize);
+	if (!(s->flags & SLAB_DEBUG_OBJECTS))
+		debug_check_no_obj_freed(object, s->objsize);
+}
+
+/*
  * Tracking of fully allocated slabs for debugging purposes.
  */
 static void add_full(struct kmem_cache_node *n, struct page *page)
@@ -1698,10 +1729,7 @@ static __always_inline void *slab_alloc(
 
 	gfpflags &= gfp_allowed_mask;
 
-	lockdep_trace_alloc(gfpflags);
-	might_sleep_if(gfpflags & __GFP_WAIT);
-
-	if (should_failslab(s->objsize, gfpflags, s->flags))
+	if ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

From: Tejun Heo <tj@kernel.org>

In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
size.  There's no use case for forcing 0 and the upcoming early alloc
support always requires non-zero dynamic size.  Make @dyn_size always
mean minimum dyn_size.

While at it, make pcpu_build_alloc_info() static which doesn't have
any external caller as suggested by David Rientjes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: work/include/linux/percpu.h
===================================================================
--- work.orig/include/linux/percpu.h
+++ work/include/linux/percpu.h
@@ -104,16 +104,11 @@ extern struct pcpu_alloc_info * __init p
 							     int nr_units);
 extern void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai);

-extern struct pcpu_alloc_info * __init pcpu_build_alloc_info(
-				size_t reserved_size, ssize_t dyn_size,
-				size_t atom_size,
-				pcpu_fc_cpu_distance_fn_t cpu_distance_fn);
-
 extern int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
 					 void *base_addr);

 #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
-extern int __init pcpu_embed_first_chunk(size_t reserved_size, ssize_t dyn_size,
+extern int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
 				size_t atom_size,
 				pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
 				pcpu_fc_alloc_fn_t alloc_fn,
Index: work/mm/percpu.c
===================================================================
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -1013,20 +1013,6 @@ phys_addr_t per_cpu_ptr_to_phys(void *ad
 		return page_to_phys(pcpu_addr_to_page(addr));
 }

-static inline size_t pcpu_calc_fc_sizes(size_t static_size,
-					size_t reserved_size,
-					ssize_t *dyn_sizep)
-{
-	size_t size_sum;
-
-	size_sum = PFN_ALIGN(static_size + reserved_size +
-			     ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

Move the gfpflags masking into the hooks for checkers and into the slowpaths.
gfpflag masking requires access to a global variable and thus adds an
additional cacheline reference to the hotpaths.

If no hooks are active then the gfpflag masking will result in
code that the compiler can toss out.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-26 14:26:33.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-26 14:26:47.000000000 -0500
@@ -798,6 +798,7 @@ static void trace(struct kmem_cache *s, 
  */
 static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
 {
+	flags &= gfp_allowed_mask;
 	lockdep_trace_alloc(flags);
 	might_sleep_if(flags & __GFP_WAIT);
 
@@ -806,6 +807,7 @@ static inline int slab_pre_alloc_hook(st
 
 static inline void slab_post_alloc_hook(struct kmem_cache *s, gfp_t flags, void *object)
 {
+	flags &= gfp_allowed_mask;
 	kmemcheck_slab_alloc(s, flags, object, s->objsize);
 	kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, flags);
 }
@@ -1679,6 +1681,7 @@ new_slab:
 		goto load_freelist;
 	}
 
+	gfpflags &= gfp_allowed_mask;
 	if (gfpflags & __GFP_WAIT)
 		local_irq_enable();
 
@@ -1727,8 +1730,6 @@ static __always_inline void *slab_alloc(
 	struct kmem_cache_cpu *c;
 	unsigned long flags;
 
-	gfpflags &= gfp_allowed_mask;
-
 	if (!slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 

--

From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

Serialize kmem_cache_create and kmem_cache_destroy using the slub_lock. Only
possible after the use of the slub_lock during dynamic dma creation has been
removed.

Then make sure that the setup of the slab sysfs entries does not race
with kmem_cache_create and kmem_cache destroy.

If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.

V1->V2:
- Do proper synchronization to address race conditions

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-27 22:51:41.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-27 22:51:43.000000000 -0500
@@ -2482,7 +2482,6 @@ void kmem_cache_destroy(struct kmem_cach
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);
-		up_write(&slub_lock);
 		if (kmem_cache_close(s)) {
 			printk(KERN_ERR "SLUB %s: %s called for cache that "
 				"still has objects.\n", s->name, __func__);
@@ -2491,8 +2490,8 @@ void kmem_cache_destroy(struct kmem_cach
 		if (s->flags & SLAB_DESTROY_BY_RCU)
 			rcu_barrier();
 		sysfs_slab_remove(s);
-	} else
-		up_write(&slub_lock);
+	}
+	up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -3147,14 +3146,12 @@ struct kmem_cache *kmem_cache_create(con
 		 */
 		s->objsize = max(s->objsize, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
-		up_write(&slub_lock);
 
 		if (sysfs_slab_alias(s, name)) {
-			down_write(&slub_lock);
 			s->refcount--;
-			up_write(&slub_lock);
 			goto err;
 		}
+		up_write(&slub_lock);
 		return s;
 	}
 
@@ -3163,14 +3160,12 @@ struct kmem_cache *kmem_cache_create(con
 		if (kmem_cache_open(s, name,
 				size, align, flags, ctor)) {
 			list_add(&s->list, ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

The cacheline with the flags is reachable from the hot paths after the
percpu allocator changes went in. So there is no need anymore to put a
flag into each slab page. Get rid of the SlubDebug flag and use
the flags in kmem_cache instead.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 include/linux/page-flags.h |    2 --
 mm/slub.c                  |   33 ++++++++++++---------------------
 2 files changed, 12 insertions(+), 23 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2010-07-28 12:03:19.000000000 -0500
+++ linux-2.6/include/linux/page-flags.h	2010-07-28 12:44:57.000000000 -0500
@@ -128,7 +128,6 @@ enum pageflags {
 
 	/* SLUB */
 	PG_slub_frozen = PG_active,
-	PG_slub_debug = PG_error,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -215,7 +214,6 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEAR
 __PAGEFLAG(SlobFree, slob_free)
 
 __PAGEFLAG(SlubFrozen, slub_frozen)
-__PAGEFLAG(SlubDebug, slub_debug)
 
 /*
  * Private page markings that may be used by the filesystem that owns the page
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-28 12:44:56.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-28 12:44:57.000000000 -0500
@@ -107,11 +107,17 @@
  * 			the fast path and disables lockless freelists.
  */
 
+#define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
+		SLAB_TRACE | SLAB_DEBUG_FREE)
+
+static inline int kmem_cache_debug(struct kmem_cache *s)
+{
 #ifdef CONFIG_SLUB_DEBUG
-#define SLABDEBUG 1
+	return unlikely(s->flags & SLAB_DEBUG_FLAGS);
 #else
-#define SLABDEBUG 0
+	return 0;
 #endif
+}
 
 /*
  * Issues still to be resolved:
@@ -1157,9 +1163,6 @@ static struct page *new_slab(struct kmem
 	inc_slabs_node(s, page_to_nid(page), page->objects);
 	page->slab = s;
 ...
From: Christoph Lameter
Date: Tuesday, August 3, 2010 - 7:45 pm

Remove the dynamic dma slab allocation since this causes too many issues with
nested locks etc etc. The change avoids passing gfpflags into many functions.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 mm/slub.c |  151 ++++++++++++++++----------------------------------------------
 1 file changed, 40 insertions(+), 111 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-07-27 22:51:36.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-07-27 22:51:36.000000000 -0500
@@ -2065,7 +2065,7 @@ init_kmem_cache_node(struct kmem_cache_n
 
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmalloc_percpu[KMALLOC_CACHES]);
 
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 {
 	if (s < kmalloc_caches + KMALLOC_CACHES && s >= kmalloc_caches)
 		/*
@@ -2092,7 +2092,7 @@ static inline int alloc_kmem_cache_cpus(
  * when allocating for the kmalloc_node_cache. This is used for bootstrapping
  * memory on a fresh node that has no slab structures yet.
  */
-static void early_kmem_cache_node_alloc(gfp_t gfpflags, int node)
+static void early_kmem_cache_node_alloc(int node)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -2100,7 +2100,7 @@ static void early_kmem_cache_node_alloc(
 
 	BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));
 
-	page = new_slab(kmalloc_caches, gfpflags, node);
+	page = new_slab(kmalloc_caches, GFP_KERNEL, node);
 
 	BUG_ON(!page);
 	if (page_to_nid(page) != node) {
@@ -2144,7 +2144,7 @@ static void free_kmem_cache_nodes(struct
 	}
 }
 
-static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
+static int init_kmem_cache_nodes(struct kmem_cache *s)
 {
 	int node;
 
@@ -2152,11 +2152,11 @@ static int init_kmem_cache_nodes(struct 
 		struct kmem_cache_node *n;
 
 		if (slab_state == DOWN) {
-			early_kmem_cache_node_alloc(gfpflags, ...
From: David Rientjes
Date: Tuesday, August 3, 2010 - 9:39 pm

This insta-reboots on my netperf benchmarking servers (but works with 
numa=off), so I'll have to wait until I can hook up a serial before 
benchmarking this series.
--

From: Christoph Lameter
Date: Wednesday, August 4, 2010 - 9:17 am

There are potential issues with

1. The size of per cpu reservation on bootup and the new percpu code that
allows allocations for per cpu areas during bootup. Sometime I wonder if I
should just go back to static allocs for that.

2. The topology information provided by the machine for the cache setup.

3. My code of course.

Bootlog would be appreciated.

--

From: David Rientjes
Date: Thursday, August 5, 2010 - 1:38 am

I bisected this to patch 8 but still don't have a bootlog.  I'm assuming 
in the meantime that something is kmallocing DMA memory on this machine 
prior to kmem_cache_init_late() and get_slab() is returning a NULL 
pointer.
--

From: Christoph Lameter
Date: Thursday, August 5, 2010 - 10:33 am

There is a kernel option "earlyprintk=..." that allows you to see early
boot messages.

If this indeed is a problem with the DMA caches then try the following
patch:



Subject: slub: Move dma cache initialization up

Do dma kmalloc initialization in kmem_cache_init and not in kmem_cache_init_late()

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-08-05 12:24:21.000000000 -0500
+++ linux-2.6/mm/slub.c	2010-08-05 12:28:58.000000000 -0500
@@ -3866,13 +3866,8 @@ void __init kmem_cache_init(void)
 #ifdef CONFIG_SMP
 	register_cpu_notifier(&slab_notifier);
 #endif
-}

-void __init kmem_cache_init_late(void)
-{
 #ifdef CONFIG_ZONE_DMA
-	int i;
-
 	/* Create the dma kmalloc array and make it operational */
 	for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
 		struct kmem_cache *s = kmalloc_caches[i];
@@ -3891,6 +3886,10 @@ void __init kmem_cache_init_late(void)
 #endif
 }

+void __init kmem_cache_init_late(void)
+{
+}
+
 /*
  * Find a mergeable slab cache
  */
--

From: David Rientjes
Date: Monday, August 16, 2010 - 9:56 pm

Ok, so this is panicking because of the error handling when trying to 
create sysfs directories with the same name (in this case, :dt-0000064).  
I'll look into while this isn't failing gracefully later, but I isolated 
this to the new code that statically allocates the DMA caches in 
kmem_cache_init_late().

The iteration runs from 0 to SLUB_PAGE_SHIFT; that's actually incorrect 
since the kmem_cache_node cache occupies the first spot in the 
kmalloc_caches array and has a size, 64 bytes, equal to a power of two 
that is duplicated later.  So this patch tries creating two DMA kmalloc 
caches with 64 byte object size which triggers a BUG_ON() during 
kmem_cache_release() in the error handling later.

The fix is to start the iteration at 1 instead of 0 so that all other 
caches have their equivalent DMA caches created and the special-case 
kmem_cache_node cache is excluded (see below).

I'm really curious why nobody else ran into this problem before, 
especially if they have CONFIG_SLUB_DEBUG enabled so 
struct kmem_cache_node has the same size.  Perhaps my early bug report 
caused people not to test the series...

I'm adding Tejun Heo to the cc because of another thing that may be 
problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to 
allocate kmem_cache_cpu for a DMA cache we may be returning memory from a 
node that doesn't include lowmem so there will be no affinity between the 
struct and the slab.  I'm wondering if it would be better for the percpu 
allocator to be extended for kzalloc_node(), or vmalloc_node(), when 
allocating memory after the slab layer is up.

There're a couple more issues with the patch as well:

 - the entire iteration in kmem_cache_init_late() needs to be protected by 
   slub_lock.  The comment in create_kmalloc_cache() should be revised 
   since you're no longer calling it only with irqs disabled.  
   kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be 
   protected.

 - a BUG_ON(!name) needs to ...
From: Tejun Heo
Date: Tuesday, August 17, 2010 - 12:55 am

Hello,


Hmmm... do you mean adding @gfp_mask to percpu allocation function?
I've been thinking about adding it for atomic allocations (Christoph,
do you still want it?).  I've been sort of against it because I
primarily don't really like atomic allocations (it often just pushes
error handling complexities elsewhere where it becomes more complex)
and it would also require making vmalloc code do atomic allocations.

Most of percpu use cases seem pretty happy with GFP_KERNEL allocation,
so I'm still quite reluctant to change that.  We can add a semi
internal interface w/ @gfp_mask but w/o GFP_ATOMIC support, which is a
bit ugly.  How important would this be?

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Tuesday, August 17, 2010 - 6:56 am

DMA caches may only exist on certain nodes because others do not have a
DMA zone. Their role is quite limited these days. DMA caches allocated on
nodes without DMA zones would have their percpu area allocated on the node
but the DMA allocations would be redirected to the closest node with DMA

At this point I would think that we do not need that support.

--

From: Christoph Lameter
Date: Tuesday, August 17, 2010 - 10:23 am

The kmem_cache_node cache is no longer at position 0.


I moved it to kmem_cache_init() which is run when we only have one
execution thread. That takes care of the issue and ensures that the dma

Ok.

--

From: Christoph Lameter
Date: Tuesday, August 17, 2010 - 10:29 am

If you do not apply all patches then you can be at a stage were
kmalloc_caches[0] is still used for kmem_cache_node. Then things break.
--

From: David Rientjes
Date: Tuesday, August 17, 2010 - 11:02 am

They do after patch 11 when you introduce dynamically sized kmalloc 
caches, but not after only patches 1-8 were applied.  Since this wasn't 
booting on my system, I bisected the problem to patch 8 where 
kmem_cache_init_late() would create two DMA caches of size 64 bytes: one 
becauses of kmalloc_caches[0] (kmem_cache_node) and one because of 
kmalloc_caches[6] (2^6 = 64).  So my fixes are necessary for patch 8 but 

I didn't know if that was a debugging patch for me or if you wanted to 
push that as part of your series, I'm not sure if you actually need to 
move it to kmem_cache_init() now that slub_state is protected by 
slub_lock.  I'm not sure if we want to allocate DMA objects between 
kmem_cache_init() and kmem_cache_init_late().
--

From: Christoph Lameter
Date: Tuesday, August 17, 2010 - 11:47 am

Drivers may allocate dma buffers during initialization.

--

From: David Rientjes
Date: Tuesday, August 17, 2010 - 11:54 am

Ok, I moved the DMA cache creation from kmem_cache_init_late() to 
kmem_cache_init().  Note: the kasprintf() will need to use GFP_NOWAIT and 
not GFP_KERNEL now.
--

From: Christoph Lameter
Date: Tuesday, August 17, 2010 - 12:34 pm

ok. I have revised the patch since there is also a problem with the
indirection on kmalloc_caches.


--

Previous thread: [S+Q3 17/23] slub: Remove MAX_OBJS limitation by Christoph Lameter on Tuesday, August 3, 2010 - 7:45 pm. (1 message)

Next thread: [PATCH] ARM: uaccess: Implement strict user copy checks by Stephen Boyd on Tuesday, August 3, 2010 - 8:02 pm. (30 messages)