Re: [PATCH 12/30] mm: memory reserve management

Previous thread: [PATCH 18/30] netvm: INET reserves. by Peter Zijlstra on Thursday, July 24, 2008 - 10:01 am. (3 messages)

Next thread: [PATCH 23/30] netvm: skb processing by Peter Zijlstra on Thursday, July 24, 2008 - 10:01 am. (1 message)
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Neil Brown <neilb@...>
Date: Thursday, July 24, 2008 - 10:00 am

Generic reserve management code.

It provides methods to reserve and charge. Upon this, generic alloc/free style
reserve pools could be build, which could fully replace mempool_t
functionality.

It should also allow for a Banker's algorithm replacement of __GFP_NOFAIL.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/reserve.h | 146 +++++++++++
include/linux/slab.h | 20 -
mm/Makefile | 2
mm/reserve.c | 594 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/slub.c | 4
5 files changed, 755 insertions(+), 11 deletions(-)

Index: linux-2.6/include/linux/reserve.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/reserve.h
@@ -0,0 +1,146 @@
+/*
+ * Memory reserve management.
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_RESERVE_H
+#define _LINUX_RESERVE_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+
+struct mem_reserve {
+ struct mem_reserve *parent;
+ struct list_head children;
+ struct list_head siblings;
+
+ const char *name;
+
+ long pages;
+ long limit;
+ long usage;
+ spinlock_t lock; /* protects limit and usage */
+
+ wait_queue_head_t waitqueue;
+};
+
+extern struct mem_reserve mem_reserve_root;
+
+void mem_reserve_init(struct mem_reserve *res, const char *name,
+ struct mem_reserve *parent);
+int mem_reserve_connect(struct mem_reserve *new_child,
+ struct mem_reserve *node);
+void mem_reserve_disconnect(struct mem_reserve *node);
+
+int mem_reserve_pages_set(struct mem_reserve *res, long pages);
+int mem_reserve_pages_add(struct mem_reserve *res, long pages);
+int mem_reserve_pages_charge(struct mem_reserve *res, long pages);
+
+int mem_reserve_kmalloc_set(struct mem_r...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <mpm@...>, <cl@...>
Date: Monday, July 28, 2008 - 6:06 am

Hi Peter,

Hmm, I'm not sure I like the use of __kmalloc_track_caller() (even
though you do add the wrappers for SLUB). The functions really are SLAB
internals so I'd prefer to see kmalloc_reserve() moved to the

If the allocation fails, we try again (but nothing has changed, right?).

We're trying to get rid of kfree() so I'd __kfree_reserve() could to

--

To: Pekka Enberg <penberg@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <cl@...>
Date: Monday, July 28, 2008 - 12:49 pm

I think you mean ksize there. My big issue is that we need to make it
clear that ksize pairs -only- with kmalloc and that
ksize(kmem_cache_alloc(...)) is a categorical error. Preferably, we do
this by giving it a distinct name, like kmalloc_size(). We can stick an

SLOB doesn't do this, of course. But does that matter? I think you want
to charge the actual allocation size to the reserve in all cases, no?
That probably means calling ksize() on both alloc and free.

--
Mathematics is the supreme nostalgia of our time.

--

To: Matt Mackall <mpm@...>
Cc: Pekka Enberg <penberg@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <cl@...>
Date: Monday, July 28, 2008 - 1:13 pm

Like said, I still need to do all the SLOB reservation stuff. That
includes coming up with upper bound fragmentation loss.

For SL[UA]B I use roundup_power_of_two for kmalloc sizes. Thus with the
above ksize(), if we did p=kmalloc(x), then we'd account
roundup_power_of_two(x), and that should be equal to
roundup_power_of_two(ksize(p)), as ksize will always be smaller or equal
to the roundup.

I'm guessing the power of two upper bound is good for SLOB too -
although I haven't tried proving it wrong or tighetening it.

Only the kmem_cache_* reservation stuff would need some extra attention
with SLOB.

--

To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <mpm@...>, <cl@...>
Date: Monday, July 28, 2008 - 6:17 am

Yes, my latest does have those.. let me paste the relevant bit:

+void *___kmalloc_reserve(size_t size, gfp_t flags, int node, void *ip,
+ struct mem_reserve *res, int *emerg)
+{
+ void *obj;
+ gfp_t gfp;
+
+ /*
+ * Try a regular allocation, when that fails and we're not entitled
+ * to the reserves, fail.
+ */
+ gfp = flags | __GFP_NOMEMALLOC | __GFP_NOWARN;
+ obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+
+ if (obj || !(gfp_to_alloc_flags(flags) & ALLOC_NO_WATERMARKS))
+ goto out;
+
+ /*
+ * If we were given a reserve to charge against, try that.
+ */
+ if (res && !mem_reserve_kmalloc_charge(res, size)) {
+ /*
+ * If we failed to charge and we're not allowed to wait for
+ * it to succeed, bail.
+ */
+ if (!(flags & __GFP_WAIT))
+ goto out;
+
+ /*
+ * Wait for a successfull charge against the reserve. All
+ * uncharge operations against this reserve will wake us up.
+ */
+ wait_event(res->waitqueue,
+ mem_reserve_kmalloc_charge(res, size));
+
+ /*
+ * After waiting for it, again try a regular allocation.
+ * Pressure could have lifted during our sleep. If this
+ * succeeds, uncharge the reserve.
+ */
+ obj = __kmalloc_node_track_caller(size, gfp, node, ip);
+ if (obj) {
+ mem_reserve_kmalloc_charge(res, -size);
+ goto out;
+ }
+ }
+
+ /*
+ * Regular allocation failed, and we've successfully charged our
+ * requested usage against the reserve. Do the emergency allocation.
+ */
+ obj = __kmalloc_node_track_caller(size, fla...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <mpm@...>, <cl@...>
Date: Monday, July 28, 2008 - 6:29 am

Hi Peter,

But if it *does* fail, it doesn't help that we mess up the reservation

Right, I guess we could just rename ksize() to something else then and

--

To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <mpm@...>, <cl@...>
Date: Monday, July 28, 2008 - 6:39 am

That would be nice - we can stuff it into mm/internal.h or somesuch.

Also, you might have noticed, I still need to do everything SLOB. The
last time I rewrote all this code I was still hoping Linux would 'soon'
have a single slab allocator, but evidently we're still going with 3 for
now.. :-/

So I guess I can no longer hide behind that and will have to bite the

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Pekka Enberg <penberg@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <cl@...>
Date: Monday, July 28, 2008 - 12:59 pm

I haven't seen the rest of this thread, but I presume this is part of
your OOM-avoidance for network I/O framework?

SLOB can be pretty easily expanded to handle a notion of independent
allocation arenas as there are only a couple global variables to switch
between. kfree will also return allocations to the page list (and
therefore arena) from whence they came. That may make it pretty simple
to create and prepopulate reserve pools.

--
Mathematics is the supreme nostalgia of our time.

--

To: Matt Mackall <mpm@...>
Cc: Pekka Enberg <penberg@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <cl@...>
Date: Monday, July 28, 2008 - 1:13 pm

Right - currently we let all the reserves sit on the free page list. The
advantage there is that it also helps the anti-frag stuff, due to having
larger free lists.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>, <netdev@...>, <trond.myklebust@...>, Daniel Lezcano <dlezcano@...>, Neil Brown <neilb@...>, <mpm@...>, <cl@...>
Date: Monday, July 28, 2008 - 6:41 am

Hi Peter,

Oh, I don't expect SLOB to go away anytime soon. We are still trying to
get rid of SLAB, though, but there are some TPC regressions that we
don't have a reproducible test case for so that effort has stalled a
bit.

Pekka

--

Previous thread: [PATCH 18/30] netvm: INET reserves. by Peter Zijlstra on Thursday, July 24, 2008 - 10:01 am. (3 messages)

Next thread: [PATCH 23/30] netvm: skb processing by Peter Zijlstra on Thursday, July 24, 2008 - 10:01 am. (1 message)