Since 2.6.25-rc7, I've been seeing an occasional livelock on one
x86_64 machine, copying kernel trees to tmpfs, paging out to swap.Signature: 6000 pages under writeback but never getting written;
most tasks of interest trying to reclaim, but each get_swap_bio
waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ);
every five seconds an atomic page allocation failure report from
kblockd failing to allocate a sense_buffer in __scsi_get_command.__scsi_get_command has a (one item) free_list to protect against
this, but rc1's [SCSI] use dynamically allocated sense buffer
de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
When it fails to allocate from the separate sense_slab, instead
of giving up, it must fall back to the command free_list, which
is sure to have a sense_buffer attached.Either my earlier -rc testing missed this, or there's some recent
contributory factor. One very significant factor is SLUB, which
merges slab caches when it can, and on 64-bit happens to merge
both bio cache and sense_slab cache into kmalloc's 128-byte cache:
so that under this swapping load, bios above are liable to gobble
up all the slots needed for scsi_cmnd sense_buffers below.That's disturbing behaviour, and I tried a few things to fix it.
Adding a no-op constructor to the sense_slab inhibits SLUB from
merging it, and stops all the allocation failures I was seeing;
but it's rather a hack, and perhaps in different configurations
we have other caches on the swapout path which are ill-merged.Another alternative is to revert the separate sense_slab, using
cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
the one kmem_cache; but that might waste more memory, and is
only a way of diverting around the known problem.While I don't like seeing the allocation failures, and hate the
idea of all those bios piled up above a scsi host working one by
one, it does seem to emerge fairly soon with the livelock fix.
So lacking better ideas, stick with that one clear fix...
A reliance on free slots that the slab allocator may provide? That is a
rather bad dependency since it is up to the slab allocator to implement
the storage layout for the objects and thus the availability of slots may
vary depending on the layout for the objects chosen by the allocator.Looking at mempool_alloc: Mempools may be used to do atomic allocations
until they fail thereby exhausting reserves and available object in the
partial lists of slab caches?In order to make this a significant factor we need to have already
exhausted reserves right? Thus we are already operating at the boundary of
what memory there is. Any non atomic alloc will then allocate a new page
with N elements in order to get one object. The mempool_allocs from the
atomic context will then gooble up the N-1 remaining objects? So the
nonatomic alloc will then have to hit the page allocator again...--
I'm not sure that I understand you. Yes, different slab allocators
may lay out slots differently. But a significant departure from
existing behaviour may be a bad idea in some circumstances.Mempools may be used for atomic allocations, but I think that's not
the case here. swap_writepage's get_swap_bio says GFP_NOIO, which
allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH
reserves.Whereas at the __scsi_get_command end, there are GFP_ATOMIC sense_slab
allocations, which do give access to __GFP_HIGH reserves.My supposition is that once a page has been allocated from __GFP_HIGH
reserves to a scsi sense_slab, swap_writepages are liable to gobble up
the rest of the page with bio allocations which they wouldn't have had
access to traditionally (i.e. under SLAB).So an unexpected behaviour emerges from SLUB's slab merging.
Though of course the same might happen in other circumstances, even
without slab merging: if some kmem_cache allocations are made with
GFP_ATOMIC, those can give access to reserves to non-__GFP_HIGH
allocations from the same kmem_cache.Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation:
I've given little thought to mempool_alloc's fiddling with theWe need to have already exhausted reserves, yes: so this isn't an
issue hitting everyone all the time, and it may be nothing worse
than a surprising anomaly; but I'm pretty sure it's not how bio
and scsi command allocation is expected to interact.What do you think a SLAB_NOMERGE flag? The last time I suggested
something like that (but I was thinking of debug), your comment
was "Ohh..", which left me in some doubt ;)If we had a SLAB_NOMERGE flag, would we want to apply it to the
bio cache or to the scsi_sense_cache or to both? My difficulty
in answering that makes me wonder whether such a flag is right.Hugh
--
Looks like that one of the issues here is that swap_writepage()
does not perform enough reclaim? If it would free more pages then
__scsi_get_command would still have pages to allocate and not drainMempool_alloc()s use of the gfp_mask here suggests that it can potentially
drain all reserves and exhaust all available "slots" (partial slabs). Thus
it may regularly force any other user of the slab to hit the slow path
and potentially trigger reclaim. Could be a bit unfair.
--
Somewhere along the line of my swap over network patches I made
'robustified' SLAB to ensure these sorts of things could not happen - it
came at a cost though.It would basically fail[*] allocations that had a higher low watermark
than what was used to allocate the current slab.[*] - well, it would attempt to allocate a new slab to raise the current
My latest series ensures that SLABs allocated using PF_MEMALLOC will not
distribute objects to allocation contexts that are not entitled for as
long as the memory shortage lasts.I'm not sure how applicable this is to the problem at hand, just letting
Relying on this is highly dubious, who is to say that first __GFP_HIGH
alloc came SCSI layer (there could be another merged slab).Also, when one of these users is PF_MEMALLOC the other users will gobble
If this is critical to avoid memory deadlocks, I would suggest using
mempools (or my reserve framework).--
Thanks, Peter: that sounds just right to me; but a larger change than
we'd want to jump into for this one particular issue - it might haveNo, the critical part of it has been dealt with (small fix to scsi
free_list handling: which resembles a mempool, but done its own way).What remains is about "unsightly" behaviour, the system having a
tendency to collapse briefly into far-from-efficient operation
when out of memory.Hugh
--
Hi Hugh,
Although you weren't convinced by my arguments, I still have
difficulties understanding why this kind of bad behavior would be
acceptable in an embedded environment and why we don't need to fix it
for the SLOB case as well.But you do bring up a good point of SLUB changing the behavior on OOM
situations for which SLAB_NOMERGE sounds like a good-enough stop-gap
measure for the short term. I would prefer some other fix even if it
means getting rid of slab merging competely (which would suck as it's
very nice for making memory footprint smaller).Pekka
--
I wonder if we can get away with a SLAB_IO flag that you can use to
annotate caches that participate in writeout and the allocator could
keep some spare pages around that can be handed out for them in case of
OOM...Pekka
--
If we do decide that it's seriously bad behaviour, then of course
it should be fixed in SLOB also. I just meant that the fact that
SLOB's been merging slabs too doesn't give me great confidence thatWell, the work of keeping spares around to be handed out in case of
OOM was done in 2.5 with mempools. They seem to be working well
(until you try swapping over network as Peter is pursuing),
I don't think we need to duplicate that.I've the uncomfortable feeling that I'm making a big fuss over this
behaviour on the one hand, but utterly failing to justify that fuss
on the other. It seems to come down to me wanting fewer backtraces
in my logs: I'd better find stronger reasons than that if we're to
pursue this.Hugh
--
No, I do agree that it's a problem. I just don't like the proposed
no-merge fix ;-).Pekka
--
Actually, I don't think this is SLAB-specific, since there are potentially
the same issues for pages.I suspect the right thing to do is not to mark them for "IO", but mark
them for "short-lived", and allow short-lived allocations that don't have
extended lifetimes to succeed even when a "real" allocation wouldn't.When we are under memory pressure, a normal allocation generally needs to
clear up enough memory from elsewhere to succeed. But a short-lived
allocation without any long-term lifetimes would be known to release its
memory back to the pool, so it can be allowed to go ahead when a normal
memory allocation would not.Examples of short-lived allocations:
- IO requests
- temporary network packets that don't get queued up (eg "ACK" packet) as
opposed to those that do get queued (incoming _or_ outgoing queues)
- things like the temporary buffers for "getname()/putname()" etc.That said, even a lot of temporary allocations can at times have issues.
If your IO layer is dead, your IO request queues that _should_ have been
very temporary may end up staying around for a long time. But I do think
that it makes sense to prioritize allocations that are known to be short-
lived.(It might also allows us to allocate them from different pools and just
generally have other heuristics for cache behaviour - short-lived
allocations simply have different behaviour)Linus
--
Hi Linus,
Yeah, makes sense. We do have GFP_TEMPORARY so we could associate this
new semantics with that. But the real problem here is how to do the
"allocate harder" part which, btw, sounds very similar to what Peter's
kmalloc reserve patches try to do...Pekka
--
Actually, a trivial way to implement that is to have a few "emergency
kmalloc" caches say for sizes 64, 128, 256, and 512 that have some
pre-allocated pages into which these GFP_TEMPORARY allocations are
allowed to dip into on OOM and OOM only.Pekka
--
So something like the following (totally untested) patch modulo the
pre-allocation bits.Pekka
diff --git a/mm/slub.c b/mm/slub.c
index acc975f..e8f7cf3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -264,6 +264,67 @@ static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
#endif
}+/*
+ * Emergency caches are for GFP_TEMPORARY allocations to dip into when we're
+ * OOM to make sure we can make some progress for writeback.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void init_emergency_caches(void)
+{
+ unsigned long size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ struct kmem_cache *cache;
+ char *name;
+
+ name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+ BUG_ON(!name);
+
+ cache = kmem_cache_create(name, size, 0, 0, NULL);
+ BUG_ON(!cache);
+ kfree(name);
+
+ emergency_caches[i] = cache;
+
+ size *= 2;
+ }
+}
+
+static struct kmem_cache *lookup_emergency_cache(unsigned long size)
+{
+ unsigned long cache_size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ if (size < cache_size)
+ return emergency_caches[i];
+
+ cache_size *= 2;
+ }
+ return NULL;
+}
+
+static void *emergency_alloc(struct kmem_cache *cache, gfp_t gfp)
+{
+ struct kmem_cache *emergency_cache;
+ void *p;
+
+ emergency_cache = lookup_emergency_cache(cache->objsize);
+ if (!emergency_cache)
+ return NULL;
+
+ p = kmem_cache_alloc(emergency_cache, gfp);
+ if (p && cache->ctor)
+ cache->ctor(cache, p);
+ return p;
+}
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1528,6 +1589,9 @@ new_slab:
goto load_freelist;
}+ if ((gfpflags & GFP_TEMPORARY) == GFP_TEMPORARY)
+ return emergency_alloc(s, gfpf...
Hmm. I didn't check, but won't this cause problems on the freeing path
when we call kmem_cache_free() on the result but with the wrong "struct
kmem_cache" pointer?Linus
--
So something like this fugly patch on top of the SLUB variable order
patches that let the allocator fall-back to smaller page orders. Survives
OOM in my testing and the box seems to be bit more responsive even under X
with this.Pekka
---
mm/slub.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 92 insertions(+), 1 deletion(-)Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c 2008-04-08 22:52:02.000000000 +0300
+++ slab-2.6/mm/slub.c 2008-04-08 23:27:27.000000000 +0300
@@ -1549,6 +1549,88 @@
}/*
+ * Emergency caches are reserved for GFP_TEMPORARY allocations on OOM.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void pre_alloc_cpu_slabs(struct kmem_cache *s)
+{
+ int cpu;
+
+ /*
+ * FIXME: CPU hot-plug, stats, debug?
+ */
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+ struct page *new;
+ void **object;
+ int node;
+
+ c = get_cpu_slab(s, cpu);
+ node = cpu_to_node(cpu);
+
+ new = new_slab(s, GFP_KERNEL, node);
+ BUG_ON(!new);
+
+ slab_lock(new);
+ SetSlabFrozen(new);
+ c->page = new;
+ object = c->page->freelist;
+ c->freelist = object[c->offset];
+ c->page->inuse = c->page->objects;
+ c->page->freelist = NULL;
+ c->node = page_to_nid(c->page);
+ slab_unlock(c->page);
+ }
+}
+
+static void init_emergency_caches(void)
+{
+ unsigned long size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ struct kmem_cache *cache;
+ char *name;
+
+ name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+ BUG_ON(!name);
+
+ cache = kmem_cache_create(name, size, 0, 0, NULL);
+ BUG_ON(!cache);
+ kfree(name);
+
+ pre_alloc_cpu_slabs(cache);
+ emergency_caches[i] = cache;
+
+ size *= 2;
+ }
+}
+
+static void *emergency...
Hmmmm... Peter has the most experience with these issues. Maybe the best
would be to have this sort of logic in a more general way in the page
allocator? Similar issues surely exist with the page allocator and a fix
there would fix it for all users.--
This needs some support in the slab allocator anyway. Keep in mind that
the patch is specifically addressing writeback in OOM conditions so we
must (1) prioritize GFP_TEMPORARY allocations over everyone else (which
just get NULL) and (2) use the remaining available memory as efficiently
as possible for _all_ GFP_TEMPORARY allocations.Peter is, however, bringing up a good point that my patch doesn't
actually _guarantee_ anything so I'm still wondering if this approach
makes any sense... But I sure do like Linus' ideas of marking
short-lived allocations and trying harder for them in OOM.Pekka
--
Also, this scheme so far does not provide for a means to detect the end
of pressure situation.I need both triggers, enter pressure and exit pressure. Enter pressure
is easy enough, that's when normal allocations start failing. Exit
pressure however is more difficult - that is basically when allocations
start succeeding again. You'll see that my patches basically always
attempt a regular allocation as long as we're in the emergency state.Also, the requirement for usage of emergency memory (GFP_MEMALLOC,
PF_MEMALLOC) is that it will be freed without external conditions. So
while it might be delayed for a while (it might sit in the fragment
assembly cache for a while) it doesn't need any external input to get
freed again:
- it will either get reclaimed from this cache;
- it will exit the cache as a full packet and:
- get dropped, or
- get processed.--
...uh-oh hand over the paper bag. I have no GFP_TEMPORARY allocations
going on...
--
Aah, yes. I was thinking of kmalloc() here for which it works as
expected because kfree() will return the page to the proper cache. But
we can relax the rules of kmem_cache_free() a bit to make this work (but
perhaps add a WARN_ON() there if cache doesn't match page->slab).Pekka
--
Right, but I doubt we'd ever get something like that merged though -
esp. as it will basically destroy the SLUB fast-path.SLAB allocation fairness:
http://lkml.org/lkml/2007/1/16/61I abandoned this approach because it was too expensive; it was reduced
to the ALLOC_NO_WATERMARKS state transition. Which is much more unlikely
to happen and it's generally accepted we're in a slow path once we
really dive so low into the reserves.The latest posting:
http://lkml.org/lkml/2008/3/20/214--
On Sun, 6 Apr 2008 23:56:57 +0100 (BST)
Reverting the separate sense_slab is fine for now but we need the
separation shortly anyway. We need to support larger sense buffer (260
bytes). The current 96 byte sense buffer works for the majority of us,As you and James agreed, the patch in scsi-misc looks good to me.
Thanks for finding this bug.
--
No, it's brought attention to this interesting slab merge issue;
I don't believe you _need_ a separate sense_slab even for that:
what I meant was that you just need something like
pool->cmd_slab = kmem_cache_create(pool->cmd_name,
cache_line_align(
sizeof(struct scsi_cmnd)) +
max_scsi_sense_buffersize,
0, pool->slab_flags, NULL);
then point cmd->sense_buffer to (unsigned char *) cmd +
cache_line_align(sizeof(struct scsi_cmnd));
where cache_line_align and max_scsi_sense_buffersize are preferably
determined at runtime.Now, it may well be that over the different configurations, at least
some would waste significant memory by putting it all in the one big
buffer, and you're better off with the separate slabs: so I didn't
want to interfere with your direction on that.Hugh
--
On Mon, 7 Apr 2008 19:07:56 +0100 (BST)
Yeah, seems that it led to an interesting discussion (using cache
behavior like ephemeral sounds useful, I think) though surely this isYes, if we have only 96 and 260 bytes sense buffers, it would be a
solution. Well, evne if we have various length sense buffers, we can
have a pool per driver (or device, scsi_host, etc).Another reason why we separated them is that we could allocate a sense
buffer only when it's necessary (though I'm not sure we will do itYes, this was about wasting memory (with the one big buffer)
vs. overheads due to allocating two buffers. After some performance
tests, we chose the latter but we might change this again in the
future.
--
This was sort of accidentally fixed in scsi-misc by commit
commit c5f73260b289cb974928eac05f2d84e58ddfc020
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Thu Mar 13 11:16:33 2008 -0500[SCSI] consolidate command allocation in a single place
Could you check that:
master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
and see if it alleviates the problem? ... if so, we can work out which
pieces to backport.Thanks,
James
--
Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now
running the test with just that applied to 2.6.25-rc8 (plus cfq rcu
fix). Not quite what you asked, but...Strictly speaking, it'd take a couple of days to be reasonably sure
that the livelock is gone (it appeared to reproduce quicker once I
moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that
wasn't just coincidence). But if nothing bad appears overnight,
let's assume your patch is the one to push: I'll report tomorrow.Hugh
--
Right, as expected, that seems to run fine: I've seen no problem with
it in 17 hours - beyond, of course, all the page allocation failures
which will plague such tests until something else is changed.Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git
(probably worth it for the various issues noted in the comment);
but reassuring for 2.6.26.I'll stop that test now and put it to work on something else.
Hugh
--
Well ... just remember that to merge scsi-misc with what Linus has now
done, I'm effectively reversing your patch, so testing current scsi-misc
is very valuable if you want the problem not to recur in 2.6.26 ...James
--
Exactly.
Hugh
--
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Srivatsa Vaddagiri | containers (was Re: -mm merge plans for 2.6.23) |
| Benjamin Herrenschmidt | Re: [linux-pm] [PATCH] Remove process freezer from suspend to RAM pathway |
git: | |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Patrick McHardy | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 6/7] [CCID-2/3]: Fix sparse warnings |
