Since 2.6.25-rc7, I've been seeing an occasional livelock on one x86_64 machine, copying kernel trees to tmpfs, paging out to swap. Signature: 6000 pages under writeback but never getting written; most tasks of interest trying to reclaim, but each get_swap_bio waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ); every five seconds an atomic page allocation failure report from kblockd failing to allocate a sense_buffer in __scsi_get_command. __scsi_get_command has a (one item) free_list to protect against this, but rc1's [SCSI] use dynamically allocated sense buffer de25deb18016f66dcdede165d07654559bb332bc upset that slightly. When it fails to allocate from the separate sense_slab, instead of giving up, it must fall back to the command free_list, which is sure to have a sense_buffer attached. Either my earlier -rc testing missed this, or there's some recent contributory factor. One very significant factor is SLUB, which merges slab caches when it can, and on 64-bit happens to merge both bio cache and sense_slab cache into kmalloc's 128-byte cache: so that under this swapping load, bios above are liable to gobble up all the slots needed for scsi_cmnd sense_buffers below. That's disturbing behaviour, and I tried a few things to fix it. Adding a no-op constructor to the sense_slab inhibits SLUB from merging it, and stops all the allocation failures I was seeing; but it's rather a hack, and perhaps in different configurations we have other caches on the swapout path which are ill-merged. Another alternative is to revert the separate sense_slab, using cache-line-aligned sense_buffer allocated beyond scsi_cmnd from the one kmem_cache; but that might waste more memory, and is only a way of diverting around the known problem. While I don't like seeing the allocation failures, and hate the idea of all those bios piled up above a scsi host working one by one, it does seem to emerge fairly soon with the livelock fix. So lacking better ideas, stick with that one clear fix...
A reliance on free slots that the slab allocator may provide? That is a rather bad dependency since it is up to the slab allocator to implement the storage layout for the objects and thus the availability of slots may vary depending on the layout for the objects chosen by the allocator. Looking at mempool_alloc: Mempools may be used to do atomic allocations until they fail thereby exhausting reserves and available object in the partial lists of slab caches? In order to make this a significant factor we need to have already exhausted reserves right? Thus we are already operating at the boundary of what memory there is. Any non atomic alloc will then allocate a new page with N elements in order to get one object. The mempool_allocs from the atomic context will then gooble up the N-1 remaining objects? So the nonatomic alloc will then have to hit the page allocator again... --
I'm not sure that I understand you. Yes, different slab allocators may lay out slots differently. But a significant departure from existing behaviour may be a bad idea in some circumstances. Mempools may be used for atomic allocations, but I think that's not the case here. swap_writepage's get_swap_bio says GFP_NOIO, which allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH reserves. Whereas at the __scsi_get_command end, there are GFP_ATOMIC sense_slab allocations, which do give access to __GFP_HIGH reserves. My supposition is that once a page has been allocated from __GFP_HIGH reserves to a scsi sense_slab, swap_writepages are liable to gobble up the rest of the page with bio allocations which they wouldn't have had access to traditionally (i.e. under SLAB). So an unexpected behaviour emerges from SLUB's slab merging. Though of course the same might happen in other circumstances, even without slab merging: if some kmem_cache allocations are made with GFP_ATOMIC, those can give access to reserves to non-__GFP_HIGH allocations from the same kmem_cache. Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation: I've given little thought to mempool_alloc's fiddling with the We need to have already exhausted reserves, yes: so this isn't an issue hitting everyone all the time, and it may be nothing worse than a surprising anomaly; but I'm pretty sure it's not how bio and scsi command allocation is expected to interact. What do you think a SLAB_NOMERGE flag? The last time I suggested something like that (but I was thinking of debug), your comment was "Ohh..", which left me in some doubt ;) If we had a SLAB_NOMERGE flag, would we want to apply it to the bio cache or to the scsi_sense_cache or to both? My difficulty in answering that makes me wonder whether such a flag is right. Hugh --
Looks like that one of the issues here is that swap_writepage() does not perform enough reclaim? If it would free more pages then __scsi_get_command would still have pages to allocate and not drain Mempool_alloc()s use of the gfp_mask here suggests that it can potentially drain all reserves and exhaust all available "slots" (partial slabs). Thus it may regularly force any other user of the slab to hit the slow path and potentially trigger reclaim. Could be a bit unfair. --
Somewhere along the line of my swap over network patches I made 'robustified' SLAB to ensure these sorts of things could not happen - it came at a cost though. It would basically fail[*] allocations that had a higher low watermark than what was used to allocate the current slab. [*] - well, it would attempt to allocate a new slab to raise the current My latest series ensures that SLABs allocated using PF_MEMALLOC will not distribute objects to allocation contexts that are not entitled for as long as the memory shortage lasts. I'm not sure how applicable this is to the problem at hand, just letting Relying on this is highly dubious, who is to say that first __GFP_HIGH alloc came SCSI layer (there could be another merged slab). Also, when one of these users is PF_MEMALLOC the other users will gobble If this is critical to avoid memory deadlocks, I would suggest using mempools (or my reserve framework). --
Thanks, Peter: that sounds just right to me; but a larger change than we'd want to jump into for this one particular issue - it might have No, the critical part of it has been dealt with (small fix to scsi free_list handling: which resembles a mempool, but done its own way). What remains is about "unsightly" behaviour, the system having a tendency to collapse briefly into far-from-efficient operation when out of memory. Hugh --
Hi Hugh, Although you weren't convinced by my arguments, I still have difficulties understanding why this kind of bad behavior would be acceptable in an embedded environment and why we don't need to fix it for the SLOB case as well. But you do bring up a good point of SLUB changing the behavior on OOM situations for which SLAB_NOMERGE sounds like a good-enough stop-gap measure for the short term. I would prefer some other fix even if it means getting rid of slab merging competely (which would suck as it's very nice for making memory footprint smaller). Pekka --
I wonder if we can get away with a SLAB_IO flag that you can use to annotate caches that participate in writeout and the allocator could keep some spare pages around that can be handed out for them in case of OOM... Pekka --
If we do decide that it's seriously bad behaviour, then of course it should be fixed in SLOB also. I just meant that the fact that SLOB's been merging slabs too doesn't give me great confidence that Well, the work of keeping spares around to be handed out in case of OOM was done in 2.5 with mempools. They seem to be working well (until you try swapping over network as Peter is pursuing), I don't think we need to duplicate that. I've the uncomfortable feeling that I'm making a big fuss over this behaviour on the one hand, but utterly failing to justify that fuss on the other. It seems to come down to me wanting fewer backtraces in my logs: I'd better find stronger reasons than that if we're to pursue this. Hugh --
No, I do agree that it's a problem. I just don't like the proposed no-merge fix ;-). Pekka --
Actually, I don't think this is SLAB-specific, since there are potentially the same issues for pages. I suspect the right thing to do is not to mark them for "IO", but mark them for "short-lived", and allow short-lived allocations that don't have extended lifetimes to succeed even when a "real" allocation wouldn't. When we are under memory pressure, a normal allocation generally needs to clear up enough memory from elsewhere to succeed. But a short-lived allocation without any long-term lifetimes would be known to release its memory back to the pool, so it can be allowed to go ahead when a normal memory allocation would not. Examples of short-lived allocations: - IO requests - temporary network packets that don't get queued up (eg "ACK" packet) as opposed to those that do get queued (incoming _or_ outgoing queues) - things like the temporary buffers for "getname()/putname()" etc. That said, even a lot of temporary allocations can at times have issues. If your IO layer is dead, your IO request queues that _should_ have been very temporary may end up staying around for a long time. But I do think that it makes sense to prioritize allocations that are known to be short- lived. (It might also allows us to allocate them from different pools and just generally have other heuristics for cache behaviour - short-lived allocations simply have different behaviour) Linus --
Hi Linus, Yeah, makes sense. We do have GFP_TEMPORARY so we could associate this new semantics with that. But the real problem here is how to do the "allocate harder" part which, btw, sounds very similar to what Peter's kmalloc reserve patches try to do... Pekka --
Actually, a trivial way to implement that is to have a few "emergency kmalloc" caches say for sizes 64, 128, 256, and 512 that have some pre-allocated pages into which these GFP_TEMPORARY allocations are allowed to dip into on OOM and OOM only. Pekka --
So something like the following (totally untested) patch modulo the
pre-allocation bits.
Pekka
diff --git a/mm/slub.c b/mm/slub.c
index acc975f..e8f7cf3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -264,6 +264,67 @@ static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
#endif
}
+/*
+ * Emergency caches are for GFP_TEMPORARY allocations to dip into when we're
+ * OOM to make sure we can make some progress for writeback.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void init_emergency_caches(void)
+{
+ unsigned long size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ struct kmem_cache *cache;
+ char *name;
+
+ name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+ BUG_ON(!name);
+
+ cache = kmem_cache_create(name, size, 0, 0, NULL);
+ BUG_ON(!cache);
+ kfree(name);
+
+ emergency_caches[i] = cache;
+
+ size *= 2;
+ }
+}
+
+static struct kmem_cache *lookup_emergency_cache(unsigned long size)
+{
+ unsigned long cache_size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ if (size < cache_size)
+ return emergency_caches[i];
+
+ cache_size *= 2;
+ }
+ return NULL;
+}
+
+static void *emergency_alloc(struct kmem_cache *cache, gfp_t gfp)
+{
+ struct kmem_cache *emergency_cache;
+ void *p;
+
+ emergency_cache = lookup_emergency_cache(cache->objsize);
+ if (!emergency_cache)
+ return NULL;
+
+ p = kmem_cache_alloc(emergency_cache, gfp);
+ if (p && cache->ctor)
+ cache->ctor(cache, p);
+ return p;
+}
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1528,6 +1589,9 @@ new_slab:
goto load_freelist;
}
+ if ((gfpflags & GFP_TEMPORARY) == GFP_TEMPORARY)
+ return emergency_alloc(s, gfpf...Hmm. I didn't check, but won't this cause problems on the freeing path when we call kmem_cache_free() on the result but with the wrong "struct kmem_cache" pointer? Linus --
So something like this fugly patch on top of the SLUB variable order
patches that let the allocator fall-back to smaller page orders. Survives
OOM in my testing and the box seems to be bit more responsive even under X
with this.
Pekka
---
mm/slub.c | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 92 insertions(+), 1 deletion(-)
Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c 2008-04-08 22:52:02.000000000 +0300
+++ slab-2.6/mm/slub.c 2008-04-08 23:27:27.000000000 +0300
@@ -1549,6 +1549,88 @@
}
/*
+ * Emergency caches are reserved for GFP_TEMPORARY allocations on OOM.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void pre_alloc_cpu_slabs(struct kmem_cache *s)
+{
+ int cpu;
+
+ /*
+ * FIXME: CPU hot-plug, stats, debug?
+ */
+ for_each_online_cpu(cpu) {
+ struct kmem_cache_cpu *c;
+ struct page *new;
+ void **object;
+ int node;
+
+ c = get_cpu_slab(s, cpu);
+ node = cpu_to_node(cpu);
+
+ new = new_slab(s, GFP_KERNEL, node);
+ BUG_ON(!new);
+
+ slab_lock(new);
+ SetSlabFrozen(new);
+ c->page = new;
+ object = c->page->freelist;
+ c->freelist = object[c->offset];
+ c->page->inuse = c->page->objects;
+ c->page->freelist = NULL;
+ c->node = page_to_nid(c->page);
+ slab_unlock(c->page);
+ }
+}
+
+static void init_emergency_caches(void)
+{
+ unsigned long size = MIN_EMERGENCY_SIZE;
+ int i;
+
+ for (i = 0; i < NR_EMERGENCY_CACHES; i++) {
+ struct kmem_cache *cache;
+ char *name;
+
+ name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
+ BUG_ON(!name);
+
+ cache = kmem_cache_create(name, size, 0, 0, NULL);
+ BUG_ON(!cache);
+ kfree(name);
+
+ pre_alloc_cpu_slabs(cache);
+ emergency_caches[i] = cache;
+
+ size *= 2;
+ }
+}
+
+static void *emergency...Hmmmm... Peter has the most experience with these issues. Maybe the best would be to have this sort of logic in a more general way in the page allocator? Similar issues surely exist with the page allocator and a fix there would fix it for all users. --
This needs some support in the slab allocator anyway. Keep in mind that the patch is specifically addressing writeback in OOM conditions so we must (1) prioritize GFP_TEMPORARY allocations over everyone else (which just get NULL) and (2) use the remaining available memory as efficiently as possible for _all_ GFP_TEMPORARY allocations. Peter is, however, bringing up a good point that my patch doesn't actually _guarantee_ anything so I'm still wondering if this approach makes any sense... But I sure do like Linus' ideas of marking short-lived allocations and trying harder for them in OOM. Pekka --
Also, this scheme so far does not provide for a means to detect the end
of pressure situation.
I need both triggers, enter pressure and exit pressure. Enter pressure
is easy enough, that's when normal allocations start failing. Exit
pressure however is more difficult - that is basically when allocations
start succeeding again. You'll see that my patches basically always
attempt a regular allocation as long as we're in the emergency state.
Also, the requirement for usage of emergency memory (GFP_MEMALLOC,
PF_MEMALLOC) is that it will be freed without external conditions. So
while it might be delayed for a while (it might sit in the fragment
assembly cache for a while) it doesn't need any external input to get
freed again:
- it will either get reclaimed from this cache;
- it will exit the cache as a full packet and:
- get dropped, or
- get processed.
--...uh-oh hand over the paper bag. I have no GFP_TEMPORARY allocations going on... --
Aah, yes. I was thinking of kmalloc() here for which it works as expected because kfree() will return the page to the proper cache. But we can relax the rules of kmem_cache_free() a bit to make this work (but perhaps add a WARN_ON() there if cache doesn't match page->slab). Pekka --
Right, but I doubt we'd ever get something like that merged though - esp. as it will basically destroy the SLUB fast-path. SLAB allocation fairness: http://lkml.org/lkml/2007/1/16/61 I abandoned this approach because it was too expensive; it was reduced to the ALLOC_NO_WATERMARKS state transition. Which is much more unlikely to happen and it's generally accepted we're in a slow path once we really dive so low into the reserves. The latest posting: http://lkml.org/lkml/2008/3/20/214 --
On Sun, 6 Apr 2008 23:56:57 +0100 (BST) Reverting the separate sense_slab is fine for now but we need the separation shortly anyway. We need to support larger sense buffer (260 bytes). The current 96 byte sense buffer works for the majority of us, As you and James agreed, the patch in scsi-misc looks good to me. Thanks for finding this bug. --
No, it's brought attention to this interesting slab merge issue; I don't believe you _need_ a separate sense_slab even for that: what I meant was that you just need something like pool->cmd_slab = kmem_cache_create(pool->cmd_name, cache_line_align( sizeof(struct scsi_cmnd)) + max_scsi_sense_buffersize, 0, pool->slab_flags, NULL); then point cmd->sense_buffer to (unsigned char *) cmd + cache_line_align(sizeof(struct scsi_cmnd)); where cache_line_align and max_scsi_sense_buffersize are preferably determined at runtime. Now, it may well be that over the different configurations, at least some would waste significant memory by putting it all in the one big buffer, and you're better off with the separate slabs: so I didn't want to interfere with your direction on that. Hugh --
On Mon, 7 Apr 2008 19:07:56 +0100 (BST) Yeah, seems that it led to an interesting discussion (using cache behavior like ephemeral sounds useful, I think) though surely this is Yes, if we have only 96 and 260 bytes sense buffers, it would be a solution. Well, evne if we have various length sense buffers, we can have a pool per driver (or device, scsi_host, etc). Another reason why we separated them is that we could allocate a sense buffer only when it's necessary (though I'm not sure we will do it Yes, this was about wasting memory (with the one big buffer) vs. overheads due to allocating two buffers. After some performance tests, we chose the latter but we might change this again in the future. --
This was sort of accidentally fixed in scsi-misc by commit
commit c5f73260b289cb974928eac05f2d84e58ddfc020
Author: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Thu Mar 13 11:16:33 2008 -0500
[SCSI] consolidate command allocation in a single place
Could you check that:
master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
and see if it alleviates the problem? ... if so, we can work out which
pieces to backport.
Thanks,
James
--Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now running the test with just that applied to 2.6.25-rc8 (plus cfq rcu fix). Not quite what you asked, but... Strictly speaking, it'd take a couple of days to be reasonably sure that the livelock is gone (it appeared to reproduce quicker once I moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that wasn't just coincidence). But if nothing bad appears overnight, let's assume your patch is the one to push: I'll report tomorrow. Hugh --
Right, as expected, that seems to run fine: I've seen no problem with it in 17 hours - beyond, of course, all the page allocation failures which will plague such tests until something else is changed. Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git (probably worth it for the various issues noted in the comment); but reassuring for 2.6.26. I'll stop that test now and put it to work on something else. Hugh --
Well ... just remember that to merge scsi-misc with what Linus has now done, I'm effectively reversing your patch, so testing current scsi-misc is very valuable if you want the problem not to recur in 2.6.26 ... James --
Exactly. Hugh --
| Karl Meyer | PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out" |
| Greg Kroah-Hartman | [PATCH 040/196] kobject: add kobject_add_ng function |
| Steven Rostedt | [RFC PATCH v4] Unified trace buffer |
| Dave Airlie | [git pull] drm patches for 2.6.27 final |
| Krzysztof Halasa | Re: [PATCH v2] Re: WAN: new PPP code for generic HDLC |
| David Miller | Re: [PATCH] Expose netdevice dev_id through sysfs |
| Jay Cliburn | Re: atl1 64-bit => 32-bit DMA borkage (reproducible, bisected) |
| Evgeniy Polyakov | [resend take 2 0/4] Distributed storage. |
git: | |
| Andrew Morton | Untracked working tree files |
| Miklos Vajna | [rfc] git submodules howto |
| Ben Collins | Re: [kernel.org users] [RFD] On deprecating "git-foo" for builtins |
| Jon Smirl | ! [rejected] master -> master (non-fast forward) |
| rancor | How to copy/pipe console buffert to file? |
| Pieter Verberne | File collision while using pkg_add |
| Greg Thomas | Re: Is it possible to fix a stale NFS hadle without rebooting? |
| Didier Wiroth | win32-codecs, avi and amd64 question |
| Netfilter kernel module | 9 hours ago | Linux kernel |
| serial driver xmit problem | 12 hours ago | Linux kernel |
| Why Windows is better than Linux | 12 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 19 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 1 day ago | Linux general |
| Compiling gfs2 on kernel 2.6.27 | 2 days ago | Linux kernel |
