login
Header Space

 
 

Re: [PATCH] scsi: fix sense_slab/bio swapping livelock

Previous thread: [GIT PATCHES] V4L/DVB new USB ID's for 2.6.25-rc8 by Mauro Carvalho Chehab on Sunday, April 6, 2008 - 6:19 pm. (1 message)

Next thread: Re: how to disable partition search? by Ferenc Wagner on Sunday, April 6, 2008 - 6:25 pm. (1 message)
To: James Bottomley <James.Bottomley@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.ziljstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Sunday, April 6, 2008 - 6:56 pm

Since 2.6.25-rc7, I've been seeing an occasional livelock on one
x86_64 machine, copying kernel trees to tmpfs, paging out to swap.

Signature: 6000 pages under writeback but never getting written;
most tasks of interest trying to reclaim, but each get_swap_bio
waiting for a bio in mempool_alloc's io_schedule_timeout(5*HZ);
every five seconds an atomic page allocation failure report from
kblockd failing to allocate a sense_buffer in __scsi_get_command.

__scsi_get_command has a (one item) free_list to protect against
this, but rc1's [SCSI] use dynamically allocated sense buffer
de25deb18016f66dcdede165d07654559bb332bc upset that slightly.
When it fails to allocate from the separate sense_slab, instead
of giving up, it must fall back to the command free_list, which
is sure to have a sense_buffer attached.

Either my earlier -rc testing missed this, or there's some recent
contributory factor.  One very significant factor is SLUB, which
merges slab caches when it can, and on 64-bit happens to merge
both bio cache and sense_slab cache into kmalloc's 128-byte cache:
so that under this swapping load, bios above are liable to gobble
up all the slots needed for scsi_cmnd sense_buffers below.

That's disturbing behaviour, and I tried a few things to fix it.
Adding a no-op constructor to the sense_slab inhibits SLUB from
merging it, and stops all the allocation failures I was seeing;
but it's rather a hack, and perhaps in different configurations
we have other caches on the swapout path which are ill-merged.

Another alternative is to revert the separate sense_slab, using
cache-line-aligned sense_buffer allocated beyond scsi_cmnd from
the one kmem_cache; but that might waste more memory, and is
only a way of diverting around the known problem.

While I don't like seeing the allocation failures, and hate the
idea of all those bios piled up above a scsi host working one by
one, it does seem to emerge fairly soon with the livelock fix.
So lacking better ideas, stick with that one clear fix...
To: Hugh Dickins <hugh@...>
Cc: James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.ziljstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 1:26 am

A reliance on free slots that the slab allocator may provide? That is a 
rather bad dependency since it is up to the slab allocator to implement
the storage layout for the objects and thus the availability of slots may
vary depending on the layout for the objects chosen by the allocator.

Looking at mempool_alloc: Mempools may be used to do atomic allocations 
until they fail thereby exhausting reserves and available object in the 
partial lists of slab caches?

In order to make this a significant factor we need to have already 
exhausted reserves right? Thus we are already operating at the boundary of 
what memory there is. Any non atomic alloc will then allocate a new page 
with N elements in order to get one object. The mempool_allocs from the 
atomic context will then gooble up the N-1 remaining objects? So the
nonatomic alloc will then have to hit the page allocator again...

--
To: Christoph Lameter <clameter@...>
Cc: James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 3:40 pm

I'm not sure that I understand you.  Yes, different slab allocators
may lay out slots differently.  But a significant departure from
existing behaviour may be a bad idea in some circumstances.

Mempools may be used for atomic allocations, but I think that's not
the case here.  swap_writepage's get_swap_bio says GFP_NOIO, which
allows (indeed is) __GFP_WAIT, and does not give access to __GFP_HIGH
reserves.

Whereas at the __scsi_get_command end, there are GFP_ATOMIC sense_slab
allocations, which do give access to __GFP_HIGH reserves.

My supposition is that once a page has been allocated from __GFP_HIGH
reserves to a scsi sense_slab, swap_writepages are liable to gobble up
the rest of the page with bio allocations which they wouldn't have had
access to traditionally (i.e. under SLAB).

So an unexpected behaviour emerges from SLUB's slab merging.

Though of course the same might happen in other circumstances, even
without slab merging: if some kmem_cache allocations are made with
GFP_ATOMIC, those can give access to reserves to non-__GFP_HIGH
allocations from the same kmem_cache.

Maybe PF_MEMALLOC and __GFP_NOMEMALLOC complicate the situation:
I've given little thought to mempool_alloc's fiddling with the

We need to have already exhausted reserves, yes: so this isn't an
issue hitting everyone all the time, and it may be nothing worse
than a surprising anomaly; but I'm pretty sure it's not how bio
and scsi command allocation is expected to interact.

What do you think a SLAB_NOMERGE flag?  The last time I suggested
something like that (but I was thinking of debug), your comment
was "Ohh..", which left me in some doubt ;)

If we had a SLAB_NOMERGE flag, would we want to apply it to the
bio cache or to the scsi_sense_cache or to both?  My difficulty
in answering that makes me wonder whether such a flag is right.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 4:43 pm

Looks like that one of the issues here is that swap_writepage() 
does not perform enough reclaim? If it would free more pages then 
__scsi_get_command would still have pages to allocate and not drain

Mempool_alloc()s use of the gfp_mask here suggests that it can potentially 
drain all reserves and exhaust all available "slots" (partial slabs). Thus 
it may regularly force any other user of the slab to hit the slow path 
and potentially trigger reclaim. Could be a bit unfair.
--
To: Hugh Dickins <hugh@...>
Cc: Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 3:55 pm

Somewhere along the line of my swap over network patches I made
'robustified' SLAB to ensure these sorts of things could not happen - it
came at a cost though.

It would basically fail[*] allocations that had a higher low watermark
than what was used to allocate the current slab.

[*] - well, it would attempt to allocate a new slab to raise the current

My latest series ensures that SLABs allocated using PF_MEMALLOC will not
distribute objects to allocation contexts that are not entitled for as
long as the memory shortage lasts.

I'm not sure how applicable this is to the problem at hand, just letting

Relying on this is highly dubious, who is to say that first __GFP_HIGH
alloc came SCSI layer (there could be another merged slab).

Also, when one of these users is PF_MEMALLOC the other users will gobble

If this is critical to avoid memory deadlocks, I would suggest using
mempools (or my reserve framework).

--
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 4:31 pm

Thanks, Peter: that sounds just right to me; but a larger change than
we'd want to jump into for this one particular issue - it might have

No, the critical part of it has been dealt with (small fix to scsi
free_list handling: which resembles a mempool, but done its own way).

What remains is about "unsightly" behaviour, the system having a
tendency to collapse briefly into far-from-efficient operation
when out of memory.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:00 pm

Hi Hugh,




Although you weren't convinced by my arguments, I still have 
difficulties understanding why this kind of bad behavior would be 
acceptable in an embedded environment and why we don't need to fix it 
for the SLOB case as well.

But you do bring up a good point of SLUB changing the behavior on OOM 
situations for which SLAB_NOMERGE sounds like a good-enough stop-gap 
measure for the short term. I would prefer some other fix even if it 
means getting rid of slab merging competely (which would suck as it's 
very nice for making memory footprint smaller).

		Pekka
--
To: Hugh Dickins <hugh@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:05 pm

I wonder if we can get away with a SLAB_IO flag that you can use to 
annotate caches that participate in writeout and the allocator could 
keep some spare pages around that can be handed out for them in case of 
OOM...

		Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:30 pm

If we do decide that it's seriously bad behaviour, then of course
it should be fixed in SLOB also.  I just meant that the fact that
SLOB's been merging slabs too doesn't give me great confidence that

Well, the work of keeping spares around to be handed out in case of
OOM was done in 2.5 with mempools.  They seem to be working well
(until you try swapping over network as Peter is pursuing),
I don't think we need to duplicate that.

I've the uncomfortable feeling that I'm making a big fuss over this
behaviour on the one hand, but utterly failing to justify that fuss
on the other.  It seems to come down to me wanting fewer backtraces
in my logs: I'd better find stronger reasons than that if we're to
pursue this.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:36 pm

No, I do agree that it's a problem. I just don't like the proposed 
no-merge fix ;-).

		Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:15 pm

Actually, I don't think this is SLAB-specific, since there are potentially 
the same issues for pages.

I suspect the right thing to do is not to mark them for "IO", but mark 
them for "short-lived", and allow short-lived allocations that don't have 
extended lifetimes to succeed even when a "real" allocation wouldn't.

When we are under memory pressure, a normal allocation generally needs to 
clear up enough memory from elsewhere to succeed. But a short-lived 
allocation without any long-term lifetimes would be known to release its 
memory back to the pool, so it can be allowed to go ahead when a normal 
memory allocation would not.

Examples of short-lived allocations:
 - IO requests
 - temporary network packets that don't get queued up (eg "ACK" packet) as 
   opposed to those that do get queued (incoming _or_ outgoing queues)
 - things like the temporary buffers for "getname()/putname()" etc.

That said, even a lot of temporary allocations can at times have issues. 
If your IO layer is dead, your IO request queues that _should_ have been 
very temporary may end up staying around for a long time. But I do think 
that it makes sense to prioritize allocations that are known to be short- 
lived.

(It might also allows us to allocate them from different pools and just 
generally have other heuristics for cache behaviour - short-lived 
allocations simply have different behaviour)

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:34 pm

Hi Linus,


Yeah, makes sense. We do have GFP_TEMPORARY so we could associate this 
new semantics with that. But the real problem here is how to do the 
"allocate harder" part which, btw, sounds very similar to what Peter's 
kmalloc reserve patches try to do...

		Pekka
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:39 pm

Actually, a trivial way to implement that is to have a few "emergency 
kmalloc" caches say for sizes 64, 128, 256, and 512 that have some 
pre-allocated pages into which these GFP_TEMPORARY allocations are 
allowed to dip into on OOM and OOM only.

		Pekka
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 6:05 pm

So something like the following (totally untested) patch modulo the 
pre-allocation bits.

		Pekka

diff --git a/mm/slub.c b/mm/slub.c
index acc975f..e8f7cf3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -264,6 +264,67 @@ static inline void stat(struct kmem_cache_cpu *c, enum stat_item si)
 #endif
 }
 
+/*
+ * Emergency caches are for GFP_TEMPORARY allocations to dip into when we're
+ * OOM to make sure we can make some progress for writeback.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void init_emergency_caches(void)
+{
+	unsigned long size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i &lt; NR_EMERGENCY_CACHES; i++) {
+		struct kmem_cache *cache;
+		char *name;
+
+		name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 &lt;&lt; i);
+		BUG_ON(!name);
+
+		cache = kmem_cache_create(name, size, 0, 0, NULL);
+		BUG_ON(!cache);
+		kfree(name);
+
+		emergency_caches[i] = cache;
+
+		size *= 2;
+	}
+}
+
+static struct kmem_cache *lookup_emergency_cache(unsigned long size)
+{
+	unsigned long cache_size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i &lt; NR_EMERGENCY_CACHES; i++) {
+		if (size &lt; cache_size)
+			return emergency_caches[i];
+
+		cache_size *= 2;
+	}
+	return NULL;
+}
+
+static void *emergency_alloc(struct kmem_cache *cache, gfp_t gfp)
+{
+	struct kmem_cache *emergency_cache;
+	void *p;
+
+	emergency_cache = lookup_emergency_cache(cache-&gt;objsize);
+	if (!emergency_cache)
+		return NULL;
+
+	p = kmem_cache_alloc(emergency_cache, gfp);
+	if (p &amp;&amp; cache-&gt;ctor)
+		cache-&gt;ctor(cache, p);
+	return p;
+}
+
 /********************************************************************
  * 			Core slab cache functions
  *******************************************************************/
@@ -1528,6 +1589,9 @@ new_slab:
 		goto load_freelist;
 	}
 
+	if ((gfpflags &amp; GFP_TEMPORARY) == GFP_TEMPORARY)
+		return emergency_alloc(s, gfpf...
To: Pekka J Enberg <penberg@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 6:17 pm

Hmm. I didn't check, but won't this cause problems on the freeing path 
when we call kmem_cache_free() on the result but with the wrong "struct 
kmem_cache" pointer?

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 4:42 pm

So something like this fugly patch on top of the SLUB variable order 
patches that let the allocator fall-back to smaller page orders. Survives 
OOM in my testing and the box seems to be bit more responsive even under X 
with this.

		Pekka

---
 mm/slub.c |   93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 92 insertions(+), 1 deletion(-)

Index: slab-2.6/mm/slub.c
===================================================================
--- slab-2.6.orig/mm/slub.c	2008-04-08 22:52:02.000000000 +0300
+++ slab-2.6/mm/slub.c	2008-04-08 23:27:27.000000000 +0300
@@ -1549,6 +1549,88 @@
 }
 
 /*
+ * Emergency caches are reserved for GFP_TEMPORARY allocations on OOM.
+ */
+
+#define MIN_EMERGENCY_SIZE 32
+#define NR_EMERGENCY_CACHES 4
+
+struct kmem_cache *emergency_caches[NR_EMERGENCY_CACHES];
+
+static void pre_alloc_cpu_slabs(struct kmem_cache *s)
+{
+	int cpu;
+
+	/*
+	 * FIXME: CPU hot-plug, stats, debug?
+	 */
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c;
+		struct page *new;
+		void **object;
+		int node;
+
+		c = get_cpu_slab(s, cpu);
+		node = cpu_to_node(cpu);
+
+		new = new_slab(s, GFP_KERNEL, node);
+		BUG_ON(!new);
+
+		slab_lock(new);
+		SetSlabFrozen(new);
+		c-&gt;page = new;
+		object = c-&gt;page-&gt;freelist;
+		c-&gt;freelist = object[c-&gt;offset];
+		c-&gt;page-&gt;inuse = c-&gt;page-&gt;objects;
+		c-&gt;page-&gt;freelist = NULL;
+		c-&gt;node = page_to_nid(c-&gt;page);
+		slab_unlock(c-&gt;page);
+	}
+}
+
+static void init_emergency_caches(void)
+{
+	unsigned long size = MIN_EMERGENCY_SIZE;
+	int i;
+
+	for (i = 0; i &lt; NR_EMERGENCY_CACHES; i++) {
+		struct kmem_cache *cache;
+		char *name;
+
+		name = kasprintf(GFP_KERNEL, "kmalloc-%d", 1 &lt;&lt; i);
+		BUG_ON(!name);
+
+		cache = kmem_cache_create(name, size, 0, 0, NULL);
+		BUG_ON(!cache);
+		kfree(name);
+
+		pre_alloc_cpu_slabs(cache);
+		emergency_caches[i] = cache;
+
+		size *= 2;
+	}
+}
+
+static void *emergency...
To: Pekka J Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 4:45 pm

Hmmmm... Peter has the most experience with these issues. Maybe the best 
would be to have this sort of logic in a more general way in the page 
allocator? Similar issues surely exist with the page allocator and a fix 
there would fix it for all users.

--
To: Christoph Lameter <clameter@...>
Cc: Linus Torvalds <torvalds@...>, Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 5:11 pm

This needs some support in the slab allocator anyway. Keep in mind that 
the patch is specifically addressing writeback in OOM conditions so we 
must (1) prioritize GFP_TEMPORARY allocations over everyone else (which 
just get NULL) and (2) use the remaining available memory as efficiently 
as possible for _all_ GFP_TEMPORARY allocations.

Peter is, however, bringing up a good point that my patch doesn't 
actually _guarantee_ anything so I'm still wondering if this approach 
makes any sense... But I sure do like Linus' ideas of marking 
short-lived allocations and trying harder for them in OOM.

			Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Christoph Lameter <clameter@...>, Linus Torvalds <torvalds@...>, Hugh Dickins <hugh@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 5:40 pm

Also, this scheme so far does not provide for a means to detect the end
of pressure situation.

I need both triggers, enter pressure and exit pressure. Enter pressure
is easy enough, that's when normal allocations start failing. Exit
pressure however is more difficult - that is basically when allocations
start succeeding again. You'll see that my patches basically always
attempt a regular allocation as long as we're in the emergency state.

Also, the requirement for usage of emergency memory (GFP_MEMALLOC,
PF_MEMALLOC) is that it will be freed without external conditions. So
while it might be delayed for a while (it might sit in the fragment
assembly cache for a while) it doesn't need any external input to get
freed again:
  - it will either get reclaimed from this cache;
  - it will exit the cache as a full packet and:
    - get dropped, or
    - get processed.



--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 4:44 pm

...uh-oh hand over the paper bag. I have no GFP_TEMPORARY allocations 
going on...
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 6:42 pm

Aah, yes. I was thinking of kmalloc() here for which it works as 
expected because kfree() will return the page to the proper cache. But 
we can relax the rules of kmem_cache_free() a bit to make this work (but 
perhaps add a WARN_ON() there if cache doesn't match page-&gt;slab).

			Pekka
--
To: Hugh Dickins <hugh@...>
Cc: Christoph Lameter <clameter@...>, James Bottomley <James.Bottomley@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Pekka Enberg <penberg@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 4:47 pm

Right, but I doubt we'd ever get something like that merged though -
esp. as it will basically destroy the SLUB fast-path.

SLAB allocation fairness:
  http://lkml.org/lkml/2007/1/16/61

I abandoned this approach because it was too expensive; it was reduced
to the ALLOC_NO_WATERMARKS state transition. Which is much more unlikely
to happen and it's generally accepted we're in a slow path once we
really dive so low into the reserves.

The latest posting:
  http://lkml.org/lkml/2008/3/20/214


--
To: <hugh@...>
Cc: <James.Bottomley@...>, <torvalds@...>, <akpm@...>, <fujita.tomonori@...>, <jens.axboe@...>, <clameter@...>, <penberg@...>, <a.p.ziljstra@...>, <rjw@...>, <linux-kernel@...>
Date: Sunday, April 6, 2008 - 10:48 pm

On Sun, 6 Apr 2008 23:56:57 +0100 (BST)


Reverting the separate sense_slab is fine for now but we need the
separation shortly anyway. We need to support larger sense buffer (260
bytes). The current 96 byte sense buffer works for the majority of us,

As you and James agreed, the patch in scsi-misc looks good to me.


Thanks for finding this bug.
--
To: FUJITA Tomonori <fujita.tomonori@...>
Cc: <James.Bottomley@...>, <torvalds@...>, <akpm@...>, <jens.axboe@...>, <clameter@...>, <penberg@...>, <a.p.zijlstra@...>, <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 2:07 pm

No, it's brought attention to this interesting slab merge issue;

I don't believe you _need_ a separate sense_slab even for that:
what I meant was that you just need something like
	pool-&gt;cmd_slab = kmem_cache_create(pool-&gt;cmd_name,
					   cache_line_align(
					   sizeof(struct scsi_cmnd)) +
					   max_scsi_sense_buffersize,
					   0, pool-&gt;slab_flags, NULL);
then point cmd-&gt;sense_buffer to (unsigned char *) cmd +
			cache_line_align(sizeof(struct scsi_cmnd));
where cache_line_align and max_scsi_sense_buffersize are preferably
determined at runtime.

Now, it may well be that over the different configurations, at least
some would waste significant memory by putting it all in the one big
buffer, and you're better off with the separate slabs: so I didn't
want to interfere with your direction on that.

Hugh
--
To: <hugh@...>
Cc: <fujita.tomonori@...>, <James.Bottomley@...>, <torvalds@...>, <akpm@...>, <jens.axboe@...>, <clameter@...>, <penberg@...>, <a.p.zijlstra@...>, <rjw@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 10:04 am

On Mon, 7 Apr 2008 19:07:56 +0100 (BST)

Yeah, seems that it led to an interesting discussion (using cache
behavior like ephemeral sounds useful, I think) though surely this is

Yes, if we have only 96 and 260 bytes sense buffers, it would be a
solution. Well, evne if we have various length sense buffers, we can
have a pool per driver (or device, scsi_host, etc).

Another reason why we separated them is that we could allocate a sense
buffer only when it's necessary (though I'm not sure we will do it

Yes, this was about wasting memory (with the one big buffer)
vs. overheads due to allocating two buffers. After some performance
tests, we chose the latter but we might change this again in the
future.
--
To: Hugh Dickins <hugh@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.ziljstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Sunday, April 6, 2008 - 7:35 pm

This was sort of accidentally fixed in scsi-misc by commit 

commit c5f73260b289cb974928eac05f2d84e58ddfc020
Author: James Bottomley &lt;James.Bottomley@HansenPartnership.com&gt;
Date:   Thu Mar 13 11:16:33 2008 -0500

    [SCSI] consolidate command allocation in a single place

Could you check that:

master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git

and see if it alleviates the problem? ... if so, we can work out which
pieces to backport.

Thanks,

James


--
To: James Bottomley <James.Bottomley@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Sunday, April 6, 2008 - 9:01 pm

Precisely that patch seems appropriate to 2.6.25-rc8-git, so I'm now
running the test with just that applied to 2.6.25-rc8 (plus cfq rcu
fix).  Not quite what you asked, but...

Strictly speaking, it'd take a couple of days to be reasonably sure
that the livelock is gone (it appeared to reproduce quicker once I
moved to -rc8 plus cfq rcu fix; but I'm not entirely convinced that
wasn't just coincidence).  But if nothing bad appears overnight,
let's assume your patch is the one to push: I'll report tomorrow.

Hugh

--
To: James Bottomley <James.Bottomley@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 1:51 pm

Right, as expected, that seems to run fine: I've seen no problem with
it in 17 hours - beyond, of course, all the page allocation failures
which will plague such tests until something else is changed.

Somewhat irrelevant since Linus put my patch into 2.6.25-rc8-git
(probably worth it for the various issues noted in the comment);
but reassuring for 2.6.26.

I'll stop that test now and put it to work on something else.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 2:04 pm

Well ... just remember that to merge scsi-misc with what Linus has now
done, I'm effectively reversing your patch, so testing current scsi-misc
is very valuable if you want the problem not to recur in 2.6.26 ...

James


--
To: James Bottomley <James.Bottomley@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, FUJITA Tomonori <fujita.tomonori@...>, Jens Axboe <jens.axboe@...>, Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Peter Zijlstra <a.p.zijlstra@...>, Rafael J. Wysocki <rjw@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 2:26 pm

Exactly.

Hugh
--
Previous thread: [GIT PATCHES] V4L/DVB new USB ID's for 2.6.25-rc8 by Mauro Carvalho Chehab on Sunday, April 6, 2008 - 6:19 pm. (1 message)

Next thread: Re: how to disable partition search? by Ferenc Wagner on Sunday, April 6, 2008 - 6:25 pm. (1 message)
speck-geostationary