login
Header Space

 
 

Re: [PATCH 0/5] make slab gfp fair

Previous thread: [PATCH 1/5] mm: page allocation rank by Peter Zijlstra on Monday, May 14, 2007 - 9:19 am. (1 message)

Next thread: [PATCH 3/5] mm: slub allocation fairness by Peter Zijlstra on Monday, May 14, 2007 - 9:19 am. (4 messages)
To: <linux-kernel@...>, <linux-mm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Christoph Lameter <clameter@...>, Matt Mackall <mpm@...>
Date: Monday, May 14, 2007 - 9:19 am

In the interest of creating a reserve based allocator; we need to make the slab
allocator (*sigh*, all three) fair with respect to GFP flags.

That is, we need to protect memory from being used by easier gfp flags than it
was allocated with. If our reserve is placed below GFP_ATOMIC, we do not want a
GFP_KERNEL allocation to walk away with it - a scenario that is perfectly
possible with the current allocators.


-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Wednesday, May 16, 2007 - 11:02 pm

And the solution is to fail the allocation of the process which tries to 
walk away with it. The failing allocation will lead to the killing of the 
process right?

We already have an OOM killer which potentially kills random processes. We 
hate it.

Could you please modify the patchset to *avoid* failure conditions. This 
patchset here only manages failure conditions. The system should not get 
into the failure conditions in the first place! For that purpose you may 
want to put processes to sleep etc. But in order to do so you need to 
figure out which processes you need to make progress.



-
To: Christoph Lameter <clameter@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Thursday, May 17, 2007 - 3:08 am

Not necessarily, we have this fault injection system that can fail

Those that have __GFP_WAIT set will go to sleep - or do whatever
__GFP_WAIT allocations do best; the other allocations must handle
failure anyway. (even __GFP_WAIT allocations must handle failure for
that matter)

I'm really not seeing why you're making such a fuzz about it; normally
when you push the system this hard we're failing allocations left right
and center too. Its just that the block IO path has some mempools which
allow it to write out some (swap) pages and slowly get back to sanity.

This really is not much different; the system is in dire need for
memory; those allocations that cannot sleep will fail, simple.

All I'm wanting to do is limit the reserve to PF_MEMALLOC processes;
those that are in charge of cleaning memory; not every other random
process that just wants to do its thing - that doesn't seem like a weird
thing to do at all.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Thursday, May 17, 2007 - 1:29 pm

I am weirdly confused by these patches. Among other things you told me 
that the performance does not matter since its never (or rarely) being 
used (why do it then?). Then we do these strange swizzles with reserve 
slabs that may contain an indeterminate amount of objects.

-
To: Christoph Lameter <clameter@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 1:53 pm

Because it's a failsafe.

Simply stated, the problem is sometimes it's impossible to free memory
without allocating more memory. Thus we must keep enough protected
reserve that we can guarantee progress. This is what mempools are for
in the regular I/O stack. Unfortunately, mempools are a bad match for
network I/O.

It's absolutely correct that performance doesn't matter in the case
this patch is addressing. All that matters is digging ourselves out of
OOM. The box either survives the crisis or it doesn't.

It's also correct that we should hardly ever get into a situation
where we trigger this problem. But such cases are still fairly easy to
trigger in some workloads. Swap over network is an excellent example,
because we typically don't start swapping heavily until we're quite
low on freeable memory.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 2:02 pm

Well we fail allocations in order to do so and these allocations may be 

Is it not possible to avoid failing allocs? Instead put processes to 
sleep? Run synchrononous reclaim?

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 3:18 pm

These allocations didn't have right to the memory they would otherwise
get. Also they will end up in the page allocator just like they normally
would. So from that point, its no different than what happens now; only
they will not eat away the very last bit of memory that could be used to

That would radically change the way we do reclaim and would be much
harder to get right. Such things could be done independant from this.

The proposed patch doesn't change how the kernel functions at this
point; it just enforces an existing rule better.



-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 3:24 pm

Well I'd say it controls the allocation failures. And that only works if 
one can consider the system having a single zone.

Lets say the system has two cpusets A and B. A allocs from node 1 and B 
allocs from node 2. Two processes one in A and one in B run on the same 
processor.

Node 1 gets very low in memory so your patch kicks in and sets up the 
global memory emergency situation with the reserve slab.

Now the process in B will either fail although it has plenty of memory on 
node 2.

Or it may just clear the emergency slab and then the next critical alloc 
of the process in A that is low on memory will fail.


-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 5:26 pm

The way I read the cpuset page allocator, it will only respect the
cpuset if there is memory aplenty. Otherwise it will grab whatever. So
still, it will only ever use ALLOC_NO_WATERMARKS if the whole system is
in distress.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 6:27 pm

Sorry no. The purpose of the cpuset is to limit memory for an application. 
If the boundaries would be fluid then we would not need cpusets.

But the same principles also apply for allocations to different zones in a 
SMP system. There are 4 zones DMA DMA32 NORMAL and HIGHMEM and we have 
general slabs for DMA and NORMAL. A slab that uses zone NORMAL falls back 
to DMA32 and DMA depending on the watermarks of the 3 zones. So a 
ZONE_NORMAL slab can exhaust memory available for ZONE_DMA.

Again the question is the watermarks of which zone? In case of the 
ZONE_NORMAL allocation you have 3 to pick from. Its the last one? Then its 
the same as ZONE_DMA, and you got a collision with the corresponding
DMA slab. Depending the system deciding on a zone where we allocate the 
page from you may get a different watermark situation.

On x86_64 systems you have the additional complication that there are 
even multiple DMA32 or NORMAL zones per node. Some will have DMA32 and 
NORMAL, others DMA32 alone or NORMAL alone. Which watermarks are we 
talking about?

The use of ALLOC_NO_WATERMARKS depends on the contraints of the allocation 
in all cases. You can only compare the stresslevel (rank?) of allocations 
that have the same allocation constraints. The allocation constraints are
a result of gfp flags, cpuset configuration and memory policies in effect.




-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Friday, May 18, 2007 - 5:54 am

Right, I see that I missed an ALLOC_CPUSET yesterday; but like Paul
said, cpusets are ignored when in dire straights for an kernel alloc.

Just not enough to make inter-cpuset interaction on slabs go away wrt

Isn't the zone mask the same for all allocations from a specific slab?
If so, then the slab wide -&gt;reserve_slab will still dtrt (barring

Watermarks like used by the page allocator given the slabs zone mask.
The page allocator will only fall back to ALLOC_NO_WATERMARKS when all

The gfp zone mask is constant per slab, no? It has to, because the zone
mask is only used when the slab is extended, other allocations live off

Yes, I see now that these might become an issue, I will have to think on
this.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Friday, May 18, 2007 - 1:11 pm

All allocations from a single slab have the same set of allowed types of 
zones. I.e. a DMA slab can access only ZONE_DMA a regular slab 

That works if zones do not vary between slab requests. So on SMP (without 

The gfp zone mask is used to select the zones in a SMP config. But not in 
a NUMA configuration there the zones can come from multiple nodes.

Ok in an SMP configuration the zones are determined by the allocation 
flags. But then there are also the gfp flags that influence reclaim 
behavior. These also have an influence on the memory pressure.

These are

__GFP_IO
__GFP_FS
__GFP_NOMEMMALLOC
__GFP_NOFAIL
__GFP_NORETRY
__GFP_REPEAT

An allocation that can call into a filesystem or do I/O will have much 
less memory pressure to contend with. Are the ranks for an allocation
with __GFP_IO|__GFP_FS really comparable with an allocation that does not 

Note that we have not yet investigated what weird effect memory policy 
constraints can have on this. There are issues with memory policies only 
applying to certain zones.....
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>
Date: Sunday, May 20, 2007 - 4:39 am

Ok, full reset.

I care about kernel allocations only. In particular about those that
have PF_MEMALLOC semantics.

The thing I need is that any memory allocated below
  ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
is only ever used by processes that have ALLOC_NO_WATERMARKS rights;
for the duration of the distress.

What this patch does:
 - change the page allocator to try ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER
   if ALLOC_NO_WATERMARKS, before the actual ALLOC_NO_WATERMARKS alloc

 - set page-&gt;reserve nonzero for each page allocated with
   ALLOC_NO_WATERMARKS; which by the previous point implies that all
   available zones are below ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER

 - when a page-&gt;reserve slab is allocated store it in s-&gt;reserve_slab
   and do not update the -&gt;cpu_slab[] (this forces subsequent allocs to
   retry the allocation).

All ALLOC_NO_WATERMARKS enabled slab allocations are served from
-&gt;reserve_slab, up until the point where a !page-&gt;reserve slab alloc
succeeds, at which point the -&gt;reserve_slab is pushed into the partial
lists and -&gt;reserve_slab set to NULL.

Since only the allocation of a new slab uses the gfp zone flags, and
other allocations placement hints they have to be uniform over all slab
allocs for a given kmem_cache. Thus the s-&gt;reserve_slab/page-&gt;reserve
status is kmem_cache wide.

Any holes left?

---

Index: linux-2.6-git/mm/internal.h
===================================================================
--- linux-2.6-git.orig/mm/internal.h
+++ linux-2.6-git/mm/internal.h
@@ -12,6 +12,7 @@
 #define __MM_INTERNAL_H
 
 #include &lt;linux/mm.h&gt;
+#include &lt;linux/hardirq.h&gt;
 
 static inline void set_page_count(struct page *page, int v)
 {
@@ -37,4 +38,50 @@ static inline void __put_page(struct pag
 extern void fastcall __init __free_pages_bootmem(struct page *page,
 						unsigned int order);
 
+#define ALLOC_HARDER		0x01 /* try to alloc harder */
+#define ALLOC_HIGH		0x02 /* __GFP_HIGH set */
+#define ALLOC_WMARK_M...
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 12:45 pm

Ok that adds a new field to the page struct. I suggested a page flag in 


So the original issue is still not fixed. A slab alloc may succeed without
watermarks if that particular allocation is restricted to a different set 
of nodes. Then the reserve slab is dropped despite the memory scarcity on

No the gfp zone flags are not uniform and placement of page allocator 
allocs through SLUB do not always have the same allocation constraints.
SLUB will check the node of the page that was allocated when the page 
allocator returns and put the page into that nodes slab list. This varies
depending on the allocation context.

Allocations can be particular to uses of a slab in particular situations. 
A kmalloc cache can be used to allocate from various sets of nodes in 
different circumstances. kmalloc will allow serving a limited number of 
objects from the wrong nodes for performance reasons but the next 
allocation from the page allocator (or from the partial lists) will occur 
using the current set of allowed nodes in order to ensure a rough 
obedience to the memory policies and cpusets. kmalloc_node behaves 
differently and will enforce using memory from a particular node.

SLAB is very strict in that area and will not allow serving objects from 
the wrong node even with only kmalloc. Changing policy will immediately 
change the per node queue that SLAB takes its objects from.
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 3:33 pm

No it doesn't; it overloads page-&gt;index. Its just used as extra return

I can't see how. This extra ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER alloc will
first deplete all other zones. Once that starts failing no node should
still have pages accessible by any allocation context other than

It has to; since it can serve the allocation from a pre-existing slab

Yes, it keeps slabs on per node lists. I'm just not seeing how this puts
hard constraints on the allocations.

As far as I can see there cannot be a hard constraint here, because
allocations form interrupt context are at best node local. And node
hit it with PF_MEMALLOC. If the page allocation doesn't use ALLOC_CPUSET
the page can come from pretty much anywhere.


-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 3:43 pm

The constraints come from the context of memory policies and cpusets. See

Interrupt context is something different. If we do not have a process 
context then no cpuset and memory policy constraints can apply since we
have no way of determining that. If you restrict your use of the reserve 

No it cannot. One the current cpuslab is exhaused (which can be anytime) 
it will enforce the contextual allocation constraints. See 
get_any_partial() in slub.c.
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 4:08 pm

mempolicy code.

Note that disobeying these constraints is not new behaviour. PF_MEMALLOC

Say the slab gets allocated by an allocation from interrupt context; no
cpuset, no policy. This same slab must be valid for whatever allocation
comes next, right? Regardless of whatever policy or GFP_ flags are in

but get_partial() will only be called if the cpu_slab is full, up until

No, what I'm saying is that if the slab gets refilled from interrupt
context the next process context alloc will have to work with whatever

If it finds no partial slabs it goes back to the page allocator; and
when you allocate a page under PF_MEMALLOC and the normal allocations
are exhausted it takes a page from pretty much anywhere.


-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 4:32 pm

In an interrupt context we do not have a process context. But there is

Yes sure if we do not have a context then no restrictions originating 
there can be enforced. So you want to restrict the logic now to

Correct. That is an optimization but it may be called anytime from the 
perspective of an execution thread and that may cause problems with your 

It will work with whatever was left behind in the case of SLUB and a 
kmalloc alloc (optimization there). It wont if its SLAB (which is 
stricter) or a kmalloc_node alloc. A kmalloc_node alloc will remove the 

If it finds no partial slab then it will go to the page allocator which 
will allocate given the current contextual alloc constraints. In the case 
of a memory policy we may have limited the allocations to a single node 
where there is no escape (the zonelist does *not* contain zones of other 
nodes). The only chance to bypass this is by only dealing with allocations 
during interrupt that have no allocation context.
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 4:54 pm

I'm not seeing how this would interfere; if the alloc can be handled


Ah, this is the point I was missing; I assumed each zonelist would
always include all zones, but would just continue/break the loop using
things like cpuset_zone_allwed_*().

This might indeed foil the game.

I could 'fix' this by doing the PF_MEMALLOC allocation from the regular
node zonelist instead of from the one handed down....

/me thinks out loud.. since direct reclaim runs in whatever process
context was handed out we're stuck with whatever policy we started from;
but since the allocations are kernel allocs - not userspace allocs, and
we're in dire straights, it makes sense to violate the tasks restraints
in order to keep the machine up.

memory policies are the only ones with 'short' zonelists, right? CPU
sets are on top of whatever zonelist is handed out, and the normal

But you just said that interrupts are not exempt from memory policies,
and policies are the only ones that have 'short' zonelists. /me
confused.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Paul Jackson <pj@...>, <npiggin@...>
Date: Monday, May 21, 2007 - 5:04 pm

I wonder if this makes any sense at all given that the only point of 

The memory policy constraints may have been setup to cage in an 
application. It was setup to *stop* the application from using memory on 
other nodes. If you now allow that then the semantics of memory policies
are significantly changed. The cpuset constraints are sometimes not that 


No I said that in an interrupt allocation we have no process context and 
therefore no cpuset or memory policy context. Thus no policies or cpusets
are applied to an allocation. You can allocate without restrictions.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <clameter@...>, <mpm@...>, <linux-kernel@...>, <linux-mm@...>, <tgraf@...>, <davem@...>, <akpm@...>, <phillips@...>, <penberg@...>
Date: Friday, May 18, 2007 - 1:11 pm

No - most kernel allocations never ignore cpusets.

The ones marked NOFAIL or ATOMIC can ignore cpusets in dire straights
and the ones off interrupts lack an applicable cpuset context.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson &lt;pj@sgi.com&gt; 1.925.600.0401
-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <clameter@...>, <mpm@...>, <linux-kernel@...>, <linux-mm@...>, <tgraf@...>, <davem@...>, <akpm@...>, <phillips@...>, <penberg@...>
Date: Thursday, May 17, 2007 - 5:44 pm

Wrong.  Well, only a little right.

For allocations that can't fail (the kernel could die if it failed)
then yes, the kernel will eventually take any damn page it can find,
regardless of cpusets.

Allocations for user space are hardwall enforced to be in the current
tasks cpuset.

Allocations off interrupts ignore the current tasks cpuset (such allocations
don't have a valid current contect.)

Allocations for most kernel space allocations will try to fit in the
current tasks cpuset, but may come from the possibly larger context of
the closest ancestor cpuset that is marked memory_exclusive.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson &lt;pj@sgi.com&gt; 1.925.600.0401
-
To: Christoph Lameter <clameter@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Thursday, May 17, 2007 - 1:52 pm

When we are very low on memory and do access the reserves by means of
ALLOC_NO_WATERMARKS, we want to avoid processed that are not entitled to
use such memory from running away with the little we have.

That is the whole and only point; restrict memory allocated under
ALLOC_NO_WATERMARKS to those processes that are entitled to it.


-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Thursday, May 17, 2007 - 1:59 pm

For me low memory conditions are node or zone specific and may be 
particular to certain allocation constraints. For some reason you have 
this simplified global picture in mind.

The other statement is weird. It is bad to fail allocation attempts, they 
may lead to a process being terminated. Memory should be reclaimed 
earlier to avoid these situations.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Monday, May 14, 2007 - 11:53 am

Why does this have to handled by the slab allocators at all? If you have 
free pages in the page allocator then the slab allocators will be able to 
use that reserve.

-
To: Christoph Lameter <clameter@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 12:12 pm

If I understand this correctly:

privileged thread                      unprivileged greedy process
kmem_cache_alloc(...)
   adds new slab page from lowmem pool
do_io()
                                       kmem_cache_alloc(...)
                                       kmem_cache_alloc(...)
                                       kmem_cache_alloc(...)
                                       kmem_cache_alloc(...)
                                       kmem_cache_alloc(...)
                                       ...
                                          eats it all
kmem_cache_alloc(...) -&gt; ENOMEM
   who ate my donuts?!

But I think this solution is somehow overkill. If we only care about
this issue in the OOM avoidance case, then our rank reduces to a
boolean.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Matt Mackall <mpm@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 12:29 pm

Yes but it returns an object for the privileged thread. Is that not 
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 1:40 pm

No, because we reserved memory for n objects, and like matt illustrates
most of those that will be eaten by the greedy process.


I tried to slim it down to a two state affair; but last time I tried
performance runs that actually slowed it down some.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 1:57 pm

1 slab per object not one page. But yes thats some bloat.

You can pull the big switch (only on a SLUB slab I fear) to switch 
off the fast path. Do SetSlabDebug() when allocating a precious 
allocation that should not be gobbled up by lower level processes. 
Then you can do whatever you want in the __slab_alloc debug section and we 
wont care because its not the hot path.

SLAB is a bit different. There we already have issues with the fast path 
due to the attempt to handle numa policies at the object level. SLUB fixes 
that issue (if we can avoid you hot path patch). It intentionally does 
defer all special object handling to the slab level to increase NUMA 
performance. If you do the same to SLAB then you will get the NUMA 
troubles propagated to the SMP and UP level.




-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 3:28 pm

One allocator is all I need; it would just be grand if all could be
supported.

So what you suggest is not placing the 'emergency' slab into the regular
place so that normal allocations will not be able to find it. Then if an
emergency allocation cannot be satified by the regular path, we fall

I could hack in a similar reserve slab; by catching the failure of the
regular allocation path. It'd not make it prettier though.

The thing is; I'm not needing any speed, as long as the machine stay
alive I'm good. However others are planing to build a full reserve based
allocator to properly fix the places that now use __GFP_NOFAIL and
situation such as in add_to_swap().

A well, one thing at a time. I'll hack this up.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 3:56 pm

Hmmm.. Maybe we could do that.... But what I had in mind was simply to 
set a page flag (DebugSlab()) if you know in alloc_slab that the slab 
should be only used for emergency allocation. If DebugSlab is set then the
fastpath will not be called. You can trap all allocation attempts and 
insert whatever fancy logic you want in the debug path since its not 

Well I have version of SLUB here that allows you do redirect the alloc 
calls at will. Adds a kmem_cache_ops structure and in the kmem_cache_ops 
structure you can redirect allocation and freeing of slabs (not objects!) 
at will. Would that help?
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:03 pm

I might have missed some detail when I looked at SLUB, but I did not see

I'm not sure; I need kmalloc as well.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:25 pm

We could add a kmalloc_ops structuret to allow redirects?

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:06 pm

Ok its not evident in slab_alloc. But if SlabDebug is set then 
page-&gt;lockless_list is always NULL and we always fall back to 
__slab_alloc. There we check for SlabDebug and go to the debug: label. 
There you can insert any fancy processing you want.

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:12 pm

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Tuesday, May 15, 2007 - 1:27 pm

How about something like this; it seems to sustain a little stress.


Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
---
 include/linux/slub_def.h |    3 +
 mm/slub.c                |   73 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 68 insertions(+), 8 deletions(-)

Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -47,6 +47,9 @@ struct kmem_cache {
 	struct list_head list;	/* List of slab caches */
 	struct kobject kobj;	/* For sysfs */
 
+	spinlock_t reserve_lock;
+	struct page *reserve_slab;
+
 #ifdef CONFIG_NUMA
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
 #include &lt;linux/mempolicy.h&gt;
 #include &lt;linux/ctype.h&gt;
 #include &lt;linux/kallsyms.h&gt;
+#include "internal.h"
 
 /*
  * Lock order:
- *   1. slab_lock(page)
- *   2. slab-&gt;list_lock
+ *   1. slab-&gt;reserve_lock
+ *   2. slab_lock(page)
+ *   3. node-&gt;list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -981,7 +983,7 @@ static void setup_object(struct kmem_cac
 		s-&gt;ctor(object, s, SLAB_CTOR_CONSTRUCTOR);
 }
 
-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *rank)
 {
 	struct page *page;
 	struct kmem_cache_node *n;
@@ -999,6 +1001,7 @@ static struct page *new_slab(struct kmem
 	if (!page)
 		goto out;
 
+	*rank = page-&gt;rank;
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&amp;n-&gt;nr_slabs);
@@ -1286,7 +1289,7 @@ static void putback_slab(struct kmem_cac
 /...
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Tuesday, May 15, 2007 - 6:02 pm

Argh again mods to kmem_cache.

Could we do this with a new slab page flag? F.e. SlabEmergPool.


in alloc_slab() do

if (is_emergency_pool_page(page)) {
	SetSlabDebug(page);
	SetSlabEmerg(page);
}

So now you can intercept allocs to the SlabEmerg slab in __slab_alloc 

debug:

if (SlabEmergPool(page)) {
	if (mem_no_longer_critical()) {
		/* Avoid future trapping */
		ClearSlabDebug(page);
		ClearSlabEmergPool(page);
	} else
	if (process_not_allowed_this_memory()) {
		do_something_bad_to_the_caller();
	} else {
		/* Allocation permitted */
	}
}

....

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 2:59 am

Hmm, I had not understood you minded that very much; I did stay away
from all the fast paths this time.

The thing is, I wanted to fold all the emergency allocs into a single
slab, not a per cpu thing. And once you loose the per cpu thing, you
need some extra serialization. Currently the top level lock is
slab_lock(page), but that only works because we have interrupts disabled
and work per cpu.

Why is it bad to extend kmem_cache a bit?

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 2:43 pm

SLUB can only allocate from a per cpu slab. You will have to reserve one 
slab per cpu anyways unless we flush the cpu slab after each access. Same 

Because it is for all practical purposes a heavily accessed read only 
structure. Modifications only occur to per node and per cpu structures.
In a 4k systems any write will kick out the kmem_cache cacheline in 4k 
processors.
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 3:25 pm

If this 4k cpu system ever gets to touch the new lock it is in way
deeper problems than a bouncing cache-line.

Please look at it more carefully.

We differentiate pages allocated at the level where GFP_ATOMIC starts to
fail. By not updating the percpu slabs those are retried every time,
except for ALLOC_NO_WATERMARKS allocations; those are served from the
-&gt;reserve_slab.

Once a regular slab allocation succeeds again, the -&gt;reserve_slab is
cleaned up and never again looked at it until we're in distress again.

Signed-off-by: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
---
 include/linux/slub_def.h |    2 +
 mm/slub.c                |   85 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 78 insertions(+), 9 deletions(-)

Index: linux-2.6-git/include/linux/slub_def.h
===================================================================
--- linux-2.6-git.orig/include/linux/slub_def.h
+++ linux-2.6-git/include/linux/slub_def.h
@@ -46,6 +46,8 @@ struct kmem_cache {
 	struct list_head list;	/* List of slab caches */
 	struct kobject kobj;	/* For sysfs */
 
+	struct page *reserve_slab;
+
 #ifdef CONFIG_NUMA
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-git/mm/slub.c
===================================================================
--- linux-2.6-git.orig/mm/slub.c
+++ linux-2.6-git/mm/slub.c
@@ -20,11 +20,13 @@
 #include &lt;linux/mempolicy.h&gt;
 #include &lt;linux/ctype.h&gt;
 #include &lt;linux/kallsyms.h&gt;
+#include "internal.h"
 
 /*
  * Lock order:
- *   1. slab_lock(page)
- *   2. slab-&gt;list_lock
+ *   1. reserve_lock
+ *   2. slab_lock(page)
+ *   3. node-&gt;list_lock
  *
  *   The slab_lock protects operations on the object of a particular
  *   slab and its metadata in the page struct. If the slab lock
@@ -259,6 +261,8 @@ static int sysfs_slab_alias(struct kmem_
 static void sysfs_slab_remove(struct kmem_cache *s) {}
 #endif
 
+static DEFINE_SPINLOCK(reserve_lock);
+
 /***************...
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 3:53 pm

A single slab? This may only give you a a single object in an extreme 
case. Are you sure that this solution is generic enough?

The problem here is that you may spinlock and take out the slab for one 
cpu but then (AFAICT) other cpus can still not get their high priority 

So you want to spill back the lockless_freelist without deactivating the 
slab? Why are you using the lockless_freelist at all? If you do not use it 

Ok so we are trying to allocate a slab and do not get one thus -&gt; 
try_reserve. But this is only working if we are using the slab after
explicitly flushing the cpuslabs. Otherwise the slab may be full and we


Remove the above two lines (they are wrong regardless) and simply make 



-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:18 pm

It is, its just that we're swapping very heavily at that point, a
bouncing cache-line will not significantly slow down the box compared to

Well, single as in a single active; it gets spilled into the full list

All cpus are redirected to -&gt;reserve_slab when the regular allocations



/me fails to parse.

When we need a new_slab: 
 - we try the partial lists,
 - we try the reserve (if ALLOC_NO_WATERMARKS)

No, no, we did get a page, and it was !ALLOC_NO_WATERMARK hard to get

It need not be the same node; the reserve_slab is node agnostic.
So here the free page watermarks are good again, and we can forget all
about the -&gt;reserve_slab. We just push it on the free/partial lists and
forget about it.

But like you said above: unfreeze_slab() should be good, since I don't

So this is when we get a page and it was ALLOC_NO_WATERMARKS hard to get
it. Instead of updating the cpu_slab we leave that unset, so that
subsequent allocations will try to allocate a slab again thereby testing

__deactivete_slab() doesn't do putback_slab, and now I see that whole



-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:27 pm

How does all of this interact with

1. cpusets

2. dma allocations and highmem?



s-&gt;cpu[cpu] is only NULL if the cpu slab was flushed. This is a pretty 

You could completely bypass the regular allocation functions and do

object = s-&gt;reserve_slab-&gt;freelist;
s-&gt;reserve_slab-&gt;freelist = object[s-&gt;reserve_slab-&gt;offset];

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:40 pm

Much like the normal kmem_cache would do; I'm not changing any of the
page allocation semantics.

For containers it could be that the machine is not actually swapping but


Ah, right:
 - !page || !page-&gt;freelist
 - and no available partial slabs.


That is basically what happens at the end; if an object is returned from
the reserve slab.

But its wanted to try the normal cpu_slab path first to detect that the
situation has subsided and we can resume normal operation.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:44 pm

So if we run out of memory on a cpuset then network I/O will still fail?

I do not see any distinction between DMA and regular memory. If we need 

Is there some indicator somewhere that indicates that we are in trouble? I 
just see the ranks.

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:54 pm

If network relies on slabs that are cpuset constrained and the page

Yes, and page-&gt;rank will only ever be 0 if the page was allocated with
ALLOC_NO_WATERMARKS, and that only ever happens if we're in dire
straights and entitled to it.

Otherwise it'll be ALLOC_WMARK_MIN or somesuch.



-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 4:59 pm

How we know that we are out of trouble? Just try another alloc and see? If 
that is the case then we may be failing allocations after the memory 
situation has cleared up.

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 5:04 pm

I hope the network stack already uses the appropriate allocator flags.
If the slab was GFP_DMA that doesn't change, the -&gt;reserve_slab will

No, no, for each regular allocation we retry to populate -&gt;cpu_slab with
a new slab. If that works we're out of the woods and the -&gt;reserve_slab
is cleaned up.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 5:13 pm

Hmmm.. so we could simplify the scheme by storing the last rank 
somewheres.

If the alloc has less priority and we can extend the slab then
clear up the situation.

If we cannot extend the slab then the alloc must fail.

Could you put the rank into the page flags? On 64 bit at least there 
should be enough space.
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 5:20 pm

That is exactly what is done; and as mpm remarked the other day, its a
binary system; we don't need full gfp fairness just ALLOC_NO_WATERMARKS.

And that is already found in -&gt;reserve_slab; if present the last

Current I stick the newly allocated page's rank in page-&gt;rank (yet
another overload of page-&gt;index). I've not yet seen the need to keep it
around longer.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Wednesday, May 16, 2007 - 5:42 pm

One does not have a way of determining the current processes
priority? Just need to do an alloc?

If we had the current processes "rank" then we could simply compare.
If rank is okay give them the object. If not try to extend slab. If that
succeeds clear the rank. If extending fails fail the alloc. There would be 
no need for a reserve slab.

What worries me about this whole thing is


1. It is designed to fail an allocation rather than guarantee that all 
   succeed. Is it not possible to figure out which processes are not 
   essential and simply put them to sleep until the situation clear up?

2. It seems to be based on global ordering of allocations which is
   not possible given large systems and the relativistic constraints
   of physics. Ordering of events get more expensive the bigger the
   system is.

   How does this system work if you can just order events within
   a processor? Or within a node? Within a zone?

3. I do not see how this integrates with other allocation constraints:
   DMA constraints, cpuset constraints, memory node constraints,
   GFP_THISNODE, MEMALLOC, GFP_HIGH.

-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 3:28 am

We need that alloc anyway, to gauge the current memory pressure.
Sure you could perhaps not do that for allocations are are entitled to
the reserve if we still have on; but I'm not sure that is worth the

Well, that is currently not done either (in as far as that __GFP_WAIT
doesn't sleep indefinitely). When you run very low on memory, some
allocations just need to fail, there is nothing very magical about that,
the system seems to cope just fine. It happens today.

Disable the __GFP_NOWARN logic and create a swap storm, see what

/me fails again..

Its about ensuring ALLOC_NO_WATERMARKS memory only reaches PF_MEMALLOC

It works exactly as it used to; if you can currently get out of a swap
storm you still can.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 1:30 pm

Watermarks are per zone?
-
To: Christoph Lameter <clameter@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 1:53 pm

Yes, but the page allocator might address multiple zones in order to
obtain a page.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Thursday, May 17, 2007 - 2:01 pm

And then again it may not because the allocation is contrained to a 
particular node,a NORMAL zone or a DMA zone. One zone way be below the 
watermark and another may not. Different allocations may be allowed to 
tap into various zones for various reasons.

-
To: Matt Mackall <mpm@...>
Cc: Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 3:44 pm

On Mon, 14 May 2007 11:12:24 -0500

Yes, that's my understanding also.

I can see why it's a problem in theory, but I don't think Peter has yet
revealed to us why it's a problem in practice.  I got all excited when
Christoph asked "I am not sure what the point of all of this is.", but
Peter cunningly avoided answering that ;)

What observed problem is being fixed here?
-
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:01 pm

(From my recollection of looking at this problem a few years ago:)

There are various critical I/O paths that aren't protected by
mempools that need to dip into reserves when we approach OOM.

If, say, we need some number of SKBs in the critical I/O cleaning path
while something else is cheerfully sending non-I/O data, that second
stream can eat the SKBs that the first had budgeted for in its
reserve.

I think the simplest thing to do is to make everyone either fail or
sleep if they're not marked critical and a global memory crisis flag
is set.

To make this not impact the fast path, we could pull some trick like
swapping out and hiding all the real slab caches when turning the crisis
flag on.

-- 
Mathematics is the supreme nostalgia of our time.
-
To: Andrew Morton <akpm@...>
Cc: Matt Mackall <mpm@...>, Christoph Lameter <clameter@...>, <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>
Date: Monday, May 14, 2007 - 4:05 pm

I'm moving towards swapping over networked storage. Admittedly a new
feature.

Like with pretty much all other swap solutions; there is the fundamental
vm deadlock: freeing memory requires memory. Current block devices get
around that by using mempools. This works well.

However with network traffic mempools are not easily usable; the network
stack uses kmalloc. By using reserve based allocation we can keep
operating in a similar matter.



-
To: Christoph Lameter <clameter@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Monday, May 14, 2007 - 12:10 pm

Yes, too freely. GFP flags are only ever checked when you allocate a new
page. Hence, if you have a low reaching alloc allocating a slab page;
subsequent non critical GFP_KERNEL allocs can fill up that slab. Hence
you would need to reserve a slab per object instead of the normal
packing.

-
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <linux-kernel@...>, <linux-mm@...>, Thomas Graf <tgraf@...>, David Miller <davem@...>, Andrew Morton <akpm@...>, Daniel Phillips <phillips@...>, Pekka Enberg <penberg@...>, Matt Mackall <mpm@...>
Date: Monday, May 14, 2007 - 12:37 pm

This is all about making one thread fail rather than another? Note that 
the allocations are a rather compex affair in the slab allocators. Per 
node and per cpu structures play a big role.


-
Previous thread: [PATCH 1/5] mm: page allocation rank by Peter Zijlstra on Monday, May 14, 2007 - 9:19 am. (1 message)

Next thread: [PATCH 3/5] mm: slub allocation fairness by Peter Zijlstra on Monday, May 14, 2007 - 9:19 am. (4 messages)