login
Header Space

 
 

Re: [patch 21/21] slab defrag: Obsolete SLAB

Previous thread: [patch 11/21] inodes: Support generic defragmentation by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)

Next thread: [patch 14/21] Filesystem: Ext4 filesystem defrag by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)
To: <akpm@...>
Cc: <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <andi@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Friday, May 9, 2008 - 11:08 pm

Slab defragmentation introduces new functionality not supported by SLAB and
SLOB.

Make slab depend on EXPERIMENTAL and note its obsoleteness and that
various functionality is not supported by SLAB.

Also update SLOB's description a bit to indicate that certain OS
support is limited by design.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;

---
 init/Kconfig |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig	2008-05-09 18:41:41.000000000 -0700
+++ linux-2.6/init/Kconfig	2008-05-09 18:46:13.000000000 -0700
@@ -749,12 +749,16 @@ choice
 	   This option allows to select a slab allocator.
 
 config SLAB
-	bool "SLAB"
+	bool "SLAB (Obsolete)"
+	depends on EXPERIMENTAL
 	help
-	  The regular slab allocator that is established and known to work
-	  well in all environments. It organizes cache hot objects in
-	  per cpu and per node queues. SLAB is the default choice for
-	  a slab allocator.
+	  The old slab allocator that is being replaced by SLUB.
+	  SLAB does not support slab defragmentation and has limited
+	  debugging support. There is no sysfs support for /sys/kernel/slab.
+	  SLAB requires order 1 allocations for some caches which may under
+	  extreme circumstances fail. New general object debugging methods
+	  (such as kmemcheck) do not support SLAB. The code is complex,
+	  difficult to comprehend and has a history of subtle bugs.
 
 config SLUB
 	bool "SLUB (Unqueued Allocator)"
@@ -771,7 +775,10 @@ config SLOB
 	help
 	   SLOB replaces the stock allocator with a drastically simpler
 	   allocator. SLOB is generally more space efficient but
-	   does not perform as well on large systems.
+	   does not perform as well on large systems. SLOBs functionality
+	   is limited by design (no sysfs support, no defrag, no debugging
+	   etc).
+
 
 endchoice
 

-- 
--
To: Christoph Lameter <clameter@...>
Cc: <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Rik van Riel <riel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Saturday, May 10, 2008 - 5:53 am

What about the TPC performance regressions? My understanding was that
slub still performed worse in the "object allocated on one CPU, freed on
the other CPU" type workloads due to less batching.

-Andi
--
To: Andi Kleen <andi@...>
Cc: Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Saturday, May 10, 2008 - 10:15 pm

On Sat, 10 May 2008 11:53:30 +0200

Which can be the majority of object allocations and frees in some
workloads.  It definately wants fixing.

-- 
All rights reversed.
--
To: Rik van Riel <riel@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Pekka Enberg <penberg@...>, <mpm@...>
Date: Monday, May 12, 2008 - 3:38 am

Agreed, that situation is very frequency happend, IMHO.
--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Rik van Riel <riel@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Monday, May 12, 2008 - 3:54 am

On Mon, May 12, 2008 at 10:38 AM, KOSAKI Motohiro

Christoph fixed a tbench regression that was in the same ballpark as
the TPC regression reported by Matthew which is why we've asked the
Intel folks to re-test. But yeah, we're working on it.
--
To: Pekka Enberg <penberg@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, Andi Kleen <andi@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 1:29 pm

I suspect that the TPC regression was due to the page allocator order 0 
inefficiencies like the tbench regression but we have no data yet to 
establish that.

Fundamentally there is no way to avoid complex queueing on free() unless 
one directly frees the object. This is serialized in SLUB by taking a page 
lock. If we can establish that the object is from the current cpu slab 
then no lock is taken because the slab is reserved for the current 
processor. So the bad case is a free of a object with a long life span or 
an object freed on a remote processor.

Howver, the "slow" case in SLUB is still much less complex 
than comparable processing in SLAB. It is quite fast.

SLAB freeing can avoid taking a lock if

1. We can establish that the object is node local (trivial if !NUMA 
otherwise we need to get the node information from the page struct and 
compare to the current node).

2. There is space in the per cpu queue

If the object is *not* node local then we have to take an alien lock for 
the remote node in order to put the object in an alien queue. That is much 
less efficient than the SLUB case. SLAB then needs to run the cache reaper 
to expire these object into the remote nodes queues (later the cache 
reaper may then actually free these objects). This management overhead 
does not exist in SLUB. The cache reaper causes processors to not be 
available for short time frames (the reaper scans through all slab 
caches!) which in turn cause regression in applications that need to 
respond in a short time frame (HPC appls, network applications that are 
timing critical).

Note that the lock granularity in SLUB is finer than the locks in SLAB. 
SLUB can concurrently free multiple objects to the same remote node etc 
etc. If the objects belong to different slabs then there is no dirtying of 
any shared cachelines.

The main issue for SLAB vs. SLUB on free is likely the !NUMA case in which 
SLAB can avoid the overhead of the node check (which does not exist in 
SLUB) and i...
To: Christoph Lameter <clameter@...>
Cc: Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 1:49 pm

iirc profiling analysis showed that the problem was the page lock
serialization (in particular the slab_lock() in __slab_free). That


Ignoring NUMA is no option unfortunately. And with integrated memory

I think the problem is that this atomic operation thrashes cache lines
around. Really counting cycles on instructions is not that interesting,
but minimizing the cache thrashing is. And for that it looks like slub

What is the big problem of having a batched free queue? If the expiry
is done at a good bounded time (e.g. on interrupt exit or similar)
locally on the CPU it shouldn't be a big issue, should it?

-Andi
--
To: Andi Kleen <andi@...>
Cc: Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 2:05 pm

The issue of object expiration holdoffs also affects applications 
running on pure SMP systems.
--
To: Andi Kleen <andi@...>
Cc: Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 4:46 pm

Some more on SMP scaling:

There is also the issue in SLAB that global locks (SMP case) need to be 
taken for a pretty long timeframe. With a sufficiently high allocation 
frequency from multiple processors you can cause lock contention on the 
list_lock that will then degrade performance.

SLUB does not take global locks for continued allocations. Global locks 
are taken for a short time frame if the partial lists need to be updated 
(which is avoided as much as possible with various measures). This can 
yield orders of magnitude higher performance

The above is possible because of locking at the page level. A queue must 
either be processor specific or global (or per node) and then would 
need locks.

Another issue is storage density. SLAB needs a metadata structure that 
either is placed in the slab page itself or in a separate slab cache. In 
some cases this is advantageous over SLUB (f.e. a series of pointers to 
objects exist in a single cacheline thus allocation of objects that are 
not immediately used could be faster) in others it is not (because it 
increases cache footprint, requires the touching of two slabcaches if the 
metadata is off slab, makes alignment of objects in the slab pages 
difficult and increases memory overhead, SLUB generally is faster if the 
object is/was immediately used since the freepointer overlays the data 
and thus the cacheline is hot both on alloc and free).
--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 4:58 pm

These are all great theories, and you mentioned that you'd fixed the
regressions with tbench, but did you fix the regression with the io-gen
program I sent you?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:00 pm

No. I thought you were satisfied with the performance increase you saw 
when pinning the process to a single processor?

--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:21 pm

Er, no.  That program emulates a TPC-C run from the point of view of
doing as much IO as possible from all CPUs.  Pinning the process to one
CPU would miss the point somewhat.

I seem to remember telling you that you might get more realistic
performance numbers by pinning the scsi_ram_0 kernel thread to a single
CPU (ie emulating an interrupt tied to one CPU rather than letting the
scheduler choose to run the thread on the 'best' CPU).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:33 pm

Oh. The last message I got was an enthusiatic report on the performance 
gains you saw by pinning the process after we looked at slub statistics 
that showed that the behavior of the tests was different from your 
expectations. I got messages here that indicate that this was a scsi 
testing program that you had under development. And yes we saw the remote 

If this is a stand in for the TPC then why did you not point that 
out when Pekka and I recently asked you to retest some configurations?
--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:43 pm

Note the complete lack of comparison between slub and slab here!  As far
as I know, slub still loses against slab by a few % -- but I haven't

I thought you'd already run this test and were asking for the results of
this to be validated against a real TPC run.

I'm rather annoyed by this.  You demand a test-case to reproduce the
problem and then when I come up with one, you ignore it!

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:53 pm

This indicated to me that you were still developing a test here and 

Indeed remote frees are slightly slower in some situations. Dont really 
dispute that. I am just not sure that the TPC test is really suffering 
from that symptom. I thought for a long time that the tbench regression 


Ignore it? That is pretty strange statement given that I helped you 
analyze the behavior of your test and understand what was going on the 
system.

--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 6:00 pm

Since there's no way we've found to date to get the TPC test to you,
how about we settle for analysing _this_ testcase which did show a
significant performance degradation for slub?

I don't think it's an unreasonable testcase either -- effectively it's
allocating memory on all CPUs and then freeing it all on one.  If that's
a worst-case scenario for slub, then slub isn't suitable for replacing
slab yet.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 6:34 pm

Could I get the latest version of the test? Or was the March 31st version 
the latest? No later changes?

--
To: Matthew Wilcox <matthew@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 6:32 pm

Indeed that is a worst case scenario due to finer grained locking. The 
opposite side of that is that fast concurrent freeing of objects from two 
processors will have higher performance in slub since there is 
significantly less global lock contention and less work with expiring 
objects and moving them around (if you hit the queue limits then SLAB 
will do synchroonous merging of objects into slabs, its then no longer 
able to hide the object handling overhead in cache_reap().)

--
To: Andi Kleen <andi@...>
Cc: Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 2:03 pm

It can thrash cachelines if objects from the same slab page are freed 
simultaneously on multiple processors. That occurred in the hackbench 
regression that we addressed with the dynamic configuration of slab sizes.

However, typically long lived objects freed from multiple processors 

Interrupt exit in general would have to inspect the per cpu structures of 
all slab caches on the system?
--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>
Date: Wednesday, May 14, 2008 - 11:26 pm

hackbench regression is because of slow allocation instead of slow freeing.
With dynamic configuration of slab sizes, fast allocation becomes 97% (the bad
one is 68%), but fast free is always 8~9% with/without the patch.


--
To: Zhang, Yanmin <yanmin_zhang@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>
Date: Thursday, May 15, 2008 - 1:05 pm

Thanks for using the slab statistics. I wish I had these numbers for the=20
TPC benchmark. That would allow us to understand what is going on while it=
=20
is running.

The frees in the hackbench were slow because partial list updates occurred=
=20
to frequently. The first fix was to let slab sit longer on the partial=20
list. The other was the increase of the slab sizes which also increases=20
the per cpu slab size and therefore the objects allocatable without a=20
round trip to the page allocator. Freeing to a per cpu slab never requires=
=20
partial list updates. So the frees also benefitted from the larger slab=20
sizes. But the effect shows up in the count of partial list updates not in=
=20
the fast/free collumn.
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>
Date: Friday, May 16, 2008 - 1:16 am

I agree. It might be better if SLUB could be optimized again to have more consideration
when the slow free percentage is high, because the page lock might ping-pong
among processors if multi-processors access the same slab at the same time.

--
To: Christoph Lameter <clameter@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 1:49 pm

Hang on, you want slab statistics for the TPC run?  You didn't tell me
that.  We're trying to gather oprofile data (and having trouble because
the machine crashes when we start using oprofile -- this is with the git
tree you/pekka put together for us to test).

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Christoph Lameter <clameter@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 2:29 pm

Hi,

oprofile was recently fixed, maybe try cherry-picking these will help:

http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=7ded2dcf5f2...
http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=08bc5caced1...
http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=2b56af59ed2...

Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--
To: Matthew Wilcox <matthew@...>
Cc: Christoph Lameter <clameter@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 2:19 pm

Hum, you might try to apply commit 
44c81433e8b05dbc85985d939046f10f95901184 or commit
8b8b498836942c0c855333d357d121c0adeefbd9

oprofile data are definitly wanted :)





--
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 1:58 pm

Well we talked about this when you send me the test program. I just 
thought that it would be logical to do the same for the real case.

Details of the crash please?

You could just start with 2.6.25.X which already contains the slab 
statistics.

Also re: the test program since pinning a process does increase the 
performance by orders of magnitude. Are you sure that the application was 
properly tuned for an 8p configuration? Pinning is usually not necessary 
for lower numbers of processors because the scheduler thrashing effect is 
less of an issue.  If the test program is an accurate representation of 
the TP-C benchmark then you can drastically increase its performance by 
doing the same to the real test.

--
To: Christoph Lameter <clameter@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 2:13 pm

You ran the test ... you didn't say "It would be helpful if you could


Certainly.  Exactly how does collecting these stats work?  Am I supposed
to zero the counters after the TPC has done its initial ramp-up?  What

The application does nothing except submit IO and wait for it to complete.
It doesn't need to be tuned.  It's not an accurate representation of
TPC-C, it just simulates the amount of IO that a TPC-C run will generate
(and simulates it coming from all CPUs, which is accurate).

I don't want to get into details of how a TPC benchmark is tuned, because
it's not relevant.  Trust me, there are people who dedicate months of
their lives per year to tuning how TPC runs are scheduled.

The pinning I was talking about was pinning the scsi_ram_0 kernel thread
to one CPU to simulate interrupts being tied to one CPU.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 2:43 pm

Well the amount of information we are getting has always been the main 

Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the 
help text provided when enabling statistics).
--
To: Christoph Lameter <clameter@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 2:51 pm

Or possibly your assumptions have been the main factor.  I gave you a
reproducer for this problem 6 weeks ago.  As far as I can tell, you

That's an utterly unhelpful answer.  Let me try asking again.

Exactly how does collecting these stats work?  Am I supposed to zero
the counters after the TPC has done its initial ramp-up?  What commands
should I run, and at exactly which points?

Otherwise I'll get something wrong and these numbers will be useless to
you.  Or that's what you'll claim anyway.

For reference the helptext says:

          SLUB statistics are useful to debug SLUBs allocation behavior in
          order find ways to optimize the allocator. This should never be
          enabled for production use since keeping statistics slows down
          the allocator by a few percentage points. The slabinfo command
          supports the determination of the most active slabs to figure
          out which slabs are relevant to a particular load.
          Try running: slabinfo -DA

By the way, when you say 'compile slabinfo', you mean the file shipped
as Documentation/vm/slabinfo.c (rather than, say, something out of tree?)

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 3:09 pm

Assumptions may be the issue. My own "reproducer" for remote frees is 
available from my git tree and I usually prefer to run my own. We 
discussed the results of that program last fall. You stated yesterday that 
your code is proprietary. I am not sure what I am allowed to do with the 
code. I did not know that it was proprietary before yesterday and I would 
have just forwarded that code to Pekka yesterday if I would not have 
caught that message in time.

I thought that what you provided it was a test program to exercise 

There is no way of zeroing the counters. Run slabinfo -AD after the 
test application has been running for awhile. If you want a differential 

No. I guess I will end up with a lot of guess work of what is going on on 

Yes. 
--
To: Christoph Lameter <clameter@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 3:29 pm

No doubt you prefer to run a test which fails to show a problem with

I'm surprised you're so cavalier about copyright.  There was nothing in

Why would you think that?  The subject of the email was "Slub test


They're your statistics.  Tell me what you need.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Friday, May 16, 2008 - 3:06 pm

The test was designed to show the worst case effect of the additional 


The output of slabinfo -AD.... 


--
To: Christoph Lameter <clameter@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 4:14 pm

This is rather interesting.  Since Christoph refuses to, here's my
results with 8f40f67, first with slab:

willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 0 completed 1000000 ops in 52.817 seconds; 18933 ops per second
CPU 2 completed 1000000 ops in 56.391 seconds; 17733 ops per second
CPU 3 completed 1000000 ops in 57.009 seconds; 17541 ops per second
CPU 1 completed 1000000 ops in 57.591 seconds; 17363 ops per second
willy@piggy:~$ sudo taskset -p 1 941
pid 941's current affinity mask: f
pid 941's new affinity mask: 1
willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 2 completed 1000000 ops in 46.740 seconds; 21394 ops per second
CPU 0 completed 1000000 ops in 48.716 seconds; 20527 ops per second
CPU 3 completed 1000000 ops in 59.255 seconds; 16876 ops per second
CPU 1 completed 1000000 ops in 60.473 seconds; 16536 ops per second

(the pid is that of scsi_ram_0)

Now, change the config to slub:

--- 64-slab/.config     2008-05-15 15:21:31.000000000 -0400
+++ 64-slub/.config     2008-05-15 15:37:45.000000000 -0400
-# Thu May 15 15:21:31 2008
+# Thu May 15 15:37:45 2008
-CONFIG_SLAB=y
-# CONFIG_SLUB is not set
+CONFIG_SLUB_DEBUG=y
+# CONFIG_SLAB is not set
+CONFIG_SLUB=y
-# CONFIG_DEBUG_SLAB is not set
+# CONFIG_SLUB_DEBUG_ON is not set
+# CONFIG_SLUB_STATS is not set

and we get slightly better results:

willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 0 completed 1000000 ops in 45.848 seconds; 21811 ops per second
CPU 2 completed 1000000 ops in 50.789 seconds; 19689 ops per second
CPU 3 completed 1000000 ops in 55.876 seconds; 17896 ops per second
CPU 1 completed 1000000 ops in 56.941 seconds; 17562 ops per second
willy@piggy:~$ sudo taskset -p 1 1001
pid 1001's current affinity mask: f
pid 1001's new affinity mask: 1
willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4
CPU 2 completed 1000000 ops in 45.713 seconds; 21875 ops per second
CPU 0 completed 1000000 ops in 47.020 seconds; 21267 ops per second
CPU 3 completed 1000000 ops in 58.692 seconds; 17038 ops per second
...
To: Matthew Wilcox <matthew@...>
Cc: Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Friday, May 16, 2008 - 3:17 pm

I sure wish you would follow the discussions instead of having paranoid 
thoughts about me not running tests that show regressions. See my 

Hmmm... Interesting. Could you post the output of slabinfo -AD?
--
To: Matthew Wilcox <matthew@...>
Cc: Christoph Lameter <clameter@...>, Zhang, Yanmin <yanmin_zhang@...>, Andi Kleen <andi@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>
Date: Thursday, May 15, 2008 - 4:30 pm

Slabinfo -A -r before and after the run would be nice (you 
CONFIG_SLUB_STATS and Documentation/vm/slabinfo.c for the latter). A 
separate oprofile would be nice as well. Thanks!

		Pekka
--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 2:18 pm

Why's that? When we're not under pressure (fast path), we can delay (and
batch) remote frees. When we are under pressure (slow path), we can do
everything immediately.

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Matt Mackall <mpm@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 3:21 pm

Fastpath is what? I guess slow path means we called into the page 
allocator?

--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 3:49 pm

No, slow path here means we're already under memory pressure, so we
don't care if something takes longer if it saves memory.

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Matt Mackall <mpm@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 4:33 pm

So expire the queues under memory pressure only? Trigger queue cleanup 
from reclaim?


--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:02 pm

This wouldn't be my first thought. The batch size could be potentially
huge and we'd have to worry about latency issues.

But here are some other thoughts:

First, we should obviously always expire all queues when we hit low
water marks as it'll be cheaper/faster than other forms of reclaim.

Second, if our queues were per-slab (this might be hard, I realize), we
can sweep the queue at alloc time.

We can also sweep before falling back to the page allocator. That should
guarantee that delayed frees don't negatively impact fragmentation.

And lastly, we can always have a periodic thread/timer/workqueue
operation.

So far this is a bunch of hand-waving but I think this ends up basically
being an anti-magazine. A magazine puts a per-cpu queue on the alloc
side which costs on both the alloc and free side, regardless of whether
the workload demands it. This puts a per-cpu queue on the free side that
we can bypass in the cache-friendly case. I think that's a step in the
right direction.

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Matt Mackall <mpm@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:26 pm

Hmmm... I tried a scheme like that awhile back but it did not improve 
performance. The cost of queuing the object degraded the fast path (Note 
SLUB object queuing is fundamentally different due to no in slab 

In that case we dirty the same cacheline that we also need to take the 
page lock. Wonder if there would be any difference? The freelist is 

That would introduce additional complexity for the NUMA case because now 
we would need to distinguish between the nodes that these objects came 
from. So we would have to scan the queue and classify the objects? Or 
determine the object node when queueing them and put them into an remote 
node queue? Sounds similar to all the trouble that we ended up with 

I have had enough trouble in the last years with the 2 second hiccups that 
come with SLAB and that affect timing sensitive operations between 
processors in a SMP configuration and also cause trouble for applications 

I think if you want queues for an SMP only system, do not care too much 
about memory use, dont do any frequent allocations on multicore systems 
and can tolerate the hiccups because your application does not care (most 
enterprise apps are constructed that way) or if you are running benchmarks 
that only access a limited dataset that fits into SLABs queues amd 
avoid touch the contenst of objects then the SLAB concept is the right way 
to go.

If we would strip the NUMA stuff out and make it an SMP only allocator for 
enterprise apps then the code may become much smaller and simpler. I guess 
Arjan suggested something similar in the past. But that would result in 
SLAB no longer being a general allocator.
--
To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 5:54 pm

What does this have to do with anything? I'm not talking about going
back to SLAB. I'm talking about plugging the use cases where SLUB
currently loses to SLAB. That's what has to happen before SLAB can be
obsoleted.

I'll certainly grant you that queueing might not break even.

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Matt Mackall <mpm@...>
Cc: Andi Kleen <andi@...>, Pekka Enberg <penberg@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Thursday, May 15, 2008 - 1:15 pm

Both allocators have a different design which leads to different behavior. 
I do not think the expectation that one must always best the other is 
reasonable or even possible.

I'd be glad if we had some means of increasing the performance in the 
currently known cases where remote slab free becomes an issue by avoiding 
the atomic op.

AFAICT we so far have been able to compensate for the additional atomic op 
with a reduced cache footprint and less complexity overall on remote frees 
and also through improvements in alloc behavior. I hope that the current 
improvements in 2.6.26 are sufficient to address the concerns with TP-C 
(which I do not have direct access to and frankly I know very little about 
the setup etc). We are still not sure exactly why TP-C has a problem. The 
slab statistics were added to figure that one out. We can get a view of 
what is going on without having access to the system.

I think the current way of compensating for that atomic op is better than 
getting back to the queue mess. Maybe there is a way of limited use of 
queues that avoids the atomic op but so far I have not 
found one. Maybe someone else looking at it will have better ideas.
--
To: Pekka Enberg <penberg@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Monday, May 12, 2008 - 6:08 am

What are your plans to fix it?

-Andi
--
To: Andi Kleen <andi@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, Christoph Lameter <clameter@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Monday, May 12, 2008 - 6:23 am

I don't have a reproducable test and my boxes are so tiny the issues
probably won't show up anyway. So all I can do at this point is to
make sure Matthew et al can easily re-test whenever we fix some other
regression that potentially affects his workload.

I only recently started tracking this issue so I have no idea where
we're at with this. Christoph? Matthew?
--
To: Pekka Enberg <penberg@...>
Cc: Andi Kleen <andi@...>, KOSAKI Motohiro <kosaki.motohiro@...>, Rik van Riel <riel@...>, <akpm@...>, <linux-kernel@...>, <linux-fsdevel@...>, Mel Gorman <mel@...>, <mpm@...>, Matthew Wilcox <matthew@...>, Zhang, Yanmin <yanmin_zhang@...>
Date: Wednesday, May 14, 2008 - 1:30 pm

I suspect that this is the same issue as tbench. I have explained the SLAB 
vs. SLUB free situation in another email in this thread.

--
Previous thread: [patch 11/21] inodes: Support generic defragmentation by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)

Next thread: [patch 14/21] Filesystem: Ext4 filesystem defrag by Christoph Lameter on Friday, May 9, 2008 - 11:08 pm. (1 message)
speck-geostationary