Re: Major regression on hackbench with SLUB (more numbers)

Previous thread: [PATCH 01/13] bring driver version number to 47 by Ed L. Cashin on Friday, December 7, 2007 - 1:47 pm. (13 messages)

Next thread: [git pull] scheduler fixes by Ingo Molnar on Friday, December 7, 2007 - 2:21 pm. (1 message)
To: LKML <linux-kernel@...>
Cc: Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 7, 2007 - 1:50 pm

Hi,

I've been doing some benchmarks for RT task migrations and I stumbled
over this regression. I first thought it was a regression with CFS and
started doing a git bisect. But I later found that the "good" kernel
also failed. Then I started looking at the config options and discovered
that SLUB has become the default.

When I put SLAB back, the regression went away, even on the lastest git
tree.

So I ran the hackbench benchmarks with the latest git commit
(f194d132e4971111f85c18c96067acffb13cee6d at the time of this writing).
And put up the results on my web page.

http://people.redhat.com/srostedt/slab/

I ran Ingo's cfs-debug-info.sh and have the results of that on the web
page. That script records lots of good information to see what kind of
machine I ran this on. This is a 64way box (yes lots of CPUS!).

The hackbench code and the script I used to run and record the results
is also present. I ran "hackbench 90" 20 times, 10 times as SCHED_OTHER
and 10 times as SCHED_FIFO (chrt -f 10 hackbench 90). The graph shows
both runs (min, max and average). The versions that start with
"rt:" (although the graph somehow truncated it to just "t:") are the
SCHED_FIFO runs. (click on the graph to see a bigger version)

The regression is that hackbench slows down tremendously. It goes from
running just under 2 seconds to running around 14 seconds.

Hackbench seems to show this regression the most. In my tests I didn't
see much change with kernel builds and such, but the focus was on
scheduling not memory management. I'll run my kernel tests next for both
SLAB and SLUB and see if there's any difference there.

But this regression might be a reason to not have SLUB as default for
now. At least until we find out why this is affected so badly.

-- Steve

--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Tuesday, December 11, 2007 - 10:33 am

i just ran various benchmarks on an 8-way (8x 700 MHz Xeon, 4GB RAM):

AVG v2.6.24.slab v2.6.24.slub [ smaller is better ]
-----------------------------------------
mmap: 1052.66 1049.33 ( 0%)
ctx-2: 4.32 4.30 ( 0%)
select: 41.95 43.69 ( 4%)
proc-exec: 394.45 391.92 ( 0%)
hackbench-10: 1.12 2.99 (166%)
hackbench-20: 2.04 6.67 (226%)
hackbench-50: 5.03 17.50 (247%)

and hackbench overhead stands out, by a huge margin. Other stuff is
within measurement noise. Neither SLUB nor SLAB debugging was turned on,
all other debugging options were off too.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Thursday, December 13, 2007 - 6:22 pm

I just came back from vacation. The non linear growth of regression
indicates lock contention somewhere. Must be something special that was
triggered by hackbench.

We have had a regression like that in 2.6.24 before due to order 1 allocs
in the network layer. The .24 modifications for SLUB introduced page
allocator pass through for allocations > PAGESIZE/2. Order 1 allocs could
serialize on zone locks.
--

To: Ingo Molnar <mingo@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 14, 2007 - 12:24 am

Hmmmm... Some tests here on an 8p 8G machine:

SLAB

N=10 Time: 0.341
N=20 Time: 0.605
N=50 Time: 1.487

SLUB

N=10 Time: 0.675
N=20 Time: 1.434
N=50 Time: 3.996

So its factor 2 for untuned SLUB. Looking at hackbench: This is a
test that allocates objects that are then consumed by N cpus that then
return them. The allocating processor can allocate from the per cpu slab
(the freelist is copied to a per cpu structure when its activated)
avoiding touching the page struct. However, the freeing processors
must update the freelist of the slab page directly. So they all content
for the cacheline of the page struct.

Since we do directly free there is the chance of lots of contention
between the N cpus that free the objects. This is in particular high in a
synthetic benchmark like this.

However, if the object to be freed is still in a cpu slab then the freeing
action will reduce the taking of the listlock. So we can actually
decrease the overhead by increasing the slab page size.

In an extreme case (boot with slub_min_order=9 to get huge page sized
slabs) SLUB can win against SLAB:

N=10 Time: 0.338 Minimally faster
N=20 Time: 0.560 10% faster
N=50 Time: 1.353 15% faster

Not sure how to get that mainstream yet but there certainly is a way to
get there. Need to get an idea how to reduce listlock contention in the
remote free case.
--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 8:09 am

what's up with this regression? There's been absolutely no activity
about it in the last 8 days: upstream still regresses, -mm still
regresses and there are no patches posted for testing.

being able to utilize order-0 pages was supposed to be one of the big
plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously
the "fix", right?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Christoph Lameter <clameter@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 1:44 pm

Hi,

Christoph, did you see Steven's oprofile results for the hackbench
runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting
add_partial like crazy?

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:22 pm

Hmmm... SLUB is cycling through partial slabs. If it gets fed objects with
a single free object from the free list again and again then this is the
behavior that one would see.

The worst case scenario would be.

1. Processor 0 gets slab with one free entry from the partial list.

2. Processor 0 allocates object and deactivates the slab since it is full.

3. Processor 1 frees the object and finds that it was not on the partial
list since there were no free objects. Call put_partial()

4. processor 0 gets slab with one free entry from the partial list.

If we would make the partial list a mininum size (which is already done
through MIN_PARTIAL maybe just increase it) then we may be able to avoid
this particular case. Also we could make sure that the slab is not put at
the beginning of the partial list on free in order to increase the time
that the slab spends on the partial list. That way it will gather more
objects before it is picked up.

Hmmmm... This is a bit different from what I got to here.

--

To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:37 pm

Actually thanks for pointing that out Pekka.... Here is a patch that takes
the regression almost completely away (Too much jetlag combined with flu
seems to have impaired my thinking I should have seen this earlier). I
still need to run my slab performance tests on this. But hackbench
improves.

SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
mm/slub.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c 2007-12-21 14:30:33.656441994 -0800
@@ -172,7 +172,7 @@ static inline void ClearSlabDebug(struct
* Mininum number of partial slabs. These will be left on the partial
* lists even if they are empty. kmem_cache_shrink may reclaim them.
*/
-#define MIN_PARTIAL 2
+#define MIN_PARTIAL 5

/*
* Maximum number of desirable partial slabs.
@@ -1616,7 +1616,7 @@ checks_ok:
* then add it.
*/
if (unlikely(!prior))
- add_partial(get_node(s, page_to_nid(page)), page);
+ add_partial_tail(get_node(s, page_to_nid(page)), page);

out_unlock:
slab_unlock(page);
--

To: Christoph Lameter <clameter@...>
Cc: Pekka Enberg <penberg@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:56 pm

FYI, this gives me:

mm/slub.c: In function 'kfree':
mm/slub.c:2604: warning: passing argument 3 of 'slab_free' discards qualifiers from pointer target type

looks harmless though at first sight.

Ingo
--

To: Christoph Lameter <clameter@...>
Cc: Pekka Enberg <penberg@...>, Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:52 pm

FYI, I'm currently on holiday till Jan 2cd, and I also lost access to the
machine I did the orignal benchmarks on. But I'll try to reserve it again
after the New Year, and I will run the benchmarks including this patch and
report my findings.

Thanks,

-- Steve

--

To: Christoph Lameter <clameter@...>
Cc: Pekka Enberg <penberg@...>, Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:51 pm

Now *this* is what I wanted to see - rather than argue against other
peoples performance regression reports, an actual patch that acknowledges
the problem.

Thanks.

And now, can the people who made the problem reports and complained about
SLUB please test the patch - the ball is now in your court!

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Christoph Lameter <clameter@...>, Pekka Enberg <penberg@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 7:17 pm

yep, and i ran a quick comparison test on a 2-core box with 3 kernels:

[ best of 5 runs in a row which had a relative jitter of less than 10% ]

MIN v2.6.24.slab v2.6.24.slub v2.6.24.slub.fix
----------------------------------------------------------
mmap: 429.00 402.00 ( -6%) 385.00 (-10%)
select: 11.38 10.46 ( -8%) 11.41 ( 0%)
proc-exec: 121.52 116.77 ( -3%) 120.77 ( 0%)
proc-fork: 106.84 106.19 ( 0%) 107.92 ( 1%)
syscall-open: 3.09 3.13 ( 1%) 3.25 ( 4%)
hackbench-50: 2.85 3.47 ( 21%) 2.88 ( 1%)

and the regression seems to be largely fixed! Not only is the hackbench
one fixed, but mmap shows an above-noise improvement as well.

Acked-by: Ingo Molnar <mingo@elte.hu>

And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson
;-)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Pekka Enberg <penberg@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 7:54 pm

Well lets use the version attached to this patch. It turns out that it is
not important to increase the number of partial slabs. Just putting the

He is already the co-maintainer of the slab allocators. See the
MAINTAINERS file.

SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
mm/slub.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c 2007-12-21 15:10:17.611022381 -0800
@@ -1611,12 +1611,12 @@ checks_ok:
goto slab_empty;

/*
- * Objects left in the slab. If it
- * was not on the partial list before
- * then add it.
+ * Objects left in the slab. If it was not on the partial list before
+ * then add it. Add it to the end since there is only a single object
+ * which would make slab_alloc inefficient.
*/
if (unlikely(!prior))
- add_partial(get_node(s, page_to_nid(page)), page);
+ add_partial_tail(get_node(s, page_to_nid(page)), page);

out_unlock:
slab_unlock(page);
--

To: Christoph Lameter <clameter@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Pekka Enberg <penberg@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 8:02 pm

You forgot to update the commit message.

Should be:

Put partial slabs to the end of the partial queue so that they can add
more objects.
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Christoph Lameter <clameter@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 7:40 pm

Hi,

I thought it might be a bug but couldn't figure it out. The patch
looks good to me too, Christoph.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

Heh, thanks, but grep me from MAINTAINERS some day, Ingo :-).

Pekka
--

To: Ingo Molnar <mingo@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 12:18 pm

I added a test that does this 1 alloc N free behavior to the slab
in kernel benchmark (cpu 0 continually allocs, cpu 1-7 continously free)
(all order 0). The test makes the regression worse because it run the
scenario with raw slab allocs/frees in the kernel context without the use
of the scheduler:

SLAB
1 alloc N free(8): 0=2115 1=549 2=551 3=550 4=550 5=550 6=551 7=549 Average=746
1 alloc N free(16): 0=2159 1=573 2=574 3=573 4=574 5=573 6=574 7=573 Average=772
1 alloc N free(32): 0=2168 1=577 2=576 3=577 4=579 5=578 6=577 7=576 Average=776
1 alloc N free(64): 0=2276 1=555 2=554 3=554 4=555 5=556 6=555 7=554 Average=770
1 alloc N free(128): 0=2635 1=549 2=548 3=548 4=550 5=549 6=548 7=549 Average=809
1 alloc N free(256): 0=3261 1=543 2=542 3=542 4=544 5=542 6=542 7=542 Average=882
1 alloc N free(512): 0=4642 1=571 2=572 3=570 4=573 5=571 6=572 7=572 Average=1080
1 alloc N free(1024): 0=7116 1=582 2=581 3=583 4=581 5=582 6=582 7=583 Average=1399
1 alloc N free(2048): 0=11612 1=667 2=667 3=667 4=668 5=667 6=668 7=667 Average=2035
1 alloc N free(4096): 0=20135 1=673 2=674 3=674 4=675 5=674 6=675 7=674 Average=3107

SLUB
1 alloc N free test
===================
1 alloc N free(8): 0=1223 1=645 2=417 3=643 4=538 5=612 6=557 7=564 Average=650
1 alloc N free(16): 0=3828 1=3140 2=3116 3=3166 4=3122 5=3172 6=3129 7=3172 Average=3231
1 alloc N free(32): 0=4116 1=3207 2=3180 3=3210 4=3183 5=3202 6=3182 7=3214 Average=3312
1 alloc N free(64): 0=4116 1=3017 2=3005 3=3013 4=3006 5=3016 6=3004 7=3020 Average=3149
1 alloc N free(128): 0=5282 1=3013 2=3010 3=3014 4=3009 5=3010 6=3005 7=3013 Average=3295
1 alloc N free(256): 0=5271 1=2750 2=2751 3=2751 4=2751 5=2751 6=2752 7=2751 Average=3066
1 alloc N free(512): 0=5304 1=2293 2=2295 3=2295 4=2295 5=2294 6=2295 7=2294 Average=2671
1 alloc N free(1024): 0=6017 1=2044 2=2043 3=2043 4=2043 5=2045 6=2044 7=2044 Average=2541
1 alloc N free(2048): 0=7162 1=2086 2=2084 3=2086 4=2086 5=2085 6=2086 7=2086 Average=2720
1 alloc N free(4096): 0=7133 1=1795 ...

To: Christoph Lameter <clameter@...>
Cc: Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 12:44 pm

[ ... ]

Christoph, *NONE* of these arguments matter at all.

The only thing that matters is the simple fact that real-life benchmarks
show that SLUB is worse. It doesn't matter one whit that you can say that
it's better under some circumstance that either isn't triggered, or when

.. and this seems to be the single biggest reason to just say "revert SLUB
entirely".

If you aren't even motivated to fix the problems that have been reported,
then SLUB isn't even a _potential_ solution, it's purely a problem, and
since I am not IN THE LEAST interested in having three different
allocators around, SLUB is going to get axed.

In other words, here's the low-down:

a) either SLUB can replace SLAB, or SLUB is going away

This isn't open for discussion, Christoph. This was the rule back when
it got merged in the first place.

b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being
one, and while hackbench may be borderline, it's certainly less
borderline than others), then SLUB cannot replace SLAB.

See (a)

c) If you aren't even interested in trying to fix it and are ignoring
the reports, there is not even a _potential_ for SLUB for getting over
these problems, and I should disable it and we should get over it as
soon as possible, and this whole discussion is pretty much ended.

See? It really is that simple. Either you say "Hell yes, I'll fix it", or
SLUB goes away. There simply is no orther choice.

When you say "not worth addressing", that to me is a clear an unambiguous
"let's remove SLUB".

The main reason for SLUB in the first place was SGI concerns. You seem to
think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only
thing and you don't fix it for other users, then SLUB gets reverted from
the standard kernel.

That's all. And it's not really worth discussing.

Linus

--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:11 pm

Not motivated? I have analyzed the problem in detail and when it comes
down to it there is not much impact that I can see in real life
applications. I have always responded to the regression reported also via
TPC-C. There is still no answer to my post at
http://lkml.org/lkml/2007/10/27/245

SLAB has similar issues when freeing to remote nodes under NUMA. Its
even worse since the lock is not per slab page but per slab cache. The

Looks like I need to find someone to get that going here to be able to

I have looked at this issue under various circumstances for the last week.
Nothing really convinced me that this is so serious. Could not figure
out if its really worth anything about but if you put it that way then of

The reason for SLUB was dysfunctional behavior of SLAB in many areas. It

Well still SLUB is really superior to SLAB in many ways....

- SLUB is performance wise much faster than SLAB. This can be more than a
factor of 10 (case of concurrent allocations / frees on multiple
processors). See http://lkml.org/lkml/2007/10/27/245

- Single threaded allocation speed is up to double that of SLAB

- Remote freeing of objectcs in a NUMA systems is typically 30% faster.

- Debugging on SLAB is difficult. Requires recompile of the kernel
and the resulting output is difficult to interpret. SLUB can apply
debugging options to a subset of the slabcaches in order to allow
the system to work with maximum speed. This is necessary to detect
difficult to reproduce race conditions.

- SLAB can capture huge amounts of memory in its queues. The problem
gets worse the more processors and NUMA nodes are in the system.

- SLAB requires a pass through all slab caches every 2 seconds to
expire objects. This is a problem both for realtime and MPI jobs
that cannot take such a processor outage.

- SLAB does not have a sophisticated slabinfo tool to report the
state of slab objects on the system. Can provide details of
object use.

- SLAB requires the ...

To: Christoph Lameter <clameter@...>
Cc: Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:49 pm

No. If there are 6% performance regressions on TPC-C, then it CAN NOT

No.

Christoph, statements like this is *exactly* why I don't think SLUB can
make it.

You're closing your eyes to real performace *regression* reports, and then
you claim thast SLUB is faster, because you find your own microbenchmarks
that show so for specific loads.

But those specific loads are apparetly never the issue.

So stop saying that SLUB is "much faster", as long as hackbench shows that
that is simply NOT THE CASE.

It doesn't matter one whit if SLUB is lots faster, if it's faster for
cases that never matter. So far, I don't think we've *ever* seen any
actual benchmarks that involve any kind of real use where SLUB is
really faster, and we have some major examples of SLUB being disastrously
slower!

Your special-case kmalloc performance tests don't matter, when real user
programs show the exact opposite effect.

And the fact that you dismiss those real user programs just because you
have your own test harness is why I'm ready to throw in the towel on SLUB.

I don't mind performance regressions, but I *do* mind performance
regressions when the main author then says "look, a helicopter!" and
points to some totally different thing and claims that the performance
regression doesn't even exist!

Because those kinds of performance regressions never get fixed, because
they are ignored.

Linus
--

To: Christoph Lameter <clameter@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:16 pm

But you are dismissing the hackbench regression, which is not a small
one. It runs an astonishing 10x slower.

--

To: Christoph Lameter <clameter@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:17 pm

BTW, does /proc/slabinfo exist again? I thought we set that as a
requirement for SLUB to be the default and a full replacement for SLAB.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:27 pm

Well the information that would be exposed in /proc/slabinfo would only be
faked up stuff from SLUB. The queues that are described in /proc/slabinfo do not
exist f.e. (tunables do not make sense) and there are more details available via /sys/kernel/slab.

There is a discussion on linux-mm with some of the people working on tools
to make these to use the information that SLUB provides.

--

To: Christoph Lameter <clameter@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 6:54 pm

Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop
relies on it, people use it every day to monitor memory consumption.

I'm really getting worried that you are apparently incapable of grasping
such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or
whatever else in some of the fields? People use it to know how many
objects are allocated and sure SLUB knows that count, sheesh. How on
earth can you come up with a lame excuse like that? You dont like the
'SLAB' portion of the name perhaps? Is it NIH again?

Really, if your behavior is representative of how our SLAB allocator
will be maintained in the future then i'm very, very worried :-( You
ignore and downplay clear-cut regressions, you insult and attack
testers, you are incredibly stupid about user ABIs (or pretend to be so)
and you distort and mislead all the way. What will you be able to do in
the much less clear-cut cases??

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 8:28 pm

It's definitely not a stable ABI. slabtop tends to exit without any
error message on any slabinfo version number increase and I've seen
that happen several times in not so old kernels.

Requiring just another slabtop update isn't really a big deal.

Also it's not that it's a critical functionality like udev.

-Andi

--

To: Andi Kleen <andi@...>
Cc: Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 6:03 am

so because we sucked in the past we can continue to suck? ;)

Why are we still arguing about this? We kernel developers are foxes
amongst the hens and if a compatibility issue comes up we have to act
_doubly_ conservatively.

You think it's reasonable to not offer /proc/slabinfo? You can fairly
assume that a user considers it absolutely unreasonable. If other kernel
developers tell you: "no, it's fine", then it's as if other foxes told

certainly. But consider this from the user's perspective who tries one
of our devel kernels. He suspects a memory leak. Runs slabtop and
gets:

$ slabtop
fopen /proc/slabinfo: No such file or directory

and would fairly conclude: "ok, this new Linux kernel looks quite
apparently unfinished, i'm outta here".

We do this way too frequently and many silly details like this _do_

Sure, we can argue about details that not all fields in /proc/slabinfo
are relevant, and that slabtop should be a bit more careful, etc., but
we've got what we've got because _we_ built the current code, so we
might as well accept the consequences it brings. The most of the basic
output of slabtop:

Active / Total Objects (% used) : 648754 / 747076 (86.8%)
Active / Total Slabs (% used) : 39394 / 39394 (100.0%)
Active / Total Caches (% used) : 103 / 143 (72.0%)
Active / Total Size (% used) : 138884.36K / 151075.96K (91.9%)
Minimum / Average / Maximum Object : 0.01K / 0.20K / 128.00K

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
261928 239808 91% 0.13K 9032 29 36128K dentry_cache
222048 174144 78% 0.05K 3084 72 12336K buffer_head
187232 178929 95% 0.48K 23404 8 93616K ext3_inode_cache
24416 17908 73% 0.27K 1744 14 6976K radix_tree_node

could be offered on SLUB too.

'top' isnt critical functionality either like udev, and the ABI does not
only cover 'critical' functionality. A utility suddenly not working
gives Linux a pretty amateurish fe...

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 2:01 pm

Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and
the fact that there hasn't been a lot of complaints about it going away
does seem to imply that it wasn't a very important ABI.

I'm the first to stand up for backwards compatibility, but I also try to
be realistic, and so far nobody has really complained about the fact that
/proc/slabinfo went away on any grounds but on the general principle of
"it was a user-visible ABI".

We've changed user-visible ABI's in the past in the hope that they weren't
actually used. Sometimes it worked, sometimes it didn't. In this case, I
think it still has the potential of working.

That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I
as a developer tend to want from there, but it's really rather less
convenient than I'd like. It is really quite hard, for example, to get any
kind of "who actually uses the most *memory*" information out of there.

You have to use something like this:

for i in *
do
order=$(cat "$i/order")
slabs=$(cat "$i/slabs")
object_size=$(cat "$i/object_size")
objects=$(cat "$i/objects")
truesize=$(( $slabs<<($order+12) ))
size=$(( $object_size*$objects ))
percent=$(( $truesize/100 ))
if [ $percent -gt 0 ]; then
percent=$(( $size / $percent ))
fi
mb=$(( $size >> 10 ))
printf "%10d MB %3d %s\n" $mb $percent $i
done | sort -n | tail

which works fine, but while it's actually _much_ more readable than doing
the same thing with /proc/slabinfo was, I worry about the lack of
atomicity in getting the statistics.

I dunno. I do think /sys/slab/ is a better interface than /proc/slabinfo
was. I just wonder if it could be better still.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 3:46 pm

As a followup to the thread regarding slabinfo and Andrew's
earlier query about updating slabtop.

watch -n1 slabinfo -S

seems to be a reasonable replacement for slabtop. The only
missing functionality are some hotkeys missing: you have to restart
now to get a different sort order (and to read the slabinfo source
to find out the correct option), but other than that it works.

So I would propose to just retire slabtop and replace
it with a simple shell script doing that.

A manpage for slabinfo would be useful though. Anybody
volunteering to write one?

-Andi

--

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 7:28 pm

That would be me.
I'm a newbie and never wrote a man page before, so it will take a few
days, but I'm bored and out of ideas for any new code for the moment
being. Should be fun.

Karol Swietlicki
--

To: Andi Kleen <andi@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 29, 2007 - 2:08 pm

I'm dropping out from this one.
I worked for a good while, and I really see no way of improving what
the usage information (slabinfo -h) already provides. It's a good
description, and I'm starting to think that adding a man page is
overkill. If I am misunderstanding something and what is desired is in
fact usage information in the man page format, I can provide such a
page, no problem.

At least now I know how to write man pages.

Karol Swietlicki
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 3:25 pm

That's probably because the people who are most likely to be using
/proc/slabinfo tend to people people using kernels in production
environments, and they probably haven't started playing with SLUB yet.
So the fact we haven't heard the yelling doesn't necessarily mean that
people won't notice.

I know of a number of debugging scripts that periodically poll
/proc/slabinfo to find out what is using memory and to track trends
and predict potential problems with memory usage and balance. Indeed,
I've left several such scripts behind at various customer
installations after debugging memory usage problems, and some system
administrators very much like being able to collect that information
easily. So I can say for sure there at there are scripts out there
that depend on /proc/slabinfo, but they generally tend to be at sites

I have a general problem with things in /sys/slab, and that's just
because they are *ugly*. So yes, you can write ugly shell scripts

But sometimes when trying to eyeball what is going on, it's a lot
nicer just to use "cat /proc/slabinfo".

Another problem with using /sys/slab is that it is downright *ugly*.
Why is it for example, that /sys/slab/dentry is a symlink to
../slab/:a-0000160? Because of this, Linus's script reports dentry twice:

10942 MB 86 buffer_head
10942 MB 86 journal_head
10943 MB 86 :a-0000056
21923 MB 86 dentry
21924 MB 86 :a-0000160
78437 MB 94 ext3_inode_cache

Once as "dentry" and once as ":a-0000160". A similar situation
happens with journal_head and ":a-0000056". I'm sure this could be
filtered out with the right shell or perl script, but now we're
getting even farther from /proc/slabinfo.

Further, if you look at Documentation/sysfs-rules.txt (does anyone
ever read it or bother to update it?), /sys/slab isn't documented so
it falls in the category of "everything else is a kernel-level
implementation detail which is not guaranteed to be stable". Nice....

Also, looking at sysfs-rules....

To: Theodore Tso <tytso@...>
Cc: Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 5:00 pm

.. and I call BS on this claim.

/proc/slabinfo was (and is) totally pointless for "trying to eyeball
what's going on". The output is totally unreadable, and useless. You end
up with exactly the same script as above, except it reads as

cat /proc/slabinfo | (read headerline
while read name active num objsize objsperslab pagesperslab rest
do
realsize=$(( nul * objsize ))
size=$(( active * objsize ))
.. exact same rest of loop ..
done | sort -n | ..

so no, "cat /proc/slabinfo" was almost never practical on its own.

The *one* advantage it does have is that you can forward it to others.
That's a big advantage. But no, it wasn't ever readable for eyeballing,
because it doesn't even give you a memory usage thing (just "number of
objects and object size" as separate numbers).

But the "everything in one file" indubitably did make it a lot easier for

That's the only really ugly thing there. Otherwise, it's pretty nice, but
having a million files makes for problems when trying to send somebody
else the full info.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 6:10 pm

I don't agree with you Linus. I'm one of those used to quickly take a
look at its contents when I don't know where all my memory has gone.
I know by experience that if I find 6-digit values in dentry_cache or
inode_cache, all I was looking for is there, and that's enough for a
quick diag. It doesn't give anything accurate, but it's very useful
to quickly diag out some problems on production systems.

It was this week that I noticed for the first time that slabinfo did
not exist anymore, and did not know what replaced it. Lacking time,
I finally gave up and *supposed* that the memory was eaten by the
usual suspects.

I can understand that it has to go away for technical reasons, but Ted
is right, please don't believe that nobody uses it just because you got
no complaint. While people are not likely to perform all computations
in scripts, at least they're used to find some quickly identifiable
patterns there.

Willy

--

To: Willy Tarreau <w@...>
Cc: Linus Torvalds <torvalds@...>, Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 8:59 pm

[/me sneaks away from the family]

I know when I'm looking for memory leaks, I've asked customers to give
snapshots of slabinfo at periodic times (once a day even, matters how bad
the leak is). This has been helpful in seeing if something did indeed
leak.

If you have a slab cache that constantly grows, and never shrinks, that's
a good indication that something might be leaking. Not always, since there
can be legitimate reasons for that, but sometimes it helps.

But I still scratch my head when ever I need to touch sysfs.

[/me runs back to the family]

-- Steve

--

To: Steven Rostedt <rostedt@...>
Cc: Linus Torvalds <torvalds@...>, Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Sunday, December 23, 2007 - 1:12 am

On Sat, Dec 22, 2007 at 07:59:47PM -0500, Steven Rostedt wrote:

Same here. In fact, I've always considered that procfs was for
humans while sysfs was for tools. sysfs reminds me too much the
unexploitable /devices in Solaris. With the proper tools, I think
we can do a lot with it, but it's not as intuitive to find the
proper tools as it was to do "ls" followed by "cat" in /proc.

Willy

--

To: Willy Tarreau <w@...>
Cc: Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Sunday, December 23, 2007 - 10:15 am

find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done

tends to work reasonably well for a quick overview, but yes
cat was nicer for humans.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Sunday, December 23, 2007 - 11:45 pm

Until you start to wonder what the heck :a-0000136 is:

/sys/slab/:a-0000136/objs_per_slab: 30

Sigh...

- Ted

--

To: Theodore Tso <tytso@...>
Cc: Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Monday, December 24, 2007 - 3:21 pm

That is why there is a slabinfo tool that does all the nice formatting.

Do a

gcc -o slabinfo Documentation/vm/slabinfo.c

Then run slabinfo instead of doing cat /proc/slabinfo

--

To: Christoph Lameter <clameter@...>
Cc: Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Monday, December 24, 2007 - 7:37 pm

So two questions: why isn't -f the default? And is /sys/slab
guaranteed to be a stable and permanent interface if the SLAB
implementation ever gets ripped out? If so, maybe this should go into
util-linux-ng? I am aa bit concerned about the lack of atomicity of
/sys/slab, but this is a heck of a lot better of many kernel drivers
or subsystems which use /sys and the completely punt on any kind of
userspace utility, forcing users to type crazy things like

echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

- Ted
--

To: Theodore Tso <tytso@...>
Cc: Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Wednesday, December 26, 2007 - 5:31 pm

Because it gives misleading output. It displays the name of the first
of multiple slabs that share the same storage structures.
--

To: Christoph Lameter <clameter@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Wednesday, December 26, 2007 - 6:16 pm

Erm... Let me spell it out: current lifetime rules are completely broken.
As it is, create/destroy/create cache sequence will do kobject_put() on
kfree'd object. Even without people playing with holding sysfs files
open or doing IO on those.

a) you have kobject embedded into struct with the lifetime rules of its
own. When its refcount hits zero you kfree() the sucker, even if you
still have references to embedded kobject.

b) your symlinks stick around. Even when cache is long gone you still
have a sysfs symlink with its embedded kobject as a target. They are
eventually removed when cache with the same name gets created. _Then_
you get the target kobject dropped - when the memory it used to be in
had been freed for hell knows how long and reused by something that would
not appreciate slub.c code suddenly deciding to decrement some word in
that memory.

c) you leak references to these kobject; kobject_del() only removes it
from the tree undoing the effect of kobject_add() and you still need
kobject_put() to deal with the last reference.
--

To: Al Viro <viro@...>
Cc: Christoph Lameter <clameter@...>, Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 28, 2007 - 5:00 am

as a sidenote: bugs like this seem to be reoccuring. People implement
sysfs bindings (without being sysfs internals experts - and why should
they be) - and create hard to debug problems. We've seen that with the
scheduler's recent sysfs changes too.

shouldnt the sysfs code be designed in a way to not allow such bugs? The
primary usecase of sysfs is by people who do _not_ deal with it on a
daily basis. So if they pick APIs that look obvious and create hard to
debug problems (and userspace incompatibilities) that's a primary
failure of sysfs, not a failure of those who utilize it.

At a minimum there should be some _strong_ debugging facility that
transparently detects and reports such bugs as they occur.
CONFIG_DEBUG_KOBJECT is totally unusable right now, it spams the syslog
(so no distro ever enables it - i disable it in random bootups as well
because it takes _ages_ to even get to a boot prompt) and never finds
any of these hard-to-find-but-easy-to-explain bugs.

or if sysfs/kobjects should be scrapped and rewritten, do you have any
insight into what kind of abstraction could/should replace it? Should we
go back to procfs and get rid of kobjects altogether? (as it's slowly
turning into a /proc problem of itself, with worse compatibility and
sneakier bugs.)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Al Viro <viro@...>, Christoph Lameter <clameter@...>, Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 28, 2007 - 10:52 am

The few times I've used kobjects/sysfs, I would always need to "relearn"
how to use it. It's not trivial at all, and even when I finally get it
working, I'm not sure it's working correctly.

Although something like debugfs took me a few minutes to use. Now when
ever I need to do something that userspace uses, I simply choose debugfs.

I thought the switch to sysfs was to keep procfs cleaner and only used for
processes. Why did it need something so complex as the kobject structure?
Was it so that udev could integrate with it? If sysfs was as simple as
debugfs, I think a lot of these bugs would not have occurred.

-- Steve

--

To: Al Viro <viro@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 1:51 am

Hmmmm.. That would also be addressed by a having a separately allocated

Patches would be appreciated.... Had a hard time to get the sysfs support
to work right.
--

To: Al Viro <viro@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 4:28 pm

Hmmm.. If I separately allocate the kobject then I can no longer get to
the kmem_cache structure from the kobject.

I need to add a second kobject_del to sysfs_slab_remove() to make sysfs
completely forget about the object?

Probably should track down any remaining symlinks at that point and nuke
them too. Isnt there some way to convince sysfs to remove the symlinks
if the target vanishes?

--

To: Christoph Lameter <clameter@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 6:59 pm

Don't bother with separate allocation.

a) remove symlink when slab goes away
b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
c) have ->release() of these kobjects do kfree()
--

To: Al Viro <viro@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 7:22 pm

Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a

In slab_ktype?

---
mm/slub.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-12-27 15:09:38.000000000 -0800
+++ linux-2.6/mm/slub.c 2007-12-27 15:21:12.000000000 -0800
@@ -247,7 +247,10 @@ static void sysfs_slab_remove(struct kme
static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
{ return 0; }
-static inline void sysfs_slab_remove(struct kmem_cache *s) {}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+ kfree(s);
+}
#endif

/********************************************************************
@@ -2322,7 +2325,6 @@ void kmem_cache_destroy(struct kmem_cach
if (kmem_cache_close(s))
WARN_ON(1);
sysfs_slab_remove(s);
- kfree(s);
} else
up_write(&slub_lock);
}
@@ -3940,6 +3942,13 @@ static ssize_t slab_attr_store(struct ko
return err;
}

+static void slab_release(struct kobject *kobj)
+{
+ struct kmem_cache *s = to_slab(kobj);
+
+ kfree(s);
+}
+
static struct sysfs_ops slab_sysfs_ops = {
.show = slab_attr_show,
.store = slab_attr_store,
@@ -3947,6 +3956,7 @@ static struct sysfs_ops slab_sysfs_ops =

static struct kobj_type slab_ktype = {
.sysfs_ops = &slab_sysfs_ops,
+ .release = slab_release
};

static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -4048,6 +4058,7 @@ static void sysfs_slab_remove(struct kme
{
kobject_uevent(&s->kobj, KOBJ_REMOVE);
kobject_del(&s->kobj);
+ kobject_put(&s->kobj);
}

/*

--

To: Christoph Lameter <clameter@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 7:54 pm

Huh? Just call it from kmem_cache_destroy(); what business does that symlink

kobject_del() undoes the effect of kobject_add(). And then you are left
with the refcount you had before kobject_add(), i.e. from kobject_init().

Think of it that way: kobject refcount controls the lifetime of structure
the sucker's embedded into. Depending on kobject methods, you might
be unable to do cleanup of non-kobject parts of that animal until after
kobject_del(). IOW, you _can't_ have kobject_del() dropping the (hopefully)
final reference to kobject - only the caller knows what might have to be
done in between.
--

To: Al Viro <viro@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 7:53 pm

How do I iterate over all symlinks in /sys/kernel/slab/*?

I remember trying to do it before and not being able to find a sysfs
method for that.

--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 7:58 pm

What the hell do you need that for? You have an obvious moment when
/sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
you remove the slab called cache_name. And at that point you don't
need to iterate over anything - you have the exact name, for fsck sake...

Why would you want these symlinks to stick around for longer than that?
--

To: Al Viro <viro@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 8:02 pm

/sys/kernel/slab/cache_name is a real directory but then there are the
aliases in /sys/kernel/slab/alias_name pointing to that directory that
also have to be removed.

So I need to scan for symlinks in /sys/kernel/slab/* pointing to
/sys/kernel/slab/cache_name.

--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 8:45 pm

Oh, lovely. So we can have module A do kmem_cache_create(), calling
cache "foo". Then module B does (without any knowlegde about A)
completely unrelated kmem_cache_create(), calling the sucker "bar".
mm/slub decides that they are mergable. Then we get rmmod A... and
no way to find out if that's foo or bar going away - kmem_cache_destroy()
doesn't have enough information to figure that out. So we have to keep
both slab/foo and slab/bar around until both are gone or until somebody
kind enough creates a cache called foo. Better yet, on some systems you
get things like slab/nfsd4_delegations sticking around long after nfsd is
gone just because slub.c decides that it's sharable, but in the next
kernel version the thing it's shared with gets different object size and
behaviour suddenly changes - now it's gone as soon as nfsd is gone.
Brilliant interface, that...
--

To: Al Viro <viro@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 10:19 pm

Just enable slab debugging and all of that goes away and you get the usual
error messages. Otherwise slub tries to compress the memory use as much as
possible. The advantage here is that you can add an abitrary amount of
kmem_cache_create() without increasing memory use significantly. Its cheap
to do kmem_cache creations. Thus you can create open arbitrary caches not
worrying about memory use and have meaningful variables on which you do
kmem_cache_alloc instead of kmalloc/kfree. kmalloc/kfree wastes space
since they go to powers of two. And kmalloc/kfree have to calculate the
cache at runtime whereas kmem_cache_alloc/free does the lookup once.

The amount of per cpu stuff we keep around that is idle is reduced by 50%.
Increasing cpu cache usage and reduces memory footprint. per cpu stuff
has to hold per cpu reserves and it is not good to waste too much memory
there.

nfsd4_delegations? What is this about?

How do I scan for the symlinks in sysfs?

--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Thursday, December 27, 2007 - 11:26 pm

At which point are you going to do that? AFAICS, the fundamental problem
is that you
* have aliases indistinguishable, so kmem_cache_destroy() can't tell
which one is going away, no matter what
* have per-alias objects in sysfs
As the result, you have a user-visible mess in that directory in sysfs.
And I don't see how you would deal with that - on the "the contents of
directory changes in so-and-so way when such-and-such operation is
done", not the implementation details one.

BTW, I'm rather sceptical about free use of slabs; keep in mind that their
names have to be unique with your sysfs layout, so...
--

To: Al Viro <viro@...>
Cc: Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Rafael J. Wysocki <rjw@...>
Date: Wednesday, January 2, 2008 - 4:25 pm

If the alias count of a kmem_cache structure reaches zero then all aliases

What do you mean by free use of slabs? Creation of new slab caches to
avoid the rounding up to order 2 size?

--

To: Theodore Tso <tytso@...>, Christoph Lameter <clameter@...>, Andi Kleen <andi@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Monday, December 24, 2007 - 11:34 pm

Debugging feature and "stable and permanent" just don't
fit together. It's like requesting stable and permanent
sysrq output.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Christoph Lameter <clameter@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Tuesday, January 1, 2008 - 8:47 am

Yeah, unfortunately, querying to find out how much memory is being
used by which parts of the system is something which I think needs to
be more than just a debugging feature. One could argue that "vmstat",
"iostat", and "sar" are debugging features as well, but other people
would consider them "critical programs to get information necessary to
monitor the health of their system".

Perhaps /proc/slabinfo should be that relatively stable interface, if
/sys/slab can't be guaranteed to be stable. But there *are* people
who will want to monitor information like this on an ongoing fashion
on production systems. One of the major downsides of /sys seems to be
that it's very hard to keep it stable, so maybe it's not the right
tool for this. /proc/slabinfo has the advantage that it's a simple
text file, and its format is relatively well-understood, and doesn't a
patch to provide /proc/slabinfo for SLUB already exist?

- Ted
--

To: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Tuesday, January 1, 2008 - 12:23 pm

yes, the complete (and tested) patch is below. It has been through a few
thousand random-bootup tests on x86 32-bit and 64-bit so it's v2.6.24
material i think.

Ingo

----------------->
Subject: slub: provide /proc/slabinfo
From: Pekka J Enberg <penberg@cs.helsinki.fi>

This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ mingo@elte.hu: build fix. ]

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/slub_def.h | 2
mm/slub.c | 105 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 94 insertions(+), 13 deletions(-)

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -200,4 +200,6 @@ static __always_inline void *kmalloc_nod
}
#endif

+extern const struct seq_operations slabinfo_op;
+
#endif /* _LINUX_SLUB_DEF_H */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3076,6 +3076,19 @@ void *__kmalloc_node_track_caller(size_t
return slab_alloc(s, gfpflags, node, caller);
}

+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+ unsigned long flags;
+ unsigned long x = 0;
+ struct page *page;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+ list_for_each_entry(page, &n->partial, lru)
+ x += page->inuse;
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ return x;
+}
+
#if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)
static int validate_slab(struct kmem_cache *s, struct page *page,
unsigned long *map)
@@ -3458,19 +3471,6 @@ static int list_locations(struct kmem_ca
return n;
}

-sta...

To: Ingo Molnar <mingo@...>
Cc: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, January 4, 2008 - 8:27 pm

On Tue, 1 Jan 2008 17:23:28 +0100

this is the part I'm not very thrilled about... at least on first sight
it looks like a user can now hold the read sem over system calls, and for as long as it wants.
Either that's no problem (but then we wouldn't need the lock in the first place) or it
is a problem that then wants to be fixed
--

To: Arjan van de Ven <arjan@...>
Cc: Ingo Molnar <mingo@...>, Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, January 4, 2008 - 8:49 pm

No, you misunderstand how seq-files work.

The start/stop sequence is done only overone single kernel buffer
instance, not over the whole open/close (or even the copy to user space).

It's expressly designed so that you can hold locks (including spinlocks
etc), and will not do any blocking ops (well - your *callbacks* can you
blocking ops, but the seq_file stuff itself won't) between start/stop.

Linus
--

To: Theodore Tso <tytso@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Willy Tarreau <w@...>, Steven Rostedt <rostedt@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Tuesday, January 1, 2008 - 11:19 am

slabinfo is about querying kernel internals and requiring stable
and permanent interfaces to that would imply freezing the kernel
internals.

People who want to have information about slabs and other similar
kernel internal allocations will always have to deal with version
dependencies.

That would require at least never changing any slab caches in any
subsystem. Obviously not an option.

-Andi
--

To: Linus Torvalds <torvalds@...>
Cc: Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 5:50 pm

*raised brows*

I would say that there's that really ugly thing with embedding kobject
into a struct with the lifetime rules of its own and then having that
struct kfree()d while kobject might still have references, but hey,
maybe it's just me and my lack of appreciation of the glory that is
sysfs.

Al, too tired of ranting about sysfs being a major architecture
mistake and a recurring source of turds like that one...
--

To: Linus Torvalds <torvalds@...>
Cc: Theodore Tso <tytso@...>, Ingo Molnar <mingo@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 7:29 pm

BTW, the fact that presence of that kobject is conditional makes life even
uglier - we have to either kfree() or drop a reference to kobject leaving
actual kfree() to its ->release(). While we are at it, when do we remove
the symlinks? That got to add even more fun for the lifetime rules...

Sigh... How does one set a script that would mail a warning upon git pull
that introduces any instances of keyword from given set? I hadn't noticed
that slub was using sysfs when it went into the tree back in May ;-/
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 3:09 pm

Yes, i agree that me calling it an 'ABI' stretches the term - we dont
want to put ourselves into a forced compatibility corner in every case
where /proc does something really stupid. But /proc/slabinfo is being
used, slabtop is installed and deployed, so why break it unnecessarily?
It's not like we couldnt remove /proc/slabinfo later on, via the normal
route of feature removal. I think Pekka's patch that adds /proc/slabinfo
is simple and reasonable enough for 2.6.24.

We can get rid of /proc/slabinfo but then it should i think be done via
the normal route of Documentation/feature-removal-schedule.txt. Was
there any particular problem with /proc/slabinfo, at least as far as the
read-only access that slabtop does goes? I dont think there was. So why
should we go out on a limb _not_ providing it when there's a patch
available, etc. I just googled a bit and people _have_ asked about
slabtop availability, and the impression was that it would be offered by
the time SLUB becomes the default. (and this was mentioned by others in
this discussion)

I'd also not rely on the fact that only a few people are complaining.
Most people, even 2.6.24-rc early adopters, still use SLAB because early
adopters typically use their .23 .config and do a 'make oldconfig' -
which picks up SLAB. So SLUB use will become more widespread only once
2.6.24 is out and is packaged in distros. Distros will likely pick SLUB
if there's no performance worries and if it's the default. Fedora
rawhide already uses SLUB.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 6:52 pm

>>>>> "IM" == Ingo Molnar <mingo@elte.hu> writes:

IM> Distros will likely pick SLUB if there's no performance worries
IM> and if it's the default. Fedora rawhide already uses SLUB.

Actually, it seems to me that not only does Fedora rawhide use SLUB,
but Fedora 8 and 7 use it as well. They don't have /proc/slabinfo and
kernel/devel/config-generic:CONFIG_SLUB=y
kernel/F-7/configs/config-generic:CONFIG_SLUB=y
kernel/F-8/config-generic:CONFIG_SLUB=y

- J<
--

To: Jason L Tibbitts III <tibbs@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Sunday, December 23, 2007 - 11:59 pm

Also true for at least recent versions of FC6 (which my bittorrent
machine is still on), which currently run 2.6.22-14 based kernels.
And no, I didn't notice - that box usually hits triple-digit uptime
before I reboot; in fact, it's still running a 2.6.22-9 based kernel.

--alessandro

"Damaged people are dangerous. They know they can survive."

(Anna Barton/Juliette Binoche, 'Damage')
--

To: Linus Torvalds <torvalds@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 3:29 pm

here's some silly statistics about allocator coverage on lkml, based on
configs reported to lkml in the past 4 months:

$ for N in SLAB SLUB SLOB; do printf "%s: " $N; grep ^CONFIG_$N=y linux-kernel | wc -l; done
SLAB: 70
SLUB: 77
SLOB: 4

so SLUB and SLAB is utilized about equally amongst people who reported
configs to lkml.

But people who use SLUB enabled it intentionally - and they are thus
much less likely to complain about this choice of them.

Reporting:

"I just enabled SLUB instead of SLAB, and hey guys, it does not have
SLABinfo"

has a foolish ring to it, doesnt it? I'd rather ask carefully, like this
person did:

http://kerneltrap.org/mailarchive/linux-kernel/2007/10/12/335765

This is one of the reasons why i think the whole SLAB->SLUB renaming was
bad - we should have done SLAB2 or SLAB^2 instead, so that people expect
_at least as good_ behavior (performance, features, etc.) from SLUB as
from SLAB. Instead of "something different".

anyway ... i think we still generally suck _alot_ at providing
near-transparent kernel upgrades to users. (kernel upgrades are still a
pain and risk, and often just due to poor developer choices on our
side.)

It's nowhere near as bad as the 2.4->2.6 transition was (in fact it
shouldnt even be mentioned in the same sentence), and it's getting
better gradually, but i think we should just by default be 10 times more
careful about these things - whenever it is borderline technically sane
to do so. We induce enough unintentional breakage of user-space, we
shouldnt compound it with intentional breakages. The kernel community
still has _a lot_ of user and distro trust to win back. So being seen as
over-cautious wont harm.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 8:37 am

> could be offered on SLUB too.

Sure (I argued that in a earlier mail in fact), but it doesn't make

While I sympatise with this point in practice non critial ABIs change

Removing an interface that exposes lots of internal details when you
rewrite the subsystem and those internal details all change seems
like a good reason to me.

Besides the original interface was broken anyways. The version
number was one of the interface worst ideas ever imho and slabtop's handling
of it quite dumb. The better way would have been to always add new
fields and never remove old ones. Hopefully the new /sys based interface
will be better.

-Andi

--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 9:01 am

Could we please drop "kset: move /sys/slab to /sys/kernel/slab" from -mm?
Old slabinfo complains that SLUB_DEBUG is not on and refuses to do anything.
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 8:51 am

Hi,

Yeah but then again removing an interface that has been around for
ever is a real PITA for users. Besides, emulating /proc/slabinfo ain't
so bad:

http://lkml.org/lkml/2007/12/22/58

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 8:54 am

Version 2.1 hasn't been around forever and at least slabtop does not
work over version number changes in my experience.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Saturday, December 22, 2007 - 9:27 am

Hi Andi,

True but the default user-visible ABI (the one without statistics) is
unchanged since version 2.0:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=c...

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=c...

But anyway, I don't care too much either way so I'll just let Andrew
sort this out, as usual.

Pekka
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 10:19 pm

Might as well mention that SLOB also doesn't provide this functionality,
nor would it make any sense for it to try (because there are no
slab-like objects to report about).

--
Mathematics is the supreme nostalgia of our time.

--

To: Matt Mackall <mpm@...>
Cc: Andi Kleen <andi@...>, Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 10:44 pm

% slabinfo
Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg
...
anon_vma 1551 24 81.9K 20/18/2 128 0 90 45

Seems certainly like information that slabtop could report, even if it's
not exactly the same.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ingo Molnar <mingo@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 8:33 pm

There's also Documentation/vm/slabinfo.c, which really belongs in
some kind of 'kernel utilities' package. Since it's so intimately
tied to kernel version I don't think it belongs in util-linux.

--

To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Linus Torvalds <torvalds@...>, Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 7:51 pm

NIH? I wrote major portions of SLAB. I would be hating my own product.

I analyzed the issue and argued that the issues that one test showed in
SLUB is a really special case and then you conclude that I ignore all
regressions? I have addressed and responded to all reports of regressions
that came to me.
--

To: Ingo Molnar <mingo@...>
Cc: <clameter@...>, <a.p.zijlstra@...>, <torvalds@...>, <rostedt@...>, <linux-kernel@...>, <hch@...>, <rjw@...>
Date: Friday, December 21, 2007 - 7:27 pm

On Fri, 21 Dec 2007 23:54:13 +0100

That would be really bad.

/proc/slabinfo intimately exposes implementation-specific internals and
strictly speaking should never have been introduced. Obviously that isn't
a _practical_ position, but there are no easy solutions here.

We don't want to have to drag a pile of be-compatible-with-slab gunk along

slabtop needs to be changed asap.

Possibly we could look at adding some interrim make-it-look-like-slab hack
into slub.c for a couple of releases but no longer.

If we'd updated slabtop six months ago this issue wouldn't have arisen now.

But we didn't do that. We're quite bad at this sort of thing.
--

To: <mingo@...>
Cc: <clameter@...>, <a.p.zijlstra@...>, <torvalds@...>, <rostedt@...>, <linux-kernel@...>, <akpm@...>, <hch@...>, <rjw@...>
Date: Friday, December 21, 2007 - 7:20 pm

From: Ingo Molnar <mingo@elte.hu>

Actually, to the contrary, I actually think Christoph responds to
every problem I've ever reported to him about his per-cpu counters
work and SLUB much better than most people who call themselves
"maintainers" around here.

And I say that without any reservations.

He doesn't deserve the ad-hominem attacks he is getting today, because
he does resolve every problem reported to him.

The guy wrote test cases, he analyzed every problem, he wrote test
patches, and he doesn't stop doing any of that until the issue really
is reported as resolved by the testers.

I'll take Christoph as the implementor and maintainer of anything, any
day of the week. He rocks.
--

To: David Miller <davem@...>
Cc: <clameter@...>, <a.p.zijlstra@...>, <torvalds@...>, <rostedt@...>, <linux-kernel@...>, <akpm@...>, <hch@...>, <rjw@...>
Date: Saturday, December 22, 2007 - 5:45 am

well, maybe i got unlucky, this hackbench thing being the first time i'm
exposed to a major SLUB regression. The hackbench problem was dead easy
to reproduce, i (and others) offered immediate testing of whatever test
patches, it also matched the profiles of the TPC-C regression but still
i was only offered explanations about why this workload does not matter
and how others suck because they are unable to give immediate test
feedback from millions-of-dollars test equipment that is barely able to
run our devel kernels. The regression is fixed now and i'm a happy
camper!

Christoph, i'd like to apologize for all overly harsh words i said. (and
i said quite a few :-/ )

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: David Miller <davem@...>, <a.p.zijlstra@...>, <torvalds@...>, <rostedt@...>, <linux-kernel@...>, <akpm@...>, <hch@...>, <rjw@...>
Date: Monday, December 24, 2007 - 3:24 pm

Looks like a bad interaction due to overload, vacation, flu epidemic
hitting me and family and also Christmas coming up so that I could not do
my usual workload. Need to find some way to cut back on the stuff that I
am involved with it seems. Thanks for the SLUB patches.

--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 8:26 am

and this is not the only regression:

http://lkml.org/lkml/2007/10/4/290

_6%_ TPC-C regression. That's _a lot_ in TPC-C terms.

and just like in this case there were very clear profiles posted. I
proffer, reading back the whole thread, that if you fix hackbench you
have fixed TPC-C as well.

So i believe you should either send some sensible fixes _NOW_, or admit
that the "no queues" NIH nonsense of SLUB doesnt work and do an edible,
incremental patchset against SLAB to bring in the debuggability features
of SLUB without killing SLAB's performance. (And fix the NUMA alien
cache problem along the lines suggested before - perhaps initially by
making 'noaliencache' the default bootup option.) And we obviously must
revert the default in 2.6.24 to SLAB as well.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 12:26 pm

There are patches pending to address these issues. AFAICT Intel is
testing if the regression is still there. There is no way for me to verify what
is going on there and there is the constant difficulty of getting
detailed information about what is going on at Intel. Every couple of
month I get a result from that test. Its a really crappy situation where a

The fixes to the alien that you proposed do not work since we would still
have to check for the need to put the object into alien cache. The checks
for the locality of each object are the main cause of trouble when freeing
in SLAB.

SLUB is generally performance wise superior to SLAB as demonstrated by my
measurements that you reviewed too. In addition SLUB has many improvements
over SLAB. Its not only the runtime debuggability which would be difficult
to add since there would be numerous runtime checks in critical code paths
that would cause regressions.

--

To: Christoph Lameter <clameter@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 12:33 pm

of course there is a way to find out, and that's why i mailed you: fix
the hackbench regression and i'm quite sure you'll improve the TPC-C
numbers as well. It shows the same kind of overhead in the profile and
takes just a few seconds to run. Are your pending SLUB patches in
2.6.24-rc5-mm1 already?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>, Rafael J. Wysocki <rjw@...>
Date: Friday, December 21, 2007 - 5:56 pm

The tests that I wrote emulate the test behavior that was described to me
by me.

The fixes in 2.6.24-rc5-mm1 improved those numbers. See
http://lkml.org/lkml/2007/10/27/245 which I quoted earlier to you.
However, I have no TPC-C setup here and from what I hear it takes weeks to
run and requires a large support team for tuning.

You can find the slab test suite for that at

http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=...

AFAICT the fixes in 2.6.25-rc5-mm1 result in double the alloc performance
(fastpath) of SLAB.

There are fixes that are not merged yet (the cpu alloc patchset) that
seem to make that factor 3 because we can use the segment register to
avoid per cpu array lookups in the fast path.
--

To: Christoph Lameter <clameter@...>
Cc: Ingo Molnar <mingo@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 14, 2007 - 2:15 am

Christoph,

Welcome back from your vacation! :-)

I booted back to slub with slub_min_order=9 on the 64way and you are
indeed right about this.

# tests/hackbench 90
Time: 2.109
# chrt -f 10 tests/hackbench 90
Time: 3.775
# tests/hackbench 90
Time: 1.583
# chrt -f 10 tests/hackbench 90
Time: 1.833
# tests/hackbench 90
Time: 1.601
# chrt -f 10 tests/hackbench 90
Time: 3.078
# tests/hackbench 90
Time: 1.321
# chrt -f 10 tests/hackbench 90
Time: 1.933

A flux between 1.321 and 3.775 (*)(the rt runs were higher). But still,
good job at narrowing this down. Now the hard part. Figuring out a way
that will be nice to all users.

-- Steve

(*) The previous runs (without the command line option) were 13 seconds or
so.
--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 7, 2007 - 1:59 pm

Can you do one run with oprofile, and see exactly where the cost is? It
should hopefully be pretty darn obvious, considering your timing.

Linus
--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 7, 2007 - 2:36 pm

So I checked hackbench on my machine with oprofile, and while I'm not
seeing anything that says "ten times slower", I do see it hitting the slow
path all the time, and __slab_alloc() is 15% of the profile.

With __slab_free() being another 8-9%.

I assume that with many CPU's, those get horrendously worse, with cache
trashing.

The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but
that may not be the one that causes problems in a 64-cpu setup, so it
would be good to have that verified.

Anyway, it's unclear whether the reason it goes into the slow-path because
the freelist is just always empty, or whether it hits the

... || !node_match(c, node)

case which can trigger on NUMA. That's another potential explanation of
why you'd see such a *huge* slowdown (ie if the whole node-match thing
doesn't work out, slub just never gets the fast-case at all). That said,
the number of places that actually pass a specific node to slub is very
limited, so I suspect it's not the node matching. But just disabling that
test in slab_alloc() might be one thing to test.

[ The whole node match thing is likely totally bogus. I suspect we pay
*more* in trying to match nodes than we'd ever likely pay in just
returning the wrong node for an allocation, but that's neither here nor
there ]

But yeah, I'm not entirely sure SLUB is working out.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Steven Rostedt <rostedt@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Thursday, December 13, 2007 - 6:11 pm

Node match is necessary in order to make the allocator able to get memory
for the node that the caller requested.
--

To: Ingo Molnar <mingo@...>
Cc: LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Christoph Lameter <clameter@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Saturday, December 8, 2007 - 8:28 pm

Ingo,

Due to email problems, I never got your reply on this thread. But I saw
your reply when reading this thread on lkml.org (nor did I receive Linus's
first reply).

Anyway, I booted 2.6.24-rc4-mm1 with the same config, and accepted the
defaults to the new config options as the git version I checked. And it
gave me an average of 12.5 second runs.

Here's the oprofile for it:

http://people.redhat.com/srostedt/slub/results/slub-mm.op

-- Steve

--

To: Linus Torvalds <torvalds@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Saturday, December 8, 2007 - 6:16 pm

The results are here:

http://people.redhat.com/srostedt/slub/results/slab.op

I did two things. First I tried making node_match always return true,
the second was just commenting out the above check. They both had pretty
much the exact same results. It slowed down by double! Instead of taking
15 seconds to complete, both took 30 seconds to complete.

Here's the oprofiles of those runs.

node_match return 1:
http://people.redhat.com/srostedt/slub/results/slub-nonode.op

comment out node_match check:

I don't think it's bogus by the result that removing it slowes it down
by half.

I'm wondering if there's not some other cache thrashing happening
somewhere else in the slub code. I'll start adding some stats in the code
to see where the congestion might lie. I'll look into this on Monday.

Seems that both SLAB and SLUB run kernel compiles the same. Here's the
results of my compile test. (make -j256 and chrt -f 10 make -j256)

http://people.redhat.com/srostedt/slub/

-- Steve

--

To: Steven Rostedt <rostedt@...>
Cc: Linus Torvalds <torvalds@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Monday, December 10, 2007 - 3:38 am

Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc
alot. I wonder if !node_match triggers too often. We always start with
the per cpu slab, if that one is on the wrong node, you'll always hit
that "another_slab" path.

After searching for way too long (given that I have no clue about that
stuff anyway and just read the code out of curiousness), I noticed that
the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff
(arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1],
which, AFAICT, shows that the per cpu slab setup is done _before_
cpu_to_node is correctly setup. That would lead to the per cpu slabs all
having node == 0xff, which looks pretty bad.

Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first
time, so this might be total bull ;-)

Björn

[1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-10/msg04648.html
--

To: Björn <B.Steinbrink@...>
Cc: Linus Torvalds <torvalds@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Monday, December 10, 2007 - 10:00 am

Well, I commented out the node_match part and it got 100% worse. It took

I didn't check to see if the internal set up of the node is correct

Well I'm a newbie on NUMA stuff too. I just got lucky enough to be able to
reserve this box ;-)

-- Steve

--

To: Linus Torvalds <torvalds@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Ingo Molnar <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 7, 2007 - 2:58 pm

[ Note: I'm currently having problems with my email:
http://www.domaindirect.com/network.html

I kicked off my kernel tests which will take a few hours.
I run "make -j256" (4*nr_CPUS) 10 times at SCHED_OTHER and 10 times
at SCHED_FIFO (chrt -f 10) for both SLUB and SLAB.

This is automated, so I need to wait for it to finish before I can

I'll try that after the oprofile.

Thanks,

-- Steve

--

To: Steven Rostedt <rostedt@...>
Cc: LKML <linux-kernel@...>, Christoph Lameter <clameter@...>, Linus Torvalds <torvalds@...>, Peter Zijlstra <a.p.zijlstra@...>, Christoph Hellwig <hch@...>
Date: Friday, December 7, 2007 - 1:55 pm

ouch ...

it might make sense to check the SLUB patches from -mm ontop of vanilla
as well - while that might be of little comfort as far as 2.6.24 goes,
it at least gives us the info whether this is something that has been
fixed in the latest SLUB code.

Ingo
--

Previous thread: [PATCH 01/13] bring driver version number to 47 by Ed L. Cashin on Friday, December 7, 2007 - 1:47 pm. (13 messages)

Next thread: [git pull] scheduler fixes by Ingo Molnar on Friday, December 7, 2007 - 2:21 pm. (1 message)