Slab defragmentation introduces new functionality not supported by SLAB and SLOB. Make slab depend on EXPERIMENTAL and note its obsoleteness and that various functionality is not supported by SLAB. Also update SLOB's description a bit to indicate that certain OS support is limited by design. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- init/Kconfig | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) Index: linux-2.6/init/Kconfig =================================================================== --- linux-2.6.orig/init/Kconfig 2008-05-09 18:41:41.000000000 -0700 +++ linux-2.6/init/Kconfig 2008-05-09 18:46:13.000000000 -0700 @@ -749,12 +749,16 @@ choice This option allows to select a slab allocator. config SLAB - bool "SLAB" + bool "SLAB (Obsolete)" + depends on EXPERIMENTAL help - The regular slab allocator that is established and known to work - well in all environments. It organizes cache hot objects in - per cpu and per node queues. SLAB is the default choice for - a slab allocator. + The old slab allocator that is being replaced by SLUB. + SLAB does not support slab defragmentation and has limited + debugging support. There is no sysfs support for /sys/kernel/slab. + SLAB requires order 1 allocations for some caches which may under + extreme circumstances fail. New general object debugging methods + (such as kmemcheck) do not support SLAB. The code is complex, + difficult to comprehend and has a history of subtle bugs. config SLUB bool "SLUB (Unqueued Allocator)" @@ -771,7 +775,10 @@ config SLOB help SLOB replaces the stock allocator with a drastically simpler allocator. SLOB is generally more space efficient but - does not perform as well on large systems. + does not perform as well on large systems. SLOBs functionality + is limited by design (no sysfs support, no defrag, no debugging + etc). + endchoice -- --
What about the TPC performance regressions? My understanding was that slub still performed worse in the "object allocated on one CPU, freed on the other CPU" type workloads due to less batching. -Andi --
On Sat, 10 May 2008 11:53:30 +0200 Which can be the majority of object allocations and frees in some workloads. It definately wants fixing. -- All rights reversed. --
Agreed, that situation is very frequency happend, IMHO. --
On Mon, May 12, 2008 at 10:38 AM, KOSAKI Motohiro Christoph fixed a tbench regression that was in the same ballpark as the TPC regression reported by Matthew which is why we've asked the Intel folks to re-test. But yeah, we're working on it. --
I suspect that the TPC regression was due to the page allocator order 0 inefficiencies like the tbench regression but we have no data yet to establish that. Fundamentally there is no way to avoid complex queueing on free() unless one directly frees the object. This is serialized in SLUB by taking a page lock. If we can establish that the object is from the current cpu slab then no lock is taken because the slab is reserved for the current processor. So the bad case is a free of a object with a long life span or an object freed on a remote processor. Howver, the "slow" case in SLUB is still much less complex than comparable processing in SLAB. It is quite fast. SLAB freeing can avoid taking a lock if 1. We can establish that the object is node local (trivial if !NUMA otherwise we need to get the node information from the page struct and compare to the current node). 2. There is space in the per cpu queue If the object is *not* node local then we have to take an alien lock for the remote node in order to put the object in an alien queue. That is much less efficient than the SLUB case. SLAB then needs to run the cache reaper to expire these object into the remote nodes queues (later the cache reaper may then actually free these objects). This management overhead does not exist in SLUB. The cache reaper causes processors to not be available for short time frames (the reaper scans through all slab caches!) which in turn cause regression in applications that need to respond in a short time frame (HPC appls, network applications that are timing critical). Note that the lock granularity in SLUB is finer than the locks in SLAB. SLUB can concurrently free multiple objects to the same remote node etc etc. If the objects belong to different slabs then there is no dirtying of any shared cachelines. The main issue for SLAB vs. SLUB on free is likely the !NUMA case in which SLAB can avoid the overhead of the node check (which does not exist in SLUB) and i...
iirc profiling analysis showed that the problem was the page lock serialization (in particular the slab_lock() in __slab_free). That Ignoring NUMA is no option unfortunately. And with integrated memory I think the problem is that this atomic operation thrashes cache lines around. Really counting cycles on instructions is not that interesting, but minimizing the cache thrashing is. And for that it looks like slub What is the big problem of having a batched free queue? If the expiry is done at a good bounded time (e.g. on interrupt exit or similar) locally on the CPU it shouldn't be a big issue, should it? -Andi --
The issue of object expiration holdoffs also affects applications running on pure SMP systems. --
Some more on SMP scaling: There is also the issue in SLAB that global locks (SMP case) need to be taken for a pretty long timeframe. With a sufficiently high allocation frequency from multiple processors you can cause lock contention on the list_lock that will then degrade performance. SLUB does not take global locks for continued allocations. Global locks are taken for a short time frame if the partial lists need to be updated (which is avoided as much as possible with various measures). This can yield orders of magnitude higher performance The above is possible because of locking at the page level. A queue must either be processor specific or global (or per node) and then would need locks. Another issue is storage density. SLAB needs a metadata structure that either is placed in the slab page itself or in a separate slab cache. In some cases this is advantageous over SLUB (f.e. a series of pointers to objects exist in a single cacheline thus allocation of objects that are not immediately used could be faster) in others it is not (because it increases cache footprint, requires the touching of two slabcaches if the metadata is off slab, makes alignment of objects in the slab pages difficult and increases memory overhead, SLUB generally is faster if the object is/was immediately used since the freepointer overlays the data and thus the cacheline is hot both on alloc and free). --
These are all great theories, and you mentioned that you'd fixed the regressions with tbench, but did you fix the regression with the io-gen program I sent you? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
No. I thought you were satisfied with the performance increase you saw when pinning the process to a single processor? --
Er, no. That program emulates a TPC-C run from the point of view of doing as much IO as possible from all CPUs. Pinning the process to one CPU would miss the point somewhat. I seem to remember telling you that you might get more realistic performance numbers by pinning the scsi_ram_0 kernel thread to a single CPU (ie emulating an interrupt tied to one CPU rather than letting the scheduler choose to run the thread on the 'best' CPU). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Oh. The last message I got was an enthusiatic report on the performance gains you saw by pinning the process after we looked at slub statistics that showed that the behavior of the tests was different from your expectations. I got messages here that indicate that this was a scsi testing program that you had under development. And yes we saw the remote If this is a stand in for the TPC then why did you not point that out when Pekka and I recently asked you to retest some configurations? --
Note the complete lack of comparison between slub and slab here! As far as I know, slub still loses against slab by a few % -- but I haven't I thought you'd already run this test and were asking for the results of this to be validated against a real TPC run. I'm rather annoyed by this. You demand a test-case to reproduce the problem and then when I come up with one, you ignore it! -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
This indicated to me that you were still developing a test here and Indeed remote frees are slightly slower in some situations. Dont really dispute that. I am just not sure that the TPC test is really suffering from that symptom. I thought for a long time that the tbench regression Ignore it? That is pretty strange statement given that I helped you analyze the behavior of your test and understand what was going on the system. --
Since there's no way we've found to date to get the TPC test to you, how about we settle for analysing _this_ testcase which did show a significant performance degradation for slub? I don't think it's an unreasonable testcase either -- effectively it's allocating memory on all CPUs and then freeing it all on one. If that's a worst-case scenario for slub, then slub isn't suitable for replacing slab yet. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Could I get the latest version of the test? Or was the March 31st version the latest? No later changes? --
Indeed that is a worst case scenario due to finer grained locking. The opposite side of that is that fast concurrent freeing of objects from two processors will have higher performance in slub since there is significantly less global lock contention and less work with expiring objects and moving them around (if you hit the queue limits then SLAB will do synchroonous merging of objects into slabs, its then no longer able to hide the object handling overhead in cache_reap().) --
It can thrash cachelines if objects from the same slab page are freed simultaneously on multiple processors. That occurred in the hackbench regression that we addressed with the dynamic configuration of slab sizes. However, typically long lived objects freed from multiple processors Interrupt exit in general would have to inspect the per cpu structures of all slab caches on the system? --
hackbench regression is because of slow allocation instead of slow freeing. With dynamic configuration of slab sizes, fast allocation becomes 97% (the bad one is 68%), but fast free is always 8~9% with/without the patch. --
Thanks for using the slab statistics. I wish I had these numbers for the=20 TPC benchmark. That would allow us to understand what is going on while it= =20 is running. The frees in the hackbench were slow because partial list updates occurred= =20 to frequently. The first fix was to let slab sit longer on the partial=20 list. The other was the increase of the slab sizes which also increases=20 the per cpu slab size and therefore the objects allocatable without a=20 round trip to the page allocator. Freeing to a per cpu slab never requires= =20 partial list updates. So the frees also benefitted from the larger slab=20 sizes. But the effect shows up in the count of partial list updates not in= =20 the fast/free collumn.
I agree. It might be better if SLUB could be optimized again to have more consideration when the slow free percentage is high, because the page lock might ping-pong among processors if multi-processors access the same slab at the same time. --
Hang on, you want slab statistics for the TPC run? You didn't tell me that. We're trying to gather oprofile data (and having trouble because the machine crashes when we start using oprofile -- this is with the git tree you/pekka put together for us to test). -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Hi, oprofile was recently fixed, maybe try cherry-picking these will help: http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=7ded2dcf5f2... http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=08bc5caced1... http://git.kernel.org/?p=linux/kernel/git/smurf/linux-trees.git;a=commit;h=2b56af59ed2... Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Hum, you might try to apply commit 44c81433e8b05dbc85985d939046f10f95901184 or commit 8b8b498836942c0c855333d357d121c0adeefbd9 oprofile data are definitly wanted :) --
Well we talked about this when you send me the test program. I just thought that it would be logical to do the same for the real case. Details of the crash please? You could just start with 2.6.25.X which already contains the slab statistics. Also re: the test program since pinning a process does increase the performance by orders of magnitude. Are you sure that the application was properly tuned for an 8p configuration? Pinning is usually not necessary for lower numbers of processors because the scheduler thrashing effect is less of an issue. If the test program is an accurate representation of the TP-C benchmark then you can drastically increase its performance by doing the same to the real test. --
You ran the test ... you didn't say "It would be helpful if you could Certainly. Exactly how does collecting these stats work? Am I supposed to zero the counters after the TPC has done its initial ramp-up? What The application does nothing except submit IO and wait for it to complete. It doesn't need to be tuned. It's not an accurate representation of TPC-C, it just simulates the amount of IO that a TPC-C run will generate (and simulates it coming from all CPUs, which is accurate). I don't want to get into details of how a TPC benchmark is tuned, because it's not relevant. Trust me, there are people who dedicate months of their lives per year to tuning how TPC runs are scheduled. The pinning I was talking about was pinning the scsi_ram_0 kernel thread to one CPU to simulate interrupts being tied to one CPU. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Well the amount of information we are getting has always been the main Compile slabinfo and then do f.e. slabinfo -AD (this is documented in the help text provided when enabling statistics). --
Or possibly your assumptions have been the main factor. I gave you a
reproducer for this problem 6 weeks ago. As far as I can tell, you
That's an utterly unhelpful answer. Let me try asking again.
Exactly how does collecting these stats work? Am I supposed to zero
the counters after the TPC has done its initial ramp-up? What commands
should I run, and at exactly which points?
Otherwise I'll get something wrong and these numbers will be useless to
you. Or that's what you'll claim anyway.
For reference the helptext says:
SLUB statistics are useful to debug SLUBs allocation behavior in
order find ways to optimize the allocator. This should never be
enabled for production use since keeping statistics slows down
the allocator by a few percentage points. The slabinfo command
supports the determination of the most active slabs to figure
out which slabs are relevant to a particular load.
Try running: slabinfo -DA
By the way, when you say 'compile slabinfo', you mean the file shipped
as Documentation/vm/slabinfo.c (rather than, say, something out of tree?)
--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--Assumptions may be the issue. My own "reproducer" for remote frees is available from my git tree and I usually prefer to run my own. We discussed the results of that program last fall. You stated yesterday that your code is proprietary. I am not sure what I am allowed to do with the code. I did not know that it was proprietary before yesterday and I would have just forwarded that code to Pekka yesterday if I would not have caught that message in time. I thought that what you provided it was a test program to exercise There is no way of zeroing the counters. Run slabinfo -AD after the test application has been running for awhile. If you want a differential No. I guess I will end up with a lot of guess work of what is going on on Yes. --
No doubt you prefer to run a test which fails to show a problem with I'm surprised you're so cavalier about copyright. There was nothing in Why would you think that? The subject of the email was "Slub test They're your statistics. Tell me what you need. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
The test was designed to show the worst case effect of the additional The output of slabinfo -AD.... --
This is rather interesting. Since Christoph refuses to, here's my results with 8f40f67, first with slab: willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4 CPU 0 completed 1000000 ops in 52.817 seconds; 18933 ops per second CPU 2 completed 1000000 ops in 56.391 seconds; 17733 ops per second CPU 3 completed 1000000 ops in 57.009 seconds; 17541 ops per second CPU 1 completed 1000000 ops in 57.591 seconds; 17363 ops per second willy@piggy:~$ sudo taskset -p 1 941 pid 941's current affinity mask: f pid 941's new affinity mask: 1 willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4 CPU 2 completed 1000000 ops in 46.740 seconds; 21394 ops per second CPU 0 completed 1000000 ops in 48.716 seconds; 20527 ops per second CPU 3 completed 1000000 ops in 59.255 seconds; 16876 ops per second CPU 1 completed 1000000 ops in 60.473 seconds; 16536 ops per second (the pid is that of scsi_ram_0) Now, change the config to slub: --- 64-slab/.config 2008-05-15 15:21:31.000000000 -0400 +++ 64-slub/.config 2008-05-15 15:37:45.000000000 -0400 -# Thu May 15 15:21:31 2008 +# Thu May 15 15:37:45 2008 -CONFIG_SLAB=y -# CONFIG_SLUB is not set +CONFIG_SLUB_DEBUG=y +# CONFIG_SLAB is not set +CONFIG_SLUB=y -# CONFIG_DEBUG_SLAB is not set +# CONFIG_SLUB_DEBUG_ON is not set +# CONFIG_SLUB_STATS is not set and we get slightly better results: willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4 CPU 0 completed 1000000 ops in 45.848 seconds; 21811 ops per second CPU 2 completed 1000000 ops in 50.789 seconds; 19689 ops per second CPU 3 completed 1000000 ops in 55.876 seconds; 17896 ops per second CPU 1 completed 1000000 ops in 56.941 seconds; 17562 ops per second willy@piggy:~$ sudo taskset -p 1 1001 pid 1001's current affinity mask: f pid 1001's new affinity mask: 1 willy@piggy:~$ sudo ./io-gen -d /dev/sda -j4 CPU 2 completed 1000000 ops in 45.713 seconds; 21875 ops per second CPU 0 completed 1000000 ops in 47.020 seconds; 21267 ops per second CPU 3 completed 1000000 ops in 58.692 seconds; 17038 ops per second ...
I sure wish you would follow the discussions instead of having paranoid thoughts about me not running tests that show regressions. See my Hmmm... Interesting. Could you post the output of slabinfo -AD? --
Slabinfo -A -r before and after the run would be nice (you CONFIG_SLUB_STATS and Documentation/vm/slabinfo.c for the latter). A separate oprofile would be nice as well. Thanks! Pekka --
Why's that? When we're not under pressure (fast path), we can delay (and batch) remote frees. When we are under pressure (slow path), we can do everything immediately. -- Mathematics is the supreme nostalgia of our time. --
Fastpath is what? I guess slow path means we called into the page allocator? --
No, slow path here means we're already under memory pressure, so we don't care if something takes longer if it saves memory. -- Mathematics is the supreme nostalgia of our time. --
So expire the queues under memory pressure only? Trigger queue cleanup from reclaim? --
This wouldn't be my first thought. The batch size could be potentially huge and we'd have to worry about latency issues. But here are some other thoughts: First, we should obviously always expire all queues when we hit low water marks as it'll be cheaper/faster than other forms of reclaim. Second, if our queues were per-slab (this might be hard, I realize), we can sweep the queue at alloc time. We can also sweep before falling back to the page allocator. That should guarantee that delayed frees don't negatively impact fragmentation. And lastly, we can always have a periodic thread/timer/workqueue operation. So far this is a bunch of hand-waving but I think this ends up basically being an anti-magazine. A magazine puts a per-cpu queue on the alloc side which costs on both the alloc and free side, regardless of whether the workload demands it. This puts a per-cpu queue on the free side that we can bypass in the cache-friendly case. I think that's a step in the right direction. -- Mathematics is the supreme nostalgia of our time. --
Hmmm... I tried a scheme like that awhile back but it did not improve performance. The cost of queuing the object degraded the fast path (Note SLUB object queuing is fundamentally different due to no in slab In that case we dirty the same cacheline that we also need to take the page lock. Wonder if there would be any difference? The freelist is That would introduce additional complexity for the NUMA case because now we would need to distinguish between the nodes that these objects came from. So we would have to scan the queue and classify the objects? Or determine the object node when queueing them and put them into an remote node queue? Sounds similar to all the trouble that we ended up with I have had enough trouble in the last years with the 2 second hiccups that come with SLAB and that affect timing sensitive operations between processors in a SMP configuration and also cause trouble for applications I think if you want queues for an SMP only system, do not care too much about memory use, dont do any frequent allocations on multicore systems and can tolerate the hiccups because your application does not care (most enterprise apps are constructed that way) or if you are running benchmarks that only access a limited dataset that fits into SLABs queues amd avoid touch the contenst of objects then the SLAB concept is the right way to go. If we would strip the NUMA stuff out and make it an SMP only allocator for enterprise apps then the code may become much smaller and simpler. I guess Arjan suggested something similar in the past. But that would result in SLAB no longer being a general allocator. --
What does this have to do with anything? I'm not talking about going back to SLAB. I'm talking about plugging the use cases where SLUB currently loses to SLAB. That's what has to happen before SLAB can be obsoleted. I'll certainly grant you that queueing might not break even. -- Mathematics is the supreme nostalgia of our time. --
Both allocators have a different design which leads to different behavior. I do not think the expectation that one must always best the other is reasonable or even possible. I'd be glad if we had some means of increasing the performance in the currently known cases where remote slab free becomes an issue by avoiding the atomic op. AFAICT we so far have been able to compensate for the additional atomic op with a reduced cache footprint and less complexity overall on remote frees and also through improvements in alloc behavior. I hope that the current improvements in 2.6.26 are sufficient to address the concerns with TP-C (which I do not have direct access to and frankly I know very little about the setup etc). We are still not sure exactly why TP-C has a problem. The slab statistics were added to figure that one out. We can get a view of what is going on without having access to the system. I think the current way of compensating for that atomic op is better than getting back to the queue mess. Maybe there is a way of limited use of queues that avoids the atomic op but so far I have not found one. Maybe someone else looking at it will have better ideas. --
What are your plans to fix it? -Andi --
I don't have a reproducable test and my boxes are so tiny the issues probably won't show up anyway. So all I can do at this point is to make sure Matthew et al can easily re-test whenever we fix some other regression that potentially affects his workload. I only recently started tracking this issue so I have no idea where we're at with this. Christoph? Matthew? --
I suspect that this is the same issue as tbench. I have explained the SLAB vs. SLUB free situation in another email in this thread. --
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Jeff Garzik | Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in |
| Chodorenko Michail | PROBLEM: Celeron Core |
git: | |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Johannes Schindelin | Re: Empty directories... |
| Jakub Narebski | Re: VCS comparison table |
| Sam Song | Re: Fwd: [OT] Re: Git via a proxy server? |
| J.W. Zondag | Dell PE1950 III - Perc 6i |
| Richard Stallman | Real men don't attack straw men |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Anselm R. Garbe | OpenBSD 4.0 / Xorg -> vesa 1920x1200 widescreen resolution |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Anselm Lingnau | File creation date in UNIX (was: Re: VMS) |
| Rafal Kustra (summer student) | mount |
| Nicholas Yue | Re: more on 486/33 weirdness |
