This message contains a list of some regressions from 2.6.23 which have been reported since 2.6.24-rc1 was released and for which there are no fixes in the mainline that I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.23, please let me know either and I'll add them to the list. Subject : EHCI causes system to resume instantly from S4 Submitter : Maxim Levitsky <maximlevitsky@gmail.com> References : http://lkml.org/lkml/2007/10/27/66 http://bugzilla.kernel.org/show_bug.cgi?id=9258 Handled-By : "Rafael J. Wysocki" <rjw@sisk.pl> David Brownell <david-b@pacbell.net> Alan Stern <stern@rowland.harvard.edu> Patch : Note: : the problem appears to heavily depend on hardware Subject : leds: ledtrig-timer calls sleeping function from invalid context Submitter : Márton Németh <nm127@freemail.hu> References : http://bugzilla.kernel.org/show_bug.cgi?id=9264 Handled-By : Richard Purdie <rpurdie@rpsys.net> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object Submitter : Hans de Bruin <bruinjm@xs4all.nl> References : http://bugzilla.kernel.org/show_bug.cgi?id=9320 Handled-By : Robert Moore <Robert.Moore@intel.com> Tejun Heo <htejun@gmail.com> Fu Michael <michael.fu@intel.com> Patch : Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s Submitter : Roland Dreier <rdreier@cisco.com> References : http://lkml.org/lkml/2007/11/8/255 http://bugzilla.kernel.org/show_bug.cgi?id=9332 Handled-By : Patch : Subject : system hangs after a few minutes Submitter : Marcus Better <marcus@better.se> References : http://bugzilla.kernel.org/show_bug.cgi?id=9335 Handled-By : Andrew Morton <akpm@linux-foundation.org> Alan Stern <stern@rowland.harvard.edu> Patch : ...
The fix is now in mainline: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc47206e55... Cheers, Richard --
Yes, already dropped. Thanks, Rafael --
Twenty nine, huh? It would be useful if these records were sorted in date-of-reportage order and had a date stamp so we could see how long they've been hanging about. Something to think about for the post-2.6.24 regression if you'll be handling That patch has been merged (dc47206e552c0850ad11f7e9a1fca0a3c92f5d65) and assuming Márton has tested the latest git snapshot (ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots) successfully we can cross it off? --
They are sorted by the bugzilla number which reflects the date-of-reportage order pretty well. For a techincal reason, it's easier to me if they're sorted like this. Adding date stamps should be easy, tough, I'll try to add them to the next Yes, I'm going to handle the post-2.6.24 regressions too (in the hope there Yes, will drop. Thanks, Rafael --
A number of other people are seeing the same thing and Tejun is putting in a blacklist of machines which cannot use libata+acpi. That patch is not yet in any git tree which I pull. AFACIT the machines kepe working OK - there's just some nasty dmesg spew. If any machines _are_ breaking then this could cause real problems and I'd prefer that we either go for a whitelist or arrange to detect the condition and fall back to non-acpi ata. --
Hi, Does this report now win me the lucky draw, pretty please? ;) STD regression rc1 -> rc234, suspend fails completely, recovering is pretty much useless since HDD is DEAD from this point on anyway. Managed to capture -rc2 suspend logging via still-alive ssh session. 2.6.24-rc1 suspend/resume log, successful (well, a couple seconds delay, most likely due to well-recovered AML failure): swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000 swsusp: Basic memory bitmaps created Syncing filesystems ... done. Freezing user space processes ... (elapsed 0.00 seconds) done. Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done. Shrinking memory... done (0 pages freed) Freed 0 kbytes in 0.02 seconds (0.00 MB/s) Suspending console(s) hub 4-0:1.0: hub_suspend usb usb4: bus suspend ehci_hcd 0000:00:10.3: suspend root hub hub 3-0:1.0: hub_suspend usb usb3: bus suspend usb usb3: suspend_rh hub 2-0:1.0: hub_suspend usb usb2: bus suspend usb usb2: suspend_rh hub 1-0:1.0: hub_suspend usb usb1: bus suspend usb usb1: suspend_rh sd 0:0:0:0: [sda] Synchronizing SCSI cache parport_pc 00:09: disabled serial 00:08: disabled serial 00:07: disabled ACPI: PCI interrupt for device 0000:00:11.5 disabled ACPI handle has no context! ACPI: PCI interrupt for device 0000:00:11.1 disabled ACPI: PCI interrupt for device 0000:00:10.3 disabled ehci_hcd 0000:00:10.3: --> PCI D3/wakeup uhci_hcd 0000:00:10.2: uhci_suspend ACPI: PCI interrupt for device 0000:00:10.2 disabled uhci_hcd 0000:00:10.2: --> PCI D3 uhci_hcd 0000:00:10.1: uhci_suspend ACPI: PCI interrupt for device 0000:00:10.1 disabled uhci_hcd 0000:00:10.1: --> PCI D3 uhci_hcd 0000:00:10.0: uhci_suspend ACPI: PCI interrupt for device 0000:00:10.0 disabled uhci_hcd 0000:00:10.0: --> PCI D3 ACPI: PCI interrupt for device 0000:00:0d.0 disabled ACPI handle has no context! ACPI: PCI interrupt for device 0000:00:0c.0 disabled ACPI handle has no context! pci_set_power_state(): 0000:00:00.0: state=3, current ...
nah, you have to cc the acpi guys to get a prize ;) Len&co, could you please take a look? Andreas, please do separately report that WOL problem too.. --
037f6bb79f753c014bc84bca0de9bf98bb5ab169 ought to have fixed this? -- Matthew Garrett | mjg59@srcf.ucam.org --
Please post full kernel boot log and the result of 'lspci -nn'. Thanks. -- tejun --
Hi, Done, on #9530. Will try some of the promising patches/suggestions now, hopefully this will show me what's up. Will add further results there. Andreas Mohr --
Hello, The pending patchset should make ATA ACPI quite resistant to failures. Known bad boards can be blacklisted (currently only one is on the list), ATA ACPI is disabled quicker if ACPI evalution fails, execution errors are handled better and commands which are intended to help the vendor instead of the user are filtered. So, I think we have enough safety nets. Thanks. -- tejun --
Yeah, I'm hoping so. Maybe command filtering should wait till 2.6.25 but the rest, yeap. -- tejun --
Actually, not gone cold, but I stopped posting about it because it's been solved and I thought agreement had been reached that it should be pushed to mainline before 2.6.25. I am very happily running with Ingo's "snd hda suspend latency: shorten codec read" patch, which was originally intended to speed up resuming from hibernation, but which as I discovered, also has the nice side effect of eliminating the reported error. On 11/23, Takashi replied to my note (http://lkml.org/lkml/2007/11/23/17) and suggested that Jaroslav push this patch to Linus immediately instead of waiting for 2.6.25, since it appearly solves two problems with one stone. However, I just checked, as of Linus's public, and Ingo's patch is *not* in mainline. However, as far as I am concerned, Ingo's patch, first posted to LKML here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing the above regression. Rafael, could you please make a note of this in your regression list, and could we please get this patch pushed into mainline? Thanks!! - Ted --
ha! I'd never have expected _that_ to happen. Cool. Fixing a driver bug by accident :-) Ingo --
Great, thanks. I should add that technically this wasn't a regression since I had been seeing this since before 2.6.23. Also, it isn't a big deal, since aside from noise in the syslog, falling back to polling more doesn't make any functional or user-visible difference (although I guess it's less efficient). Regardless of whether it is a regression, it would be nice to get the patch applied and and this issue fixed for 2.6.25! - Ted --
At Thu, 13 Dec 2007 11:49:51 +0100, Jaroslav, what about this now? Takashi --
This one we have a confirmed fix from Alan but it doesn't appear to be in anyone's tree. There is a second bug in here, applicable to core x86: Marcus's machine won't boot with nmi_watchdog=1. --
An expanded version of that fix is in Greg's queue: http://marc.info/?l=linux-usb-devel&m=119697043410947&w=2 Since he's away until Tuesday, nothing will happen for a few days. Alan Stern --
Nasty one. Tejun and several diligent reporters are doing sterling work there and things have improved. I don't know whether any of Tejun's patches have been merged yet, but we'll probably be OK on this one. What is unclear (to me) is what actually caused those people's machines to break? --
Hello, (cc'ing Alan) I'm still trying to find out what's really going on. That drive is It's introduced by setting ATAPI transfer chunk size to actual transfer size which is the right thing to do generally. However, with the change, the ATAPI HSM should be ready to drain full extra transfer chunks which libata HSM wasn't doing. With that part changed, most regressions should go away. Unfortunately, simply adding that doesn't fix the case in bug 9346 and I'm still trying to find out why. The good news is that the drive works fine with proposed more extensive improvements to libata ATAPI which will probably be included into 2.6.25, so we at least have long term solution. If we fail to find out the solution in time, we always have the alternative of backing out the ATAPI transfer chunk size update. This will break some other cases which were fixed by the change but those won't be regressions at least and we can add transfer chunk size update with other changes to 2.6.25. Thanks. -- tejun --
Which will break far more controllers and drives than it fixes, so Great, make everyone else wait another three months for a working CD drive. The one off regression appears far less harmful than a revert. Tejun - instead of backing out important updates for 2.6.24 we should just blacklist that specific drive for now and sort it nicely in 2.6.25, not revert stuff and break everyone elses ATAPI devices. Alan --
Hello, Alan. Newly broken ones will be regressions. How many do we fix by the change? On SATA, setting the correct transfer chunk size doesn't seem We'll need to blacklist setting transfer chunk size, eek, and let's leave that as the last resort and hope that we find the solution soon. Blacklist takes time to develop and temporary blacklist for just one release doesn't sound like a good idea. -- tejun --
Regressions are not some kind of grand evil. Better to regress the odd It seems to be sensible to me *if* it is just this one device we are somehow confusing and that one device is holding up fixing everything else. --
We need to put more weight on regressions as it at least makes releases predictable to users. Anyways, I wasn't saying it was some absolute maxim. I was literally asking how many so that we can evaluate the Yeah, if it's this one device, I fully agree. Let's see how debugging turns out. -- tejun --
No. Regressions are worse. It doesn't matter AT ALL if you think that it breaks ten times more devices, if it's a regression and those devices didn't work in the past, they simply DO NOT COUNT. Linus --
Btw, Alan, that "math" is total and utter BULLSH*T, and you should know that. "The one off regression" is likely the tip of an iceberg. If something regresses for one person, for that one person who tested and noticed and made a bug-report, there's probably a thousand people who haven't even tested the development kernel, or who had problems and just went back to the previous version. In contrast, reverting something will be guaranteed to not have those kinds of issues, since the only people who could notice are people for who it never worked in the first place. There's no "silent mass of people" that can be affected. This is why regressions are so important. They don't trump _everything_, but basically ignoring and letting them slide is *much* more painful than just reverting it. The biggest reason to ignore a regression is if nobody can even figure out where it came from, or reverting simply isn't an option for some really deep and fundamental issue. That doesn't seem to be the case here. So we should revert unless there is some known acceptable real fix. Linus --
Alan, you're so full of shit that it's not even funny.
Have you even *read* the thread?
Tejun already reported that this apparently gets fixed _properly_ with the
more extensive cleanups and fixes that are pending for 2.6.25.
In other words, the stuff you call so critically important (yet we've been
able to live without it until now!) is apparently simply NOT YET READY.
It's breaking things.
In this case, Tejun seems to be right on the money. I also agree 100%
with him when he says
"Blacklist takes time to develop and temporary blacklist for just one
release doesn't sound like a good idea."
because if we create some blacklist for that one reported device, not only
is it likely going to be wrong (it's almost never just one firmware or one
chip that has a particular issue), but we tend to create thee blacklists
and later realize that we shouldn't have blacklisted things at all, we
should just have done things differently.
For examples of that, see the NCQ blacklist that was just _us_ doing
things wrong (over-reacting to things we shouldn't care about), and
there's currently another totally unrelated discussion on a very similar
thing wrt libata and the ACPI startup commands for an unused controller
.. and what the hell does that matter? If the code doesn't work, it
doesn't work, and you might as well point to some random scribblings done
by a three-year-old on toilet paper rather than any "specs".
Real life matters more. Regressions matter more.
We apparently do have a full fix, but it seems to be too invasive for
2.6.24, which means that the thing that currently DOES NOT WORK and
causes regressions should be reverted, so that 2.6.24 is at least no worse
than 2.6.23 (and all earlier kernels) in this respect.
And then we should just hope that the more complete fix that Tejun has
doesn't cause any issues on its own. I would suggest that if you care so
deeply about this issue, you press Fedora into putting Tejun's tree into ...Its your kernel. Its your call, and your privilege to be wrong. And anyone with ATAPI problems should probably test the -mm tree before reporting anything. Alan --
> Have you even *read* the thread? In detail, as it unfolds and while testing variants of Tejun's code on the hardware I have access to - none of which has this bug making it And as I keep pointing out but you keep ignoring - not doing it breaks The code without the changes doesn't work either. So pick your toilet Which as the distro bug lists for ATAPI will tell you - aint good. Still Linus, the kernel regresses all over the place every release. If it didn't do that you'd never get any changes in. Your kernel would fossilize like RHEL or SLES and you'd be spending weeks analysing each changeset for possible side effects, or - as happens by neccessity - adding code paths so a fix vital to one driver ceases to share core code with another driver - to reduce regression risk. Been there, done that Have fun. I trust you'll be fixing the other 11 I think it was listed regressions before 2.6.24 - or backing out every changeset that could be responsible ? No I thought not - because that wouldn't be sensible either. Alan --
btw., how extensive are those cleanups and fixes in reality, is there a rollup somewhere one could take a look at? Those fixes and cleanups were deferred to v2.6.25 in the knowledge of having the current code included in v2.6.24 - but now that the current approach seems to regress, maybe those cleanups are still safe enough. (compared to an outright revert) Ingo --
The following git tree contains patches pending review for 2.6.25. http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=shortlog;h=improve-ATAPI... And we're getting close to fixing the regression. I don't think there's too much worry about this one. Just need a bit more time to test few more things. Thanks. -- tejun --
ah, i see, the joys of the kernel running BIOS written code (AML): http://bugzilla.kernel.org/attachment.cgi?id=13932&action=view cute! Ingo --
Matt, the above bug is still occuring en masse during random bootups: [ 67.828312] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 41.270905] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 82.378714] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 35.308126] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 35.484155] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() and the patch does not seem to be upstream yet. Ingo --
err, umm, OK, I'll go hunt it down and take a closer look at it. --
I was hoping for some discussion about whether it was the best fix.
The current kzalloc thing strikes me as a step backwards for all
allocators. We'd do better to have a single non-inline kzalloc
function rather than an extra branch in the normal kmalloc fast path.
But here's the patch again, with my sign-off:
Avoid calling page allocator with __GFP_ZERO, as we might be in atomic
context and this will make thing unhappy on highmem systems. Instead,
manually zero allocations from the page allocator.
Signed-off-by: Matt Mackall <mpm@selenic.com>
diff -r f7edf7226317 mm/slob.c
--- a/mm/slob.c Wed Dec 05 15:57:06 2007 -0600
+++ b/mm/slob.c Wed Dec 05 15:57:51 2007 -0600
@@ -223,6 +231,7 @@ static void *slob_new_page(gfp_t gfp, in
{
void *page;
+ gfp &= ~__GFP_ZERO;
#ifdef CONFIG_NUMA
if (node != -1)
page = alloc_pages_node(node, gfp, order);
@@ -457,6 +470,8 @@ void *__kmalloc_node(size_t size, gfp_t
page = virt_to_page(ret);
page->private = size;
}
+ if (unlikely((gfp & __GFP_ZERO) && ret))
+ memset(ret, 0, size);
return ret;
}
}
--
Mathematics is the supreme nostalgia of our time.
--
I think this is fine, but didn't we fix the warning already? Calling page allocators with __GFP_ZERO should be fine, as long as __GFP_HIGHMEM isn't set, and slab/slub/slob/kmalloc cannot use GFP_HIGHMEM *anyway*. But I'll apply it anyway, because it looks "obviously correct" from the standpoint that the _other_
It's a kmap_atomic() debugging patch which I wrote ages ago and whcih Ingo sucked into his tree. I don't _think_ this warning is present in your tree at all. http://lkml.org/lkml/2007/11/29/157 is where it starts. I had a lenghty back-and-forth with Christoph on this within the past couple of months and I cannot locate the thread and I don't recall what the upshot was and Christoph is still offline. Knocking out __GFP_ZERO at the point where the slab allocator(s) call the page allocator seems like a good approach to me. But I don't think we need to do anything for 2.6.24.. --
Good. Although we should perhaps look at that reported performance problem with SLUB. It looks like SLUB will do a memclear() for the area twice (first for the whole page, then for the thing it allocated) for the slow case. Maybe that exacerbates the problem. Linus --
yes, i understand the initial announcement (and the Kconfig entry still says the same), but that is not matched up by the reality i see in the actual code - SLUB clearly uses a queue/list of objects (as cited in my previous mail), for obvious performance reasons. unless i'm missing something obvious (and i easily might), i see SLUB as SLAB reimplemented with a different queueing model. Not "without i'm curious about the real facts behind this "alien cache problem". I heard about it and asked around and was told that there's some sort of bad quadratic behavior of memory consumption on NUMA - but i cannot actually see that in the code. The alien caches feature of SLAB i see as a spread out clustered index/cache of objects on other nodes. It's not increasing the average per object memory consumption per se! The number of alien caches increases with increasing number of nodes, but _of course_, as memory size increases too so there's more stuff and a larger expected spreadout of memory to keep track of. ("Fixing" that would be like reintroducing a single runqueue for the scheduler, based on the argument that it's an O(1) number of runqueues against O(N) number of runqueues - which would be complete nonsense.) so i see SLAB alien caches as an automatic self-partitioning mechanism ... which has complexities but which also has _obvious_ performance benefits. Yes, it has some disadvantages like all caching schemes do - there's more cached memory tied up in the allocator at any given moment - but arguing against that would be like arguing against a 2MB L2 cache purely on the basis that a 1MB L2 cache is smaller and hence more space-efficient. Caches are there to cache stuff, and more caches ... use more memory. It's all a question of proportion and tuning, but the _design_ should be based on having as thorough caching as possible. Ingo --
Hi Ingo,
Sure, the emphasis is on "no *such* queues."
I mostly live in the legacy 32-bit UMA/UP land still so I cannot verify this
myself but the kind folks at SGI claim the following (again from the
announcement):
"On our systems with 1k nodes / processors we have several gigabytes
just tied up for storing references to objects for those queues This does
not include the objects that could be on those queues. One fears that the
whole memory of the machine could one day be consumed by those
queues."
The problem is that for each cache, you have an "per-node alien queues"
for each node (see struct kmem_cache nodelists -> struct kmem_list3 alien).
Moving slab metadata to struct page solves this but now you can only have
one "queue" thats part of the same struct.
Pekka
--
Yes, you can find gigs tied up on systems that have 100 GB of RAM, or you can have gigs tied up if you over-size your caches. I'd like to see yes, it's what i referred to as "distributed, per node cache". It has no "quadratic overhead". It has SLAB memory spread out amongst nodes. I.e. 1 million pages are distributed amongst 1k nodes with 1000 pages per node with each node having 1 page. But that memory is not lost and it's disingenous to call it 'overhead' and it very much comes handy for performance _IF_ there's global workload that involves cross-node allocations. It's simply a cache that is mis-sized and mis-constructed on large node count systems but i bet it makes quite a performance difference on low-node-count systems. On high node-count systems it might make sense to reduce the amount of cross-node caching and to _structure_ the distributed NUMA SLAB cache in an intelligent way (perhaps along cpuset boundaries) - but a total, design level _elimination_ of this caching concept, using very misleading arguments, just looks stupid to me ... Ingo --
i think much of this could be achieved by delaying the creation of alien
caches up until the point a { node X } -> { node Y } alloc/free
relationship gets established. So if a system is partitioned along
cpusets, vast areas of the NxN matrix never gets populated with alien
caches.
Ingo
--
The "queue" that you are talking about is the freelist of a slab. It exist for each slab. SLAB uses a table of free entries there. The freelist is limited to a slab and thus the locality of the freelist is bound to the same page which avoids NUMA locality checks. --
On Sun, 9 Dec 2007 10:20:19 +0200 "Pekka Enberg" <penberg@cs.helsinki.fi> wrote: .. and they make slab slower on numa systems for database workloads To be honest, a SLAB without the alien stuff might be one of the best performers today .... ;) (on database loads.. where slub is quite a disaster as everyone knows) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote: > so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and > SLUB have essentially the same linecount: > > $ wc -l mm/slab.c mm/slub.c > 4478 mm/slab.c > 4125 mm/slub.c > > (and while linecount != complexity, there is a strong relationship.) > > With SLAB having 10 years more test coverage and tuning. FWIW, the one thing slub does that slab doesn't that I find really nice is being enable to enable debugging at boot time rather than compile time. We don't get many people running benchmarks against the Fedora kernel, so any scalability differences between slub/slab probably won't reach us until we start shipping betas of the next RHEL based on the same kernel. Which leaves my only other gripe. It broke slabtop. There's an alternative implementation in Documentation/vm/slabinfo.c (why there not say, util-linux, home of current slabtop?) Dave -- http://www.codemonkey.org.uk --
yes, but that's largely due to "dont change SLAB because we've got SLUB" resistence to SLAB patches. It's a 2 minute hack to implement this for that's actually a _bad_ ABI regression. Rafael, could you please add the kernel should output /proc/slabinfo data with the same formatting, period. Ingo --
I can remember it being said that would be one of the things to-do before SLUB could become default. --
Yes, SLUB should be the way to go, but some issues are not yet solved. I had to switch back to SLAB on a production NUMA server, with 2 nodes and 8GB ram. Using a lot of sockets, so a large part of memory was used by kernel. SLUB kernel was hitting OOM after 2 or 3 days of uptime. SLAB kernel never hit this. Unfortunatly I dont have a test machine to reproduce the setup. Maybe the problem is not related to SLUB at all, but an underlying VM/NUMA bug. The /proc/buddyinfo showed that : Node 0 contained two zones (DMA and DMA32) total 4 GB Node 1 contained one zone (Normal) total 4 GB So Node 0 contained no (Normal) zone part of /proc/meminfo Slab: 3338512 kB SReclaimable: 789716 kB SUnreclaim: 2548796 kB I remember network interrupts were taken by CPU 1, so most allocations were done by CPU 1 (node 1), and many freeing were done on CPU 0 Hope this helps --
huh? You got the ordering wrong ;-) SLUB needs to resolve all regressions relative to SLAB. (or at least have a really good which is of little help if it regresses on other workloads. As we've seen it, SLUB can be more than 10 times slower on hackbench. You can tune SLUB to use 2MB pages but of course that's not a production level that's not a fundamental property of SLAB. It would be an about 10 lines hack to enable SLAB debugging switchable-on runtime, with the boot flag well that's the nature of caches, but it could be improved: restrict the moment you start capturing more memory in SLUB's per cpu queues i do appreciate that :-) SLUB was rather easy to "port" to PREEMPT_RT: it did not need a single line of change. The SLAB portion is a lot scarier: dione:~linux-rt.q> diffstat patches/rt-slab-new.patch all of these are neat. How about renaming it to SLAB2 instead of SLUB? The "unqueued" bit is just stupid NIH syndrome. It's _of course_ queued because it has to. "It actually, this might be a bug. the DMA caches should be created right away and filled with a small amount of objects due to stupid 16MB limitations with certain hardware. Later on a GFP_DMA request might not be fulfillable. (because that zone is filled up pretty quickly) Ingo --
ok - Rafael, please strike this off the regressions list, there is no problem in .24 other than the double memset for some SLOB metadata. Ingo --
But what about:
(c) stop passing GFP_ZERO to kmalloc allocators
stop fiddling with flags inside allocators
remove ugly if (unlikely(gfp & GFP_ZERO)) from kmalloc allocators
make kzalloc and kcalloc non-inline functions that do the memset
That should:
- make both kmalloc and kzalloc/kcalloc faster (one less branch)
- reduce kernel size
GFP_ZERO is a bit of an abuse here, given that we don't even intend to
pass it to the underlying "GFP".
--
Mathematics is the supreme nostalgia of our time.
--
While I think the whole GFP_ZERO thing is a bit broken here, this is an improvement on my original patch, so: -- Mathematics is the supreme nostalgia of our time. --
Hi Linus, In case you didn't already merge this: Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> --
SLUB does not pass __GFP_ZERO to the page allocator. --
Hi. I don't think that this is a regression: I reported on RedHat bugzilla when I switched from F7 to F8 and I was using 2.6.23.8 at that time. It looks to me an HAL regression, but of course I may be wrong :-) as the reported bisected to a bad commit. https://bugzilla.redhat.com/show_bug.cgi?id=373041 By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5 custom kernel and Gnome desktop and the problem is still present, even with gnome-power-manager. Hope this helps. Regards, Fabio --
to me this looks like an ABI regression - utilities should work without
change. Something changed in /sys output that caused HAL to think that
there are two batteries:
| The output of lshal shows that there are two UDI's with
| info.capabilities = { 'battery' }:
|
| udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
| udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'
whether it's a HAL bug or a kernel bug, the original state should be
restored and it should be worked out without breaking users of older HAL
versions.
grumble: way too many times do various system utilities break when i
upgrade the kernel on my laptop. Maybe a new debug mechanism: we should
start fingerprinting the exact /sys and /proc output and enforce that
it's immutable across kernel releases as long as the hardware is
unmodified?
Ingo
--
"breaking users of older HAL versions" == "breaking machines". That would be neat. It would need to be executed on a lot of different machines. I wonder if there's something sneaky we can do here. Install the script in /lib/modules/$(uname -r) and then run it from the kernel when the fork count reaches 1000 ;) (hey, I've seen worse: /proc files which start with #!/bin/sh) --
Hm, that wouldn't allow us to add new attributes ... --
Here's one for you - I have a new Lenovo t61p with which to irritate everyone. suspend-to-ram is a wipeout, but suspend-to-disk works OK under 2.6.23. However under 2.6.24-rc1 and -rc4 the machine reboots right at the end of resume-from-disk. Am trying to do a git-disect on it but it seems that someone has been screwing with ata Kconfig and I'm hitting a pile of cant-find-root-disk bisection points and I can't immediately work out why. I'll try to find time to look at it again next week. --
the way i solve such bisection problems is to have the patch like below applied by a "git-bisect run" scriptlet (and popped off after the test). This way all must-have drivers and kernel features are selected for that particular testbox, no matter what Kconfig complication there are. (except outright config option renaming but those are rare) Ingo Index: linux/arch/x86/Kconfig.needed =================================================================== --- /dev/null +++ linux/arch/x86/Kconfig.needed @@ -0,0 +1,88 @@ +config FORCE_MINIMAL_CONFIG + bool + default y + +select EXPERIMENTAL + +select EXT3_FS +select EXT3_FS_XATTR +select EXT3_FS_POSIX_ACL +select EXT3_FS_SECURITY +select BLOCK +select HOTPLUG +#select INOTIFY +#select INOTIFY_USER + +# so that capset() works (sudo, etc.): +select SECURITY +select SECURITY_CAPABILITIES + +select BINFMT_ELF +select MSDOS_PARTITION +select PARTITION_ADVANCED +select BSD_DISKLABEL + +select SYSFS +select SYSFS_DEPRECATED +select PROC_FS +select FUTEX + +select ATA +select SATA_AHCI +select ATA_PIIX +select PATA_AMD +select PATA_OLDPIIX +select BLK_DEV_SD + +select E100 +select E1000 +select NET_ETHERNET +select NET_PCI +select MII +select CRC32 + +select 8139TOO +select FORCEDETH + +select PACKET + +select NETPOLL +select NETCONSOLE +select NET_POLL_CONTROLLER +select INET +select NET +select UNIX +select NETDEVICES + +select SERIAL_8250 +select SERIAL_8250_CONSOLE +select MAGIC_SYSRQ + +select INPUT +select INPUT_MOUSEDEV +select INPUT_POLLDEV +select INPUT_KEYBOARD +select KEYBOARD_ATKBD +select SERIO +select SERIO_I8042 + +select VT +select VT_CONSOLE +select HW_CONSOLE +select VGA_CONSOLE +select EARLY_PRINTK +select PRINTK +select UNIX98_PTYS + +select USB +select USB_MOUSE +select USB_EHCI_HCD +select USB_OHCI_HCD +select USB_UHCI_HCD +select USB_SUPPORT + +select PCI + +select STANDALONE +select PREVENT_FIRMWARE_BUILD + Index: ...
It's http://bugzilla.kernel.org/show_bug.cgi?id=9258 , I think. Does it do that if you unload ehci-hcd before the hibernation? --
Linus, Andrew, i need some help deciding what to do about this regression. The fixes for this have been tested and resolve the regression, but they change printk and other code that runs by default and is thus rather invasive so late in the v2.6.24 cycle. This bug should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug option) - although some claimed effect was on udelay()/mdelay() too. i've attached below the queue of 5 patches that fix this problem. They have been build and boot tested with more than 1000 random kernels in the past few days, so i certainly trust the core and x86 bits of this. what do you think? Right now i've got them queued up for 2.6.25 in both the scheduler-devel and the x86-devel git trees - but can submit them for 2.6.24 if it's better if we did them there. I've got no strong opinion either way. Ingo --------------------> Subject: x86: scale cyc_2_nsec according to CPU frequency From: "Guillaume Chazarain" <guichaz@yahoo.fr> scale the sched_clock() cyc_2_nsec scaling factor according to CPU frequency changes. [ mingo@elte.hu: simplified it and fixed it for SMP. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++----- arch/x86/kernel/tsc_64.c | 57 ++++++++++++++++++++++++++++++++++++++--------- include/asm-x86/timer.h | 23 ++++++++++++++---- 3 files changed, 102 insertions(+), 21 deletions(-) Index: linux/arch/x86/kernel/tsc_32.c =================================================================== --- linux.orig/arch/x86/kernel/tsc_32.c +++ linux/arch/x86/kernel/tsc_32.c @@ -5,6 +5,7 @@ #include <linux/jiffies.h> #include <linux/init.h> #include <linux/dmi.h> +#include <linux/percpu.h> #include <asm/delay.h> #include <asm/tsc.h> @@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable); * * -johnstul@us.ibm.com "math is hard, lets go shopping!" */ -unsigned long ...
Any specific report? The jumping sched_clock on frequency change caused some scheduling oddities for me, but CFS attenuated the effect. Thanks. -- Guillaume --
On Mon, 10 Dec 2007 21:42:12 +0100 printk_clock() doesn't seem terribly important but what's this stuff about effects on udelay/mdelay? That can be serious if they're getting shortened. --
since udelay depends on loops_per_jiffy, which is fixed up time_cpufreq_notifier(), i dont see how it could be affected by frequency changes. (but that's the theory - practice might be different) Ingo --
Stefano Brivio reported udelay()/mdelay() effects in the b43 driver. (and it caused driver failures for him.) Stefano, could you please try to sum up your experiences with that issue? Is it reproducable, and the 5 patches i did fix it? (if yes, could you try to re-do the mdelay verifications perhaps, to make sure it's not some other effect interacting here. In theory sched-clock scaling has no effect on udelay behavior.) Ingo --
On Tue, 11 Dec 2007 00:04:25 +0100 Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in delays isn't that good when using my crappy unstable TSC (mdelay(2000) causes delays between 2 and 2.9 seconds) but it's not depending on frequency changes anymore. So I'd say it's fixed, but please tell me if you want me to do any other test so as to be sure it is. -- Ciao Stefano --
Do you mean that the cpufreq code can be confused about the actual frequency of the cores? That sounds like a big problem. Thanks for any insight. -- Guillaume --
On Tue, 11 Dec 2007 01:01:25 +0100 it'll get way worse going forward. (but even on todays systems, the tsc no longer represents frequency, but is some fixed clock totally unrelated to cpu frequency) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Ingo, it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock() based udelay), so how udelay can be affected by your proposed changes? Thanks. -- Guillaume --
was this needed for you to get stable udelay()? (that cpu_clock() based udelay patch was buggy, i got the units wrong. udelay does wacky conversions between various units. So i dropped it for the time being.) the last rollup you tested didnt show udelay problems, and it didnt include the sched_clock() based udelay patch. so it would be nice if you could re-examine exactly what is needed. Please try latest -git and the concatenation of the 4 patches below. What would be the best info is to see which (if any!) patches are needed against latest -git to get a stable udelay() on your box. Ingo ---------------------------------------> Linus, Andrew, i need some help deciding what to do about this regression. The fixes for this have been tested and resolve the regression, but they change printk and other code that runs by default and is thus rather invasive so late in the v2.6.24 cycle. This bug should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug option) - although some claimed effect was on udelay()/mdelay() too. i've attached below the queue of 5 patches that fix this problem. They have been build and boot tested with more than 1000 random kernels in the past few days, so i certainly trust the core and x86 bits of this. what do you think? Right now i've got them queued up for 2.6.25 in both the scheduler-devel and the x86-devel git trees - but can submit them for 2.6.24 if it's better if we did them there. I've got no strong opinion either way. Ingo --------------------> Subject: x86: scale cyc_2_nsec according to CPU frequency From: "Guillaume Chazarain" <guichaz@yahoo.fr> scale the sched_clock() cyc_2_nsec scaling factor according to CPU frequency changes. [ mingo@elte.hu: simplified it and fixed it for SMP. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++----- arch/x86/kernel/tsc_64.c | 57 ...
ok, just to make sure we are all synced up. I made 8 patches related to this problem category (and all the trickle effects). 3 are upstream already, 5 are pending for v2.6.25. One out of those 5 is an immaterial cleanup patch - which leaves us 4 patches to sort out. So i'd suggest for you to try latest -git - that will tell us whether udelay() is acceptable on your box right now. i've attached those 4 patches: x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch x86-cpu-clock-idle-event.patch sched-printk-recursion-fix.patch sched-printk-clock-fix.patch none of them is _supposed_ to have any effect on udelay(), but the interactions in this area are weird. [ note: CONFIG_PRINTK_TIME will be broken and only fixed in v2.6.25, so use some other time metric for determining mdelay quality. ] plus then there's this patch: http://lkml.org/lkml/2007/12/7/100 is it perhaps this one that fixed udelay for you? [ which would be much more expected, as this patch changes udelay ;-) ] Ingo
On Tue, 11 Dec 2007 10:01:20 +0100 Yes, it is (msleep(2000), as said, gives delays between 2 and 2.9s on my box, and drivers are happy). The commit which fixed this (it seems) is fa2dd441df28b9fdfc68f84ae66f1b507cfff0e4. I'll bisect and tell you more in the Will try it ASAP, again, in the next few days anyway. -- Ciao Stefano --
